-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathregex.Rmd
480 lines (310 loc) · 15.4 KB
/
regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
---
title: Using regular expression in R and Python
layout: default
---
## 1 Overview
Regular expressions are a domain-specific language for
finding patterns and are one of the key functionalities in scripting
languages such as Perl and Python, as well as the UNIX utilities `grep`, `sed`, and
`awk`.
The basic idea of regular expressions is that they allow us to find
matches of strings or patterns in strings, as well as do substitution.
Regular expressions are good for tasks such as:
- extracting pieces of text;
- creating variables from information found in text;
- cleaning and transforming text into a uniform format; and
- mining text by treating documents as data.
## 2 Regular expression syntax
Please use one or more of the following resources to learn regular expression syntax.
- [Our tutorial on using the bash shell](https://berkeley-scf.github.io/tutorial-using-bash/regex)
- Duncan Temple Lang (UC Davis Statistics) has written a [nice tutorial](regexpr-Lang.pdf) that is part of the repository for this tutorial
- Check out Sections 9.9 and 11 of [Paul Murrell's book](http://www.stat.auckland.ac.nz/~paul/ItDT)
Also, see the back/second page of RStudio's stringr cheatsheet for a [cheatsheet on regular expressions](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf) for a regular expression cheatsheet. And here is a [website where you can interactively test regular expressions on example strings](https://regex101.com).
## 3 Versions of regular expressions
One thing that can cause headaches is differences in version of regular expression syntax used. As discussed in `man grep`, *extended regular expressions* are standard, with *basic regular expressions* providing somewhat less functionality and *Perl regular expressions* additional functionality.
In R, `stringr` provides *ICU regular expressions* (see `help(regex)`), which are based on Perl regular expressions. More details can be found in the [regex Wikipedia page](https://en.wikipedia.org/wiki/Regular_expression).
The [bash shell tutorial](https://berkeley-scf.github.io/tutorial-using-bash/regex) provides a full documentation of the various *extended regular expressions* syntax, which we'll focus on here. This should be sufficient for most usage and should be usable in R and Python, but if you notice something funny going on, it might be due to differences between the regular expressions versions.
## 4 General principles for working with regex
The syntax is very concise, so it's helpful to break down
individual regular expressions into the component parts to understand
them. As Murrell notes, since regex are their own language, it's
a good idea to build up a regex in pieces as a way of avoiding errors
just as we would with any computer code. `str_detect` in R's `stringr` and `re.findall` in Python are particularly
useful in seeing *what* was matched to help in understanding
and learning regular expression syntax and debugging your regex. As with
many kinds of coding, I find that debugging my regex is usually what takes
most of my time.
## 5 Using regex in R
The `grep`, `gregexpr` and `gsub` functions and
their `stringr` analogs are more powerful when used with regular
expressions. In the following examples, we'll illustrate usage of `stringr` functions, but
with their base R analogs as comments.
### 5.1 Working with patterns
First let's see the use of character sets and character classes.
```{r}
library(stringr)
text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
"They bought 731 bananas", "Please call 919.554.3800")
str_detect(text, "[[:digit:]]") ## search for a digit
## Base R equivalent:
## grep("[[:digit:]]", text, perl = TRUE)
```
```{r}
str_detect(text, "[:,\t.]") # search for various punctuation symbols
## Base R equivalent:
## grep("[:,\t.]", text)
```
```{r}
str_locate_all(text, "[:,\t.]")
## Base R equivalent:
## gregexpr("[:,\t.]", text)
```
```{r}
str_extract_all(text, "[[:digit:]]+") # extract one or more digits
## Base R equivalent:
## matches <- gregexpr("[[:digit]]+", text)
## regmatches(text, matches)
```{r}
str_replace_all(text, "[[:digit:]]", "X")
## Base R equivalent:
## gsub("[[:digit:]]", "X", text)
```
Challenge: how would we find a spam-like pattern with digits or non-letters inside a word? E.g., I want to find "V1agra" or "Fancy repl!c@ted watches".
Next let's consider location-specific matches.
```{r}
str_detect(text, "^[[:upper:]]") # text starting with upper case letter
## Base R equivalent:
## grep("^[[:upper:]]", text)
```
```{r}
str_detect(text, "[[:digit:]]$") # text ending with a digit
## Base R equivalent:
## grep("[[:digit:]]$", text)
```
Now let's make use of repetitions.
Let's search for US/Canadian/Caribbean phone numbers in the example text we've been using:
```{r}
text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
"They bought 731 bananas", "Please call 919.554.3800")
pattern <- "[[:digit:]]{3}[-.][[:digit:]]{3}[-.][[:digit:]]{4}"
str_extract_all(text, pattern)
## Base R equivalent:
## matches <- gregexpr(pattern, text)
## regmatches(text, matches)
```
Challenge: How would I extract an email address from an arbitrary text string?
Next consider grouping.
For example, the phone number detection problem could have been done a bit more compactly (and more generally, in case the area code is omitted or a 1 is included) as:
```{r}
str_extract_all(text, "(1[-.])?([[:digit:]]{3}[-.]){1,2}[[:digit:]]{4}")
## Base R equivalent:
## matches <- gregexpr("(1[-.])?([[:digit:]]{3}[-.]){1,2}[[:digit:]]{4}", text)
## regmatches(text, matches)
```
Challenge: the above pattern would actually match something that is not a valid phone number. What can go wrong?
Here's a basic example of using grouping via parentheses with the OR operator.
```{r}
text <- c("at the site http://www.ibm.com", "other text", "ftp://ibm.com")
str_locate(text, "(http|ftp):\\/\\/") # http or ftp followed by ://
## Base R equivalent:
## gregexpr("(http|ftp):\\/\\/", text)
```
Parentheses are also used for referencing back to a detected pattern when doing a replacement. For example, here we'll find any numbers and add underscores before and after them:
```{r}
text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
"They bought 731 bananas", "Please call 919.554.3800")
str_replace_all(text, "([0-9]+)", "_\\1_") # place underscores around all numbers
```
Here we'll remove commas not used as field separators.
```{r}
text <- ('"H4NY07011","ACKERMAN, GARY L.","H","$13,242",,,')
clean_text <- str_replace_all(text, "([^\",]),", "\\1")
clean_text
cat(clean_text)
## Base R equivalent:
## gsub("([^\",]),", "\\1", text)
```
Challenge: Suppose a text string has dates in the form “Aug-3”, “May-9”, etc. and I want them in the form “3 Aug”, “9 May”, etc. How would I do this search/replace?
Finally let's consider greedy matching.
```{r}
text <- "Do an internship <b> in place </b> of <b> one </b> course."
str_replace_all(text, "<.*>", "")
## Base R equivalent:
## gsub("<.*>", "", text)
```
What went wrong?
One solution is to append a ? to the repetition syntax to cause the matching to be non-greedy. Here's an example.
```{r}
str_replace_all(text, "<.*?>", "")
## Base R equivalent:
## gsub("<.*?>", "", text)
```
However, one can often avoid greedy matching by being more clever.
Challenge: How could we change our regexp to avoid the greedy matching without using the “?”?
### 5.2 'Escaping' special characters
Using backslashes to 'escape' particular characters can be tricky. One rule of thumb is to just keep adding backslashes until you get what you want!
```{r}
## last case here is literally a backslash and then 'n'
strings <- c("Hello", "Hello.", "Hello\nthere", "Hello\\nthere")
cat(strings, sep = "\n")
```
```{r}
str_detect(strings, ".") ## . means any character
## This would fail because \. looks for the special symbol \.
## (which doesn't exist):
## str_detect(strings, "\."))
str_detect(strings, "\\.") ## \\ says treat \ literally, which then escapes the .
str_detect(strings, "\n") ## \n looks for the special symbol \n
## \\ says treat \ literally, but \ is not meaningful regex
try(str_detect(strings, "\\"))
## R parser removes two \ to give \\; then in regex \\ treats second \ literally
str_detect(strings, "\\\\")
```
### 5.3 Other comments
If we are working with newlines embedded in a string, we can include the newline character as a regular character that is matched by a “.” by first creating the regular expression with `stringr::regex` with the `dotall` argument set to `TRUE`:
```{r}
myex <- regex("<p>.*</p>", dotall = TRUE)
html_string <- "And <p>here is some\ninformation</p> for you."
str_extract(html_string, myex)
```
```{r}
str_extract(html_string, "<p>.*</p>") # doesn't work because \n is not matched
```
Regular expression can be used in a variety of places. E.g., to split by any number of white space characters
```{r}
line <- "a dog\tjumped\nover \tthe moon."
cat(line)
str_split(line, "[[:space:]]+")
str_split(line, "[[:blank:]]+")
```
## 6 Using regex in Python
### 6.1 Working with patterns
For working with regex in Python, we'll need the `re` package. It provides Perl-style regular expressions, but it doesn't seem to support named character classes such as `[:digit:]`. Instead use classes such as `\d` and `[0-9]`.
Again, in the code chunks that follow, all the explicit print statements are needed for R Markdown to print out the values.
In Python, you apply a matching function and then query the result to get information about what was matched and where in the string.
```{python}
import re
text = "Here's my number: 919-543-3300."
m = re.search("\d+", text)
m
m.group()
m.start()
m.end()
m.span()
```
Notice that that showed us only the first match.
We can instead use `findall` to get all the matches.
```{python}
re.findall("\d+", text)
```
To ignore case, do the following:
```{python}
import re
str = "That cat in the Hat"
re.findall("hat", str, re.IGNORECASE)
```
There are several other regex flags (also called compilation flags) that can control the behavior of the matching engine in interesting ways (check out re.VERBOSE and re.MULTILINE for instance).
We can of course use list comprehension to work with multiple strings. But we need to be careful to check whether a match was found.
```{python}
import re
text = ["Here's my number: 919-543-3300.", "hi John, good to meet you",
"They bought 731 bananas", "Please call 919.554.3800"]
def return_group(pattern, txt):
m = re.search(pattern, txt)
if m:
return(m.group())
else:
return(None)
[return_group("\d+", str) for str in text]
```
Next, let's look at replacing patterns, using `re.sub`.
```{python}
import re
text = ["Here's my number: 919-543-3300.", "hi John, good to meet you",
"They bought 731 bananas", "Please call 919.554.3800"]
re.sub("\d", "Z", text[0])
```
Next let's consider grouping using `()`.
Here’s a basic example of using grouping via parentheses with the OR operator (`|`).
```{python}
text = "At the site http://www.ibm.com. Some other text. ftp://ibm.com"
re.search("(http|ftp):\\/\\/", text).group()
```
However, if we want to find all the matches and try to use `findall`, we see that it returns only the captured groups when grouping operators are present, as discussed a bit in `help(re.findall)`,
so we'd need to add an additional grouping operator to capture the full pattern when using `findall`:
```{python}
re.findall("(http|ftp):\\/\\/", text)
re.findall("((http|ftp):\\/\\/)", text)
```
When you are searching for all occurrences of a pattern in a large text object, it may be beneficial to use `finditer`:
```{python}
it = re.finditer("(http|ftp):\\/\\/", text) # http or ftp followed by ://
for match in it:
match.span()
```
This method behaves lazily and returns an iterator that gives us one match at a time, and only scans for the next match when we ask for it.
Groups are also used when we need to reference back to a detected pattern when doing a replacement. This is why they are sometimes referred to as "capturing groups". For example, here we’ll find any numbers and add underscores before and after them:
```{python}
text = "Here's my number: 919-543-3300. They bought 731 bananas. Please call 919.554.3800."
re.sub("([0-9]+)", "_\\1_", text)
```
Here we’ll remove commas not used as field separators by replacing all commas except those occurring
after another comma or after a quotation mark. This is an attempt to remove all commas not used
as field delimiters.
```{python}
text = '"H4NY07011","ACKERMAN, GARY L.","H","$13,242",,,'
re.sub("([^\",]),", "\\1", text)
```
How does that work? Consider that "[^\",]" matches a character that is not a quote and not a comma. The regex is therefore such a non-quote, non-comma character followed by a comma, with the matched character saved in `\\1` because of the grouping operator.
Groups can also be given names, instead of having to refer to them by their numbers, but we will not demonstrate this here.
Finally, let's consider where a match ends when there is ambiguity.
As a simple example consider that if we try this search, we match as many digits as possible, rather than returning the first "9" as satisfying the request for "one or more" digits.
```{python}
text = "See the 998 balloons."
re.findall("\d+", text)
```
That behavior is called *greedy* matching, and it's the default. That example also shows why it
is the default. What would happen if it were not the default?
However, sometimes greedy matching doesn't get us what we want.
Consider this attempt to remove multiple html tags from a string.
```{python}
text = "Do an internship <b> in place </b> of <b> one </b> course."
re.sub("<.*>", "", text)
```
Notice what happens because of greedy matching.
One way to avoid greedy matching is to use a `?` after the repetition specifier.
```{python}
re.sub("<.*?>", "", text)
```
However, that syntax is a bit frustrating because `?` is also used to indicate
0 or 1 repetitions, making the regex a bit hard to read/understand.
> **Challenge**: Suppose I want to strip out HTML tags but without using the `?` to avoid
greedy matching. How can I be more careful in constructing my regex?
### 6.2 'Escaping' special characters
For reasons explained in the [re documentation](https://docs.python.org/3/howto/regex.html), to match an actual backslash, such as `"\section"`, you'd need `"\\\\section"`. This can be avoided by using raw strings: `r"\\section"`.
Here are some more examples of escaping characters.
```{python}
strings = ["Hello", "Hello.", "Hello\nthere", "Hello\\nthere"]
strings[2]
strings[3]
print(strings[2])
print(strings[3])
```{python}
re.search(".", strings[0]) ## . means any character
re.search("\.", strings[0]) ## \. escapes the period and treats it literally
re.search("\.", strings[1]) ## \. escapes the period and treats it literally
re.search("\n", strings[2]) ## \n looks for the special symbol \n
re.search("\n", strings[3]) ## \n looks for the special symbol \n
re.search("\\\\", strings[3]) ## string parser removes two \ to give \\;
## then in regex \\ treats second \ literally
```
### 6.3 Other comments
You can also compile regex patterns for faster processing when working with a pattern multiple times.
```{python}
import re
text = ["Here's my number: 919-543-3300.", "hi John, good to meet you",
"They bought 731 bananas", "Please call 919.554.3800"]
p = re.compile('\d+')
m = p.search(text[0])
m.group()
```