-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathregex.Rmd
181 lines (139 loc) · 4.65 KB
/
regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# Regular expressions
Regular expressions are tools to **describe patterns in strings**.
## Find simple matches with grep
* Find a pattern anywhere in the string (outputs the index of the element):
```{r}
# By default, outputs the index of the element matching the pattern
grep(pattern="Gen",
x="Genomics")
```
* Show actual element where the pattern is found (instead of the index only) with **value=TRUE**:
```{r}
# Set value=TRUE
grep(pattern="Gen",
x="Genomics",
value=TRUE)
```
* Non case-sensitive search with **ignore.case=TRUE**:
```{r}
# Enter the pattern in lower-case, but case is ignored
grep(pattern="gen",
x="Genomics",
value=TRUE,
ignore.case=TRUE)
```
* Show if it DOESN'T match the pattern with **inv=TRUE**:
```{r}
# Shows what doesn't match
grep(pattern="gen",
x="Genomics",
value=TRUE,
ignore.case=TRUE,
inv=TRUE)
```
## Regular expressions to find more flexible patterns
<h4>Special characters used for pattern recognition:</h4>
| $ | Find pattern at the end of the string |
| ^ | Find pattern at the beginning of the string |
| {n} | The previous pattern should be found exactly n times |
| {n,m} | The previous pattern should be found between n and m times|
| + | The previous pattern should be found at least 1 time |
| * | One or more allowed, but optional |
| ? | One allowed, but optional |
<h4>Match your own pattern inside **[]**</h4>
\[abc\]: matches a, b, or c.<br>
^\[abc\]: matches a, b or c at the beginning of the element.<br>
^A\[abc\]+: matches A as the first character of the element, then either a, b or c<br>
^A\[abc\]*: matches A as the first character of the element, then optionally either a, b or c<br>
^A\[abc\]{1}_: matches A as the first character of the element, then either a, b or c (one time!) followed by an underscore<br>
\[a-z\]: matches every character between a and z.<br>
\[A-Z\]: matches every character between A and Z.<br>
\[0-9\]: matches every number between 0 and 9.<br>
* Match anything contained between brackets (here either g or t) at least once:
```{r}
grep(pattern="[gt]+",
x=c("genomics", "proteomics", "transcriptomics"),
value=TRUE)
```
* Match anything contained between brackets at least once AND at the start of the element:
```{r}
grep(pattern="^[gt]+",
x=c("genomics", "proteomics", "transcriptomics"),
value=TRUE)
```
* **Create a vector of email addresses:**
```{r}
vec_ad <- c("[email protected]", "[email protected]",
```
* Keep only email addresses finishing with "es":
```{r}
grep(pattern="es$",
x=vec_ad,
value=TRUE)
```
## Substitute or remove matching patterns with gsub
From the same vector of email addresses:
* Remove the "@" symbol and the email provider from each address
```{r}
gsub(pattern="@[a-z.]+",
replacement="",
x=vec_ad)
```
* Substitute the "@" symbol with "_at_"
```{r}
gsub(pattern="@",
replacement="_at_",
x=vec_ad)
```
* Substitute "es" and "it" by "eu"
```{r}
gsub(pattern="es$|it$",
replacement="eu",
x=vec_ad)
```
## Predefined variables to use in regular expressions:
| [:lower:] | Lower-case letters |
| [:upper:] | Upper-case letters |
| [:alpha:] | Alphabetic characters: [:lower:] and [:upper:] |
| [:digit:] | Digits: 0 1 2 3 4 5 6 7 8 9 |
| [:alnum:] | Alphanumeric characters: [:alpha:] and [:digit:] |
| [:print:] | Printable characters: [:alnum:], [:punct:] and space. |
| [:punct:] | Punctuation characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ |
| [:blank:] | Blank characters: space and tab |
* Take the previous character vector containing email addresses:
* Remove the @ and the email provider from each address
```{r}
gsub(pattern="@[[:lower:][:punct:]]+",
replacement="",
x=vec_ad)
```
* Same thing but remove additionally any number(s) BEFORE the @ (if any):
```{r}
gsub(pattern="[[:digit:]]*@[[:lower:][:punct:]]+",
replacement="",
x=vec_ad)
```
* Same but simplified:
```{r}
gsub(pattern="[[:digit:]]*@[[:print:]]+",
replacement="",
x=vec_ad)
```
## Use grep and regular expressions to retrieve columns by their names
Example of a data frame:
```{r}
# Build data frame
df_regex <- data.frame(expression1=1:4,
expression2=2:5,
expression3=4:7,
annotation=LETTERS[1:4],
expression4=6:3,
average_expression=c(3.25, 3.75, 4.25, 4.75),
stringsAsFactors=FALSE)
# Select column names that start with "expression"
grep(pattern="^expression",
x=colnames(df_regex))
# Select columns from df_regex if their names start with "expression"
df_regex[, grep(pattern="^expression", colnames(df_regex))]
```