regex.Rmd

# Regular expressions

Regular expressions are tools to **describe patterns in strings**.

## Find simple matches with grep

* Find a pattern anywhere in the string (outputs the index of the element):

```{r}
# By default, outputs the index of the element matching the pattern
grep(pattern="Gen", 
	x="Genomics")
```

* Show actual element where the pattern is found (instead of the index only) with **value=TRUE**:

```{r}
# Set value=TRUE
grep(pattern="Gen",
        x="Genomics",
        value=TRUE)
```

* Non case-sensitive search with **ignore.case=TRUE**:

```{r}
# Enter the pattern in lower-case, but case is ignored
grep(pattern="gen",
        x="Genomics",
        value=TRUE,
        ignore.case=TRUE)
```

* Show if it DOESN'T match the pattern with **inv=TRUE**:

```{r}
# Shows what doesn't match
grep(pattern="gen",
        x="Genomics",
        value=TRUE,
        ignore.case=TRUE,
	inv=TRUE)
```

## Regular expressions to find more flexible patterns

<h4>Special characters used for pattern recognition:</h4>

| $ | Find pattern at the end of the string |
| ^ | Find pattern at the beginning of the string |
| {n} | The previous pattern should be found exactly n times |
| {n,m} | The previous pattern should be found between n and m times|
| + | The previous pattern should be found at least 1 time |
| * | One or more allowed, but optional |
| ? | One allowed, but optional |

<h4>Match your own pattern inside **[]**</h4>

\[abc\]: matches a, b, or c.<br>
^\[abc\]: matches a, b or c at the beginning of the element.<br>
^A\[abc\]+: matches A as the first character of the element, then either a, b or c<br>
^A\[abc\]*: matches A as the first character of the element, then optionally either a, b or c<br>
^A\[abc\]{1}_: matches A as the first character of the element, then either a, b or c (one time!) followed by an underscore<br>

\[a-z\]: matches every character between a and z.<br>
\[A-Z\]: matches every character between A and Z.<br>
\[0-9\]: matches every number between 0 and 9.<br>


* Match anything contained between brackets (here either g or t) at least once:

```{r}
grep(pattern="[gt]+", 
	x=c("genomics", "proteomics", "transcriptomics"), 
	value=TRUE)
```

* Match anything contained between brackets at least once AND at the start of the element:

```{r}
grep(pattern="^[gt]+",
        x=c("genomics", "proteomics", "transcriptomics"),
        value=TRUE)
```

* **Create a vector of email addresses:**

```{r}
vec_ad <- c("marie.curie@yahoo.es", "albert.einstein01@hotmail.com", 
	"charles.darwin1809@gmail.com", "rosalind.franklin@aol.it")
```

* Keep only email addresses finishing with "es":

```{r}
grep(pattern="es$",
        x=vec_ad,
        value=TRUE)
```

## Substitute or remove matching patterns with gsub

From the same vector of email addresses:

* Remove the "@" symbol and the email provider from each address

```{r}
gsub(pattern="@[a-z.]+",
        replacement="",
        x=vec_ad)
```

* Substitute the "@" symbol with "_at_"

```{r}
gsub(pattern="@",
        replacement="_at_",
        x=vec_ad)
```

* Substitute "es" and "it" by "eu"

```{r}
gsub(pattern="es$|it$", 
	replacement="eu", 
	x=vec_ad)
```

## Predefined variables to use in regular expressions:

| [:lower:] | Lower-case letters |
| [:upper:] | Upper-case letters |
| [:alpha:] | Alphabetic characters: [:lower:] and [:upper:] |
| [:digit:] | Digits: 0 1 2 3 4 5 6 7 8 9 |
| [:alnum:] | Alphanumeric characters: [:alpha:] and [:digit:] |
| [:print:] | Printable characters: [:alnum:], [:punct:] and space. |
| [:punct:] | Punctuation characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { &#124; } ~ |
| [:blank:] | Blank characters: space and tab |


* Take the previous character vector containing email addresses:
	* Remove the @ and the email provider from each address
```{r}
gsub(pattern="@[[:lower:][:punct:]]+", 
	replacement="", 
	x=vec_ad)
```
	* Same thing but remove additionally any number(s) BEFORE the @ (if any):
```{r}
gsub(pattern="[[:digit:]]*@[[:lower:][:punct:]]+",
        replacement="",
        x=vec_ad)
```
	* Same but simplified:
```{r}
gsub(pattern="[[:digit:]]*@[[:print:]]+",
        replacement="",
        x=vec_ad)
```

## Use grep and regular expressions to retrieve columns by their names

Example of a data frame:

```{r}
# Build data frame
df_regex <- data.frame(expression1=1:4, 
	expression2=2:5, 
	expression3=4:7, 
	annotation=LETTERS[1:4], 
	expression4=6:3, 
	average_expression=c(3.25, 3.75, 4.25, 4.75),
	stringsAsFactors=FALSE)

# Select column names that start with "expression"
grep(pattern="^expression", 
	x=colnames(df_regex))

# Select columns from df_regex if their names start with "expression"
df_regex[, grep(pattern="^expression", colnames(df_regex))]
```