string-processing.Rmd

String Processing
======================================================================================
Manipulating Text Data in R and Python, including Regular Expressions 
----------------------------------------------------------

Chris Paciorek, Department of Statistics, UC Berkeley

```{r setup, include=FALSE}
library(knitr)
library(stringr)
read_chunk('string-processing.R')
read_chunk('string-processing.py')
```

# 0) This Tutorial

This tutorial covers tools for manipulating text data in R and Python, including the use of regular expressions. We also briefly discuss tools for reading and manipulating formatted text files such as HTML, XML, and JSON. At the moment, this tutorial is somewhat more focused on R than Python, but we hope to flesh out the Python sections more fully in the future.

If you have a standard R or Python installation and can install the *stringr* package for R and the *re* package for Python, you should be able to reproduce the results in this document.

This tutorial was originally developed using a virtual machine developed here at Berkeley, the [Berkeley Common Environment (BCE)](http://bce.berkeley.edu). BCE is a virtual Linux machine - basically it is a Linux computer that you can run within your own computer, regardless of whether you are using Windows, Mac, or Linux. This provides a common environment so that things behave the same for all of us. However, BCE has not been updated in a while, so I don't suggest you use it at this point in time.

This tutorial assumes you have a working knowledge of R or Python. 

Materials for this tutorial, including the R markdown file and associated code files that were used to create this document are available on Github at (https://github.com/berkeley-scf/tutorial-string-processing).  You can download the files by doing a git clone from a terminal window on a UNIX-like machine, as follows:
```{r, clone, eval=FALSE}
git clone https://github.com/berkeley-scf/tutorial-string-processing
```

To create this HTML document, simply compile the corresponding R Markdown file in R as follows (the following will work from within BCE after cloning the repository as above).
```{r, build-html, eval=FALSE}
Rscript -e "library(knitr); knit2html('string-processing.Rmd')"
```
This tutorial by Christopher Paciorek is licensed under a Creative Commons Attribution 3.0 Unported License.


# 1) Background

Text manipulations in R, Python, Perl, and bash have a number of things 
in common, as many of these evolved from UNIX. When I use the
term *string* here, I'll be referring to any sequence of characters
that may include numbers, white space, and special characters. Note that in R
a character vector is a vector of one or more such strings.  

# 2) Basic text manipulation 

Some of the basic things we need to do are paste/concatenate strings together,
split strings apart, take subsets of strings, and replace characters within strings.
Often these operations are done based on patterns rather than a fixed string 
sequence. This involves the use of regular expressions, covered in Section 3.

## 2.1) R

In general, strings in R are stored in character vectors. R's functions for string manipulation are fully vectorized and will work on all of the strings in a vector at once.

Here's a [cheatsheet from RStudio](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf) on manipulating strings in R.

### 2.1.1) String manipulation in base R

A few of the basic R functions for manipulating strings are *paste*,
*strsplit*, and *substring*. *paste* and *strsplit*
are basically inverses of each other: *paste* concatenates
together an arbitrary set of strings (or a vector, if using the *collapse*
argument) with a user-specified separator character, while *strsplit*
splits apart based on a delimiter/separator. *substring*
splits apart the elements of a character vector based on fixed widths.
*nchar* returns the number of characters in a string.
Note that all of these operate in a vectorized fashion.

```{r, r-basics1}
```

Note that *strsplit* returns a list because it can operate
on a character vector (i.e., on multiple strings).

```{r, r-basics2}
```


To identify particular subsequences in strings, there are several
related R functions. *grep* will look for a specified string
within an R character vector and report back indices identifying the
elements of the vector in which the string was found. Note that using the
`fixed=TRUE` argument ensures that regular expressions are NOT
used. *gregexpr* will indicate the position in each string
that the specified string is found (use *regexpr* if you only
want the first occurrence). *gsub* can be used to replace a
specified string with a replacement string (use *sub* if you
only want to replace only the first occurrence). 

```{r, r-pattern}
```

### 2.1.2) String manipulation using *stringr*

The *stringr* package wraps the various core string manipulation
functions to provide a common interface. It also removes some of the
clunkiness involved in some of the string operations with the base
string functions, such as having to to call *gregexpr* and
then *regmatches* to pull out the matched strings. In general, I'd suggest using *stringr* functions
in place of R's base string functions.

Here's 

First let's see *stringr*'s versions of some of the base string functions mentioned in the previous sections.

```{r, r-stringr}
```

The basic interface to *stringr* functions is `function(strings, pattern, [replacement])`. 

Table 1 provides an overview of the key functions related to working with patterns, which are basically
wrappers for *grep*, *gsub*, *gregexpr*, etc.


|  Function                         | What it does
|-----------------------------------|---------------------------------------------------------------------
| str_detect                        |                 detects pattern, returning TRUE/FALSE
| str_count                         |                 counts matches
| str_locate/str_locate_all         |                  detects pattern, returning positions of matching characters
| str_extract/str_extract_all       |                  detects pattern, returning matches
| str_replace/str_replace_all       |                  detects pattern and replaces matches

The analog of *regexpr* vs. *gregexpr* and *sub*
vs. *gsub* is that most of the functions have versions that
return all the matches, not just the first match, e.g. *str_locate_all*
*str_extract_all*, etc. Note that the *_all* functions return
lists while the non-*_all* functions return vectors.

To specify options, you can wrap these functions around the pattern
argument: `fixed(pattern, ignore_case)` and `regex(pattern, ignore_case)`.
The default is *regex*, so you only need to specify that if you also want to 
specify additional arguments, such as *ignore_case* or others listed under `help(regex)` (invoke the help after loading *stringr*)

Here's an example:
```{r, stringr-example}
```

## 2.2) Basic text manipulation in Python

Let's see basic concatenation, splitting, working with substrings, and searching/replacing 
substrings. Notice that Python's string functionality is object-oriented (though *len* is not).
Note: apologies for all the extra print statements in the code - this is required when running Python
chunks in R Markdown.

```{r, py-basics, engine='python'}
```

In IPython, hitting tab after typing `out.` when *out* is a string will show the full suite of string-related methods.

Unlike in R, you cannot use the string methods directly on a list or tuple of strings, but you of course can do things like list comprehension to easily process multiple strings.

Working with substrings relies on the fact that Python works with strings as if they are vectors of individual characters.

```{r, py-substrings, engine='python'}
```


However strings are immutable - you cannot alter a subset of characters in the string. Another option is to work with strings as lists.

```{r, py-lists, engine='python'}
```

Now let's consider finding substrings.

```{r, py-find-substrings, engine='python'}
```


# 3) Regular expressions (regex/regexp)

Regular expressions are a domain-specific language for
finding patterns and are one of the key functionalities in scripting
languages such as Perl and Python, as well as the UNIX utilities *sed*,
*awk* and *grep*. 

The basic idea of regular expressions is that they allow us to find
matches of strings or patterns in strings, as well as do substitution.
Regular expressions are good for tasks such as:
 - extracting pieces of text - for example finding all the links in an html document;
 -  creating variables from information found in text;
 -  cleaning and transforming text into a uniform format;
 -  mining text by treating documents as data; and
 -  scraping the web for data.

Please look at Section 3 of [our tutorial on using the bash shell](https://github.com/berkeley-scf/tutorial-using-bash) to learn the regular expression syntax that we'll use in the remainder of this tutorial. For other resources, Duncan Temple Lang (UC Davis Statistics)
has written a nice tutorial that is part of the repository for this tutorial 
or check out Sections 9.9 and 11 of [Paul Murrell's book](http://www.stat.auckland.ac.nz/~paul/ItDT)

Also, here's a [cheatsheet on regular expressions](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf) and here is a [website where you can interactively test regular expressions on example strings](https://regex101.com).

## 3.1) Versions of regular expressions

One thing that can cause headaches is differences in version of regular expression syntax used. As discussed in `man grep`, *extended regular expressions* are standard, with *basic regular expressions* providing somewhat less functionality and *Perl regular expressions* additional functionality. 
As can be seen in `help(regex)`, In R, *stringr* provides *ICU regular expressions*, which are based on Perl regular expressions. More details can be found in the [regex Wikipedia page](https://en.wikipedia.org/wiki/Regular_expression).

The tutorial on using bash provides a full documentation of the various *extended regular expressions* syntax, which we'll focus on here. This should be sufficient for most usage and should be usable in R and Python, but if you notice something funny going on, it might be due to differences between the regular expressions versions. 

## 3.2) General principles for working with regex

The syntax is very concise, so it's helpful to break down
individual regular expressions into the component parts to understand
them. As Murrell notes, since regex are their own language, it's
a good idea to build up a regex in pieces as a way of avoiding errors
just as we would with any computer code. *str_detect* in R's *stringr* and *re.findall* in Python are particularly
useful in seeing *what* was matched to help in understanding
and learning regular expression syntax and debugging your regex. As with
many kinds of coding, I find that debugging my regex is usually what takes
most of my time.

## 3.3) Using regex in R


The *grep*, *gregexpr* and *gsub* functions and
their *stringr* analogs are more powerful when used with regular
expressions. In the following examples, we'll illustrate usage of *stringr* functions, but 
with their base R analogs as comments.

### 3.3.1) Working with patterns

First let's see the use of character sets and character classes.

```{r, detect1}
```

```{r, detect2}
```

Challenge: how would we find a spam-like pattern with digits or non-letters inside a word? E.g., I want to find "V1agra" or "Fancy repl!c@ted watches".

Next let's consider location-specific matches.

```{r, location}
```


Now let's make use of repetitions.

Let's search for US/Canadian/Caribbean phone numbers in the example text we've been using: 


```{r, repetitions}
```

Challenge: How would I extract an email address from an arbitrary text string?

Next consider grouping.

For example, the phone number detection problem could have been done a bit more compactly (and more generally, in case the area code is omitted or a 1 is included) as:

```{r, grouping}
```

Challenge: the above pattern would actually match something that is not a valid phone number. What can go wrong?

Here's a basic example of using grouping via parentheses with the OR operator.

```{r, grouping2}
```

Parentheses are also used for referencing back to a detected pattern when doing a replacement. For example, here we'll find any numbers and add underscores before and after them:

```{r, references-basic}
```

Here we'll remove commas not used as field separators.

```{r, references}
```

Challenge: Suppose a text string has dates in the form “Aug-3”, “May-9”, etc. and I want them in the form “3 Aug”, “9 May”, etc. How would I do this search/replace?

Finally let's consider greedy matching.

```{r, greedy}
```


What went wrong?

One solution is to append a ? to the repetition syntax to cause the matching to be non-greedy. Here's an example.

```{r, greedy2}
```

However, one can often avoid greedy matching by being more clever. 

Challenge: How could we change our regexp to avoid the greedy matching without using the “?”?

### 3.3.2 Other comments

If we are working with newlines embedded in a string, we can include the newline character as a regular character that is matched by a “.” by first creating the regular expression with *stringr::regex* with the *dotall* argument set to `TRUE`:

```{r, newlines}
```

Regular expression can be used in a variety of places. E.g., to split by any number of white space characters

```{r, split}
```

Using backslashes to 'escape' particular characters can be tricky. One rule of thumb is to just keep adding backslashes until you get what you want!

```{r, escaping}
```


## 3.4) Using regex in Python

### 3.4.1) Working with patterns

For working with regex in Python, we'll need the *re* package. It provides Perl-style regular expressions, but it doesn't seem to support named character classes such as `[:digit:]`. Instead use classes such as `\d` and `[0-9]`.

Again, in the code chunks that follow, all the explicit print statements are needed for R Markdown to print out the values.

In Python, you apply a matching function and then query the result to get information about what was matched and where in the string. 

```{r, py-search, engine='python'}
```

```{r, py-search2, engine='python'}
```

To ignore case, do the following:
```{r, py-ignore-case, engine='python'}
```

We can of course use list comprehension to work with multiple strings. But we need to be careful to check whether a match was found.

```{r, py-list-comp, engine='python'}
```

Next, let's look at replacing patterns.

```{r, py-replace1, engine='python'}
```

```{r, py-replace2, engine='python'}
```

Finally, let's see the consequences of greedy matching and use of `?` to avoid greeding matching.

```{r, py-greedy, engine='python'}
```

### 3.4.2) Other comments

You can also compile regex patterns for faster processing when working with a pattern multiple times.

```{r, py-compile, engine='python'}
```

For reasons explained in the [re documentation](https://docs.python.org/2/howto/regex.html), to match an actual backslash, such as `"\section"`, you'd need `"\\\\section"`. This can be avoided by using raw strings: `r"\\section"`.

Here are some more examples of escaping characters.

```{r, py-escape, engine='python'}
```


# 4) Webscraping and formatted text files

There are lots of packages in both R and Python for downloading, reading in, and parsing formatted text files, such as HTML, XML, and JSON, as well as for submitting HTTP requests. So make sure you don't reinvent the wheel of what is already out there. 

For R, *httr* and *RCurl* are good for interacting with webpages and *rvest* and *XML2* for reading/parsing HTML and XML files. *jsonlite* is good for reading/parsing JSON. See the [WebTechnologies CRAN task view](https://cran.r-project.org/web/views/WebTechnologies.html) for a summary.

For Python, the *requests* module is good for interacting with webpages. For XML and HTML, *lxml* is good, in particular *lxml.etree* and *lxml.html*.  For HTML, one might also use *HTMLParser* in Python 2.7 and *html.parser* in Python 3, and for XML there is also the *xml* module. *json* is good for reading/parsing JSON files.