title | layout |
---|---|
Basic text manipulation in R and Python |
default |
Text manipulations in R, Python, Perl, and bash have a number of things in common, as many of these evolved from UNIX. When I use the term string here, I’ll be referring to any sequence of characters that may include numbers, white space, and special characters. Note that in R a character vector is a vector of one or more such strings.
Some of the basic things we need to do are paste/concatenate strings together, split strings apart, take subsets of strings, and replace characters within strings. Often these operations are done based on patterns rather than a fixed string sequence. This involves the use of regular expressions, covered in Section 3.
In general, strings in R are stored in character vectors. R’s functions for string manipulation are fully vectorized and will work on all of the strings in a vector at once.
Here’s a cheatsheet from
RStudio
on manipulating strings using the stringr
package in R.
A few of the basic R functions for manipulating strings are paste
,
strsplit
, and substring
. paste
and strsplit
are basically
inverses of each other:
paste
concatenates together an arbitrary set of strings (or a vector, if using thecollapse
argument) with a user-specified separator characterstrsplit
splits apart based on a delimiter/separatorsubstring
splits apart the elements of a character vector based on fixed widthsnchar
returns the number of characters in a string.
Note that all of these operate in a vectorized fashion.
out <- paste("My", "name", "is", "Chris", ".", sep = " ")
paste(c("My", "name", "is", "Chris", "."), collapse = " ") # equivalent
## [1] "My name is Chris ."
nchar(out)
## [1] 18
strsplit(out, split = ' ')
## [[1]]
## [1] "My" "name" "is" "Chris" "."
Note
Some string processing functions (such as
strsplit
above) can return multiple values for each input string (each element of the character vector). As a result, the functions will return a list, which will be a list with one element when the function operates on a single string.out <- c("Her name is Maya", "Hello everyone") strsplit(out, split = ' ')## [[1]] ## [1] "Her" "name" "is" "Maya" ## ## [[2]] ## [1] "Hello" "everyone"
Here are some examples of using substring
:
times <- c("04:18:04", "12:12:53", "13:47:00")
substring(times, 7, 8)
## [1] "04" "53" "00"
substring(times[3], 1, 2) <- '01' ## replacement
times
## [1] "04:18:04" "12:12:53" "01:47:00"
To identify particular subsequences in strings, there are several
closely-related R functions. grep
will look for a specified string
within an R character vector and report back indices identifying the
elements of the vector in which the string was found. Note that using
the fixed=TRUE
argument ensures that regular expressions are NOT used.
gregexpr
will indicate the position in each string that the specified
string is found (use regexpr
if you only want the first occurrence).
gsub
can be used to replace a specified string with a replacement
string (use sub
if you only want to replace only the first
occurrence).
dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
grep("2016", dates)
## [1] 1 3
gregexpr("2016", dates)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
gsub("2016", "16", dates)
## [1] "16-08-03" "2007-09-05" "16-01-02"
The stringr
package wraps the various core string manipulation
functions to provide a common interface. It also removes some of the
clunkiness involved in some of the string operations with the base
string functions, such as having to to call gregexpr
and then
regmatches
to pull out the matched strings. In general, I’d suggest
using stringr
functions in place of R’s base string functions.
First let’s see stringr
’s versions of some of the base R string
functions mentioned in the previous sections.
The basic interface to stringr
functions is
function(character_vector, pattern, [replacement])
.
Table 1 provides an overview of the key functions related to working
with patterns, which are basically wrappers for grep
, gsub
,
gregexpr
, etc.
Function | What it does |
---|---|
str_detect | detects pattern, returning TRUE/FALSE |
str_count | counts matches |
str_locate/str_locate_all | detects pattern, returning positions of matching characters |
str_extract/str_extract_all | detects pattern, returning matches |
str_replace/str_replace_all | detects pattern and replaces matches |
The analog of regexpr
vs. gregexpr
and sub
vs. gsub
is that most
of the functions have versions that return all the matches, not just the
first match, e.g. str_locate_all
str_extract_all
, etc. Note that the
_all
functions return lists while the non-_all
functions return
vectors.
To specify options, you can wrap these functions around the pattern
argument: fixed(pattern, ignore_case)
and
regex(pattern, ignore_case)
. The default is regex
, so you only need
to specify that if you also want to specify additional arguments, such
as ignore_case
or others listed under help(regex)
(invoke the help
after loading stringr
)
Here’s an example:
library(stringr)
str <- c("Apple Computer", "IBM", "Apple apps")
str_detect(str, fixed("app", ignore_case = TRUE))
## [1] TRUE FALSE TRUE
str_count(str, fixed("app", ignore_case = TRUE))
## [1] 1 0 2
str_locate(str, fixed("app", ignore_case = TRUE))
## start end
## [1,] 1 3
## [2,] NA NA
## [3,] 1 3
str_locate_all(str, fixed("app", ignore_case = TRUE))
## [[1]]
## start end
## [1,] 1 3
##
## [[2]]
## start end
##
## [[3]]
## start end
## [1,] 1 3
## [2,] 7 9
dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
str_locate(dates, "20[^0][0-9]") ## regular expression: years begin in 2010
## start end
## [1,] 1 4
## [2,] NA NA
## [3,] 1 4
str_extract_all(dates, "20[^0][0-9]")
## [[1]]
## [1] "2016"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "2016"
str_replace_all(dates, "20[^0][0-9]", "XXXX")
## [1] "XXXX-08-03" "2007-09-05" "XXXX-01-02"
Let’s see basic concatenation, splitting, working with substrings, and
searching/replacing substrings. Notice that Python’s string
functionality is object-oriented (though len
is not).
Here, We’ll just cover the basic methods for the str
type. There’s
lots of additional functionality for working with strings in the re
package, discussed here in this
tutorial.
Of course in many cases of working with strings, one would need the full
power of regular expressions to do what one needs to do.
First let’s look at combining/concatenating strings. We can do this with
the +
operator or using the join
method, which is (perhaps
confusingly) called based on the separator of interest with the input
strings as arguments.
out = "My" + "name" + "is" + "Chris" + "."
out
## 'MynameisChris.'
out = ' '.join(("My", "name", "is", "Chris", "."))
out
## 'My name is Chris .'
len
simply returns the number of characters in the string.
len(out)
## 18
out.split(' ')
## ['My', 'name', 'is', 'Chris', '.']
To see the various string methods, we can hit tab after typing str.
or
based on any specific string:
out.
out.capitalize() out.index( out.isspace() out.removesuffix( out.startswith(
out.casefold() out.isalnum() out.istitle() out.replace( out.strip(
out.center( out.isalpha() out.isupper() out.rfind( out.swapcase()
out.count( out.isascii() out.join( out.rindex( out.title()
out.encode( out.isdecimal() out.ljust( out.rjust( out.translate(
out.endswith( out.isdigit() out.lower() out.rpartition( out.upper()
out.expandtabs( out.isidentifier() out.lstrip( out.rsplit( out.zfill(
out.find( out.islower() out.maketrans( out.rstrip(
out.format( out.isnumeric() out.partition( out.split(
out.format_map( out.isprintable() out.removeprefix( out.splitlines(
Unlike in R, you cannot use the string methods directly on a list or tuple of strings, but you of course can do things like list comprehension to easily process multiple strings.
Working with substrings relies on the fact that Python works with strings as if they are vectors of individual characters.
var = "13:47:00"
var[3:5]
## '47'
However strings are immutable - you cannot alter a subset of characters in the string. Another option is to work with strings as lists.
var[0:2] = "01"
## Error: TypeError: 'str' object does not support item assignment
Now let’s consider finding substrings. Here Python tells us that ‘2016’ starts in the 6th position in the first and third elements (with 0-based indexing).
var = "08-03-2016"
var.find("2016")
## 6
We can count occurrences with .count()
:
var = "08-03-2016; 07-09-2016"
var.count("2016")
## 2
And we can replace like this:
var = "13:47:00"
var.replace("13", "01")
## '01:47:00'