-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtext-manipulation.Rmd
246 lines (177 loc) · 8.78 KB
/
text-manipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
title: Basic text manipulation in R and Python
layout: default
---
Text manipulations in R, Python, Perl, and bash have a number of things
in common, as many of these evolved from UNIX. When I use the
term *string* here, I'll be referring to any sequence of characters
that may include numbers, white space, and special characters. Note that in R
a character vector is a vector of one or more such strings.
Some of the basic things we need to do are paste/concatenate strings together,
split strings apart, take subsets of strings, and replace characters within strings.
Often these operations are done based on patterns rather than a fixed string
sequence. This involves the use of regular expressions, covered in Section 3.
## 1 R
In general, strings in R are stored in character vectors. R's functions for string manipulation are fully vectorized and will work on all of the strings in a vector at once.
Here's a [cheatsheet from RStudio](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf) on manipulating strings using the `stringr` package in R.
### 1.1 String manipulation in base R
A few of the basic R functions for manipulating strings are `paste`,
`strsplit`, and `substring`. `paste` and `strsplit`
are basically inverses of each other:
- `paste` concatenates together an arbitrary set of strings (or a vector, if using the `collapse` argument) with a user-specified separator character
- `strsplit` splits apart based on a delimiter/separator
- `substring` splits apart the elements of a character vector based on fixed widths
- `nchar` returns the number of characters in a string.
Note that all of these operate in a vectorized fashion.
```{r}
out <- paste("My", "name", "is", "Chris", ".", sep = " ")
paste(c("My", "name", "is", "Chris", "."), collapse = " ") # equivalent
nchar(out)
strsplit(out, split = ' ')
```
> *Note*
>
> Some string processing functions (such as `strsplit` above) can return multiple values for each input string (each element of the character vector). As a result, the functions will return a list, which will be a list with one element when the function operates on a single string.
>
> ```{r}
> out <- c("Her name is Maya", "Hello everyone")
> strsplit(out, split = ' ')
> ```
Here are some examples of using `substring`:
```{r}
times <- c("04:18:04", "12:12:53", "13:47:00")
substring(times, 7, 8)
```
```{r}
substring(times[3], 1, 2) <- '01' ## replacement
times
```
To identify particular subsequences in strings, there are several
closely-related R functions. `grep` will look for a specified string
within an R character vector and report back indices identifying the
elements of the vector in which the string was found. Note that using the
`fixed=TRUE` argument ensures that regular expressions are NOT
used. `gregexpr` will indicate the position in each string
that the specified string is found (use `regexpr` if you only
want the first occurrence). `gsub` can be used to replace a
specified string with a replacement string (use `sub` if you
only want to replace only the first occurrence).
```{r}
dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
grep("2016", dates)
```
```{r}
gregexpr("2016", dates)
```
```{r}
gsub("2016", "16", dates)
```
### 1.2 String manipulation using `stringr`
The `stringr` package wraps the various core string manipulation
functions to provide a common interface. It also removes some of the
clunkiness involved in some of the string operations with the base
string functions, such as having to to call `gregexpr` and
then `regmatches` to pull out the matched strings. In general, I'd suggest using `stringr` functions
in place of R's base string functions.
First let's see `stringr`'s versions of some of the base R string functions mentioned in the previous sections.
The basic interface to `stringr` functions is `function(character_vector, pattern, [replacement])`.
Table 1 provides an overview of the key functions related to working with patterns, which are basically
wrappers for `grep`, `gsub`, `gregexpr`, etc.
| Function | What it does
|-----------------------------------|---------------------------------------------------------------------
| str_detect | detects pattern, returning TRUE/FALSE
| str_count | counts matches
| str_locate/str_locate_all | detects pattern, returning positions of matching characters
| str_extract/str_extract_all | detects pattern, returning matches
| str_replace/str_replace_all | detects pattern and replaces matches
The analog of `regexpr` vs. `gregexpr` and `sub`
vs. `gsub` is that most of the functions have versions that
return all the matches, not just the first match, e.g. `str_locate_all`
`str_extract_all`, etc. Note that the `_all` functions return
lists while the non-`_all` functions return vectors.
To specify options, you can wrap these functions around the pattern
argument: `fixed(pattern, ignore_case)` and `regex(pattern, ignore_case)`.
The default is `regex`, so you only need to specify that if you also want to
specify additional arguments, such as `ignore_case` or others listed under `help(regex)` (invoke the help after loading `stringr`)
Here's an example:
```{r}
library(stringr)
str <- c("Apple Computer", "IBM", "Apple apps")
str_detect(str, fixed("app", ignore_case = TRUE))
str_count(str, fixed("app", ignore_case = TRUE))
```
```{r}
str_locate(str, fixed("app", ignore_case = TRUE))
str_locate_all(str, fixed("app", ignore_case = TRUE))
```
```{r}
dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
str_locate(dates, "20[^0][0-9]") ## regular expression: years begin in 2010
```
```{r}
str_extract_all(dates, "20[^0][0-9]")
str_replace_all(dates, "20[^0][0-9]", "XXXX")
```
## 2 Basic text manipulation in Python
Let's see basic concatenation, splitting, working with substrings, and searching/replacing
substrings. Notice that Python's string functionality is object-oriented (though `len` is not).
Here, We'll just cover the basic methods for the `str` type. There's lots of additional functionality for working with strings in the `re` package, discussed [here in this tutorial](http://berkeley-scf.github.io/tutorial-string-processing/regex#6-using-regex-in-python). Of course in many cases of working with strings, one would need the full power of regular expressions to do what one needs to do.
First let's look at combining/concatenating strings. We can do this with the `+` operator
or using the `join` method, which is (perhaps confusingly) called based on the separator
of interest with the input strings as arguments.
```{python}
out = "My" + "name" + "is" + "Chris" + "."
out
```
```{python}
out = ' '.join(("My", "name", "is", "Chris", "."))
out
```
`len` simply returns the number of characters in the string.
```{python}
len(out)
```
```{python}
out.split(' ')
```
To see the various string methods, we can hit tab after typing `str.` or based on any specific string:
```{python, eval=FALSE}
out.
```
```
out.capitalize() out.index( out.isspace() out.removesuffix( out.startswith(
out.casefold() out.isalnum() out.istitle() out.replace( out.strip(
out.center( out.isalpha() out.isupper() out.rfind( out.swapcase()
out.count( out.isascii() out.join( out.rindex( out.title()
out.encode( out.isdecimal() out.ljust( out.rjust( out.translate(
out.endswith( out.isdigit() out.lower() out.rpartition( out.upper()
out.expandtabs( out.isidentifier() out.lstrip( out.rsplit( out.zfill(
out.find( out.islower() out.maketrans( out.rstrip(
out.format( out.isnumeric() out.partition( out.split(
out.format_map( out.isprintable() out.removeprefix( out.splitlines(
```
Unlike in R, you cannot use the string methods directly on a list or tuple of strings, but you of course can do things like list comprehension to easily process multiple strings.
Working with substrings relies on the fact that Python works with strings as if they are vectors of individual characters.
```{python}
var = "13:47:00"
var[3:5]
```
However strings are immutable - you cannot alter a subset of characters in the string. Another option is to work with strings as lists.
```{python, error=TRUE}
var[0:2] = "01"
```
Now let's consider finding substrings. Here Python tells us that '2016' starts in the 6th position in the first and third elements (with 0-based indexing).
```{python}
var = "08-03-2016"
var.find("2016")
```
We can count occurrences with `.count()`:
```{python}
var = "08-03-2016; 07-09-2016"
var.count("2016")
```
And we can replace like this:
```{python}
var = "13:47:00"
var.replace("13", "01")
```