-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpeb_tutorial.Rmd
345 lines (221 loc) · 11 KB
/
peb_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
---
title: "PEB advanced R workshop"
author: "Giovanni M DallOlio"
date: "28/04/2015"
output: html_document
---
PEB advanced R workshop
---------------------------------
Welcome to the PEB advanced R workshop!
We will take a "messy" gene expression table, and use the tidyr library to restructure it in a format that is better suited for data analysis. We will also group the data and calculate summaries using the dplyr library, and finally plot it using ggplot2.
- http://dalloliogm.github.io/peb_workshop/
- https://github.com/dalloliogm/peb_workshop
Requirements
-------------
Let's install all the required packages:
```{r message=FALSE}
pkgs = c("tidyr", "dplyr", "ggplot2")
#for (pkg in pkgs) install.packages(pkg)
for (pkg in pkgs) library(pkg, character.only=T)
```
The tidyr package requires at least R 3.1.2. If you can't install it, you can use reshape2 instead, but you won't be able to follow some of the exercises.
We also need to download the following **data files**:
```{r eval=FALSE}
download.file('https://raw.githubusercontent.com/dalloliogm/peb_workshop/master/peb_expression.csv', dest='peb_expression.csv')
download.file('https://raw.githubusercontent.com/dalloliogm/peb_workshop/master/cancer_genes.csv', dest='cancer_genes.csv')
```
the peb_expression.csv file
-----------------------------
Let's read the peb_expression.csv file that we just downloaded, and have a look at its contents.
```{r read}
peb = read.table("peb_expression.csv", header=T, sep='\t')
head(peb)
str(peb)
```
The first two columns contain the probe ID and the name of the gene. The other columns contain the expression levels of each gene in an individual. Notice how the name of the columns also encode the population of each individual, e.g. YRI, EUR, or EAS.
A piping system for R
----------------------
The dplyr package introduced a piping system for R, using the %>% symbol.
This works similarly to the piping system in bash, but using the %>% symbol instead of the pipe |. We first write the name of the dataframe to use, then all the operation that must be executed on it.
For example, the following is equivalent to head(peb)
```{r piping1}
peb %>% head
```
We can concantenate any number of operations on the same dataset, just as we would do in bash. We will play with this piping system in a few minutes.
Converting to a long format
------------------------------
A dataset can be encoded in a "wide" format (more columns and less rows), or in a "long" format (minimum number of columns and more rows). While the wide format can be more readable for the human eye, the long format is better suited for data analysis.
Our peb data frame is in a "wide" format, as it contains many columns, one for every individual. However, all the functions used in the rest of the workshop require the dataset to be in a long format. Let's convert it using the gather() function from tidyr:
```{r gather1}
peb.long = gather(peb, sample, expression, -c(ID_REF, IDENTIFIER))
head(peb.long)
```
Explanation:
- *sample* is the name of the new column containing the key variables
- *expression* is the new column containing the values variable
- we use the *-* operator to define which columns must **not** be converted to the long format
Note how the new format encodes exactly the same data as before, but has only four columns.
Nowadays many recent R libraries and functions are designed for datasets in the long format. A common beginner mistake is trying to apply these functions to datasets in the wide format, while the trick is to restructure the data first. If you learn how to reorganize your dataset into a long format, then you can solve most data analysis problems in R using always the same approach.
If you couldn't install the tidyr package, you can achieve the same format using the *melt* function from reshape2
```{r eval=FALSE}
install.packages("reshape2")
library("reshape2")
peb.long = melt(peb, id.vars=c("ID_REF", "IDENTIFIER"), variable.name="sample")
names(peb.long)[4] = "expression"
head(peb.long, 3)
```
Tidying-up the peb.long dataframe
---------------------------------------------
In a properly structured table each variable must contain only one type of information. If we look at our peb.long dataframe, the sample column contains both the individual ID and its population. Let's split this column into two, using *separate* from tidyr:
```{r gather2}
peb.long.tmp = gather(peb, sample, expression, -ID_REF, -IDENTIFIER)
peb.long = separate(peb.long.tmp, sample, into=c("individual", "population"), sep='\\.')
head(peb.long)
```
The above code can be simplified using the %>% operator:
```{r gather2.dplyr}
peb.long = peb %>%
gather(sample, expression, -ID_REF, -IDENTIFIER) %>%
separate(sample, into=c("individual", "population"), sep='\\.')
```
To prepare the dataset for our workshop, we need to apply a few more data filtering steps, like removing all the "Control" rows, eliminating all the duplicated genes (keeping only one probe per gene), and dropping the ID_REF column.
```{r tidy.all}
peb.long %>% nrow
peb.long = peb %>%
gather(sample, expression, -ID_REF, -IDENTIFIER) %>%
separate(sample, into=c("individual", "population"), sep='\\.') %>%
subset(!grepl("Control", IDENTIFIER)) %>%
subset(!duplicated(paste(IDENTIFIER, individual))) %>%
select(-ID_REF)
peb.long %>% nrow
```
Group operations
----------------------
Apart from the %>% operator, the dplyr library introduces three useful functions: group_by, summarise and mutate.
- *group_by* is used to define how the rows of a dataset must be grouped. For example, we can group the rows of peb.long by population.
- *summarise* is used to calculate summaries of a grouped dataset - for example we can use it to calculate the mean and standard deviation of the expression for every population:
```{r grouping1}
peb.means = peb.long %>%
group_by(population) %>%
summarise(
mean=mean(expression),
sd=sd(expression))
peb.means %>% print
```
- *mutate* is used to add columns to a data frame, or to modify existing columns. For example we can use it to annotate whether each gene is over or under expressed, compared to the expression of other genes in the same individual:
```{r mutate1}
peb.long = peb.long %>%
group_by(individual) %>%
mutate(
mean.expression=mean(expression) ) %>%
mutate(
overexpressed = ifelse(expression > mean.expression, "overexpressed", "underexpressed")
)
peb.long %>% head %>% print.data.frame(row.names=F)
```
Plotting the dataset
------------------------
We will use the ggplot2 package to plot our peb.long dataframe.
A ggplot2 plot is composed by a base ggplot() object, defining the dataset and the basic variables used, and by additional elements defining how to represent them. Let's see an example:
```{r fig.width=10, fig.height=6}
ggplot(peb.long, aes(x=individual, y=expression)) +
geom_point() +
ggtitle("Expression by individual")
```
Explanation:
- *ggplot(peb.long, aes(x=overexpressed, y=expression))* initialize the base ggplot2 object. We use the aes (for aesthetic) function to define the default x and y variables
- *geom_point* adds a scatterplot representation of the base plot
- *ggtitle* is another element of the plot, in this case the title.
Let' try a boxplot representation:
```{r fig.width=10, fig.height=6}
ggplot(peb.long, aes(x=individual, y=expression)) +
geom_boxplot()
```
Let's create a nicer plot, in which we:
- sort the individuals by population (x=reorder..)
- color by the population variable (color=population)
- set limits for the Y axis (scale_y_continuous)
- set a nice name for X axis (scale_x_discrete)
- use a whiter plot theme (theme_bw)
- rotate the X axis labels (theme)
```{r fig.width=10, fig.height=6}
ggplot(peb.long, aes(x=reorder(individual, desc(population)), y=expression, color=population)) +
geom_boxplot() +
scale_y_continuous("Expression", limits=c(0, 1000)) +
scale_x_discrete('Individuals sorted by Population') +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
ggtitle('Expression by individual')
```
Classifying cancer genes
------------------------
Remember that we also downloaded a file containing the classification of genes into oncogenes and tumor suppressors, from the NCG database:
```{r}
cancer = read.table("cancer_genes.csv", header=T)
cancer %>% head
```
This file doesn't follow the principles of "tidy" data, according to which each row should represent only one single observation.
In this case we can use the unnest function from tidyr to convert it to a "long" format:
```{r}
cancer.long = cancer %>%
mutate(genes=strsplit(as.character(genes), ",") ) %>%
unnest(genes) %>%
rename(IDENTIFIER=genes)
cancer.long %>% head
```
Now we can use left_join from dplyr to add a column to our peb expression data frame:
```{r}
peb.long %>% nrow
peb.long = peb.long %>%
left_join(cancer.long, by='IDENTIFIER') %>%
mutate(dataset = ifelse(is.na(dataset), "no-cancer", as.character(dataset)))
peb.long %>% nrow
```
**OPTIONAL exercise**
You may notice that the number of rows in the dataframe increased after the join. The reason is that three genes are classified both as Oncogenes and as Tumor Suppressors.
To find them you can do any of the following:
```{r}
# Detect duplicated IDENTIFIER & individual rows
peb.long %>%
subset(duplicated(paste(IDENTIFIER, individual)))
# Group by IDENTIFIER and individual, and summarise
peb.long %>%
group_by(IDENTIFIER, individual) %>%
mutate(dataset=paste(dataset, collapse=',')) %>%
subset(grepl('tumor', dataset) & grepl('oncogene', dataset))
# Reorganize the data frame to a "wide" format
peb.long %>%
mutate(value=T) %>%
spread(dataset, value) %>%
subset(oncogenes==T & tumorsuppressors==T)
```
End of the **OPTIONAL** exercise
advanced ggplot2: facetting and saving to a file
------------------------------------------------
Now that we have more qualitative variables in peb.long, we can create more sophisticated plots. For example, an useful way to represent multi-dimensional data is facetting:
```{r fig.width=10, fig.height=6}
myplot = ggplot(peb.long, aes(x=reorder(individual, desc(population)), y=expression, color=population)) +
geom_boxplot() +
scale_y_continuous("Expression", limits=c(0, 1000)) +
scale_x_discrete('Individuals sorted by Population') +
theme_bw() +
theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
ggtitle('Expression by individual')
print(myplot + facet_wrap(~population, scales="free"))
```
If we want to save this plot to a file, we can do:
```
ggsave('expression_by_individual.pdf')
```
a Barchart plot
---------------
Let's see how many oncogenes/tumor suppressors are overexpressed in each population:
```{r fig.width=10, fig.height=6}
peb.long %>%
subset(dataset !='no-cancer') %>%
ggplot(aes(x=dataset, fill=overexpressed)) +
geom_bar(position='dodge') +
facet_wrap(~population) +
theme_bw() +
ggtitle('Number of overexpressed/underexpressed genes by population')
```