peb_tutorial.Rmd

---
title: "PEB advanced R workshop"
author: "Giovanni M DallOlio"
date: "28/04/2015"
output: html_document
---

PEB advanced R workshop
---------------------------------

Welcome to the PEB advanced R workshop! 

We will take a "messy" gene expression table, and use the tidyr library to restructure it in a format that is better suited for data analysis. We will also group the data and calculate summaries using the dplyr library, and finally plot it using ggplot2.


- http://dalloliogm.github.io/peb_workshop/
- https://github.com/dalloliogm/peb_workshop

Requirements
-------------

Let's install all the required packages:

```{r message=FALSE}
pkgs = c("tidyr", "dplyr", "ggplot2")
#for (pkg in pkgs) install.packages(pkg)
for (pkg in pkgs) library(pkg, character.only=T)
```

The tidyr package requires at least R 3.1.2. If you can't install it, you can use reshape2 instead, but you won't be able to follow some of the exercises.

We also need to download the following **data files**:

```{r eval=FALSE}
download.file('https://raw.githubusercontent.com/dalloliogm/peb_workshop/master/peb_expression.csv', dest='peb_expression.csv')
download.file('https://raw.githubusercontent.com/dalloliogm/peb_workshop/master/cancer_genes.csv', dest='cancer_genes.csv')
```


the peb_expression.csv file
-----------------------------

Let's read the peb_expression.csv file that we just downloaded, and have a look at its contents.

```{r read}
peb = read.table("peb_expression.csv", header=T, sep='\t')
head(peb)
str(peb)

```

The first two columns contain the probe ID and the name of the gene. The other columns contain the expression levels of each gene in an individual. Notice how the name of the columns also encode the population of each individual, e.g. YRI, EUR, or EAS.


A piping system for R
----------------------

The dplyr package introduced a piping system for R, using the %>% symbol.

This works similarly to the piping system in bash, but using the %>% symbol instead of the pipe |. We first write the name of the dataframe to use, then all the operation that must be executed on it.

For example, the following is equivalent to head(peb)

```{r piping1}
peb %>% head
```

We can concantenate any number of operations on the same dataset, just as we would do in bash. We will play with this piping system in a few minutes.

Converting to a long format
------------------------------

A dataset can be encoded in a "wide" format (more columns and less rows), or in a "long" format (minimum number of columns and more rows). While the wide format can be more readable for the human eye, the long format is better suited for data analysis. 

Our peb data frame is in a "wide" format, as it contains many columns, one for every individual. However, all the functions used in the rest of the workshop require the dataset to be in a long format. Let's convert it using the gather() function from tidyr:


```{r gather1}
peb.long = gather(peb, sample, expression, -c(ID_REF, IDENTIFIER))
head(peb.long)
```

Explanation:

- *sample* is the name of the new column containing the key variables
- *expression* is the new column containing the values variable
- we use the *-* operator to define which columns must **not** be converted to the long format

Note how the new format encodes exactly the same data as before, but has only four columns. 

Nowadays many recent R libraries and functions are designed for datasets in the long format. A common beginner mistake is trying to apply these functions to datasets in the wide format, while the trick is to restructure the data first. If you learn how to reorganize your dataset into a long format, then you can solve most data analysis problems in R using always the same approach. 

If you couldn't install the tidyr package, you can achieve the same format using the *melt* function from reshape2

```{r eval=FALSE}
install.packages("reshape2")
library("reshape2")
peb.long = melt(peb, id.vars=c("ID_REF", "IDENTIFIER"), variable.name="sample")
names(peb.long)[4] = "expression"
head(peb.long, 3)
```


Tidying-up the peb.long dataframe
---------------------------------------------

In a properly structured table each variable must contain only one type of information. If we look at our peb.long dataframe, the sample column contains both the individual ID and its population. Let's split this column into two, using *separate* from tidyr:


```{r gather2}
peb.long.tmp = gather(peb, sample, expression, -ID_REF, -IDENTIFIER)
peb.long     = separate(peb.long.tmp, sample, into=c("individual", "population"), sep='\\.')
head(peb.long)
```

The above code can be simplified using the %>% operator:
```{r gather2.dplyr}
peb.long = peb %>% 
  gather(sample, expression, -ID_REF, -IDENTIFIER) %>%
  separate(sample, into=c("individual", "population"), sep='\\.')

```


To prepare the dataset for our workshop, we need to apply a few more data filtering steps, like removing all the "Control" rows, eliminating all the duplicated genes (keeping only one probe per gene), and dropping the ID_REF column. 


```{r tidy.all}
peb.long %>% nrow

peb.long = peb %>% 
  gather(sample, expression, -ID_REF, -IDENTIFIER) %>%
  separate(sample, into=c("individual", "population"), sep='\\.') %>%
  subset(!grepl("Control", IDENTIFIER)) %>%
  subset(!duplicated(paste(IDENTIFIER, individual))) %>%
  select(-ID_REF)

peb.long %>% nrow
```


Group operations
----------------------

Apart from the %>% operator, the dplyr library introduces three useful functions: group_by, summarise and mutate.

- *group_by* is used to define how the rows of a dataset must be grouped. For example, we can group the rows of peb.long by population.

- *summarise* is used to calculate summaries of a grouped dataset - for example we can use it to calculate the mean and standard deviation of the expression for every population:

```{r grouping1}
peb.means = peb.long %>% 
  group_by(population) %>%
  summarise(
    mean=mean(expression),
    sd=sd(expression))

peb.means %>% print

```

- *mutate* is used to add columns to a data frame, or to modify existing columns. For example we can use it to annotate whether each gene is over or under expressed, compared to the expression of other genes in the same individual:


```{r mutate1}
peb.long = peb.long %>% 
  group_by(individual) %>%
  mutate(
    mean.expression=mean(expression) ) %>%
  mutate(
    overexpressed = ifelse(expression > mean.expression, "overexpressed", "underexpressed")
  )

peb.long %>% head %>% print.data.frame(row.names=F)

```


Plotting the dataset
------------------------

We will use the ggplot2 package to plot our peb.long dataframe.

A ggplot2 plot is composed by a base ggplot() object, defining the dataset and the basic variables used, and by additional elements defining how to represent them. Let's see an example:


```{r fig.width=10, fig.height=6}

ggplot(peb.long, aes(x=individual, y=expression)) +
  geom_point() + 
  ggtitle("Expression by individual")

```

Explanation:

- *ggplot(peb.long, aes(x=overexpressed, y=expression))* initialize the base ggplot2 object. We use the aes (for aesthetic) function to define the default x and y variables

- *geom_point* adds a scatterplot representation of the base plot

- *ggtitle* is another element of the plot, in this case the title.

Let' try a boxplot representation:

```{r fig.width=10, fig.height=6}

ggplot(peb.long, aes(x=individual, y=expression)) +
  geom_boxplot()

```

Let's create a nicer plot, in which we:

    - sort the individuals by population (x=reorder..)
    - color by the population variable   (color=population)
    - set limits for the Y axis          (scale_y_continuous)
    - set a nice name for X axis         (scale_x_discrete)
    - use a whiter plot theme            (theme_bw)
    - rotate the X axis labels           (theme)


```{r fig.width=10, fig.height=6}

ggplot(peb.long, aes(x=reorder(individual, desc(population)), y=expression, color=population)) +
  geom_boxplot() + 
  scale_y_continuous("Expression", limits=c(0, 1000)) + 
  scale_x_discrete('Individuals sorted by Population') +
  theme_bw() +
  theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ggtitle('Expression by individual')

```


Classifying cancer genes
------------------------


Remember that we also downloaded a file containing the classification of genes into oncogenes and tumor suppressors, from the NCG database:

```{r}
cancer = read.table("cancer_genes.csv", header=T)
cancer %>% head

```

This file doesn't follow the principles of "tidy" data, according to which each row should represent only one single observation.

In this case we can use the unnest function from tidyr to convert it to a "long" format:


```{r}
cancer.long = cancer %>%
    mutate(genes=strsplit(as.character(genes), ",") ) %>%
    unnest(genes) %>%
    rename(IDENTIFIER=genes)

cancer.long %>% head
```

Now we can use left_join from dplyr to add a column to our peb expression data frame:


```{r}

peb.long %>% nrow

peb.long = peb.long %>%
    left_join(cancer.long, by='IDENTIFIER') %>%
    mutate(dataset = ifelse(is.na(dataset), "no-cancer", as.character(dataset)))

peb.long %>% nrow

```

**OPTIONAL exercise**

You may notice that the number of rows in the dataframe increased after the join. The reason is that three genes are classified both as Oncogenes and as Tumor Suppressors.

To find them you can do any of the following:

```{r}
# Detect duplicated IDENTIFIER & individual rows
peb.long %>% 
  subset(duplicated(paste(IDENTIFIER, individual)))

# Group by IDENTIFIER and individual, and summarise
peb.long %>% 
  group_by(IDENTIFIER, individual) %>%
  mutate(dataset=paste(dataset, collapse=',')) %>% 
  subset(grepl('tumor', dataset) & grepl('oncogene', dataset))

# Reorganize the data frame to a "wide" format 
peb.long %>%
  mutate(value=T) %>% 
  spread(dataset, value) %>% 
  subset(oncogenes==T & tumorsuppressors==T)
```

End of the **OPTIONAL** exercise


advanced ggplot2: facetting and saving to a file
------------------------------------------------


Now that we have more qualitative variables in peb.long, we can create more sophisticated plots. For example, an useful way to represent multi-dimensional data is facetting:

```{r fig.width=10, fig.height=6}

myplot = ggplot(peb.long, aes(x=reorder(individual, desc(population)), y=expression, color=population)) +
  geom_boxplot() + 
  scale_y_continuous("Expression", limits=c(0, 1000)) + 
  scale_x_discrete('Individuals sorted by Population') +
  theme_bw() +
  theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
  ggtitle('Expression by individual')

print(myplot + facet_wrap(~population, scales="free"))
```

If we want to save this plot to a file, we can do:

```
ggsave('expression_by_individual.pdf')
```


a Barchart plot
---------------

Let's see how many oncogenes/tumor suppressors are overexpressed in each population:

```{r fig.width=10, fig.height=6}
peb.long %>% 
    subset(dataset !='no-cancer') %>% 
    ggplot(aes(x=dataset, fill=overexpressed)) + 
        geom_bar(position='dodge') + 
        facet_wrap(~population) +
        theme_bw() +
        ggtitle('Number of overexpressed/underexpressed genes by population')
```