Module1_June_2017.Rmd

---
title: "MODULE I. Descriptive Statistics & Intro to Probability. Hands-on practicum. "
author: "Julia Ponomarenko, CRG Bioinformatics core facility."
date: "June 6, 2017"
output:
  html_document:
    toc: true
    toc_depth: 4
    theme: readable
---

1. Descriptive statistics
--------------------------
"Descriptive statistics aim to summarize **a sample**, rather than use the data to learn about **the population** that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory." (Wikipedia)

### Explore data

We will use the package "car". For details on this package, see https://cran.r-project.org/web/packages/car/car.pdf

```{r, results="hide", warning = F}
# install.packages("car") - run this code if you do not have car" package installed
library(car) 
```

<br/>
Let's explore the dataset "Davis" from the car package. It is called "Self-Reports of Height and Weight Description".
The Davis data frame has 200 rows and 5 columns. The subjects were men and women engaged in regular exercise. There are some missing data. This data frame contains the following columns:
<br/>sex    - A factor with levels: F, female; M, male. 
<br/>weight - Measured weight in kg.
<br/>height - Measured height in cm.
<br/>repwt  - Reported weight in kg.
<br/>repht  - Reported height in cm.

```{r, results="hide"}
data <- Davis

dim(data) # first thing to do to see how many rows and columns
names(data) # columns names

class(data)  # should be "data.frame""
sapply(data[1,], class)  # to see types of variables in each column
head(data)

summary(data)  # shows quantiles for each column and how many NA's !!!!

quantile(data$repwt, na.rm = TRUE)  # shows quantiles for values in a specific column, ignoring 'NA'

unique(data$sex) # shows unique values for a specified column
unique(data$weight)
length(unique(data$weight))

table(data$sex) # table() builds a contingency table of the counts at each combination of factor levels
table(data$sex, data$weight)  # shows relationships between variables
table(data$sex, data$weight < 80) # gives the 2x2 contingency table

intervals <- cut(data$weight, quantile(data$weight)) # cut() assigns to each value a quantile - shows intervals in the order of data$weight
table(intervals, data$sex) # contigency table of sex by intervals (weight quantiles)

# contigency table of quantiles of weight versus repwt
x <- data$weight
y <- data$repwt
table(cut(x, quantile(x)), cut(y, quantile(y, na.rm = TRUE))) 
```

<br/>

####Exercise with table(), cut(), and quantile()
**Who is more inclined to report lower weight, males or females?**
```{r }
#males only
d <- data[data$sex == "M",]
x <- d$weight
y <- d$repwt
table(cut(x, quantile(x)), cut(y, quantile(y, na.rm = TRUE)))

#males only
d <- data[data$sex == "F",]
x <- d$weight
y <- d$repwt
table(cut(x, quantile(x)), cut(y, quantile(y, na.rm = TRUE)))
```
**Answer**: This is a rough observation from looking at the contingency table, where rows represent quantiles for measured weight and columns, reported weight, it can be seen that females have a tendendy to report lower weight. Thus, 9 females versus 4 males reported their weight being in the 1st quantile whicle their measured weight was in the 2nd quantile. And in total, there were 20 females and 7 males whose weights were below diagonal. 


<br/>


#### Looking at subset of data
```{r, results="hide"}
d <- data[data$weight < 60,] # rows with weight below 60 kg
summary(d)
summary(d$sex)
dim(d)
nrow(d)

data[data$weight < 60 & data$repwt >= 60,] # the results looks strange !!!
data[which(data$weight < 60 & data$repwt >= 60),] # use which() to exclude NA values
```

<br/>

#### Missing (NA) values
```{r}
sum(is.na(data$repwt))  # the number of missing values for repwt
table(is.na(data$repht))  # how many complete data (FALSE) and how many missing (TRUE) for data$repht

#by default table() doesn't show missing value; e.g.
table(data$repht)
table(data$repht, useNA = "ifany") #now it shows

#How many rows contain missing values (i.e., at least one 'NA')?
sum(!complete.cases(data)) 
data[!complete.cases(data), ] # here they are
```

<br/>

#### Exercise on missing values
**What is the proportion of females and males who didn't report the weight?**
```{r}
f <- data[which(data$sex == "F"), ] 
n <- nrow(f) #total number of females in the original data frame
y <- sum(is.na(f$repwt))/n
round(y, digits = 3)

m <- data[which(data$sex == "M"), ] 
n <- nrow(m) #total number of males in the original data frame
y <- sum(is.na(m$repwt))/n
round(y, digits = 3)
```

<br/>
<br/>

### Summary statistics 
Basic R functions: mean, sd, var, min, max, median, range, IQR, and quantile. All these function require special treatment for missing values, using parameter na.rm = TRUE. While summary(data) does not.

Let's look at the row with the maximum weight.
```{r}
max_weight <- max(data$weight)
data[which(data$weight == max_weight), ] # full info (row) on a person with the max_weight
```

<br/>

#### Outliers
Can be defined (by the Tuley's test) as values in the sample that differ from the Q1 (25% quantile) or Q3 (75% quantile) by more than 1.5*IQR (where IQR = Q3 - Q1). <br/>However, there is no formal definition of outliers; they need to be treated subjectively.

```{r}
quantile(data$weight)
q <- unname(quantile(data$weight)) # unname() removes names and gives the vector of quantile values that we use below

iqr <- q[4] - q[2] # or use IQR(data$weight) instead
upper_limit <- q[4] + 1.5 * iqr # values above this limit are outliers
upper_limit
max_weight > upper_limit # Is the maximum weight an outlier? TRUE or FALSE?
```

<br/>

How many "formal" outliers by weight in the dataset?
```{r}
data[which(data$weight > upper_limit), ]
```

<br/>

Let's remove these outliers and check which of the summary statistics change.
```{r}
d <- data[which(!data$weight > upper_limit), ] # d is our data frame without outliers; we will use it further in this practicum

# Did mean change?
mean(d$weight)
mean(data$weight)

# Did median change?
median(data$weight)
median(d$weight)

# Did SD change?
sd(data$weight)
sd(d$weight)

# Did quantiles change?
quantile(data$weight)
quantile(d$weight)
```

<br/>

### Plots
#### Box-plot
```{r}
boxplot(data$weight) # we can clearly see three outliers as individual dots
data[which(data$weight > upper_limit), ] # here they are

#boxplots of weight values with and w/o outliers side by side
par(mfrow = c(1,2)) # par() allows to plot graphs in a table; e.g., two plots in one row
boxplot(data$weight)
boxplot(d$weight) # now we got another outliers because IQR changed
par(mfrow = c(1,1)) # restore the default plot layout

# Let's explore these "new" outliers
quantile(d$weight)
q <- unname(quantile(d$weight)) # the vector of quantile values
upper_limit <- q[4] + 1.5 * IQR(d$weight) # values above this limit are 
d[which(d$weight > upper_limit), ]
#It is up to a researcher to remove those outliers or not. Let's not remove them.
```

<br/>
**How to plot box-plots side-by-side on one graph**
```{r}
boxplot(data$weight ~ data$sex)
boxplot(data$weight ~ data$sex, outline = FALSE) # boxplot has a parameter outline to not show outliers! (But it doesn't change your data)
```

<br/>
**Bonus: How to plot together data from two or more vectors of different lengths?**
```{r, results="hide", warning = FALSE}
#install.packages("reshape2")
library(reshape2) # for melt() to use below for transforming a data frame from the wide to the long format
```

<br/>
```{r}
male <- d[which(d$sex == "M"),] # we use data with outliers removed
female <- d[which(d$sex == "F"),]
m <- male$weight
f <- female$weight
length(m)
length(f)

# to make a data frame from these vectors, we first need to make them of the same length; that is, fill missing values in a smaller vector with 'NA'
len = max(length(m), length(f))
Female = c(f, rep(NA, len - length(f)))
Male = c(m, rep(NA, len - length(m)))
df <- structure(data.frame(Male, Female)) 
meltdf <- melt(df) # uses reshape2 package; transforms a data frame from the wide to the long format
# "No id variables; using all as measure variables" is a message that can be ignored
boxplot(data = meltdf, value ~ variable)
```

<br/>

**Bonus: Now to make the above plot more informative and look "publication-ready"?**
```{r, results="hide"}
# Define the plot parameters
y_limits <- c(30, 130)
colors <- c("blue", "red")
ylab <- "Weight, kg"
title <- "Distribution of weight by sex"

boxplot(data = meltdf, value ~ variable, outline = FALSE,
        #pars = list(boxwex = .4),
        ylab = ylab,
        cex.lab = 1.5, #to change (multiply) the font size of the axes legends
        cex.axis = 1.2, #to change (multiply) the font size of the axes
        ylim = y_limits,
        border = colors, #color the boxplot borders
        boxwex = 0.6,
        staplewex = 0.4,
        frame.plot = FALSE, #this removes upper and right borders on the plot area
        outwex = 0.5,
        cex.main = 1.5, #to change (multiply) the size of the title
        main = title # the title of the graph
        )

stripchart(data = meltdf, value ~ variable,
           col = colors,
           method = "jitter", jitter = .2,
           pch = c(16, 15), cex = c(1.0, 1.0), #different points and of different size can be used
           vertical = TRUE, add = TRUE)

# let's show the (yet unknown) significance level on the plot
text <- "p-value < ???"

y <- 120 # y position of the horizontal line
offset <- 5 # length of vertical segments
x <- 1

segments(x, y, x + 1, y)
segments(x, y - offset, x, y)
segments(x + 1, y - offset, x + 1, y)
text(x + 0.5, y + offset, paste(text), cex = 1) #cex defines the font size of the text

# you can also plot the means as sick (lwd = 3) horizontal black lines
y <- mean(Male, na.rm = T)
segments(x - 0.35, y, x + 0.35, y, lwd = 3)

y <- mean(Female, na.rm = T)
segments(x + 1 - 0.35, y, x + 1 + 0.35, y, lwd = 3)

```


<br/>

#### Histogram
```{r, results="hide"}
hist(data$weight)
hist(d$weight)
```


<br/>

You can control for the size of bins!
```{r}
bin_size <- 5
start <- 30
end <- 110
bins <- seq(start, end, by = bin_size)
hist(d$weight, breaks = bins, col = "blue")
```

<br/>

**Bonus: Two overlaying histograms on one graph**
```{r, results="hide"}
xlim = c(start, end)
title <- "Distribution of weights for males and females"
colors <- c("red", rgb(0, 0, 1, 0.7)) # the last number in rgb() is for the transparency of the second color

hist(Female, col = colors[1], xlim = xlim, xlab = "Weight, kg", main = title)
hist(Male, add = TRUE, col = colors[2], xlim = xlim)
legend("topright", legend = c("Females", "Males"), fill = colors, bty = "n", border = NA)
```


<br/>

####Exercise with hist()
**Draw the same graph as above with bins of size 2 and using colors <- c(rgb(1, 0, 1, 0.7), rgb(0, 0, 1, 0.5))**
```{r, echo = F, results = "hide"}
bin_size <- 2
start <- min(min(m), min(f)) - 3*bin_size
end <- max(max(m), max(f)) + 3*bin_size
bins <- seq(start, end, by = bin_size)

xlim = c(start, end)
title <- "Distribution of weights for males and females"
colors <- c(rgb(1, 0, 1, 0.7), rgb(0, 0, 1, 0.5)) # the last number in rgb() is for transparency

hist(f, breaks = bins, col = colors[1], xlim = xlim, xlab = "Weight, kg", main = title)
hist(m, breaks = bins, add = TRUE, col = colors[2], xlim = xlim)
legend("topright", legend = c("Females", "Males"), fill = colors, bty = "n", border = NA)
```


<br/>

#### Scatterplot
```{r, results="hide"}
plot(d$weight, d$repwt, pch = 20) # Just a simple plot of reported weights against measured weights

plot(d$weight, d$repwt, pch = 20, col = d$sex) # it can be colored by sex, for example
legend("topleft",legend = c("Females", "Males"), col = 1:2, pch = 20)
```

<br/>

Colors are taken in the order from the currently setup **palette()** . 
```{r}
palette()
```

<br/>


It can be changed to control for colors using palette() and then reset back to default, using palette("default").
```{r, results="hide"}
palette(c("orange", "darkgreen")) # changing palette()
plot(d$weight, d$repwt, col = d$sex, pch = 20, xlab = "Measured weight, kg", ylab = "Reported weight, kg", main = "Reported versus measured weights by sex") 
legend("bottomright",legend = c("Females", "Males"), col = 1:2, pch = 20, bty = "n")
palette("default") # reset back to the default
```
<br/>


#### Combining plots
Refer to http://www.statmethods.net/advgraphs/layout.html for using layout() function.


<br/>

#### Empirical cumulative distribution functions (eCDFs)
**eCDF(x) = the proportion of observations which values are <= x**
```{r, results="hide"}
weight.ecdf = ecdf(d$weight) # obtain empirical CDF values
plot(weight.ecdf, xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
```

<br/>

Let's draw vertical lines depicting median, Q1 and Q2, using abline(v = ...).
```{r}
plot(weight.ecdf, xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
summary(d$weight)
x <- unname(summary(d$weight)) # a vector of 6 elements {min, Q1, median, mean, Q3, max}

abline(v = median(d$weight), lwd = 2) # vertical line for median
abline(v = x[2], col = "blue", lwd = 2) # vertical line for Q1
abline(v = x[5], col = "red", lwd = 2) # vertical line for Q3
```

<br/>


Let's draw horizontal lines for quantiles 0.5, 0.25, 0.75, using abline(h = ...).
```{r, results="hide"}
plot(weight.ecdf, xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
x <- unname(summary(d$weight)) # a vector of 6 elements {min, Q1, median, mean, Q3, max}
abline(v = median(d$weight), lwd = 2) # vertical line for median
abline(v = x[2], col = "blue", lwd = 2) # vertical line for Q1
abline(v = x[5], col = "red", lwd = 2) # vertical line for Q3

abline(h = 0.5) # median
abline(h = 0.25, col = "blue")
abline(h = 0.75, col = "red")
```
<br/>
It can be seen that the CDF graph allows to identify the weight in kg for any quantile. For example, the lowest 10% of observations (CDF < 0.1) have weight below ~50 kg. Likewise, the top 10% of heavest people (CDF > 0.9) have weight above ~82 kg.

<br/>

####Exercise with eCDF
**Plot eCDFs for male and female weights in different colors on the same graph**
```{r, include = F}
plot(ecdf(Male), col = "blue", xlim = c(35, 110), xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
plot(ecdf(Female), xlim = c(35, 110), col = "red", add = T)
legend(x = 90, y = 0.5, legend = c("Females", "Males"), col = c("red", "blue"), pch = 20, bty = "n")

```


<br/>

2. Distributions
--------------------------
### Normal distribution
**We will learn four important functions dnorm(), rnorm(), pnorm(), and qnorm().**
See in addition https://www.r-bloggers.com/normal-distribution-functions/


<br/>


####Probability Density function (PDF) 
Let's look at three normal distributions with means and standard deviations of (0,1), and (8,0.5). 

<br/>We will use the following function:
<br/>**dnorm() - Probability Density Function (PDF)**

```{r, results="hide"}
x <- seq(-5, 15, length.out = 200)
y1 <- dnorm(x, mean = 0, sd = 1) # dnorm() !!!
y2 <- dnorm(x, mean = 8, sd = 0.5)

plot(x = x, y = y1, type = "l", lwd = 3, col = "black", ylim = c(0, 1), xlab = "X", ylab = "Density", yaxs = "i") # the last parameter forces the x-axes to meet the y-axes at 0
lines(x = x, y = y2, lwd = 3, col = "blue")
```


<br/>


####Z-transformation of a random sample from a normal distribution
<br/>We will use the following function:
<br/>**rnorm() - generates random sample from a normal distribution**

```{r, results="hide"}
mean <- 8
sd <- 0.5

x.sample <- rnorm(n = 5000, mean = mean, sd = sd) # rnorm() generates a random sample of size 500 from a normal distribution N(mean,sd)
z.sample <- (x.sample - mean) / sd # Z-transformation of x.sample

hist(x.sample, freq = F, ylim = c(0, 1), xlim = c(-5, 15), col = rgb(0,0,1,0.3), yaxs = "i")
hist(z.sample, freq = F, add = T)
```


<br/>

Now if we overlay these histograms with the density functions plotted for normal distributions obtained before, we will see that sample distributions can be modelled by normal distributions: N(mean=8, sd=0.5) for x.sample and N(mean=0, sd=1) for a transformed z.sample. The black distribution is for the **standard normal distribution**.
```{r, results="hide"}
hist(x.sample, freq = F, ylim = c(0, 1), xlim = c(-5, 15), col = rgb(0,0,1,0.3), yaxs = "i")
hist(z.sample, freq = F, add = T)

lines(x = x, y = y1, type = "l", lwd = 3) 
lines(x = x, y = y2, lwd = 3, col = "blue")
```


<br/>

####Cumulative Distribution Function (CDF)
**pnorm(z, mu, sd) - gives an area under the standard normal curve to the left of z**
```{r}
z <- seq(-4, 4, 0.01)
cdf <- pnorm(z, 0, 1)

plot(z, cdf, col = "blue", xlab = "z", ylab = "Cumulative Probability", type = "l", lwd = 3, cex = 2, main = "CDF of Standard Normal Distribution", cex.axis = .8, yaxs = "i")
```


<br/>

####The 68-95-99.7 rule
For the Normal distribution with mean **mu** and standard deviation **sd**,
**~68% of the observations fall within the range [mu - sd; mu + sd].**

```{r}
z <- seq(-4, 4, 0.01)
dens <- dnorm(z)
plot(x = z, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "z", ylab = "Standard Normal density", yaxs = "i")

sd <- 1 # !!!!
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "blue")

# What is the area of the blue area?
pnorm(sd) - pnorm(-sd)
```


<br/>

**~95.5% of the observations fall within the range [mu - 2sd; mu + 2sd].**

```{r}
plot(x = z, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "z", ylab = "Standard Normal density", yaxs = "i")

sd <- 2
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "green")

# What is the area of the green area?
pnorm(sd) - pnorm(-sd)
```


<br/>

**~99.7% of the observations fall within the range [mu - 3sd; mu + 3sd].**
```{r}
plot(x = z, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "z", ylab = "Standard Normal density", yaxs = "i")

sd <- 3
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "red")

# What is the area of the red area?
pnorm(sd) - pnorm(-sd)
```


<br/>

####Quantile function, or How to obtain the critical values of -z and z for a specified area under the standard normal curve.
**qnorm(p, mu, sd) - gives the value of z at which CDF (area on the left of z) is equal p**
```{r}
#Find the critical values -z and z for the area under the standard normal curve of 95.5%
p <- 0.955
qnorm(p) #gives z such that P(X < z) = p
qnorm(p + (1 - p) / 2) #gives a value of z such that P(-z < X < z) = p
```
It can be seen that this z = 2, or two standard deviations (sd = 1).