lab2a_community-cluster.Rmd

---
title: "Lab 2a. Community - Cluster"
author: "Ben Best"
date: "1/18/2022"
output: html_document
bibliography: ["ml-env.bib"]
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Learning Objectives {.unnumbered}

In this lab, you will play with **unsupervised classification** techniques while working with **ecological community** datasets.

- Comparing species counts between sites using **distance** metrics:

  - **Euclidean** calculates the distance between a virtualized space using Pythagorean theorem.
  
  - **Manhattan** calculates integer "around the block" difference.
  
  - **Bray-Curtis** dissimilarity is based on the sum of lowest counts of shared species between sites over the sum of all species. A dissimilarity value of 1 is completely dissimilar, i.e. no species shared. A value of 0 is completely identical.

- **Clustering**

  - **_K_-Means clustering** with function `kmeans()` given a pre-assigned number of clusters assigns membership centroid based on reducing within cluster variation.
  
    - **Voronoi diagrams** visualizes regions to nearest points, useful here to show membership of nodes to nearest centroid.
  
  - **Hierarchical clustering** allows for a non-specific number of clusters. 
  
    - **Agglomerative hierarchical clustering**, such as with `diana()`, agglomerates as it builds the tree. It is good at identifying small clusters.

    - **Divisive hierarchical clustering**, such as with `agnes()`, divides as it builds the tree. It is good at identifying large clusters.
    
    - **Dendrograms** visualize the branching tree.

- **Ordination** (coming Monday)

# Clustering

**Clustering** associates similar data points with each other, adding a grouping label. It is a form of **unsupervised learning** since we don't fit the model based on feeding it a labeled response (i.e. $y$). 

## _K_-Means Clustering

Source: [K Means Clustering in R | DataScience+](https://datascienceplus.com/k-means-clustering-in-r/)

In _k_-means clustering, the number of clusters needs to be specified. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

1. Reassign data points to the cluster whose centroid is closest.
1. Calculate new centroid of each cluster.

These two steps are repeated until the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

### Load and plot the `penguins` dataset

The `penguins` dataset comes from Allison Horst's [palmerpenguins](https://allisonhorst.github.io/palmerpenguins) R package and records biometric measurements of different penguin species found at Palmer Station, Antarctica [@gormanEcologicalSexualDimorphism2014]. It is an alternative to the `iris` dataset example for exploratory data analysis (to avoid association of this 1935 dataset's collector [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) who "held strong views on race and eugenics"). Use of either dataset will be acceptable for submission of this lab (and mention of `iris` or Fisher will be dropped for next year).

```{=html}
<style>
.row {
  display: flex;
}

.column {
  flex: 50%;
  padding: 5px;
}
</style>
<div class="row">
  <div class="column">
    <img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" style="width:100%"/>
  </div>
  <div class="column">
    <img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png" style="width:100%"/>
  </div>
</div>
```

```{r}
# load R packages
librarian::shelf(
  dplyr, DT, ggplot2, palmerpenguins, skimr, tibble)

# set seed for reproducible results
set.seed(42)

# load the dataset
data("penguins")

# look at documentation in RStudio
if (interactive())
  help(penguins)

# show data table
datatable(penguins)

# skim the table for a summary
skim(penguins)

# remove the rows with NAs
penguins <- na.omit(penguins)

# plot petal length vs width, species naive
ggplot(
  penguins, aes(bill_length_mm, bill_depth_mm)) +
  geom_point()

# plot petal length vs width, color by species
legend_pos <- theme(
    legend.position = c(0.95, 0.05),
    legend.justification = c("right", "bottom"),
    legend.box.just = "right")
ggplot(
  penguins, aes(bill_length_mm, bill_depth_mm, color = species)) +
  geom_point() +
  legend_pos
```

### Cluster `penguins` using `kmeans()`

```{r}
# cluster using kmeans
k <- 3  # number of clusters
penguins_k <- kmeans(
  penguins %>% 
    select(bill_length_mm, bill_depth_mm), 
  centers = k)

# show cluster result
penguins_k

# compare clusters with species (which were not used to cluster)
table(penguins_k$cluster, penguins$species)
```

**Question**: Comparing the observed species plot with 3 species with the kmeans() cluster plot with 3 clusters, where does this "unsupervised" `kmeans()` technique (that does not use species to "fit" the model) produce similar versus different results? One or two sentences would suffice. Feel free to mention ranges of values along the axes.

```{r}
# extract cluster assignment per observation
Cluster = factor(penguins_k$cluster)

ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster)) +
  geom_point() + 
  legend_pos
```

```{r, eval=F, echo=F}
# **Task**: Highlight the "misclassified" points in the plot. _Hints: To get just the points misclassified, you can use `penguins_k$cluster != as.integer(penguins$species)`, which can feed as the second argument into `filter(penguins)`. To add another set of points to the ggplot, use `+ geom_point()` with arguments for: `data` with the additional points, `pch` [point shape](https://www.r-bloggers.com/2021/06/r-plot-pch-symbols-different-point-shapes-in-r/) with `fill=NA` for transparency and outline `color="red"`._
obs_mis <- penguins %>% 
  filter(penguins_k$cluster != as.integer(penguins$species))
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster)) +
  geom_point() + 
  legend_pos +
  geom_point(data = obs_mis, color="red", fill=NA, pch=21)
```

### Plot Voronoi diagram of clustered `penguins`

This form of clustering assigns points to the cluster based on nearest centroid. You can see the breaks more clearly with a [Voronoi diagram](https://en.wikipedia.org/wiki/Voronoi_diagram).

```{r}
librarian::shelf(ggvoronoi, scales)

# define bounding box for geom_voronoi()
xr <- extendrange(range(penguins$bill_length_mm), f=0.1)
yr <- extendrange(range(penguins$bill_depth_mm), f=0.1)
box <- tribble(
  ~bill_length_mm, ~bill_depth_mm, ~group,
  xr[1], yr[1], 1,
  xr[1], yr[2], 1,
  xr[2], yr[2], 1,
  xr[2], yr[1], 1,
  xr[1], yr[1], 1) %>% 
  data.frame()

# cluster using kmeans
k <- 3  # number of clusters
penguins_k <- kmeans(
  penguins %>% 
    select(bill_length_mm, bill_depth_mm), 
  centers = k)

# extract cluster assignment per observation
Cluster = factor(penguins_k$cluster)

# extract cluster centers
ctrs <- as.data.frame(penguins_k$centers) %>% 
  mutate(
    Cluster = factor(1:k))

# plot points with voronoi diagram showing nearest centroid
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster)) +
  geom_point() + 
  legend_pos +
  geom_voronoi(
    data = ctrs, aes(fill=Cluster), color = NA, alpha=0.5, 
    outline = box) + 
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  geom_point(
    data = ctrs, pch=23, cex=2, fill="black")
```

**Task**: Show the Voronoi diagram for fewer (`k=2`) and more (`k=8`) clusters to see how assignment to cluster centroids work.

## Hierarchical Clustering

Next, you'll cluster sites according to species composition. You'll use the `dune` dataset from the `vegan` R package and follow along with these source texts:

- [Ch. 8-9 (p.123-152)](https://drive.google.com/open?id=16T7kAO9W5Rz7xSHqwJrOdDu8RUTmTJ89&authuser=ben%40ecoquants.com&usp=drive_fs) [@kindtTreeDiversityAnalysis2005]
- [21.3.1 Agglomerative hierarchical clustering](https://bradleyboehmke.github.io/HOML/hierarchical.html#agglomerative-hierarchical-clustering) [@greenwellHandsOnMachineLearning]


### Load `dune` dataset

```{r}
librarian::shelf(
  cluster, vegan)

# load dune dataset from package vegan
data("dune")

# show documentation on dataset if interactive
if (interactive())
  help(dune)
```

**Question**: What are the rows and columns composed of in the `dune` data frame?

### Calculate Ecological Distances on `sites`

Before we calculate ecological distance between sites for `dune`, let's look at these metrics with a simpler dataset, like the example given in Chapter 8 by @kindtTreeDiversityAnalysis2005.

```{r}
sites <- tribble(
  ~site, ~sp1, ~sp2, ~sp3,
    "A",    1,    1,    0,
    "B",    5,    5,    0,
    "C",    0,    0,    1) %>% 
  column_to_rownames("site")
sites

sites_manhattan <- vegdist(sites, method="manhattan")
sites_manhattan

sites_euclidean <- vegdist(sites, method="euclidean")
sites_euclidean

sites_bray <- vegdist(sites, method="bray")
sites_bray
```

### Bray-Curtis Dissimilarity on `sites` 

Let's take a closer look at the [Bray-Curtis Dissimilarity](https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity) distance:

$$
B_{ij} = 1 - \frac{2C_{ij}}{S_i + S_j}
$$

- $B_{ij}$: Bray-Curtis dissimilarity value between sites $i$ and $j$. \
1 = completely dissimilar (no shared species); 0 = identical.

- $C_{ij}$: sum of the lesser counts $C$ for shared species common to both sites $i$ and $j$

- $S_{i OR j}$: sum of all species counts $S$ for the given site $i$ or $j$

So to calculate Bray-Curtis for the example `sites`: 

- $B_{AB} = 1 - \frac{2 * (1 + 1)}{2 + 10} = 1 - 4/12 = 1 - 1/3 = 0.667$

- $B_{AC} = 1 - \frac{2 * 0}{2 + 1} = 1$

- $B_{BC} = 1 - \frac{2 * 0}{10 + 1} = 1$


### Agglomerative hierarchical clustering on `dune` 

See text to accompany code: _HOMLR_ [21.3.1 Agglomerative hierarchical clustering](https://bradleyboehmke.github.io/HOML/hierarchical.html#agglomerative-hierarchical-clustering).

```{r}
# Dissimilarity matrix
d <- vegdist(dune, method="bray")
dim(d)
as.matrix(d)[1:5, 1:5]

# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )

# Dendrogram plot of hc1
plot(hc1, cex = 0.6, hang = -1)


# Compute agglomerative clustering with agnes
hc2 <- agnes(dune, method = "complete")

# Agglomerative coefficient
hc2$ac

# Dendrogram plot of hc2
plot(hc2, which.plot = 2)

# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {
  agnes(dune, method = x)$ac
}

# get agglomerative coefficient for each linkage method
purrr::map_dbl(m, ac)


# Compute ward linkage clustering with agnes
hc3 <- agnes(dune, method = "ward")

# Agglomerative coefficient
hc3$ac

# Dendrogram plot of hc3
plot(hc3, which.plot = 2)
```

### Divisive hierarchical clustering on `dune` 

See text to accompany code: _HOMLR_ [21.3.2 Divisive hierarchical clustering](https://bradleyboehmke.github.io/HOML/hierarchical.html#divisive-hierarchical-clustering).

```{r}
# compute divisive hierarchical clustering
hc4 <- diana(dune)

# Divise coefficient; amount of clustering structure found
hc4$dc
```

### Determining optimal clusters

See text to accompany code: _HOMLR_ [21.4 Determining optimal clusters](https://bradleyboehmke.github.io/HOML/hierarchical.html#determining-optimal-clusters).

```{r}
librarian::shelf(factoextra)

# Plot cluster results
p1 <- fviz_nbclust(dune, FUN = hcut, method = "wss",  k.max = 10) +
  ggtitle("(A) Elbow method")

p2 <- fviz_nbclust(dune, FUN = hcut, method = "silhouette", k.max = 10) +
  ggtitle("(B) Silhouette method")

p3 <- fviz_nbclust(dune, FUN = hcut, method = "gap_stat", k.max = 10) +
  ggtitle("(C) Gap statistic")

# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, nrow = 1)
```

### Working with dendrograms

See text to accompany code: _HOMLR_ [21.5 Working with dendrograms](https://bradleyboehmke.github.io/HOML/hierarchical.html#working-with-dendrograms).

```{r}
# Construct dendorgram for the Ames housing example
hc5 <- hclust(d, method = "ward.D2" )
dend_plot <- fviz_dend(hc5)
dend_data <- attr(dend_plot, "dendrogram")
dend_cuts <- cut(dend_data, h = 8)
fviz_dend(dend_cuts$lower[[2]])

# Ward's method
hc5 <- hclust(d, method = "ward.D2" )

# Cut tree into 4 groups
k = 4
sub_grp <- cutree(hc5, k = k)

# Number of members in each cluster
table(sub_grp)

# Plot full dendogram
fviz_dend(
  hc5,
  k = k,
  horiz = TRUE,
  rect = TRUE,
  rect_fill = TRUE,
  rect_border = "jco",
  k_colors = "jco")
```