generated from allisonhorst/meds-distill-template
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathlab2a_community-cluster.Rmd
395 lines (282 loc) · 12.5 KB
/
lab2a_community-cluster.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
---
title: "Lab 2a. Community - Cluster"
author: "Ben Best"
date: "1/18/2022"
output: html_document
bibliography: ["ml-env.bib"]
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Learning Objectives {.unnumbered}
In this lab, you will play with **unsupervised classification** techniques while working with **ecological community** datasets.
- Comparing species counts between sites using **distance** metrics:
- **Euclidean** calculates the distance between a virtualized space using Pythagorean theorem.
- **Manhattan** calculates integer "around the block" difference.
- **Bray-Curtis** dissimilarity is based on the sum of lowest counts of shared species between sites over the sum of all species. A dissimilarity value of 1 is completely dissimilar, i.e. no species shared. A value of 0 is completely identical.
- **Clustering**
- **_K_-Means clustering** with function `kmeans()` given a pre-assigned number of clusters assigns membership centroid based on reducing within cluster variation.
- **Voronoi diagrams** visualizes regions to nearest points, useful here to show membership of nodes to nearest centroid.
- **Hierarchical clustering** allows for a non-specific number of clusters.
- **Agglomerative hierarchical clustering**, such as with `diana()`, agglomerates as it builds the tree. It is good at identifying small clusters.
- **Divisive hierarchical clustering**, such as with `agnes()`, divides as it builds the tree. It is good at identifying large clusters.
- **Dendrograms** visualize the branching tree.
- **Ordination** (coming Monday)
# Clustering
**Clustering** associates similar data points with each other, adding a grouping label. It is a form of **unsupervised learning** since we don't fit the model based on feeding it a labeled response (i.e. $y$).
## _K_-Means Clustering
Source: [K Means Clustering in R | DataScience+](https://datascienceplus.com/k-means-clustering-in-r/)
In _k_-means clustering, the number of clusters needs to be specified. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:
1. Reassign data points to the cluster whose centroid is closest.
1. Calculate new centroid of each cluster.
These two steps are repeated until the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.
### Load and plot the `penguins` dataset
The `penguins` dataset comes from Allison Horst's [palmerpenguins](https://allisonhorst.github.io/palmerpenguins) R package and records biometric measurements of different penguin species found at Palmer Station, Antarctica [@gormanEcologicalSexualDimorphism2014]. It is an alternative to the `iris` dataset example for exploratory data analysis (to avoid association of this 1935 dataset's collector [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) who "held strong views on race and eugenics"). Use of either dataset will be acceptable for submission of this lab (and mention of `iris` or Fisher will be dropped for next year).
```{=html}
<style>
.row {
display: flex;
}
.column {
flex: 50%;
padding: 5px;
}
</style>
<div class="row">
<div class="column">
<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" style="width:100%"/>
</div>
<div class="column">
<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png" style="width:100%"/>
</div>
</div>
```
```{r}
# load R packages
librarian::shelf(
dplyr, DT, ggplot2, palmerpenguins, skimr, tibble)
# set seed for reproducible results
set.seed(42)
# load the dataset
data("penguins")
# look at documentation in RStudio
if (interactive())
help(penguins)
# show data table
datatable(penguins)
# skim the table for a summary
skim(penguins)
# remove the rows with NAs
penguins <- na.omit(penguins)
# plot petal length vs width, species naive
ggplot(
penguins, aes(bill_length_mm, bill_depth_mm)) +
geom_point()
# plot petal length vs width, color by species
legend_pos <- theme(
legend.position = c(0.95, 0.05),
legend.justification = c("right", "bottom"),
legend.box.just = "right")
ggplot(
penguins, aes(bill_length_mm, bill_depth_mm, color = species)) +
geom_point() +
legend_pos
```
### Cluster `penguins` using `kmeans()`
```{r}
# cluster using kmeans
k <- 3 # number of clusters
penguins_k <- kmeans(
penguins %>%
select(bill_length_mm, bill_depth_mm),
centers = k)
# show cluster result
penguins_k
# compare clusters with species (which were not used to cluster)
table(penguins_k$cluster, penguins$species)
```
**Question**: Comparing the observed species plot with 3 species with the kmeans() cluster plot with 3 clusters, where does this "unsupervised" `kmeans()` technique (that does not use species to "fit" the model) produce similar versus different results? One or two sentences would suffice. Feel free to mention ranges of values along the axes.
```{r}
# extract cluster assignment per observation
Cluster = factor(penguins_k$cluster)
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster)) +
geom_point() +
legend_pos
```
```{r, eval=F, echo=F}
# **Task**: Highlight the "misclassified" points in the plot. _Hints: To get just the points misclassified, you can use `penguins_k$cluster != as.integer(penguins$species)`, which can feed as the second argument into `filter(penguins)`. To add another set of points to the ggplot, use `+ geom_point()` with arguments for: `data` with the additional points, `pch` [point shape](https://www.r-bloggers.com/2021/06/r-plot-pch-symbols-different-point-shapes-in-r/) with `fill=NA` for transparency and outline `color="red"`._
obs_mis <- penguins %>%
filter(penguins_k$cluster != as.integer(penguins$species))
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster)) +
geom_point() +
legend_pos +
geom_point(data = obs_mis, color="red", fill=NA, pch=21)
```
### Plot Voronoi diagram of clustered `penguins`
This form of clustering assigns points to the cluster based on nearest centroid. You can see the breaks more clearly with a [Voronoi diagram](https://en.wikipedia.org/wiki/Voronoi_diagram).
```{r}
librarian::shelf(ggvoronoi, scales)
# define bounding box for geom_voronoi()
xr <- extendrange(range(penguins$bill_length_mm), f=0.1)
yr <- extendrange(range(penguins$bill_depth_mm), f=0.1)
box <- tribble(
~bill_length_mm, ~bill_depth_mm, ~group,
xr[1], yr[1], 1,
xr[1], yr[2], 1,
xr[2], yr[2], 1,
xr[2], yr[1], 1,
xr[1], yr[1], 1) %>%
data.frame()
# cluster using kmeans
k <- 3 # number of clusters
penguins_k <- kmeans(
penguins %>%
select(bill_length_mm, bill_depth_mm),
centers = k)
# extract cluster assignment per observation
Cluster = factor(penguins_k$cluster)
# extract cluster centers
ctrs <- as.data.frame(penguins_k$centers) %>%
mutate(
Cluster = factor(1:k))
# plot points with voronoi diagram showing nearest centroid
ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = Cluster)) +
geom_point() +
legend_pos +
geom_voronoi(
data = ctrs, aes(fill=Cluster), color = NA, alpha=0.5,
outline = box) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
geom_point(
data = ctrs, pch=23, cex=2, fill="black")
```
**Task**: Show the Voronoi diagram for fewer (`k=2`) and more (`k=8`) clusters to see how assignment to cluster centroids work.
## Hierarchical Clustering
Next, you'll cluster sites according to species composition. You'll use the `dune` dataset from the `vegan` R package and follow along with these source texts:
- [Ch. 8-9 (p.123-152)](https://drive.google.com/open?id=16T7kAO9W5Rz7xSHqwJrOdDu8RUTmTJ89&authuser=ben%40ecoquants.com&usp=drive_fs) [@kindtTreeDiversityAnalysis2005]
- [21.3.1 Agglomerative hierarchical clustering](https://bradleyboehmke.github.io/HOML/hierarchical.html#agglomerative-hierarchical-clustering) [@greenwellHandsOnMachineLearning]
### Load `dune` dataset
```{r}
librarian::shelf(
cluster, vegan)
# load dune dataset from package vegan
data("dune")
# show documentation on dataset if interactive
if (interactive())
help(dune)
```
**Question**: What are the rows and columns composed of in the `dune` data frame?
### Calculate Ecological Distances on `sites`
Before we calculate ecological distance between sites for `dune`, let's look at these metrics with a simpler dataset, like the example given in Chapter 8 by @kindtTreeDiversityAnalysis2005.
```{r}
sites <- tribble(
~site, ~sp1, ~sp2, ~sp3,
"A", 1, 1, 0,
"B", 5, 5, 0,
"C", 0, 0, 1) %>%
column_to_rownames("site")
sites
sites_manhattan <- vegdist(sites, method="manhattan")
sites_manhattan
sites_euclidean <- vegdist(sites, method="euclidean")
sites_euclidean
sites_bray <- vegdist(sites, method="bray")
sites_bray
```
### Bray-Curtis Dissimilarity on `sites`
Let's take a closer look at the [Bray-Curtis Dissimilarity](https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity) distance:
$$
B_{ij} = 1 - \frac{2C_{ij}}{S_i + S_j}
$$
- $B_{ij}$: Bray-Curtis dissimilarity value between sites $i$ and $j$. \
1 = completely dissimilar (no shared species); 0 = identical.
- $C_{ij}$: sum of the lesser counts $C$ for shared species common to both sites $i$ and $j$
- $S_{i OR j}$: sum of all species counts $S$ for the given site $i$ or $j$
So to calculate Bray-Curtis for the example `sites`:
- $B_{AB} = 1 - \frac{2 * (1 + 1)}{2 + 10} = 1 - 4/12 = 1 - 1/3 = 0.667$
- $B_{AC} = 1 - \frac{2 * 0}{2 + 1} = 1$
- $B_{BC} = 1 - \frac{2 * 0}{10 + 1} = 1$
### Agglomerative hierarchical clustering on `dune`
See text to accompany code: _HOMLR_ [21.3.1 Agglomerative hierarchical clustering](https://bradleyboehmke.github.io/HOML/hierarchical.html#agglomerative-hierarchical-clustering).
```{r}
# Dissimilarity matrix
d <- vegdist(dune, method="bray")
dim(d)
as.matrix(d)[1:5, 1:5]
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Dendrogram plot of hc1
plot(hc1, cex = 0.6, hang = -1)
# Compute agglomerative clustering with agnes
hc2 <- agnes(dune, method = "complete")
# Agglomerative coefficient
hc2$ac
# Dendrogram plot of hc2
plot(hc2, which.plot = 2)
# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
# function to compute coefficient
ac <- function(x) {
agnes(dune, method = x)$ac
}
# get agglomerative coefficient for each linkage method
purrr::map_dbl(m, ac)
# Compute ward linkage clustering with agnes
hc3 <- agnes(dune, method = "ward")
# Agglomerative coefficient
hc3$ac
# Dendrogram plot of hc3
plot(hc3, which.plot = 2)
```
### Divisive hierarchical clustering on `dune`
See text to accompany code: _HOMLR_ [21.3.2 Divisive hierarchical clustering](https://bradleyboehmke.github.io/HOML/hierarchical.html#divisive-hierarchical-clustering).
```{r}
# compute divisive hierarchical clustering
hc4 <- diana(dune)
# Divise coefficient; amount of clustering structure found
hc4$dc
```
### Determining optimal clusters
See text to accompany code: _HOMLR_ [21.4 Determining optimal clusters](https://bradleyboehmke.github.io/HOML/hierarchical.html#determining-optimal-clusters).
```{r}
librarian::shelf(factoextra)
# Plot cluster results
p1 <- fviz_nbclust(dune, FUN = hcut, method = "wss", k.max = 10) +
ggtitle("(A) Elbow method")
p2 <- fviz_nbclust(dune, FUN = hcut, method = "silhouette", k.max = 10) +
ggtitle("(B) Silhouette method")
p3 <- fviz_nbclust(dune, FUN = hcut, method = "gap_stat", k.max = 10) +
ggtitle("(C) Gap statistic")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, nrow = 1)
```
### Working with dendrograms
See text to accompany code: _HOMLR_ [21.5 Working with dendrograms](https://bradleyboehmke.github.io/HOML/hierarchical.html#working-with-dendrograms).
```{r}
# Construct dendorgram for the Ames housing example
hc5 <- hclust(d, method = "ward.D2" )
dend_plot <- fviz_dend(hc5)
dend_data <- attr(dend_plot, "dendrogram")
dend_cuts <- cut(dend_data, h = 8)
fviz_dend(dend_cuts$lower[[2]])
# Ward's method
hc5 <- hclust(d, method = "ward.D2" )
# Cut tree into 4 groups
k = 4
sub_grp <- cutree(hc5, k = k)
# Number of members in each cluster
table(sub_grp)
# Plot full dendogram
fviz_dend(
hc5,
k = k,
horiz = TRUE,
rect = TRUE,
rect_fill = TRUE,
rect_border = "jco",
k_colors = "jco")
```