lab4c_inat.Rmd

---
title: "Lab 4c. Deep Learning - iNaturalist"
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = T, warning = F)
```

# Deep Learning with R / Python Exercises

You'll first learn about Computer Vision techniques by going through the Chapter 5 lab exercises:

- 5.1 Introduction to convnets
  R: [html](./lab4c_5.1.intro-convnets.html), [Rmd](https://raw.githubusercontent.com/bbest/eds232-ml/main/lab4c_5.1.intro-convnets.Rmd) ; Python: [html](https://github.com/bbest/deep-learning-with-python-notebooks/blob/master/first_edition/5.1-introduction-to-convnets.ipynb), [ipynb](https://github.com/bbest/deep-learning-with-python-notebooks/raw/master/first_edition/5.1-introduction-to-convnets.ipynb)

- 5.2 Training a convnet from scratch on a small dataset
  R: [html](./lab4c_5.2.small-convnets.html), [Rmd](https://raw.githubusercontent.com/bbest/eds232-ml/main/lab4c_5.2.small-convnets.Rmd) ; Python: [html](https://github.com/bbest/deep-learning-with-python-notebooks/blob/master/first_edition/5.2-using-convnets-with-small-datasets.ipynb), [ipynb](https://github.com/bbest/deep-learning-with-python-notebooks/raw/master/first_edition/5.2-using-convnets-with-small-datasets.ipynb)

The subsequent lab exercises meet the limits of using a CPU over a GPU, which is not available on `taylor.bren.ucsb.edu`. Here's as far as I was able to get for demonstration sake, but you're not expected to run this. You might want to try if you have personal computer with a GPU setup.

- 5.3 Using a pretrained convnet
  R: [html](./lab4c_5.3-using-a-pretrained-convnet.html), [Rmd](https://raw.githubusercontent.com/bbest/eds232-ml/main/lab4c_5.3-using-a-pretrained-convnet.Rmd) ; Python: [html](https://github.com/bbest/deep-learning-with-python-notebooks/blob/master/first_edition/5.3-using-a-pretrained-convnet.ipynb), [ipynb](https://github.com/bbest/deep-learning-with-python-notebooks/raw/master/first_edition/5.3-using-a-pretrained-convnet.ipynb)

# iNaturalist

The main lab that you'll turn in is to apply these techniques to a small subset of the iNaturalist species imagery. These data were downloaded from the links provided at [github.com/visipedia/inat_comp:2021/](https://github.com/visipedia/inat_comp/tree/master/2021). Of all the 10,000 species and many images for each from training (Train), training mini (Train Mini), validation (Val) and test images, you'll draw only from the Train Mini set of images:

![](https://github.com/visipedia/inat_comp/raw/master/2021/assets/train_val_distribution.png)


```{r, echo=F, eval=F}
# in Terminal:
#   cd /courses/EDS232; mkdir 'inaturalist-2021'
#   curl -o train_mini.tar.gz https://ml-inat-competition-datasets.s3.amazonaws.com/2021/train_mini.tar.gz
#   tar -xf train_mini.tar.gz 

librarian::shelf(
  dplyr, glue, jsonlite, listviewer, purrr, readr, tidyjson, tidyr)

train_mini <- jsonlite::read_json("~/Desktop/iNat/train_mini.json")

write_meta <- function(m){
  train_mini[[m]]  %>% 
    tidyjson::spread_all() %>% 
    tibble() %>% 
    select(-document.id, -`..JSON`) %>% 
    write_csv(
      glue("~/Desktop/iNat/train_mini_{m}.csv"))
}
write_meta("images")
write_meta("annotations")
write_meta("categories")
```

The first step is to move the images into directories for the variety of models. The `keras::`[`flow_images_from_directory()`](https://keras.rstudio.com/reference/flow_images_from_directory.html) expects the first argument `directory` to "contain one subdirectory per class". We are building models for two species `spp2` (binary) and ten species `spp10` (multiclass), plus we want to have `train` (n=30), `validation` (n=10) and `test` (n=10) images assigned to each. So we want a directory structure that looks something like this:

```
├── spp10
│   ├── test
│   │   ├── 01172_Animalia_Arthropoda_Insecta_Lepidoptera_Geometridae_Circopetes_obtusata
│   │   │   ├── cfd17d74-c7aa-49a2-9417-0a4e6aa4170d.jpg
│   │   │   ├── d6c2cf8f-89ef-40a2-824b-f51c85be030b.jpg
│   │   │   └── ...[+n_img=8]
│   │   ├── 06033_Plantae_Tracheophyta_Liliopsida_Asparagales_Orchidaceae_Epipactis_atrorubens
│   │   │   └── ...[n_img=10]
│   │   └── ...[+n_spp=8]
│   ├── train
│   │   ├── 01172_Animalia_Arthropoda_Insecta_Lepidoptera_Geometridae_Circopetes_obtusata
│   │   │   └── ...[n_img=30]
│   │   └── ...[+n_spp=9]
│   └── validation
│       ├── 01172_Animalia_Arthropoda_Insecta_Lepidoptera_Geometridae_Circopetes_obtusata
│       │   └── ...[n_img=10]
│       └── ...[+n_spp=9]
└── spp2
    ├── test
    │   └── ...[n_spp=2]
    ├── train
    │   └── ...[n_spp=2]
    └── validation
        └── ...[n_spp=2]
```

```{r}
librarian::shelf(
  digest, dplyr, DT, glue, purrr, readr, stringr, tidyr)

# path to folder containing species directories of images
dir_src  <- "/courses/EDS232/inaturalist-2021/train_mini"
dir_dest <- "~/inat"
dir.create(dir_dest, showWarnings = F)

# get list of directories, one per species (n = 10,000 species)
dirs_spp <- list.dirs(dir_src, recursive = F, full.names = T)
n_spp <- length(dirs_spp)

# set seed (for reproducible results) 
# just before sampling (otherwise get different results)
# based on your username (unique amongst class)
Sys.info()[["user"]] %>% 
  digest::digest2int() %>% 
  set.seed()
i10 <- sample(1:n_spp, 10)

# show the 10 indices sampled of the 10,000 possible 
i10

# show the 10 species directory names
basename(dirs_spp)[i10]

# show the first 2 species directory names
i2 <- i10[1:2]
basename(dirs_spp)[i2]

# setup data frame with source (src) and destination (dest) paths to images
d <- tibble(
  set     = c(rep("spp2", 2), rep("spp10", 10)),
  dir_sp  = c(dirs_spp[i2], dirs_spp[i10]),
  tbl_img = map(dir_sp, function(dir_sp){
    tibble(
      src_img = list.files(dir_sp, full.names = T),
      subset  = c(rep("train", 30), rep("validation", 10), rep("test", 10))) })) %>% 
  unnest(tbl_img) %>% 
  mutate(
    sp       = basename(dir_sp),
    img      = basename(src_img),
    dest_img = glue("{dir_dest}/{set}/{subset}/{sp}/{img}"))

# show source and destination for first 10 rows of tibble
d %>% 
  select(src_img, dest_img)

# iterate over rows, creating directory if needed and copying files 
d %>% 
  pwalk(function(src_img, dest_img, ...){
    dir.create(dirname(dest_img), recursive = T, showWarnings = F)
    file.copy(src_img, dest_img) })

# uncomment to show the entire tree of your destination directory
# system(glue("tree {dir_dest}"))
```

Your task is to apply your deep learning skills to build the following models:

1. **2 Species (binary classification) - neural net**. Draw from [3.4 🍿 Movies (binary classification)](./lab4b_examples.html). You'll need to pre-process the images to be a consistent shape first though -- see 5.2.4 Data preprocessing.

1. **2 Species (binary classification) - convolutional neural net**. Draw from the [dogs vs cats example](https://bbest.github.io/eds232-ml/lab4c_5.2.small-convnets.html).

1. **10 Species (multi-class classification) - neural net**.  Draw from [3.5 📰 Newswires (multi-class classification)](./lab4b_examples.html).

1. **10 Species (multi-class classification) - convolutional neural net**. Draw from [dogs vs cats example](https://bbest.github.io/eds232-ml/lab4c_5.2.small-convnets.html) and update necessary values to go from binary to mult-class classification.

In your models, be sure to include the following:

- Split the original images per species (n=50) into train (n=30), validate (n=10) and test (n=10). These are almost absurdly few files to feed into these complex deep learning models but will serve as a good learning example.

- Include accuracy metric and validation in the fitting process and history plot.

- Evaluate loss and accuracy on your test model results. Compare standard neural network and convolutional neural network results.

# Deep Learning Cheat Sheets

Parameterizing all the image processing and layers can be confusing. Here are a few cheat sheets to help with a deeper understanding:

* [Deep learning with Keras cheatsheet | RStudio](https://raw.githubusercontent.com/rstudio/cheatsheets/main/keras.pdf)
* [Understanding Convolutions and Pooling in Neural Networks: a simple explanation | by Miguel Fernández Zafra | Towards Data Science](https://towardsdatascience.com/understanding-convolutions-and-pooling-in-neural-networks-a-simple-explanation-885a2d78f211)
* [Classification: Image Classification Cheatsheet | Codecademy](https://www.codecademy.com/learn/dlsp-classification-track/modules/dlsp-image-classification/cheatsheet)
* [CS 230 - Convolutional Neural Networks Cheatsheet | Stanford](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks)
* [CS231n Convolutional Neural Networks for Visual Recognition | Stanford](https://cs231n.github.io/convolutional-networks/)

# Submit Lab 4

To submit Lab 4, please submit the path (`/Users/*`) on [taylor.bren.ucsb.edu](https://taylor.bren.ucsb.edu) to your iNaturalist Rmarkdown (`*.Rmd`) or Jupyter Notebook (`*.pynb`) file here:


- [Submit Lab 4. iNaturalist](https://docs.google.com/forms/d/e/1FAIpQLSddHuLejY_V-PAIQOO19TLHAyyQRyTUSsfEwpgeJP-Jx4MClA/viewform?usp=sf_link){target="_blank"}