Scalable ML on the ODC (#773)

* start porting scalable-ml * predict working * all notebooks working now * remove hash from classification.py * convert to running on WA only * some typos * fixed plot and title order * Added words on linking feature importance, increased fontsize on violin plots Is this addition ok? "It is important to note the difference here between classes with good feature discrimination, as observed in the violin plots, and feature importance for the classifier, as estimated above. The violin plots demonstrate the relative biophysical importance of each feature to class types. However, the classifier estimates of feature importance are unrelated to biophysical processes. Instead, they represent the relative importance of each feature to the nominated classifier and may have no rational explanation or connection to physical processes - such is the mystery of machine learning models." * minor editing to markdown in Plotting Results cell * tidy markdown * Consistency updates, add notebooks to readthedocs index * Restore results files Co-authored-by: Sean Chua <[email protected]> Co-authored-by: Claire Phillips <[email protected]> Co-authored-by: robbibt <[email protected]>
GeoscienceAustralia · Mar 26, 2021 · e72aed7 · e72aed7
1 parent ff2e374
commit e72aed7
Show file tree

Hide file tree

Showing 18 changed files with 18,752 additions and 1 deletion.
diff --git a/Real_world_examples/README.rst b/Real_world_examples/README.rst
@@ -1,7 +1,7 @@
 Real-world examples
 ===================
 
-More complex workflows demonstrating how DEA can be used to address real-world problems.
+More complex case study-based workflows demonstrating how DEA can be used to address real-world problems:
 
 .. toctree::
    :maxdepth: 1
@@ -22,3 +22,11 @@ More complex workflows demonstrating how DEA can be used to address real-world p
    Urban_change_detection.ipynb
    Vegetation_phenology.ipynb
    Wetness_stream_gauge_correlations.ipynb
+
+Scalable Supervised Machine Learning on the Open Data Cube:
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Scaleable machine learning
+
+   Scalable_machine_learning/index
diff --git a/Real_world_examples/Scalable_machine_learning/0_README.ipynb b/Real_world_examples/Scalable_machine_learning/0_README.ipynb
@@ -0,0 +1,133 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img align=\"centre\" src=\"../../Supplementary_data/dea_logo_wide.jpg\" width=\"100%\">\n",
+    "\n",
+    "# Scalable Supervised Machine Learning on the Open Data Cube"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* **Prerequisites:** This notebook series assumes some familiarity with machine learning, statistical concepts, and python programming. Beginners should consider working through the earlier notebooks in the [dea-notebooks](https://github.com/GeoscienceAustralia/dea-notebooks) repository before attempting to run through this notebook series."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Background\n",
+    "\n",
+    "Classification of satellite images using supervised machine learning (ML) techniques has become a common occurence in the remote sensing literature. Machine learning offers an effective means for identifying complex land cover classes in a relatively efficient manner. However, sensibly implementing machine learning classifiers is not always straighforward owing to the training data requirements, the computational requirements, and the challenge of sorting through a proliferating number of software libraries. Add to this the complexity of handling large volumes of satellite data and the task can become unwieldy at best. \n",
+    "\n",
+    "This series of notebooks aims to lessen the difficulty of running machine learning classifiers on satellite imagery by guiding the user through the steps necessary to classify satellite data using the [Open Data Cube](https://www.opendatacube.org/) (ODC). This is achieved in two ways. Firstly, the critical steps in a ML workflow (in the context of the ODC) are broken down into discrete notebooks which are extensively documented. And secondly, a number of custom python functions have been written to ease the complexity of running ML on the ODC. These include (among others) `collect_training_data`, `spatial_cluster`, `SKCV`, and `predict_xr`, all of which are contained in the [dea_tools.classification](../../Tools/dea_tools/classification.py) package. These functions are introduced and explained further in the relevant sections of the notebooks.\n",
+    "\n",
+    "There are four primary notebooks in this notebook series (along with an optional fifth notebook) that each represent a critical step in a ML workflow. \n",
+    "\n",
+    "1. `Extracting_training_data.ipynb` explores how to extract training data (feature layers) from the ODC using geometries within a shapefile (or geojson). The goal of this notebook is to familarise users with the `collect_training_data` function so you can extract the appropriate data for your use-case.\n",
+    "2. `Inspect_training_data.ipynb`: After having extracted training data from the ODC, its important to inspect the data using a number of statistical methods to aid in understanding if our feature layers are useful for distinguishing between classes.\n",
+    "3. `Evaluate_optimize_fit_classifier.ipynb`: Using the training data extracted in the first notebook, this notebook first evaluates the accuracy of a given ML model (using nested, spatial k-fold cross validation), performs a hyperparameter optimization, and then fits a model on the training data.\n",
+    "4. `Classify_satellite_data.ipynb`: This is where we load in satellite data and classify it using the model created in the previous notebook. The notebook initially asks you to provide a number of small test locations so we can observe visually how well the model is going at classifying real data. The last part of the notebook attempts to classify a much larger region.  \n",
+    "5. `Object-based_filtering.ipynb`: This notebook is provided as an optional extra. It guides you through converting your pixel-based classification into an object-based classification using image segmentation.\n",
+    "\n",
+    "The default example in the notebooks uses a training dataset containing \"crop\" and \"non-crop\" labels (labelled as 1 and 0 in the geojson, respectively) from across Western Australia. The training data is called `\"crop_training_WA.geojson\"`, and is located in the `'data/'` folder. This reference data was acquired and pre-processed from the USGS's Global Food Security Analysis Data portal [here](https://croplands.org/app/data/search?page=1&page_size=200) and [here](https://e4ftl01.cr.usgs.gov/MEASURES/GFSAD30VAL.001/2008.01.01/).  By the end of this notebook series we will have produced a model for identifying cropland areas in Western Australia, and we will output a cropland mask (as a geotiff) for around region around south-east WA.\n",
+    "\n",
+    "If you wish to begin running your own classification workflow, the first step is to replace this training data with your own in the `Extract_training_data.ipynb` notebook. However, it is best to run through the default example first to ensure you understand the content before altering the notebooks for your specific use case. \n",
+    "\n",
+    "**Important notes**\n",
+    "\n",
+    "* There are many different methods for running ML models and the approach used here may not suit your own classification problem. This is especially true for the `Evaluate_optimize_fit_classifier.ipynb` notebook, which has been crafted to suit the default training data. It's advisable to research the different methods for evaluating and training a model to determine which approach is best for you. Remember, the first step of any scientific pursuit is to precisely define the problem. \n",
+    "* The word \"**Scalable**\" in the title _Scalable ML on the ODC_ refers to scalability within the contraints of the machine you're running. These notebooks rely on [dask](https://dask.org/) (and [dask-ml](https://ml.dask.org/)) to manage memory and distribute the computations across mulitple cores. However, the notebooks are set up for the case of running on a single machine. For example, if your machine has 2 cores and 16 Gb of RAM (these are the specs on the default Sandbox), then you'll only be able to load and classify data up to that 16 Gb limit (and parallelization will be limited to 2 cores). Access to larger machines is required to scale analyses to very large areas. Its unlikley you'll be able to use these notebooks to classify satellite data at the country-level scale using laptop sized machines.  To better understand how we use dask, have a look at the [dask notebook](../../Beginners_guide/09_Parallel_processing_with_Dask.ipynb).\n",
+    "\n",
+    "**Helpful Resources**\n",
+    "\n",
+    "* There are many online courses that can help you understand the fundamentals of machine learning with python e.g. [edX](https://www.edx.org/course/machine-learning-with-python-a-practical-introduct), [coursera](https://www.coursera.org/learn/machine-learning-with-python). \n",
+    "* The [Scikit-learn](https://scikit-learn.org/stable/supervised_learning.html) documentation provides information on the available models and their parameters.\n",
+    "* This [review article](https://www.tandfonline.com/doi/full/10.1080/01431161.2018.1433343) provides a nice overview of machine learning in the context of remote sensing.\n",
+    "* The stand alone notebook, [Machine_learning_with_ODC](https://github.com/GeoscienceAustralia/dea-notebooks/blob/develop/Frequently_used_code/Machine_learning_with_ODC.ipynb), in the `Real_world_examples/` folder is a companion piece to these notebooks and provides a more succint (but less descriptive) version of the workflow demonstrated here.\n",
+    "___\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting Started\n",
+    "\n",
+    "To begin working through the notebooks in this `Scalable ML on the ODC` workflow, go to the first notebook `Extracting_training_data.ipynb`.\n",
+    "\n",
+    "1. [Extract_training_data](1_Extract_training_data.ipynb) \n",
+    "2. [Inspect_training_data](2_Inspect_training_data.ipynb)\n",
+    "3. [Evaluate_optimize_fit_classifier](3_Evaluate_optimize_fit_classifier.ipynb)\n",
+    "4. [Classify_satellite_data](4_Classify_satellite_data.ipynb)\n",
+    "5. [Object-based_filtering](5_Object-based_filtering.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "\n",
+    "## Additional information\n",
+    "\n",
+    "**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). \n",
+    "Digital Earth Australia data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.\n",
+    "\n",
+    "**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).\n",
+    "If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/GeoscienceAustralia/dea-notebooks).\n",
+    "\n",
+    "**Last modified:** March 2021"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tags\n",
+    "Browse all available tags on the DEA User Guide's [Tags Index](https://docs.dea.ga.gov.au/genindex.html)"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "metadata": {
+    "raw_mimetype": "text/restructuredtext"
+   },
+   "source": [
+    "**Tags**: :index:`sandbox compatible`, :index:`machine learning`"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  },
+  "widgets": {
+   "application/vnd.jupyter.widget-state+json": {
+    "state": {},
+    "version_major": 2,
+    "version_minor": 0
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}