From cd8ac19df7fab0d29113af9a2abbe5ae96617057 Mon Sep 17 00:00:00 2001
From: Nils Braun <nilslennartbraun@gmail.com>
Date: Thu, 29 Oct 2020 22:24:38 +0100
Subject: [PATCH 1/7] Added an example notebook on dask-sql. Fixes #170

---
 applications/sql-queries.ipynb | 322 +++++++++++++++++++++++++++++++++
 1 file changed, 322 insertions(+)
 create mode 100644 applications/sql-queries.ipynb

diff --git a/applications/sql-queries.ipynb b/applications/sql-queries.ipynb
new file mode 100644
index 00000000..8c4b4857
--- /dev/null
+++ b/applications/sql-queries.ipynb
@@ -0,0 +1,322 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Analyzing dask data with SQL"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dask's abilities to analyze data are huge and due to its large similarities with the pandas dataframe API, it is very easy for pandas experts to migrate.\n",
+    "However, for many applications amd users, SQL still plays a large role for querying and retrieving data.\n",
+    "For very good reasons: it is easy to learn, it is a common language many (data) systems understand and it contains all the important elements to query the data.\n",
+    "\n",
+    "With [dask-sql](https://nils-braun.github.io/dask-sql/), which leverages [Apache Calcite](https://calcite.apache.org/), it is possible to query the data with SQL and still use the full power of a dask cluster."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`dask-sql` can be installed via conda, like in the following cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! conda install -y dask-sql -c conda-forge"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "If you want to analyze data with dask-sql in python, you need to do three steps:\n",
+    "\n",
+    "1. Create a context\n",
+    "2. Load and register your data \n",
+    "3. Start querying!\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Create a Context"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In SQL, all tables and functions are specified via names. Therefore we need to have some place to store all the registered tables (and functions), so that dask-sql knows which data it refers to.\n",
+    "This is the task of the Context.\n",
+    "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask_sql import Context\n",
+    "c = Context()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set up a dask cluster"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now would be the best time to connect to your dask cluster if you have one. \n",
+    "dask-sql leverages dask for performing the computations on the data.\n",
+    "\n",
+    "Check out one of [the many ways](https://docs.dask.org/en/latest/setup.html) to create and connect to your dask cluster.\n",
+    "\n",
+    "For this example we will create a cluster running locally.\n",
+    "This is optional as dask can also create one implicetly, but we can get more diagnostics and insights.\n",
+    "You can click the link shown after the client intialization to show the dask dashboard."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.distributed import Client\n",
+    "\n",
+    "client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')\n",
+    "client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Load and register the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So far, no data was involved. Let's change that! There are many ways how you can get the data into your cluster and tell dask-sql, where to find it. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Register a dask or pandas dataframe"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you have already a dask dataframe (which is dask's abstraction of a pandas dataframe with nearly the same API), you can directly associate it with a name:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.datasets import timeseries\n",
+    "df = timeseries()\n",
+    "type(df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.register_dask_table(df.persist(), \"timeseries\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "Please note that we have persisted the data before passing it to dask-sql.\n",
+    "This will tell dask that we want to prefetch the data into memory.\n",
+    "Doing so will speed up the queries a lot, so you probably always want to do this.\n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It is also possible to register a pandas dataframe directly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "df = pd.DataFrame({\"column\": [1, 2, 3]})\n",
+    "c.register_dask_table(df, \"pandas\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Read in data an from external location and register it"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In most of the cases however, your data will live on some external storage device, such as a local disk, S3 or hdfs.\n",
+    "You can leverage dask's large set of understood input formats and sources to load the data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We have now registered tables in our context with the given names.\n",
+    "Let's do some data analysis!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Query the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Whenever you call the `.sql()` method of the context, dask-sql will hand your query to Apache Calcite to turn it into a relational algebra and will then create a dask computation graph, which will execute your computation. \n",
+    "\n",
+    "Let's see how this works in action:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.sql(\"\"\"\n",
+    "    SELECT AVG(x) FROM timeseries\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The result is again a dask dataframe, which only represents the computation (without having executed it) so far.\n",
+    "Lets trigger it now!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.sql(\"\"\"\n",
+    "    SELECT AVG(x) FROM timeseries\n",
+    "\"\"\").compute()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Congratulations! You have just queried your data with SQL.\n",
+    "If you check out the dask dashboard, you see that this has triggered some computations in dask.\n",
+    "\n",
+    "Of course, it is also possible to calculate more complex queries:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.sql(\"\"\"\n",
+    "    SELECT\n",
+    "        lhs.name,\n",
+    "        lhs.id,\n",
+    "        lhs.x\n",
+    "    FROM\n",
+    "        timeseries AS lhs\n",
+    "    JOIN\n",
+    "        (\n",
+    "            SELECT\n",
+    "                name AS max_name,\n",
+    "                MAX(x) AS max_x\n",
+    "            FROM timeseries\n",
+    "            GROUP BY name\n",
+    "        ) AS rhs\n",
+    "    ON\n",
+    "        lhs.name = rhs.max_name AND\n",
+    "        lhs.x = rhs.max_x\n",
+    "\"\"\").compute()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can find more information on dask-sql in the [documentation](https://dask-sql.readthedocs.io/)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

From e637822a7e8189f4325ee19e25c3a4f25bc531d6 Mon Sep 17 00:00:00 2001
From: Nils Braun <nilslennartbraun@gmail.com>
Date: Thu, 29 Oct 2020 22:52:39 +0100
Subject: [PATCH 2/7] Fixes needed to make it run on binder

---
 applications/sql-queries.ipynb | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/applications/sql-queries.ipynb b/applications/sql-queries.ipynb
index 8c4b4857..5a3aeb84 100644
--- a/applications/sql-queries.ipynb
+++ b/applications/sql-queries.ipynb
@@ -22,7 +22,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`dask-sql` can be installed via conda, like in the following cell:"
+    "`dask-sql` can be installed via conda (or mamba) or pip, like in the following cell:"
    ]
   },
   {
@@ -31,7 +31,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "! conda install -y dask-sql -c conda-forge"
+    "! mamba install --quiet -y dask-sql "
    ]
   },
   {
@@ -59,7 +59,10 @@
    "source": [
     "In SQL, all tables and functions are specified via names. Therefore we need to have some place to store all the registered tables (and functions), so that dask-sql knows which data it refers to.\n",
     "This is the task of the Context.\n",
-    "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application."
+    "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application.\n",
+    "\n",
+    "As we have just installed dask-sql, the `$JAVA_HOME` is not picked up correctly.\n",
+    "We will help dask-sql a bit and specify it explicitly."
    ]
   },
   {
@@ -68,6 +71,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import os\n",
+    "os.environ[\"JAVA_HOME\"] = os.environ[\"CONDA_PREFIX\"]\n",
+    "\n",
     "from dask_sql import Context\n",
     "c = Context()"
    ]

From 12840cdc6b6624140477e257c5ee3a6f37345140 Mon Sep 17 00:00:00 2001
From: Nils Braun <nilslennartbraun@gmail.com>
Date: Fri, 30 Oct 2020 16:24:20 +0100
Subject: [PATCH 3/7] Rename the notebook and add it to the index

---
 index.rst                                   | 1 +
 applications/sql-queries.ipynb => sql.ipynb | 0
 2 files changed, 1 insertion(+)
 rename applications/sql-queries.ipynb => sql.ipynb (100%)

diff --git a/index.rst b/index.rst
index 2a867950..4c6679ad 100644
--- a/index.rst
+++ b/index.rst
@@ -22,6 +22,7 @@ You can run these examples in a live session here: |Binder|
    delayed
    futures
    machine-learning
+   sql
    xarray
 
 .. toctree::
diff --git a/applications/sql-queries.ipynb b/sql.ipynb
similarity index 100%
rename from applications/sql-queries.ipynb
rename to sql.ipynb

From 3c0b2f2bf7f9fecfee49b4769ef7d61e6cbc6e87 Mon Sep 17 00:00:00 2001
From: Nils Braun <nilslennartbraun@gmail.com>
Date: Sat, 31 Oct 2020 11:38:22 +0100
Subject: [PATCH 4/7] Install dask-sql via conda before launching the notebook

---
 binder/environment.yml |  1 +
 sql.ipynb              | 11 +----------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/binder/environment.yml b/binder/environment.yml
index e56a0977..7800d520 100644
--- a/binder/environment.yml
+++ b/binder/environment.yml
@@ -26,5 +26,6 @@ dependencies:
   - requests
   - py-xgboost
   - dask-xgboost
+  - dask-sql
   - pip:
     - mimesis
diff --git a/sql.ipynb b/sql.ipynb
index 5a3aeb84..0edd8e0d 100644
--- a/sql.ipynb
+++ b/sql.ipynb
@@ -22,16 +22,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`dask-sql` can be installed via conda (or mamba) or pip, like in the following cell:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! mamba install --quiet -y dask-sql "
+    "`dask-sql` can be installed via conda (or mamba) or pip (it is already installed in the environment)"
    ]
   },
   {

From fa29c8624973e5cc3bf8ecf46a96d6dce10a01f8 Mon Sep 17 00:00:00 2001
From: Nils Braun <nilslennartbraun@gmail.com>
Date: Tue, 3 Nov 2020 12:31:55 +0100
Subject: [PATCH 5/7] Revert "Install dask-sql via conda before launching the
 notebook"

This reverts commit 3c0b2f2bf7f9fecfee49b4769ef7d61e6cbc6e87.
---
 binder/environment.yml |  1 -
 sql.ipynb              | 11 ++++++++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/binder/environment.yml b/binder/environment.yml
index 7800d520..e56a0977 100644
--- a/binder/environment.yml
+++ b/binder/environment.yml
@@ -26,6 +26,5 @@ dependencies:
   - requests
   - py-xgboost
   - dask-xgboost
-  - dask-sql
   - pip:
     - mimesis
diff --git a/sql.ipynb b/sql.ipynb
index 0edd8e0d..5a3aeb84 100644
--- a/sql.ipynb
+++ b/sql.ipynb
@@ -22,7 +22,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`dask-sql` can be installed via conda (or mamba) or pip (it is already installed in the environment)"
+    "`dask-sql` can be installed via conda (or mamba) or pip, like in the following cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! mamba install --quiet -y dask-sql "
    ]
   },
   {

From 77f59f1ea19c4926efa8167f77fd76e873a2faae Mon Sep 17 00:00:00 2001
From: Nils Braun <nilslennartbraun@gmail.com>
Date: Tue, 3 Nov 2020 12:33:03 +0100
Subject: [PATCH 6/7] Try installation with conda

---
 sql.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sql.ipynb b/sql.ipynb
index 5a3aeb84..0f421c3f 100644
--- a/sql.ipynb
+++ b/sql.ipynb
@@ -31,7 +31,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "! mamba install --quiet -y dask-sql "
+    "!conda install -y dask-sql "
    ]
   },
   {

From a4eed0ce2b32adaa9a6aa810e57a68167840b695 Mon Sep 17 00:00:00 2001
From: Nils Braun <nilslennartbraun@gmail.com>
Date: Mon, 30 Nov 2020 21:34:29 +0100
Subject: [PATCH 7/7] Fix for the newest dask-sql version and use mamba

---
 sql.ipynb | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/sql.ipynb b/sql.ipynb
index 0f421c3f..c584e534 100644
--- a/sql.ipynb
+++ b/sql.ipynb
@@ -31,7 +31,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!conda install -y dask-sql "
+    "! mamba install -y dask-sql"
    ]
   },
   {
@@ -59,10 +59,7 @@
    "source": [
     "In SQL, all tables and functions are specified via names. Therefore we need to have some place to store all the registered tables (and functions), so that dask-sql knows which data it refers to.\n",
     "This is the task of the Context.\n",
-    "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application.\n",
-    "\n",
-    "As we have just installed dask-sql, the `$JAVA_HOME` is not picked up correctly.\n",
-    "We will help dask-sql a bit and specify it explicitly."
+    "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application."
    ]
   },
   {
@@ -71,9 +68,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
-    "os.environ[\"JAVA_HOME\"] = os.environ[\"CONDA_PREFIX\"]\n",
-    "\n",
     "from dask_sql import Context\n",
     "c = Context()"
    ]
@@ -156,7 +150,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "c.register_dask_table(df.persist(), \"timeseries\")"
+    "c.create_table(\"timeseries\", df.persist())"
    ]
   },
   {
@@ -187,7 +181,7 @@
    "source": [
     "import pandas as pd\n",
     "df = pd.DataFrame({\"column\": [1, 2, 3]})\n",
-    "c.register_dask_table(df, \"pandas\")"
+    "c.create_table(\"pandas\", df)"
    ]
   },
   {
@@ -202,7 +196,8 @@
    "metadata": {},
    "source": [
     "In most of the cases however, your data will live on some external storage device, such as a local disk, S3 or hdfs.\n",
-    "You can leverage dask's large set of understood input formats and sources to load the data."
+    "You can leverage dask's large set of understood input formats and sources to load the data.\n",
+    "Find our more information in the [documentation](https://dask-sql.readthedocs.io/en/latest/pages/data_input.html) of dask-sql."
    ]
   },
   {