dask · nils-braun · Oct 29, 2020 · Oct 29, 2020 · Oct 30, 2020 · Oct 31, 2020
diff --git a/index.rst b/index.rst
@@ -22,6 +22,7 @@ You can run these examples in a live session here: |Binder|
    delayed
    futures
    machine-learning
+   sql
    xarray
 
 .. toctree::

diff --git a/sql.ipynb b/sql.ipynb
@@ -0,0 +1,323 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Analyzing dask data with SQL"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dask's abilities to analyze data are huge and due to its large similarities with the pandas dataframe API, it is very easy for pandas experts to migrate.\n",
+    "However, for many applications amd users, SQL still plays a large role for querying and retrieving data.\n",
+    "For very good reasons: it is easy to learn, it is a common language many (data) systems understand and it contains all the important elements to query the data.\n",
+    "\n",
+    "With [dask-sql](https://nils-braun.github.io/dask-sql/), which leverages [Apache Calcite](https://calcite.apache.org/), it is possible to query the data with SQL and still use the full power of a dask cluster."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`dask-sql` can be installed via conda (or mamba) or pip, like in the following cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! mamba install -y dask-sql"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "If you want to analyze data with dask-sql in python, you need to do three steps:\n",
+    "\n",
+    "1. Create a context\n",
+    "2. Load and register your data \n",
+    "3. Start querying!\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Create a Context"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In SQL, all tables and functions are specified via names. Therefore we need to have some place to store all the registered tables (and functions), so that dask-sql knows which data it refers to.\n",
+    "This is the task of the Context.\n",
+    "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask_sql import Context\n",
+    "c = Context()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set up a dask cluster"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now would be the best time to connect to your dask cluster if you have one. \n",
+    "dask-sql leverages dask for performing the computations on the data.\n",
+    "\n",
+    "Check out one of [the many ways](https://docs.dask.org/en/latest/setup.html) to create and connect to your dask cluster.\n",
+    "\n",
+    "For this example we will create a cluster running locally.\n",
+    "This is optional as dask can also create one implicetly, but we can get more diagnostics and insights.\n",
+    "You can click the link shown after the client intialization to show the dask dashboard."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.distributed import Client\n",
+    "\n",
+    "client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')\n",
+    "client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Load and register the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So far, no data was involved. Let's change that! There are many ways how you can get the data into your cluster and tell dask-sql, where to find it. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Register a dask or pandas dataframe"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you have already a dask dataframe (which is dask's abstraction of a pandas dataframe with nearly the same API), you can directly associate it with a name:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.datasets import timeseries\n",
+    "df = timeseries()\n",
+    "type(df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.create_table(\"timeseries\", df.persist())"
-    "c.create_table(\"timeseries\", df.persist())"
+    "c.create_table(\"timeseries\", persist=True)"
-    "c.create_table(\"timeseries\", df.persist())"
+    "c.create_table(\"timeseries\", persist=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "Please note that we have persisted the data before passing it to dask-sql.\n",
+    "This will tell dask that we want to prefetch the data into memory.\n",
+    "Doing so will speed up the queries a lot, so you probably always want to do this.\n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It is also possible to register a pandas dataframe directly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "df = pd.DataFrame({\"column\": [1, 2, 3]})\n",
+    "c.create_table(\"pandas\", df)"
-    "c.create_table(\"pandas\", df)"
+    "c.create_table(\"pandas\", df, persist=True)"
-    "c.create_table(\"pandas\", df)"
+    "c.create_table(\"pandas\", df, persist=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Read in data an from external location and register it"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In most of the cases however, your data will live on some external storage device, such as a local disk, S3 or hdfs.\n",
+    "You can leverage dask's large set of understood input formats and sources to load the data.\n",
+    "Find our more information in the [documentation](https://dask-sql.readthedocs.io/en/latest/pages/data_input.html) of dask-sql."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We have now registered tables in our context with the given names.\n",
+    "Let's do some data analysis!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Query the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Whenever you call the `.sql()` method of the context, dask-sql will hand your query to Apache Calcite to turn it into a relational algebra and will then create a dask computation graph, which will execute your computation. \n",
+    "\n",
+    "Let's see how this works in action:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.sql(\"\"\"\n",
+    "    SELECT AVG(x) FROM timeseries\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The result is again a dask dataframe, which only represents the computation (without having executed it) so far.\n",
+    "Lets trigger it now!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.sql(\"\"\"\n",
+    "    SELECT AVG(x) FROM timeseries\n",
+    "\"\"\").compute()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Congratulations! You have just queried your data with SQL.\n",
+    "If you check out the dask dashboard, you see that this has triggered some computations in dask.\n",
+    "\n",
+    "Of course, it is also possible to calculate more complex queries:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.sql(\"\"\"\n",
+    "    SELECT\n",
+    "        lhs.name,\n",
+    "        lhs.id,\n",
+    "        lhs.x\n",
+    "    FROM\n",
+    "        timeseries AS lhs\n",
+    "    JOIN\n",
+    "        (\n",
+    "            SELECT\n",
+    "                name AS max_name,\n",
+    "                MAX(x) AS max_x\n",
+    "            FROM timeseries\n",
+    "            GROUP BY name\n",
+    "        ) AS rhs\n",
+    "    ON\n",
+    "        lhs.name = rhs.max_name AND\n",
+    "        lhs.x = rhs.max_x\n",
+    "\"\"\").compute()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can find more information on dask-sql in the [documentation](https://dask-sql.readthedocs.io/)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
-Original file line number
+Diff line change
@@ Expand Up / @@ -22,6 +22,7 @@ You can run these examples in a live session here: |Binder| @@
        delayed
        futures
        machine-learning
+       sql
        xarray
     .. toctree::
@@ Expand Down @@