From cd8ac19df7fab0d29113af9a2abbe5ae96617057 Mon Sep 17 00:00:00 2001 From: Nils Braun Date: Thu, 29 Oct 2020 22:24:38 +0100 Subject: [PATCH 1/7] Added an example notebook on dask-sql. Fixes #170 --- applications/sql-queries.ipynb | 322 +++++++++++++++++++++++++++++++++ 1 file changed, 322 insertions(+) create mode 100644 applications/sql-queries.ipynb diff --git a/applications/sql-queries.ipynb b/applications/sql-queries.ipynb new file mode 100644 index 00000000..8c4b4857 --- /dev/null +++ b/applications/sql-queries.ipynb @@ -0,0 +1,322 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Analyzing dask data with SQL" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask's abilities to analyze data are huge and due to its large similarities with the pandas dataframe API, it is very easy for pandas experts to migrate.\n", + "However, for many applications amd users, SQL still plays a large role for querying and retrieving data.\n", + "For very good reasons: it is easy to learn, it is a common language many (data) systems understand and it contains all the important elements to query the data.\n", + "\n", + "With [dask-sql](https://nils-braun.github.io/dask-sql/), which leverages [Apache Calcite](https://calcite.apache.org/), it is possible to query the data with SQL and still use the full power of a dask cluster." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`dask-sql` can be installed via conda, like in the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! conda install -y dask-sql -c conda-forge" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you want to analyze data with dask-sql in python, you need to do three steps:\n", + "\n", + "1. Create a context\n", + "2. Load and register your data \n", + "3. Start querying!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Create a Context" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In SQL, all tables and functions are specified via names. Therefore we need to have some place to store all the registered tables (and functions), so that dask-sql knows which data it refers to.\n", + "This is the task of the Context.\n", + "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask_sql import Context\n", + "c = Context()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Set up a dask cluster" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now would be the best time to connect to your dask cluster if you have one. \n", + "dask-sql leverages dask for performing the computations on the data.\n", + "\n", + "Check out one of [the many ways](https://docs.dask.org/en/latest/setup.html) to create and connect to your dask cluster.\n", + "\n", + "For this example we will create a cluster running locally.\n", + "This is optional as dask can also create one implicetly, but we can get more diagnostics and insights.\n", + "You can click the link shown after the client intialization to show the dask dashboard." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask.distributed import Client\n", + "\n", + "client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')\n", + "client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and register the data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So far, no data was involved. Let's change that! There are many ways how you can get the data into your cluster and tell dask-sql, where to find it. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Register a dask or pandas dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you have already a dask dataframe (which is dask's abstraction of a pandas dataframe with nearly the same API), you can directly associate it with a name:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask.datasets import timeseries\n", + "df = timeseries()\n", + "type(df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c.register_dask_table(df.persist(), \"timeseries\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "\n", + "Please note that we have persisted the data before passing it to dask-sql.\n", + "This will tell dask that we want to prefetch the data into memory.\n", + "Doing so will speed up the queries a lot, so you probably always want to do this.\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is also possible to register a pandas dataframe directly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.DataFrame({\"column\": [1, 2, 3]})\n", + "c.register_dask_table(df, \"pandas\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Read in data an from external location and register it" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In most of the cases however, your data will live on some external storage device, such as a local disk, S3 or hdfs.\n", + "You can leverage dask's large set of understood input formats and sources to load the data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have now registered tables in our context with the given names.\n", + "Let's do some data analysis!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Query the data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Whenever you call the `.sql()` method of the context, dask-sql will hand your query to Apache Calcite to turn it into a relational algebra and will then create a dask computation graph, which will execute your computation. \n", + "\n", + "Let's see how this works in action:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c.sql(\"\"\"\n", + " SELECT AVG(x) FROM timeseries\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result is again a dask dataframe, which only represents the computation (without having executed it) so far.\n", + "Lets trigger it now!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c.sql(\"\"\"\n", + " SELECT AVG(x) FROM timeseries\n", + "\"\"\").compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Congratulations! You have just queried your data with SQL.\n", + "If you check out the dask dashboard, you see that this has triggered some computations in dask.\n", + "\n", + "Of course, it is also possible to calculate more complex queries:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c.sql(\"\"\"\n", + " SELECT\n", + " lhs.name,\n", + " lhs.id,\n", + " lhs.x\n", + " FROM\n", + " timeseries AS lhs\n", + " JOIN\n", + " (\n", + " SELECT\n", + " name AS max_name,\n", + " MAX(x) AS max_x\n", + " FROM timeseries\n", + " GROUP BY name\n", + " ) AS rhs\n", + " ON\n", + " lhs.name = rhs.max_name AND\n", + " lhs.x = rhs.max_x\n", + "\"\"\").compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can find more information on dask-sql in the [documentation](https://dask-sql.readthedocs.io/)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From e637822a7e8189f4325ee19e25c3a4f25bc531d6 Mon Sep 17 00:00:00 2001 From: Nils Braun Date: Thu, 29 Oct 2020 22:52:39 +0100 Subject: [PATCH 2/7] Fixes needed to make it run on binder --- applications/sql-queries.ipynb | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/applications/sql-queries.ipynb b/applications/sql-queries.ipynb index 8c4b4857..5a3aeb84 100644 --- a/applications/sql-queries.ipynb +++ b/applications/sql-queries.ipynb @@ -22,7 +22,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`dask-sql` can be installed via conda, like in the following cell:" + "`dask-sql` can be installed via conda (or mamba) or pip, like in the following cell:" ] }, { @@ -31,7 +31,7 @@ "metadata": {}, "outputs": [], "source": [ - "! conda install -y dask-sql -c conda-forge" + "! mamba install --quiet -y dask-sql " ] }, { @@ -59,7 +59,10 @@ "source": [ "In SQL, all tables and functions are specified via names. Therefore we need to have some place to store all the registered tables (and functions), so that dask-sql knows which data it refers to.\n", "This is the task of the Context.\n", - "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application." + "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application.\n", + "\n", + "As we have just installed dask-sql, the `$JAVA_HOME` is not picked up correctly.\n", + "We will help dask-sql a bit and specify it explicitly." ] }, { @@ -68,6 +71,9 @@ "metadata": {}, "outputs": [], "source": [ + "import os\n", + "os.environ[\"JAVA_HOME\"] = os.environ[\"CONDA_PREFIX\"]\n", + "\n", "from dask_sql import Context\n", "c = Context()" ] From 12840cdc6b6624140477e257c5ee3a6f37345140 Mon Sep 17 00:00:00 2001 From: Nils Braun Date: Fri, 30 Oct 2020 16:24:20 +0100 Subject: [PATCH 3/7] Rename the notebook and add it to the index --- index.rst | 1 + applications/sql-queries.ipynb => sql.ipynb | 0 2 files changed, 1 insertion(+) rename applications/sql-queries.ipynb => sql.ipynb (100%) diff --git a/index.rst b/index.rst index 2a867950..4c6679ad 100644 --- a/index.rst +++ b/index.rst @@ -22,6 +22,7 @@ You can run these examples in a live session here: |Binder| delayed futures machine-learning + sql xarray .. toctree:: diff --git a/applications/sql-queries.ipynb b/sql.ipynb similarity index 100% rename from applications/sql-queries.ipynb rename to sql.ipynb From 3c0b2f2bf7f9fecfee49b4769ef7d61e6cbc6e87 Mon Sep 17 00:00:00 2001 From: Nils Braun Date: Sat, 31 Oct 2020 11:38:22 +0100 Subject: [PATCH 4/7] Install dask-sql via conda before launching the notebook --- binder/environment.yml | 1 + sql.ipynb | 11 +---------- 2 files changed, 2 insertions(+), 10 deletions(-) diff --git a/binder/environment.yml b/binder/environment.yml index e56a0977..7800d520 100644 --- a/binder/environment.yml +++ b/binder/environment.yml @@ -26,5 +26,6 @@ dependencies: - requests - py-xgboost - dask-xgboost + - dask-sql - pip: - mimesis diff --git a/sql.ipynb b/sql.ipynb index 5a3aeb84..0edd8e0d 100644 --- a/sql.ipynb +++ b/sql.ipynb @@ -22,16 +22,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`dask-sql` can be installed via conda (or mamba) or pip, like in the following cell:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "! mamba install --quiet -y dask-sql " + "`dask-sql` can be installed via conda (or mamba) or pip (it is already installed in the environment)" ] }, { From fa29c8624973e5cc3bf8ecf46a96d6dce10a01f8 Mon Sep 17 00:00:00 2001 From: Nils Braun Date: Tue, 3 Nov 2020 12:31:55 +0100 Subject: [PATCH 5/7] Revert "Install dask-sql via conda before launching the notebook" This reverts commit 3c0b2f2bf7f9fecfee49b4769ef7d61e6cbc6e87. --- binder/environment.yml | 1 - sql.ipynb | 11 ++++++++++- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/binder/environment.yml b/binder/environment.yml index 7800d520..e56a0977 100644 --- a/binder/environment.yml +++ b/binder/environment.yml @@ -26,6 +26,5 @@ dependencies: - requests - py-xgboost - dask-xgboost - - dask-sql - pip: - mimesis diff --git a/sql.ipynb b/sql.ipynb index 0edd8e0d..5a3aeb84 100644 --- a/sql.ipynb +++ b/sql.ipynb @@ -22,7 +22,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`dask-sql` can be installed via conda (or mamba) or pip (it is already installed in the environment)" + "`dask-sql` can be installed via conda (or mamba) or pip, like in the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! mamba install --quiet -y dask-sql " ] }, { From 77f59f1ea19c4926efa8167f77fd76e873a2faae Mon Sep 17 00:00:00 2001 From: Nils Braun Date: Tue, 3 Nov 2020 12:33:03 +0100 Subject: [PATCH 6/7] Try installation with conda --- sql.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql.ipynb b/sql.ipynb index 5a3aeb84..0f421c3f 100644 --- a/sql.ipynb +++ b/sql.ipynb @@ -31,7 +31,7 @@ "metadata": {}, "outputs": [], "source": [ - "! mamba install --quiet -y dask-sql " + "!conda install -y dask-sql " ] }, { From a4eed0ce2b32adaa9a6aa810e57a68167840b695 Mon Sep 17 00:00:00 2001 From: Nils Braun Date: Mon, 30 Nov 2020 21:34:29 +0100 Subject: [PATCH 7/7] Fix for the newest dask-sql version and use mamba --- sql.ipynb | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/sql.ipynb b/sql.ipynb index 0f421c3f..c584e534 100644 --- a/sql.ipynb +++ b/sql.ipynb @@ -31,7 +31,7 @@ "metadata": {}, "outputs": [], "source": [ - "!conda install -y dask-sql " + "! mamba install -y dask-sql" ] }, { @@ -59,10 +59,7 @@ "source": [ "In SQL, all tables and functions are specified via names. Therefore we need to have some place to store all the registered tables (and functions), so that dask-sql knows which data it refers to.\n", "This is the task of the Context.\n", - "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application.\n", - "\n", - "As we have just installed dask-sql, the `$JAVA_HOME` is not picked up correctly.\n", - "We will help dask-sql a bit and specify it explicitly." + "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application." ] }, { @@ -71,9 +68,6 @@ "metadata": {}, "outputs": [], "source": [ - "import os\n", - "os.environ[\"JAVA_HOME\"] = os.environ[\"CONDA_PREFIX\"]\n", - "\n", "from dask_sql import Context\n", "c = Context()" ] @@ -156,7 +150,7 @@ "metadata": {}, "outputs": [], "source": [ - "c.register_dask_table(df.persist(), \"timeseries\")" + "c.create_table(\"timeseries\", df.persist())" ] }, { @@ -187,7 +181,7 @@ "source": [ "import pandas as pd\n", "df = pd.DataFrame({\"column\": [1, 2, 3]})\n", - "c.register_dask_table(df, \"pandas\")" + "c.create_table(\"pandas\", df)" ] }, { @@ -202,7 +196,8 @@ "metadata": {}, "source": [ "In most of the cases however, your data will live on some external storage device, such as a local disk, S3 or hdfs.\n", - "You can leverage dask's large set of understood input formats and sources to load the data." + "You can leverage dask's large set of understood input formats and sources to load the data.\n", + "Find our more information in the [documentation](https://dask-sql.readthedocs.io/en/latest/pages/data_input.html) of dask-sql." ] }, {