diff --git a/index.rst b/index.rst index 2a867950..4c6679ad 100644 --- a/index.rst +++ b/index.rst @@ -22,6 +22,7 @@ You can run these examples in a live session here: |Binder| delayed futures machine-learning + sql xarray .. toctree:: diff --git a/sql.ipynb b/sql.ipynb new file mode 100644 index 00000000..c584e534 --- /dev/null +++ b/sql.ipynb @@ -0,0 +1,323 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Analyzing dask data with SQL" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask's abilities to analyze data are huge and due to its large similarities with the pandas dataframe API, it is very easy for pandas experts to migrate.\n", + "However, for many applications amd users, SQL still plays a large role for querying and retrieving data.\n", + "For very good reasons: it is easy to learn, it is a common language many (data) systems understand and it contains all the important elements to query the data.\n", + "\n", + "With [dask-sql](https://nils-braun.github.io/dask-sql/), which leverages [Apache Calcite](https://calcite.apache.org/), it is possible to query the data with SQL and still use the full power of a dask cluster." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`dask-sql` can be installed via conda (or mamba) or pip, like in the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! mamba install -y dask-sql" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you want to analyze data with dask-sql in python, you need to do three steps:\n", + "\n", + "1. Create a context\n", + "2. Load and register your data \n", + "3. Start querying!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Create a Context" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In SQL, all tables and functions are specified via names. Therefore we need to have some place to store all the registered tables (and functions), so that dask-sql knows which data it refers to.\n", + "This is the task of the Context.\n", + "You typically create a single context once at the beginning of your python script/notebook and use it through the rest of the application." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask_sql import Context\n", + "c = Context()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Set up a dask cluster" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now would be the best time to connect to your dask cluster if you have one. \n", + "dask-sql leverages dask for performing the computations on the data.\n", + "\n", + "Check out one of [the many ways](https://docs.dask.org/en/latest/setup.html) to create and connect to your dask cluster.\n", + "\n", + "For this example we will create a cluster running locally.\n", + "This is optional as dask can also create one implicetly, but we can get more diagnostics and insights.\n", + "You can click the link shown after the client intialization to show the dask dashboard." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask.distributed import Client\n", + "\n", + "client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')\n", + "client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and register the data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So far, no data was involved. Let's change that! There are many ways how you can get the data into your cluster and tell dask-sql, where to find it. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Register a dask or pandas dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you have already a dask dataframe (which is dask's abstraction of a pandas dataframe with nearly the same API), you can directly associate it with a name:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask.datasets import timeseries\n", + "df = timeseries()\n", + "type(df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c.create_table(\"timeseries\", df.persist())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "