Skip to content

Implementation of DeepDB: Learn from Data, not from Queries!

License

Notifications You must be signed in to change notification settings

DataManagementLab/deepdb-public

Repository files navigation

DeepDB: Learn from Data, not from Queries!

DeepDB is a data-driven learned database component achieving state-of-the-art-performance in cardinality estimation and approximate query processing (AQP). This is the implementation described in

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, Carsten Binnig: "DeepDB: Learn from Data, not from Queries!", VLDB'2020. [PDF]

DeepDB Overview

Setup

Tested with python3.7 and python3.8

git clone https://github.com/DataManagementLab/deepdb-public.git
cd deepdb-public
sudo apt install -y libpq-dev gcc python3-dev
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

For python3.8: Sometimes spflow fails, in this case remove spflow from requirements.txt, install them and run

pip3 install spflow --no-deps

Reproduce Experiments

Cardinality Estimation

Download the Job dataset. Generate hdf files from csvs.

python3 maqp.py --generate_hdf
    --dataset imdb-light
    --csv_seperator ,
    --csv_path ../imdb-benchmark
    --hdf_path ../imdb-benchmark/gen_single_light
    --max_rows_per_hdf_file 100000000

Generate sampled hdf files from csvs.

python3 maqp.py --generate_sampled_hdfs
    --dataset imdb-light
    --hdf_path ../imdb-benchmark/gen_single_light
    --max_rows_per_hdf_file 100000000
    --hdf_sample_size 10000

Learn ensemble with the optimized rdc strategy (requires postgres with imdb dataset)

python3 maqp.py --generate_ensemble
    --dataset imdb-light 
    --samples_per_spn 10000000 10000000 1000000 1000000 1000000
    --ensemble_strategy rdc_based
    --hdf_path ../imdb-benchmark/gen_single_light
    --max_rows_per_hdf_file 100000000
    --samples_rdc_ensemble_tests 10000
    --ensemble_path ../imdb-benchmark/spn_ensembles
    --database_name imdb
    --post_sampling_factor 10 10 5 1 1
    --ensemble_budget_factor 5
    --ensemble_max_no_joins 3
    --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl

Alternatively: Learn base ensemble over different tables with naive strategy. (Does not work with different dataset sizes because join sizes are hard coded but does not require postgres)

python3 maqp.py --generate_ensemble
    --dataset imdb-light 
    --samples_per_spn 1000000 1000000 1000000 1000000 1000000
    --ensemble_strategy relationship
    --hdf_path ../imdb-benchmark/gen_single_light
    --ensemble_path ../imdb-benchmark/spn_ensembles
    --max_rows_per_hdf_file 100000000
    --post_sampling_factor 10 10 5 1 1

Evaluate performance for queries.

python3 maqp.py --evaluate_cardinalities
    --rdc_spn_selection
    --max_variants 1
    --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl
    --dataset imdb-light
    --target_path ./baselines/cardinality_estimation/results/deepDB/imdb_light_model_based_budget_5.csv
    --ensemble_location ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl
    --query_file_location ./benchmarks/job-light/sql/job_light_queries.sql
    --ground_truth_file_location ./benchmarks/job-light/sql/job_light_true_cardinalities.csv

Updates

Conditional incremental learning (i.e., initial learning of all films before 2013, newer films learn incremental)

python3 maqp.py  --generate_ensemble
    --dataset imdb-light
    --samples_per_spn 10000000 10000000 1000000 1000000 1000000
    --ensemble_strategy rdc_based
    --hdf_path ../imdb-benchmark/gen_single_light
    --max_rows_per_hdf_file 100000000
    --samples_rdc_ensemble_tests 10000
    --ensemble_path ../imdb-benchmark/spn_ensembles
    --database_name JOB-light
    --post_sampling_factor 10 10 5 1 1
    --ensemble_budget_factor 0
    --ensemble_max_no_joins 3
    --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl
    --incremental_condition "title.production_year<2013"

Optimized Inference

Generate the C++ code. (Currently only works for cardinality estimation).

python3 maqp.py --code_generation 
    --ensemble_path ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl

Compile it in a venv with pybind installed. Sometimes installing this yields: ModuleNotFoundError: No module named 'pip.req' One workaround is to downgrade pip pip3 install pip==9.0.3 as described here.

The command below works for ubuntu 18.04. Make sure the generated .so file is in the root directory of the project.

g++ -O3 -Wall -shared -std=c++11 -ftemplate-depth=2048 -ftime-report -fPIC `python3 -m pybind11 --includes` optimized_inference.cpp -o optimized_inference`python3-config --extension-suffix`

If you now want to leverage the module you have to specify it for cardinalities.

python3 maqp.py --evaluate_cardinalities 
    --rdc_spn_selection 
    --max_variants 1 
    --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl 
    --dataset imdb-light 
    --target_path ./baselines/cardinality_estimation/results/deepDB/imdb_light_model_based_budget_5.csv 
    --ensemble_location ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl 
    --query_file_location ./benchmarks/job-light/sql/job_light_queries.sql 
    --ground_truth_file_location ./benchmarks/job-light/sql/job_light_true_cardinalities.csv 
    --use_generated_code

AQP

SSB pipeline

Generate standard SSB dataset (Scale Factor=500) and use the correct seperator.

for i in `ls *.tbl`; do
    sed 's/|$//' $i > $TMP_DIR/${i/tbl/csv} &
    echo $i;
done

Create lineorder sample

cat lineorder.csv | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .003333) print $0}' > lineorder_sampled.csv

Generate hdf files from csvs.

python3 maqp.py --generate_hdf
    --dataset ssb-500gb
    --csv_seperator \|
    --csv_path ../mqp-data/ssb-benchmark
    --hdf_path ../mqp-data/ssb-benchmark/gen_hdf

Learn the ensemble with a naive strategy.

python3 maqp.py --generate_ensemble 
    --dataset ssb-500gb
    --samples_per_spn 1000000
    --ensemble_strategy single 
    --hdf_path ../mqp-data/ssb-benchmark/gen_hdf 
    --ensemble_path ../mqp-data/ssb-benchmark/spn_ensembles
    --rdc_threshold 0.3
    --post_sampling_factor 10

Optional: Compute ground truth for AQP queries (requires postgres with ssb schema).

python3 maqp.py --aqp_ground_truth
    --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
    --target_path ./benchmarks/ssb/ground_truth_500GB.pkl
    --database_name ssb

Evaluate the AQP queries.

python3 maqp.py --evaluate_aqp_queries
    --dataset ssb-500gb
    --target_path ./baselines/aqp/results/deepDB/ssb_500gb_model_based.csv
    --ensemble_location ../mqp-data/ssb-benchmark/spn_ensembles/ensemble_single_ssb-500gb_1000000.pkl
    --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/ssb/ground_truth_500GB.pkl

Optional: Create the ground truth for confidence interval. (with 10M because we also use 10M samples for the training)

python3 maqp.py --aqp_ground_truth
    --query_file_location ./benchmarks/ssb/sql/confidence_queries.sql
    --target_path ./benchmarks/ssb/confidence_intervals/confidence_interval_10M.pkl
    --database_name ssb

Evaluate the confidence intervals.

python3 maqp.py --evaluate_confidence_intervals
    --dataset ssb-500gb
    --target_path ./baselines/aqp/results/deepDB/ssb500GB_confidence_intervals.csv
    --ensemble_location ../mqp-data/ssb-benchmark/spn_ensembles/ensemble_single_ssb-500gb_1000000.pkl
    --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/ssb/confidence_intervals/confidence_interval_10M.pkl
    --confidence_upsampling_factor 300
    --confidence_sample_size 10000000

Flights pipeline

Generate flights dataset with scale factor 1 billion using IDEBench and generate a sample using

cat dataset.csv | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' > dataset_sampled.csv

Generate hdf files from csvs.

python3 maqp.py --generate_hdf
    --dataset flights1B
    --csv_seperator ,
    --csv_path ../mqp-data/flights-benchmark
    --hdf_path ../mqp-data/flights-benchmark/gen_hdf

Learn the ensemble.

python3 maqp.py --generate_ensemble 
    --dataset flights1B
    --samples_per_spn 10000000 
    --ensemble_strategy single 
    --hdf_path ../mqp-data/flights-benchmark/gen_hdf 
    --ensemble_path ../mqp-data/flights-benchmark/spn_ensembles
    --rdc_threshold 0.3
    --post_sampling_factor 10

Optional: Compute ground truth

python3 maqp.py --aqp_ground_truth
    --dataset flights1B
    --query_file_location ./benchmarks/flights/sql/aqp_queries.sql
    --target_path ./benchmarks/flights/ground_truth_1B.pkl
    --database_name flights   

Evaluate the AQP queries.

python3 maqp.py --evaluate_aqp_queries
    --dataset flights1B
    --target_path ./baselines/aqp/results/deepDB/flights1B_model_based.csv
    --ensemble_location ../mqp-data/flights-benchmark/spn_ensembles/ensemble_single_flights1B_10000000.pkl
    --query_file_location ./benchmarks/flights/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/flights/ground_truth_1B.pkl

Optional: Create the ground truth for confidence interval. (with 10M because we also use 10M samples for the training)

python3 maqp.py --aqp_ground_truth
    --dataset flights1B
    --query_file_location ./benchmarks/flights/sql/confidence_queries.sql
    --target_path ./benchmarks/flights/confidence_intervals/confidence_interval_10M.pkl
    --database_name flights10M_origsample 

Evaluate the confidence intervals.

python3 maqp.py --evaluate_confidence_intervals
    --dataset flights1B
    --target_path ./baselines/aqp/results/deepDB/flights1B_confidence_intervals.csv
    --ensemble_location ../mqp-data/flights-benchmark/spn_ensembles/ensemble_single_flights1B_10000000.pkl
    --query_file_location ./benchmarks/flights/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/flights/confidence_intervals/confidence_interval_10M.pkl
    --confidence_upsampling_factor 100
    --confidence_sample_size 10000000

TPC-DS (Single Table) pipeline

As an additional example on how to work with DeepDB, we provide an example on just a single table of the TPC-DS schema for the queries in ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql. As a prerequisite, you need a 10 million tuple sample of the store_sales table in the directory ../mqp-data/tpc-ds-benchmark/store_sales_sampled.csv. Afterwards, you can run the following commands. To compute the ground truth, you need a postgres instance with a 1T TPC-DS dataset.

Generate hdf files from csvs

python3 maqp.py --generate_hdf
    --dataset tpc-ds-1t
    --csv_seperator |
    --csv_path ../mqp-data/tpc-ds-benchmark
    --hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf

Learn the ensemble

python3 maqp.py --generate_ensemble 
    --dataset tpc-ds-1t
    --samples_per_spn 10000000 
    --ensemble_strategy single 
    --hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
    --ensemble_path ../mqp-data/tpc-ds-benchmark/spn_ensembles 
    --rdc_threshold 0.3 
    --post_sampling_factor 10

Compute ground truth

python3 maqp.py --aqp_ground_truth
    --dataset tpc-ds-1t
    --query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
    --target_path ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
    --database_name tcpds

Evaluate the AQP queries

python3 maqp.py --evaluate_aqp_queries
    --dataset tpc-ds-1t
    --target_path ./baselines/aqp/results/deepDB/tpcds1t_model_based.csv
    --ensemble_location ../mqp-data/tpc-ds-benchmark/spn_ensembles/ensemble_single_tpc-ds-1t_10000000.pkl
    --query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
    --ground_truth_file_location ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl

How to experiment with DeepDB on a new Dataset

  • Specify a new schema in the schemas folder
  • Due to the current implementation, make sure to declare
    • the primary key,
    • the filename of the csv sample file,
    • the correct table size and sample rate,
    • the relationships among tables if you do not just run queries over a single table,
    • any non-key functional dependencies (this is rather an implementation detail),
    • and include all columns in the no-compression list by default (as done for the IMDB benchmark),
  • To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
  • Generate the HDF/sampled HDF files and learn the RSPN ensemble
  • Use the RSPN ensemble to answer queries
  • For reference, please check the commands to reproduce the results of the paper

Reference

If you find this repository useful in your work, please cite our paper:

@inproceedings{deepdb,
  title={DeepDB: Learn from Data, not from Queries!},
  author = {Benjamin Hilprecht and Andreas Schmidt and Moritz Kulessa and Alejandro Molina and Kristian Kersting and Carsten Binnig},
  journal={Proceedings of the VLDB Endowment},
  volume={13},
  number={7},
  pages={992--1005},
  year={2020},
  publisher={VLDB Endowment}
}