Skip to content

PilotScope is a middleware to bridge the gaps of deploying AI4DB (Artificial Intelligence for Databases) algorithms into actual database systems.

Notifications You must be signed in to change notification settings

alibaba/pilotscope

Repository files navigation

PilotScope

PilotScope

PilotScope is a middleware to bridge the gaps of deploying AI4DB (Artificial Intelligence for Databases) algorithms into actual database systems. It aims at hindering the underlying details of different databases so that an AI4DB driver could steer any database in a unified manner. By applying PilotScope, we obtain the following benefits:

  • The DB users could experience any AI4DB algorithm as a plug-in unit on their databases with little cost. The cloud computing service providers could operate and maintain AI4DB algorithms on their database products as a service to users. (More Convenient for Usage! πŸ‘πŸ‘πŸ‘)

  • The ML researchers could easily benchmark and iterate their AI4DB algorithms in practical scenarios. (Much Faster to Iterate! ⬆️⬆️⬆️)

  • The ML and DB developers are liberated from learning the details in other side. They could play their own strengths to write the codes in their own sides. (More Freedom to Develop! πŸ„β€β™€οΈπŸ„β€β™€οΈπŸ„β€β™€οΈ)

  • All contributors could extend PilotScope to support more AI4DB algorithms, more databases and more functions. (We highly encourage this! 😊😊😊)

| Code Structure | Installation | Feature Overview | Documentation | License | Reference | Contributing |


News

  • πŸŽ‰ [2023-12-15] Our paper on PilotScope has been accepted by VLDB 2024!

Code Structure

PilotScope/
β”œβ”€β”€ algorithm_examples                         # Algorithm examples
β”œβ”€β”€ fig                                        # Saved some Figures
β”œβ”€β”€ paper                                 
β”‚   β”œβ”€β”€ PilotScope.pdf                         # Paper of PilotScope
β”œβ”€β”€ pilotscope
β”‚   β”œβ”€β”€ Anchor                                 # Base push and pull anchors for implementing push and pull opearators       
β”‚   β”‚   β”œβ”€β”€ AnchorHandler.py
β”‚   β”‚   β”œβ”€β”€ AnchorEnum.py
β”‚   β”‚   β”œβ”€β”€ AnchorTransData.py
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ Common                                 # Useful tools for PilotScope
β”‚   β”‚   β”œβ”€β”€ Index.py
β”‚   β”‚   β”œβ”€β”€ CardMetricCalc.py                   
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ DBController                           # The implemenation of DB controllers for different databased
β”‚   β”‚   β”œβ”€β”€ BaseDBController.py
β”‚   β”‚   β”œβ”€β”€ PostgreSQLController.py
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ DBInteractor                           # The funtionalities for interaction with database
β”‚   β”‚   β”œβ”€β”€ HttpInteractorReceiver.py
β”‚   β”‚   β”œβ”€β”€ PilotDataInteractor.py
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ DataManager                            # The management of data
β”‚   β”‚   β”œβ”€β”€ DataManager.py
β”‚   β”‚   └── TableVisitedTracker.py
β”‚   β”œβ”€β”€ Dataset                                # An easy-to-use API for loading benchmarks
β”‚   β”‚   β”œβ”€β”€ BaseDataset.py
β”‚   β”‚   β”œβ”€β”€ Imdb
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ Exception                              # Some exception which may occur in the lifecycle of pilotscope
β”‚   β”‚   └── Exception.py
β”‚   β”œβ”€β”€ Factory                                # Factory patterns
β”‚   β”‚   β”œβ”€β”€ AnchorHandlerFactory.py
β”‚   β”‚   β”œβ”€β”€ DBControllerFectory.py
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ PilotConfig.py                         # Configurations of PilotScope
β”‚   β”œβ”€β”€ PilotEnum.py                           # Some related enumeration types
β”‚   β”œβ”€β”€ PilotEvent.py                          # Some predefined events
β”‚   β”œβ”€β”€ PilotModel.py                          # Base models of pilotscope 
β”‚   β”œβ”€β”€ PilotScheduler.py                      # Sheduling data traing、inference、collection push-and-pull and so on
β”‚   β”œβ”€β”€ PilotSysConfig.py                      # System configuration of PilotScope 
β”‚   └── PilotTransData.py                      # A unified data object for data collection
β”œβ”€β”€ requirements.txt                           # Requirements for PilotScope
β”œβ”€β”€ setup.py                                   # Setup for PilotScope
β”œβ”€β”€ test_example_algorithms                    # Examples of some tasks, such as index recommendation, knob tuning, etc.
└── test_pilotscope                            # Unittests of PilotScope

Installation

Required Software Versions:

  • Python: 3.8
  • PostgreSQL: 13.1
  • Apache Spark: 3.3.2

You can install PilotScope Core and modified databases (e.g., PostgreSQL and Spark) following the documentation.

Feature Overview

The components of PilotScope Core in ML side can be divided into two categories: Database Components and Deployment Components. The Database Components are used to facilitate data exchange and control over the database, while the Deployment Components are used to facilitate the automatic application of custom AI algorithms to each incoming SQL query.

A high-level overview of the PilotScope Core components is shown in the following figure.

PilotScope

The Database Components are highlighted in Yellow, while the Deployment Components are highlighted in green. We will discuss each of these components in detail in the documentation.

An Example for Data Interaction with Database

The PilotConfig class is utilized to configure the PilotScope application, such as the database credentials for establishing a connection. We first create an instance of the PilotConfig where we can specify the database credentials and connected database name, i.e., stats_tiny.

# Example of PilotConfig
config: PilotConfig = PostgreSQLConfig(host="localhost", port="5432", user="postgres", pwd="postgres")
# You can also instantiate a PilotConfig for other DBMSes. e.g. 
# config:PilotConfig = SparkConfig()
config.db = "stats_tiny"
# Configure PilotScope here, e.g. changing the name of database you want to connect to.

The PilotDataInteractor class provides a flexible workflow for data exchange. It includes three main functions: push, pull, and execute. These functions assist the user in collecting data (pull operators) after setting additional data (push operators) in a single query execution process.

For instance, if the user wants to collect the execution time, estimated cost, and cardinality of all sub-queries within a query. Here is an example code:

sql = "select count(*) from votes as v, badges as b, users as u where u.id = v.userid and v.userid = b.userid and u.downvotes>=0 and u.downvotes<=0"
data_interactor = PilotDataInteractor(config)
data_interactor.pull_estimated_cost()
data_interactor.pull_subquery_card()
data_interactor.pull_execution_time()
data = data_interactor.execute(sql)
print(data)

The execute function returns a PilotTransData object named data, which serves as a placeholder for the collected data. Each member of this object represents a specific data point, and the values corresponding to the previously registered pull operators will be filled in, while the other values will remain as None.

execution_time: 0.00173
estimated_cost: 98.27
subquery_2_card: {'select count(*) from votes v': 3280.0, 'select count(*) from badges b': 798.0, 'select count(*) from users u where u.downvotes >= 0 and u.downvotes <= 0': 399.000006, 'select count(*) from votes v, badges b where v.userid = b.userid;': 368.609177, 'select count(*) from votes v, users u where v.userid = u.id and u.downvotes >= 0 and u.downvotes <= 0;': 333.655156, 'select count(*) from badges b, users u where b.userid = u.id and u.downvotes >= 0 and u.downvotes <= 0;': 425.102804, 'select count(*) from votes v, badges b, users u where v.userid = u.id and v.userid = b.userid and u.downvotes >= 0 and u.downvotes <= 0;': 37.536205}
buffercache: None
...

In certain scenarios, when the user wants to collect the execution time of a SQL query after applying a new cardinality (e.g., scaling the original cardinality by 100) for all sub-queries within the SQL, the PilotDataInteractor provides push function to achieve this. Here is an example code:

# Example of PilotDataInteractor (registering operators again and execution)
data_interactor.push_card({k: v * 100 for k, v in data.subquery_2_card.items()})
data_interactor.pull_estimated_cost()
data_interactor.pull_execution_time()
new_data = data_interactor.execute(sql)
print(new_data)

By default, each call to the execute function will reset any previously registered operators. Therefore, we need to push these new cardinalities and re-register the pull operators to collect the estimated cost and execution time. In this scenario, the new cardinalities will replace the ones estimated by the database's cardinality estimator. As a result, the partial result of the new_data object will be significantly different from the result of the data object, mainly due to the variation in cardinality values.

execution_time: 0.00208
estimated_cost: 37709.05
...

More functionalities please refer to the documentation.

Documentation

The classes and methods of PilotScope have been well documented. You can find the documentation in documentation.

License

PilotScope is released under Apache License 2.0.

Reference

If you find our work useful for your research or development, please kindly cite the following

@article{zhu2023pilotscope,
	title={PilotScope: Steering Databases with Machine Learning Drivers},
	author={Rong Zhu and Lianggui Weng and Wenqing Wei and Di Wu and Jiazhen Peng and Yifan Wang and Bolin Ding and Defu Lian Bolong Zheng and Jingren Zhou},
	journal = {Proceedings of the VLDB Endowment},
	year={2024}}

Contributing

As an open-sourced project, we greatly appreciate any contribution to PilotScope!