Skip to content

Latest commit

 

History

History
239 lines (140 loc) · 11.5 KB

README.en.md

File metadata and controls

239 lines (140 loc) · 11.5 KB

data-copilot

Natural Language Database Query System (RAG) based on langchain and Large Language Models (LLM)

Utilizing natural language queries, this system intelligently parses database structures using large language models, enabling intelligent multi-table structured queries and statistical computations on data. It also smartly generates various charts based on query results. Implemented with Pywebio interactive frontend web pages, without the need for OpenAI API, it's 100% pure Python code.

🚩简体中文文档

Related Projects

Personal website: www.bytesc.top

display video: ./display_video.mp4

🔔 For project-related inquiries, feel free to raise an issue in this repository. I typically respond within 24 hours.

Feature Overview

    1. Natural language querying
    1. Implementation of structured queries and statistical computations across multiple tables
    1. Smart generation of various types of charts and interactive chart creation
    1. Intelligent parsing of database structures, no additional configuration needed for different MySQL databases
    1. Support for concurrent multithreaded queries
    1. Ability to handle exceptions such as instability in large language model performance
    1. Support for local offline deployment (GPU required) using Hugging Face format models (e.g., qwen-7b)
    1. Support for API interfaces in OpenAI format and Dashscope's qwen

Technological Innovations

  • Enable retrying of questions by feeding back information from exceptions and assertions, improving the stability of outputs from LLM.
  • Implement multi-threading for concurrent questioning to enhance response speed and stability.
  • Utilize DataFrame mapping in databases to avoid the risk of SQL injection attacks by manipulating the LLM through induced queries.
  • Introduce word embedding models and vector databases as a replacement for simple regular expressions, in order to address the challenge of mapping fuzzy outputs from LLM to specific system code executions.

Basic Technical Principles

System Architecture

Users access the Web application through a browser, which communicates with the server using the WebSocket protocol via the Pywebio framework.

The service layer uses LLM to generate code and tool suggestions, the Python interpreter to execute code, and Pygwalker to provide interactive plotting support.

The data layer uses MySQL as the database.

Basic flow of single-instance generation

Basic Flow

  1. After the natural language question is input into the system, it will be combined with the pre-set toolset description information to form a prompt and input into the LLM, allowing the LLM to select the appropriate tool for solving the problem.

  2. Retrieve the structural information of the data from the database (Dataframe data summary).

  3. Input the data summary and tool suggestion information into the LLM to write Python code to solve the problem.

  4. Extract the code from the LLM's response and execute it with the Python interpreter.

  5. If an exception occurs during code execution, combine the exception information with the problem code to form a new prompt and re-input it into the LLM for another attempt (return to step 3). Continue until the code runs successfully or the maximum number of retries is exceeded.

  6. If no exception occurs during code execution, assert the program output. If it is not the expected type, combine the assertion information with the problem code to form a new prompt and re-input it into the LLM for another attempt (return to step 3). Continue until the assertion is successful or the maximum number of retries is exceeded.

  7. Display the successful code execution output (charts) on the user interface and launch the interactive plotting interface based on the output data.

Concurrency generation control

Concurrency Control

Repeated feedback of exceptions and assertions can cause the prompt to become increasingly long, causing the LLM to lose focus and affect the generation effect. The LLM's first incorrect response can also affect subsequent generations. Starting over may yield better results.

Therefore, multi-threaded concurrent execution is introduced to ask questions independently multiple times, reducing the probability of overall generation failure caused by unstable LLM outputs, and improving system stability and response speed.

Display

display video: ./display_video.mp4

Query Interface

Query Interface Query Interface

Based on the retrieved data, the system intelligently selects chart types for plotting, supporting structured queries across multiple tables. In this example, it intelligently connects the employee table and the salary table for a structured query, selecting a bar chart for plotting.

Database Structure Structured Query Result

If unsatisfied with the intelligent plotting results, advanced mode allows for manual interactive plotting based on the queried data.

Advanced Plotting

Supports intelligent statistical computations. In this example, it intelligently calculates the percentage of sales in each country, selecting a pie chart for plotting.

Database Structure Statistical Computation Result Statistical Computation Result

Advanced mode enables interactive plotting.

Advanced Plotting Advanced Plotting

Also supports statistical computations such as mean, sum, max, min, etc.

Statistical Computation Result Statistical Computation Result Statistical Computation Result Statistical Computation Result Statistical Computation Result Statistical Computation Result

How to Use

Installing Dependencies

Python version 3.9

pip install -r requirement.txt

Fill in Configuration Information

The configuration file is located at ./config/config.yaml.

Database Configuration

Simply provide the connection information. The model will automatically read the database structure, so no additional configuration is needed.

mysql: mysql+pymysql://root:[email protected]/data_copilot
# mysql: mysql+pymysql://username:password@host:port/database

Large Language Model Configuration

If using the Dashscope qwen API (recommended):

llm:
  model: qwen1.5-110b-chat # Model name
  url: ""  # No need to fill this when using the Dashscope `qwen` API

# qwen1.5-72b-chat   qwen1.5-110b-chat
# qwen-turbo  qwen-plus   qwen-max   qwen-long

Also, in llm_access/LLM.py, uncomment the following line:

# llm = llm_access.openai_access.llm
llm = llm_access.qwen_access.llm

If using the OpenAI API (the provided endpoint URL is for the GPT-like model's OpenAI compatible API):

llm:
  model: glm-4 # Model name
  url: "https://open.bigmodel.cn/api/paas/v4/"   # endpoint_url

# qwen1.5-72b-chat   qwen1.5-110b-chat
# qwen-turbo  qwen-plus   qwen-max   qwen-long

Also, in llm_access/LLM.py, uncomment the following line:

llm = llm_access.openai_access.llm
# llm = llm_access.qwen_access.llm

If offline deployment is needed, relevant code is in ./llm_access/qwen_access.py.

Obtaining API Key

If obtaining the API key from Alibaba Cloud Dashscope for the qwen large language model,

save the API key to llm_access/api_key_qwen.txt.

If using the API key for the openai format API,

save the API key to llm_access/api_key_openai.txt.

Running

main.py is the entry point of the project. Running this file will start the server.

python main.py

By default, you can access it via the browser at http://127.0.0.1:8087/.

Potential Future Directions

The challenge of this project lies in mapping from user's natural language vague queries and natural language responses from a large language model to deterministic data that traditional non-AI computer code can handle.

At its current stage, this project, along with some other open-source implementations (such as langchain agent), relies on repeated questioning to elicit responses from the large language model that closely match the predefined format of natural language responses. Then, regular expressions are used to match the data.

While large language models offer high flexibility and can implement various complex intelligent database queries, statistics, and plotting functions, they face fatal bottlenecks in terms of input-output scale. They cannot directly handle extremely long texts or large datasets. Therefore, they may not perform well when faced with structural information of large-scale data (e.g., hundreds of tables).

Although large language models can to some extent control relatively deterministic output formats through prompts, they are fundamentally still natural language outputs, exhibiting unstable output characteristics and emergent problems. They cannot guarantee acceptable results or the rationality and accuracy of the results.

Thus, innovation in this area may be a possible future direction for this project.

Vector Databases and Word Embedding Models

Introducing word embedding models (such as text2vec-base-multilingual) and vector databases (such as PGVector) can improve the handling of large-scale data.

By transforming large-scale data into vectors on a per-item basis and mapping vocabulary into a lower-dimensional vector space, semantic relationships between words can be captured. By storing these vectors in a vector database, we can swiftly perform similarity queries, clustering, and other operations, facilitating effective processing of large-scale data. Moreover, vector databases can match only predetermined results that are semantically closest, mitigating the instability and emergent issues often associated with large language models.

The integration of vector databases and word embedding models helps overcome the instability of large language models and their limitations in handling large-scale data. By combining the stability and reliability of vector databases with the intelligence and flexibility of large language models, efficient and stable processing and analysis of large-scale data can be achieved.

The relevant technical resources can be found in the ./pgv/ folder.

Related Links

Word Embedding Models:

Vector Databases:

To be continued...