✨ Natural Language Database Query System (RAG) based on langchain and Large Language Models (LLM)
Utilizing natural language queries, this system intelligently parses database structures using large language models, enabling intelligent multi-table structured queries and statistical computations on data. It also smartly generates various charts based on query results. Implemented with Pywebio interactive frontend web pages, without the need for OpenAI API, it's 100% pure Python code.
- Interpretable Natural Language Database Query System (RAG) based on Large Language Models (LLM) https://github.com/bytesc/data-copilot-steps
- Natural Language Database Query System (RAG) based on Large Language Models (LLM) and Concurrent Prediction Models https://github.com/bytesc/data-copilot-v2
Personal website: www.bytesc.top
display video: ./display_video.mp4
🔔 For project-related inquiries, feel free to raise an issue in this repository. I typically respond within 24 hours.
-
- Natural language querying
-
- Implementation of structured queries and statistical computations across multiple tables
-
- Smart generation of various types of charts and interactive chart creation
-
- Intelligent parsing of database structures, no additional configuration needed for different MySQL databases
-
- Support for concurrent multithreaded queries
-
- Ability to handle exceptions such as instability in large language model performance
-
- Support for local offline deployment (GPU required) using Hugging Face format models (e.g.,
qwen-7b
)
- Support for local offline deployment (GPU required) using Hugging Face format models (e.g.,
-
- Support for API interfaces in OpenAI format and Dashscope's
qwen
- Support for API interfaces in OpenAI format and Dashscope's
- Enable retrying of questions by feeding back information from exceptions and assertions, improving the stability of outputs from LLM.
- Implement multi-threading for concurrent questioning to enhance response speed and stability.
- Utilize DataFrame mapping in databases to avoid the risk of SQL injection attacks by manipulating the LLM through induced queries.
- Introduce word embedding models and vector databases as a replacement for simple regular expressions, in order to address the challenge of mapping fuzzy outputs from LLM to specific system code executions.
Users access the Web application through a browser, which communicates with the server using the WebSocket protocol via the Pywebio framework.
The service layer uses LLM to generate code and tool suggestions, the Python interpreter to execute code, and Pygwalker to provide interactive plotting support.
The data layer uses MySQL as the database.
-
After the natural language question is input into the system, it will be combined with the pre-set toolset description information to form a prompt and input into the LLM, allowing the LLM to select the appropriate tool for solving the problem.
-
Retrieve the structural information of the data from the database (Dataframe data summary).
-
Input the data summary and tool suggestion information into the LLM to write Python code to solve the problem.
-
Extract the code from the LLM's response and execute it with the Python interpreter.
-
If an exception occurs during code execution, combine the exception information with the problem code to form a new prompt and re-input it into the LLM for another attempt (return to step
3
). Continue until the code runs successfully or the maximum number of retries is exceeded. -
If no exception occurs during code execution, assert the program output. If it is not the expected type, combine the assertion information with the problem code to form a new prompt and re-input it into the LLM for another attempt (return to step
3
). Continue until the assertion is successful or the maximum number of retries is exceeded. -
Display the successful code execution output (charts) on the user interface and launch the interactive plotting interface based on the output data.
Repeated feedback of exceptions and assertions can cause the prompt to become increasingly long, causing the LLM to lose focus and affect the generation effect. The LLM's first incorrect response can also affect subsequent generations. Starting over may yield better results.
Therefore, multi-threaded concurrent execution is introduced to ask questions independently multiple times, reducing the probability of overall generation failure caused by unstable LLM outputs, and improving system stability and response speed.
display video: ./display_video.mp4
Query Interface
Based on the retrieved data, the system intelligently selects chart types for plotting, supporting structured queries across multiple tables. In this example, it intelligently connects the employee table and the salary table for a structured query, selecting a bar chart for plotting.
If unsatisfied with the intelligent plotting results, advanced mode allows for manual interactive plotting based on the queried data.
Supports intelligent statistical computations. In this example, it intelligently calculates the percentage of sales in each country, selecting a pie chart for plotting.
Advanced mode enables interactive plotting.
Also supports statistical computations such as mean, sum, max, min, etc.
Python version 3.9
pip install -r requirement.txt
The configuration file is located at ./config/config.yaml
.
Simply provide the connection information. The model will automatically read the database structure, so no additional configuration is needed.
mysql: mysql+pymysql://root:[email protected]/data_copilot
# mysql: mysql+pymysql://username:password@host:port/database
If using the Dashscope qwen
API (recommended):
llm:
model: qwen1.5-110b-chat # Model name
url: "" # No need to fill this when using the Dashscope `qwen` API
# qwen1.5-72b-chat qwen1.5-110b-chat
# qwen-turbo qwen-plus qwen-max qwen-long
Also, in llm_access/LLM.py
, uncomment the following line:
# llm = llm_access.openai_access.llm
llm = llm_access.qwen_access.llm
If using the OpenAI API (the provided endpoint URL is for the GPT-like model's OpenAI compatible API):
llm:
model: glm-4 # Model name
url: "https://open.bigmodel.cn/api/paas/v4/" # endpoint_url
# qwen1.5-72b-chat qwen1.5-110b-chat
# qwen-turbo qwen-plus qwen-max qwen-long
Also, in llm_access/LLM.py
, uncomment the following line:
llm = llm_access.openai_access.llm
# llm = llm_access.qwen_access.llm
If offline deployment is needed, relevant code is in ./llm_access/qwen_access.py
.
If obtaining the API key from Alibaba Cloud Dashscope for the qwen
large language model,
save the API key to llm_access/api_key_qwen.txt
.
If using the API key for the openai
format API,
save the API key to llm_access/api_key_openai.txt
.
main.py
is the entry point of the project. Running this file will start the server.
python main.py
By default, you can access it via the browser at http://127.0.0.1:8087/.
The challenge of this project lies in mapping from user's natural language vague queries and natural language responses from a large language model to deterministic data that traditional non-AI computer code can handle.
At its current stage, this project, along with some other open-source implementations (such as langchain agent
), relies on repeated questioning to elicit responses from the large language model that closely match the predefined format of natural language responses. Then, regular expressions are used to match the data.
While large language models offer high flexibility and can implement various complex intelligent database queries, statistics, and plotting functions, they face fatal bottlenecks in terms of input-output scale. They cannot directly handle extremely long texts or large datasets. Therefore, they may not perform well when faced with structural information of large-scale data (e.g., hundreds of tables).
Although large language models can to some extent control relatively deterministic output formats through prompts, they are fundamentally still natural language outputs, exhibiting unstable output characteristics and emergent problems. They cannot guarantee acceptable results or the rationality and accuracy of the results.
Thus, innovation in this area may be a possible future direction for this project.
Introducing word embedding models (such as text2vec-base-multilingual
) and vector databases (such as PGVector
) can improve the handling of large-scale data.
By transforming large-scale data into vectors on a per-item basis and mapping vocabulary into a lower-dimensional vector space, semantic relationships between words can be captured. By storing these vectors in a vector database, we can swiftly perform similarity queries, clustering, and other operations, facilitating effective processing of large-scale data. Moreover, vector databases can match only predetermined results that are semantically closest, mitigating the instability and emergent issues often associated with large language models.
The integration of vector databases and word embedding models helps overcome the instability of large language models and their limitations in handling large-scale data. By combining the stability and reliability of vector databases with the intelligence and flexibility of large language models, efficient and stable processing and analysis of large-scale data can be achieved.
The relevant technical resources can be found in the ./pgv/
folder.
Word Embedding Models:
text2vec-large-chinese
huggingface hf-mirrortext2vec-base-multilingual
huggingface hf-mirror
Vector Databases:
PGVector
DockerHub dockerproxy
To be continued...