GitHub - madeyexz/markdown-file-query: Semantic QA with a markdown database: Query any markdown file using vector embedding, Pinecone vector database and GPT (langchain). A weaker version of privateGPT

This project currently works best with English documents.

About This Project

this project

utilizes Pinecone vector database (VDB) and OpenAI (vector) embedding model to turn texts into vectors.
works with any .md file, so it works perfectly with Notion & Obsidian (though for Notion you have to export it to .md manually first)
is the author's practice of Feynman technique.
is probably a weaker duplicate of privateGPT and llama_index, if you want a beautifully-crafted document query program, you should use llama_index instead of this toy.

Walkthrough of this Program

Each markdown file in the target directory is cut into lots of small chunks using langchain.textsplitter
Each chunck is turned into a vector via OpenAI's embedding model (langchain.embeddings.OpenAIEmbeddings)
The vectors are then uploaded to Pinecone vector database.
Queries are also converted to vectors using the vector embedding model and uploaded to Pinecone.
To retrieve search results, we compare the query vector with vector database using Pinecone (by cosine similarity).
Closest 3 results are retrieved and fed into GPT-3 along with the question, and GPT-3 will generate an answer in natural language.

TODO

add a --help option
deploy to Streamlit

Getting Started

Prerequisites

Prepare Pinecone and OpenAI API key:
- Pinecone API key can be obtained here.
- OpenAI API key can be obtained here.
To export the Pinecone and OpenAI API key to system environment
```
export PINECONE_API_KEY="your_pinecone_api_key"
export OPENAI_API_KEY="your_openai_api_key"
```
now in Python use
```
import os
os.environ["PINECONE_API_KEY"]
os.environ["OPENAI_API_KEY"]
```
to check if you have them exported to system environment, if KeyError, then restart the terminal upon completion (and your IDE if you are using one).

Installation

clone this repo to your local machine

git clone https://github.com/madeyexz/markdown-file-query.git

Install the dependencies
```
pip install pinecone langchain tqdm
```

Usage

Prepare the markdown file(s) and put them in a FOLDER (or any name you like, but you have to change the code accordingly). Notice this should be in the same directory as main.py.
If this is your first time querying a certain document, run the main.py program
```
python3 main.py "PATH_OF_FOLDER" "QUESTION"
```
The query results and the reference GPT used to generate the answer will be saved in answer.txt and contents.txt respectively.
If you want to query the same batch of documents again, then run the query_only.py to avoid re-embedding the documents.
```
python3 query_only.py "QUESTION"
```

Example

I have a folder called markdown_database which contains a bunch of .md files, I want to query this database with the question "Whats the strange situation"

❯ python3 main.py "markdown_database" "what's the strange situation"

initiating pinecone index...
digesting docs...
uploading datas to pinecone...
92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████          | 60/65 [00:29<00:02,  1.87it/s]
let's wait for 60 seconds to avoid RateLimitError... \(since im not a paid user\))
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:00<00:00,  1.00s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 65/65 [01:32<00:00,  1.42s/it]
querying pinecone...
querying gpt...
writing results to answer.txt and contents.txt
done! the answer to 'what's the strange situation' is: '
The Strange Situation is a standardized procedure devised by Mary Ainsworth in the 1970s to observe attachment security in children within the context of caregiver relationships. It applies to infants between the age of nine and 18 months and involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. The procedure is used to observe the quality of a young child’s attachment to his or her mother, and can also be applied to other attachment figures, such as God, through the use of Emotionally Focused Therapy (EFT) and religious beliefs, such as the saying “there are no atheists in foxholes”.'

If I want to query the same database again, I can use query_only.py to avoid re-embedding the documents.

❯ python3 query_only.py "Who is Mary Ainsworth?"

connecting to pinecone index...
getting docs
querying pinecone...
querying gpt...
done! the answer to 'Who is Mary Ainsworth?' is: '
Mary Ainsworth was a developmental psychologist who devised the Strange Situation in the 1970s to observe attachment security in children within the context of caregiver relationships. The Strange Situation involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. Ainsworth is also known for her observation that if you want to see the quality of a young child’s attachment to his or her mother, watch what the child does, not when Mother leaves, but when she returns. She is also known for her research on anxious babies and their inability to use their mothers as a secure base.'

Known Limitation

If you use Pinecone, then whenever you want to query a new document (i.e. creating a new database), you should probably create a new Pinecone index (for you don't want answers from the old document), or delete the old index. This is because Pinecone does not support updating the index (yet).

To delete the old index:
```
python3 delete_pinecone_index.py NAME_OF_INDEX
```

Acknowledgements

Huge shout out to the open-source community for providing straight-forward examples and comprehensive tutorials!

openai-cookbook: using vector database for embeddings search
Build a Personal Search Engine Web App using Open AI Text Embeddings - Avra
this project is heavily inspired by hwchase17/notion-qa
Langchain, a Python library for manipulating LLMs elegently.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
README.md		README.md
README.zh.md		README.zh.md
delete_pinecone_index.py		delete_pinecone_index.py
main.py		main.py
query_only.py		query_only.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

README.zh.md

README.zh.md

delete_pinecone_index.py

delete_pinecone_index.py

main.py

main.py

query_only.py

query_only.py

Repository files navigation

About This Project

Walkthrough of this Program

TODO

Getting Started

Prerequisites

Installation

Usage

Example

Known Limitation

Acknowledgements

About

Languages

madeyexz/markdown-file-query

Folders and files

Latest commit

History

Repository files navigation

About This Project

Walkthrough of this Program

TODO

Getting Started

Prerequisites

Installation

Usage

Example

Known Limitation

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages