The Content Navigator empowers users to swiftly search for specific information across multiple uploaded documents in various formats. It's designed with a user-friendly interface and powerful search capabilities, making it a breeze to find relevant data in large sets of documents.
- Multi-Format Support: Upload and search across documents in various formats such as 'pdf', 'doc', 'docx', 'csv', 'json', 'txt', 'htm', 'html', 'ppt', and 'pptx'.
- Flexible Search Options: Specify the exact information you want to search for in each document.
- Traceability: View search results along with a trace of the relevant documents.
- Scalability: Handle large content by integrating with external vector databases using PineCone API.
Please note that by default, the app tries to use in-memory embedding for all document content. However, in case of extensive content combining all uploaded documents, the app conveniently uses the PineCone Vector Database.
We split the UI into two pieces to accommodate details in snapshots.
-
Clone the repository:
git clone https://github.com/your-username/document-searcher.git
-
Navigate to the project directory
cd document-searcher
-
Install Poetry using pip (if not already installed):
pip install poetry
-
Activate the virtual environment using Poetry:
pip shell
-
Install the project dependencies using Poetry:
poetry install
Create a .env
file in the root directory of the project:
-
Add your own OpenAI API key as:
OPENAI_API_KEY = 'your-key-here'
-
In order to handle large document content, the app will use
PineCone
vector database. Please set up your PineCone API key:PINECONE_API_KEY = 'your_pinecone_api_key'
-
After installing the dependencies, you can run the Streamlit app in the root directory by executing the following command:
streamlit run app.py
-
Please follow the prompts and upload your documents in the permitted formats.
-
Specify the information you want to search for.
-
The app will search the requested information, display the results and the trace of relevant documents.
This project is licensed under the MIT License - see the LICENSE file for details.