DRHP Analysis & Storage

Overview

This project analyzes Draft Red Herring Prospectus (DRHP) documents—regulatory filings by companies planning to go public—to extract key insights such as financials, risk factors, business strategies, and operational details. The structured data is then enriched with AI-generated summaries and vector embeddings for efficient retrieval and analysis. this project can be made better with resources currently it is made for free.

Objectives

Extract key details from DRHP documents, including:
- Section Number
- Chapter Name
- Company Name
- Full Text
- Tables (structured tabular data) (used tabula for that)
- Key Findings (AI-generated summaries)
Generate vector embeddings for the key findings using an embedding model.
Store the processed data in a vector database for efficient similarity search and retrieval.

Assignment Tasks

Download DRHP Documents:
Download 5 DRHP documents from the SEBI website.
Parse and Structure Data:
Use Python scripts to extract the full text from the PDFs and structure the data into JSON/CSV. The structured data includes details such as company name, section number, chapter name, full text, and any tabular data.
AI Summarization:
Generate comprehensive AI-based summaries for each section using the Gemini API (Google Generative AI).
Vector Embedding:
Generate vector embeddings for each section’s key findings using a SentenceTransformer model.
Store in a Vector Database:
Store the embeddings in a vector database (using FAISS) to support efficient similarity search and retrieval.

Deliverables

Python Scripts:
- doc_processor.py: Downloads and extracts text from DRHP PDFs.
- company_data_class: it hold the data structures made for this project
- data_extractor.py: Parses and structures the extracted data.
- data_extractor.py: Parses and structures the extracted Table. will be making a better and more proficient one using ocr.
- section_summarizer.py: Generates AI-based summaries for each section using together.
- vector_embedder.py: Generates vector embeddings and stores them in a FAISS index.
- Pipeline.py: Orchestrates the entire pipeline.
Output Files:
- JSON/CSV output containing the structured DRHP data.
- A vector database (FAISS index) populated with embedded key findings.

Install dependencies with:

pip install -r requirements.txt

Environment Configuration

Create a .env file in the project root and add your Gemini API key:

TOGETHER_API_KEY = your_api_key_here

Usage

Download the DRHP PDFs:
Place the downloaded DRHP documents in the docs/ folder.
Set output and input path Set the input and output path in the main.py file
Run the Pipeline:
Execute the pipeline script to process PDFs, extract structured data, generate summaries, and store vector embeddings:
```
python main.py
```
Query the Vector Database:
Use the provided vector embedder scripts to perform similarity searches on the stored embeddings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DRHP Analysis & Storage

Overview

Objectives

Assignment Tasks

Deliverables

Install dependencies with:

Environment Configuration

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
output		output
src		src
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

chiruu12/DRHP-Analysis-Storage

Folders and files

Latest commit

History

Repository files navigation

DRHP Analysis & Storage

Overview

Objectives

Assignment Tasks

Deliverables

Install dependencies with:

Environment Configuration

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages