A Python-based multi-agent Retrieval Augmented Generation (RAG) system designed to process and query large volumes of financial documents for M&A due diligence purposes.
- Multi-agent architecture for parallel processing of large financial documents
- Specialized chunking strategies optimized for financial documents (PDFs, Excel, Word, etc.)
- Intelligent financial entity extraction and indexing
- Advanced semantic search with financial term expansion
- Topic modeling for document categorization
- Integration with LLMs for comprehensive analysis and summarization
- Vector database integration for efficient retrieval
- Distributed processing capabilities for handling large document collections
- REST API for interacting with the system
This system is specifically designed to assist financial analysts and investment bankers in the due diligence process for mergers and acquisitions. It can process and analyze:
- Financial statements and annual reports
- Legal contracts and agreements
- Regulatory filings
- Market analysis reports
- Valuation documents
- Tax documents
- Due diligence memos and reports
The system extracts key financial information, identifies risks and opportunities, and provides a comprehensive analysis to support M&A decision-making.
financial-due-diligence-rag/
├── config/ # Configuration files
├── data/ # Data storage location
│ └── financial_indices/ # Intelligent indices for financial documents
├── docs/ # Documentation
└── src/ # Source code
├── agents/ # Multi-agent system components
├── api/ # API endpoints
├── document_processing/ # Document processors for financial documents
├── utils/ # Utility functions
└── vector_store/ # Vector database integration
- Document Loading: Supports various financial document formats (PDF, DOCX, XLSX, etc.)
- OCR Processing: Handles scanned documents with OCR capabilities
- Financial Entity Extraction: Identifies companies, monetary values, dates, percentages, etc.
- Intelligent Chunking: Splits documents based on semantic boundaries
- Metadata Extraction: Extracts key financial metrics and document categories
- Embedding Generation: Creates vector representations of document chunks
- Intelligent Indexing: Builds specialized indices for financial terms and entities
- Topic Modeling: Categorizes documents for better organization and retrieval
- Clone the repository
- Create a virtual environment:
python -m venv venv - Activate the virtual environment:
source venv/bin/activate(Unix) orvenv\Scripts\activate(Windows) - Install dependencies:
pip install -r requirements.txt - Copy
.env.exampleto.envand add your API keys (especially OpenAI for LLM integration)
- Start the system:
python src/main.py - Use the API to upload financial documents and query the system
POST /api/upload: Upload and process a financial documentPOST /api/query: Query the system with financial questionsPOST /api/task/status: Check the status of a document processing taskGET /api/collections: List all document collectionsGET /api/collections/{collection_name}/stats: Get statistics for a collection
The system supports various financial document formats including:
- PDF (text and scanned via OCR)
- Microsoft Word (DOCX)
- Microsoft Excel (XLSX, XLS)
- Microsoft PowerPoint (PPTX, PPT)
- CSV and TSV files
- Plain text files
- HTML and XML documents
- Markdown files
- JSON files
MIT