A Retrieval-Augmented Generation (RAG) system for analyzing, searching, and fixing code with LLMs.
This project provides a system to analyze a codebase by:
- Parsing Python code files to extract functions, classes, methods, and docstrings
- Generating embeddings using sentence transformers
- Storing embeddings in Qdrant vector database
- Retrieving relevant context for LLM to answer code-related questions
- Parse Python files into semantic segments (functions, classes, methods)
- Generate embeddings using sentence-transformers
- Store embeddings in Qdrant vector database
- Search code semantically by natural language
- Generate contextual information for LLM prompts
- Command-line interface for easy interactions
- Python 3.8+
- Qdrant server (can be run locally or accessed remotely)
- Dependencies listed in requirements.txt
- Clone the repository
- Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the requirements:
pip install -r requirements.txt
- Start Qdrant server:
docker run -p 6333:6333 qdrant/qdrant
The demo.py script provides a command-line interface:
# Embed code files or directories
python demo.py embed /path/to/code --force
# Search for relevant code
python demo.py search "how does the custom calculation work?"
# Generate context for LLM
python demo.py context "explain the database initialization"
# Run complete demo
python demo.py demo
from app.main import CodeEmbedder
# Initialize with data directory
embedder = CodeEmbedder(data_dir="./data")
# Embed a directory of code
embedder.embed_directory("/path/to/code")
# Search for similar code
results = embedder.search("database initialization", top_k=5)
for code_id, similarity, code_text in results:
print(f"{code_id}: {similarity}")
print(code_text)
# Generate context for LLM
context = embedder.get_context_for_llm("how to parse Python code?")
# Get structured query results for programmatic use
structured_results = embedder.query_codebase("how to initialize the database?")
- Parsing: Python files are parsed using AST (Abstract Syntax Tree) to extract meaningful code segments.
- Embeddings: Sentence-transformers (all-MiniLM-L6-v2 by default) creates embeddings for each code segment.
- Storage: Embeddings are stored in Qdrant, a vector database optimized for similarity search.
- Search: Natural language queries are converted to embeddings and searched by similarity.
- Context: Relevant code is formatted into a context string that can be fed to an LLM.
- sentence-transformers
- numpy
- qdrant-client
MIT