Skip to content

sudipme/index_codebase

Repository files navigation

Code Embeddings for RAG

A Retrieval-Augmented Generation (RAG) system for analyzing, searching, and fixing code with LLMs.

Overview

This project provides a system to analyze a codebase by:

  1. Parsing Python code files to extract functions, classes, methods, and docstrings
  2. Generating embeddings using sentence transformers
  3. Storing embeddings in Qdrant vector database
  4. Retrieving relevant context for LLM to answer code-related questions

Features

  • Parse Python files into semantic segments (functions, classes, methods)
  • Generate embeddings using sentence-transformers
  • Store embeddings in Qdrant vector database
  • Search code semantically by natural language
  • Generate contextual information for LLM prompts
  • Command-line interface for easy interactions

Requirements

  • Python 3.8+
  • Qdrant server (can be run locally or accessed remotely)
  • Dependencies listed in requirements.txt

Installation

  1. Clone the repository
  2. Create a virtual environment:
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install the requirements:
    pip install -r requirements.txt
    
  4. Start Qdrant server:
    docker run -p 6333:6333 qdrant/qdrant
    

Usage

Using the CLI

The demo.py script provides a command-line interface:

# Embed code files or directories
python demo.py embed /path/to/code --force

# Search for relevant code
python demo.py search "how does the custom calculation work?"

# Generate context for LLM
python demo.py context "explain the database initialization"

# Run complete demo
python demo.py demo

Using the library in your code

from app.main import CodeEmbedder

# Initialize with data directory
embedder = CodeEmbedder(data_dir="./data")

# Embed a directory of code
embedder.embed_directory("/path/to/code")

# Search for similar code
results = embedder.search("database initialization", top_k=5)
for code_id, similarity, code_text in results:
    print(f"{code_id}: {similarity}")
    print(code_text)

# Generate context for LLM
context = embedder.get_context_for_llm("how to parse Python code?")

# Get structured query results for programmatic use
structured_results = embedder.query_codebase("how to initialize the database?")

How It Works

  1. Parsing: Python files are parsed using AST (Abstract Syntax Tree) to extract meaningful code segments.
  2. Embeddings: Sentence-transformers (all-MiniLM-L6-v2 by default) creates embeddings for each code segment.
  3. Storage: Embeddings are stored in Qdrant, a vector database optimized for similarity search.
  4. Search: Natural language queries are converted to embeddings and searched by similarity.
  5. Context: Relevant code is formatted into a context string that can be fed to an LLM.

Dependencies

  • sentence-transformers
  • numpy
  • qdrant-client

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages