Skip to content

tacticaxyz/tactica.faq.similaritysearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Chromium codebase FAQ Similarity Search Bot

A sophisticated AI-powered chatbot solution designed to help users quickly find answers to [Chromium codebase] questions. The bot uses vector embeddings and RAG (Retrieval Augmented Generation) to provide accurate, citation-backed responses.

Play with it In Production

This bot DOES NOT replace ChatGPT, Gemini, Copilot or other super-powerful models. Instead, it can quickly retrieve relevant information from your internal documents.

Essentially, this is a show case of what is possible with similarity search on your internal corporate data, while might not be best example.

Similarity Search Bot Preview .NET 9 License

Similarity Search Bot

🌟 Features

  • Smart FAQ Matching: Uses vector embeddings for intelligent question matching
  • RAG Fallback: Searches entire document corpus when FAQ doesn't match
  • Citation-Based Responses: Every answer includes sources and confidence scores
  • Static Deployment: Runs entirely in the browser - perfect for GitHub Pages
  • Responsive Design: Works on desktop and mobile devices
  • Real-time Processing: Fast in-memory vector search with cosine similarity

πŸ—οΈ Architecture

Phase 1: Context Database Preparation

  • Vector embeddings of Q&A pairs using sentence transformers
  • Fast in-memory vector index for similarity search
  • Configurable similarity threshold for FAQ matching

Phase 2: User Query Processing

  • FAQ matching with similarity threshold (default: 0.90)
  • RAG fallback for document retrieval when FAQ confidence is low
  • Citation-based responses with confidence scores

Phase 3: General Knowledge Retrieval

  • Full document corpus search for unmatched queries
  • Top-K retrieval with similarity ranking
  • Synthesized answers with source attribution

πŸ“ Project Structure

tactica.faq.similaritysearch/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ TacTicA.FaqSimilaritySearchBot.Training/     
β”‚   β”‚   β”œβ”€β”€ Program.cs                      # C# training pipeline
β”‚   β”‚   β”œβ”€β”€ Services/                       # Vector embedding generation services
β”‚   β”‚   β”‚   β”œβ”€β”€ SimpleEmbeddingService.cs      # Based on simplest hash-based vectors embeddings
β”‚   β”‚   β”‚   β”œβ”€β”€ TfIdfEmbeddingService.cs       # Based on TF-IDF Vectorization and Keyword-Based Similarity
β”‚   β”‚   β”‚   β”œβ”€β”€ OnnxEmbeddingService.cs        # Based on pre-trained Transformer Models (Customizable!)
β”‚   β”‚   β”‚   └── DataProcessingService.cs       # Data processing & chunking
β”‚   β”‚   β”œβ”€β”€ models/                         # Input models files
β”‚   β”‚   β”‚   β”œβ”€β”€ model_tokenizer.json           # https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
β”‚   β”‚   β”‚   β”œβ”€β”€ model_vocab.txt                # https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
β”‚   β”‚   β”‚   └── model.onnx                     # https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
β”‚   β”‚   └── data/                           # Input data files
β”‚   β”‚       β”œβ”€β”€ questions_answers.txt          # Q&A pairs (you provide)
β”‚   β”‚       β”œβ”€β”€ wiki_links.txt                 # Wiki URLs (you provide)
β”‚   β”‚       └── exported-content/              # Exported content to work in full isolation
β”‚   β”œβ”€β”€ TacTicA.FaqSimilaritySearchBot.Web/    # Static web application for simple hash-based similarity search
β”‚   β”‚   β”œβ”€β”€ index.html                              # Main UI
β”‚   β”‚   β”œβ”€β”€ css/style.css                           # Styling
β”‚   β”‚   └── js/chatbot.js                           # Client-side logic
β”‚   β”œβ”€β”€ TacTicA.FaqSimilaritySearchBot.WebOnnx/  # Static web application for AI based search
β”‚   β”‚   β”œβ”€β”€ index.html                                  # Main UI
β”‚   β”‚   β”œβ”€β”€ css/style.css                               # Styling
β”‚   β”‚   └── js/chatbot.js                               # Client-side logic
β”‚   └── TacTicA.FaqSimilaritySearchBot.Shared/  # Shared models and utilities
β”‚       β”œβ”€β”€ Models/Models.css
β”‚       └── Utils/VectorUtils.cs
β”œβ”€β”€ wwwroot/data/                               # Generated embeddings & data
β”œβ”€β”€ build.ps1                                   # Build & Training script
└── TacTicA.FaqSimilaritySearchBot.sln

πŸš€ Quick Start

Prerequisites

1. Clone and Setup

git clone <your-repo-url>
cd tactica.faq.similaritysearch

2. Data

Data files in the TacTicA.FaqSimilaritySearchBot.Training/data/ directory:

data/questions_answers.txt (Q&A pairs, one per 2 lines):

Q: Where is android webview build instructions?
A: https://source.chromium.org/chromium/chromium/src/+/main:android_webview/docs/build-instructions.md

Q: What are frame trees in Chromium?
A: https://source.chromium.org/chromium/chromium/src/+/main:docs/frame_trees.md

data/wiki_links.txt (one URL per line):

https://source.chromium.org/chromium/chromium/src/+/main:docs/documentation_best_practices.md
https://source.chromium.org/chromium/chromium/src/+/main:docs/fuchsia/gtests.md

data/exported-content

Put as many as required exported content files in a simple markdown format into this directory. It allows system to work in full isolation without even reading wiki-links which is very helpful in case content is restricted via any type of authentication and you need to make some portion of data available to the bot without over-complicating things.

Format is following (URL is mandatory key field here):

URL: https://source.chromium.org/chromium/chromium/src/+/main:docs/fuchsia/gtests.md
# Content in Markdown starts here

Your actual content goes here...

3. Build and Train

# Build solution
dotnet restore
# Run training
dotnet build --configuration Release
# OR Build & Run together
cd "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Training" ; dotnet run --configuration Release

The process is super straightforward and NOT optimized. So it will run super slow... for couple of hours.

1

4. Serve Locally

Depending on the similarity search you want to use, make sure $WebPath in build.ps1 points to the right Web Project: TacTicA.FaqSimilaritySearchBot.WebOnnx - for AI based similarity search TacTicA.FaqSimilaritySearchBot.Web - for simplistic hash-based or TF-IDF similarity search

# Copy data to web server
Copy-Item -Path "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Training\wwwroot\data\*" -Destination "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.WebOnnx\data\" -Recurse -Force
# Run
cd "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.WebOnnx"; python -m http.server 8080

OR

# Copy data to web server
Copy-Item -Path "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Training\wwwroot\data\*" -Destination "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Web\data\" -Recurse -Force
# Run
cd "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Web"; python -m http.server 8080

πŸ”§ Configuration

Training Configuration (src/TacTicA.FaqSimilaritySearchBot.Training/appsettings.json)

{
  "ProcessingSettings": {
    "MaxTokensPerChunk": 500,
    "SimilarityThreshold": 0.90,
    "TopKResults": 5
  }
}

Customizing Embeddings

The current implementation uses three possible embedding services for demo purposes: ── SimpleEmbeddingService - Vector embedding generation based on simplest hash-based vectors embeddings ── TfIdfEmbeddingService - Vector embedding generation based on TF-IDF Vectorization and Keyword-Based Similarity ── OnnxEmbeddingService - Vector embedding generation based on pre-trained sentence transformer model all-MiniLM-L6-v2 converted to ONNX format

πŸ“Š Data Flow

  1. Training Phase:

    Q&A Text β†’ Embeddings β†’ Vector Index
    Wiki URLs β†’ Content β†’ Chunks β†’ Embeddings β†’ Document Index
    
  2. Query Phase:

    User Question β†’ Embedding β†’ FAQ Search β†’ High Score?
    β”œβ”€ Yes: FAQ Response + Related Docs
    └─ No:  RAG Search β†’ Top Chunks β†’ Generated Response
    

πŸ” Known Issues

Low Accuracy Responses:

  • We can increase similarity threshold in config
  • We may add more Q&A pairs to training data
  • Improve existing wiki content quality

Built with ❀️ AI