A sophisticated AI-powered chatbot solution designed to help users quickly find answers to [Chromium codebase] questions. The bot uses vector embeddings and RAG (Retrieval Augmented Generation) to provide accurate, citation-backed responses.
This bot DOES NOT replace ChatGPT, Gemini, Copilot or other super-powerful models. Instead, it can quickly retrieve relevant information from your internal documents.
Essentially, this is a show case of what is possible with similarity search on your internal corporate data, while might not be best example.
- Smart FAQ Matching: Uses vector embeddings for intelligent question matching
- RAG Fallback: Searches entire document corpus when FAQ doesn't match
- Citation-Based Responses: Every answer includes sources and confidence scores
- Static Deployment: Runs entirely in the browser - perfect for GitHub Pages
- Responsive Design: Works on desktop and mobile devices
- Real-time Processing: Fast in-memory vector search with cosine similarity
- Vector embeddings of Q&A pairs using sentence transformers
- Fast in-memory vector index for similarity search
- Configurable similarity threshold for FAQ matching
- FAQ matching with similarity threshold (default: 0.90)
- RAG fallback for document retrieval when FAQ confidence is low
- Citation-based responses with confidence scores
- Full document corpus search for unmatched queries
- Top-K retrieval with similarity ranking
- Synthesized answers with source attribution
tactica.faq.similaritysearch/
βββ src/
β βββ TacTicA.FaqSimilaritySearchBot.Training/
β β βββ Program.cs # C# training pipeline
β β βββ Services/ # Vector embedding generation services
β β β βββ SimpleEmbeddingService.cs # Based on simplest hash-based vectors embeddings
β β β βββ TfIdfEmbeddingService.cs # Based on TF-IDF Vectorization and Keyword-Based Similarity
β β β βββ OnnxEmbeddingService.cs # Based on pre-trained Transformer Models (Customizable!)
β β β βββ DataProcessingService.cs # Data processing & chunking
β β βββ models/ # Input models files
β β β βββ model_tokenizer.json # https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
β β β βββ model_vocab.txt # https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
β β β βββ model.onnx # https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
β β βββ data/ # Input data files
β β βββ questions_answers.txt # Q&A pairs (you provide)
β β βββ wiki_links.txt # Wiki URLs (you provide)
β β βββ exported-content/ # Exported content to work in full isolation
β βββ TacTicA.FaqSimilaritySearchBot.Web/ # Static web application for simple hash-based similarity search
β β βββ index.html # Main UI
β β βββ css/style.css # Styling
β β βββ js/chatbot.js # Client-side logic
β βββ TacTicA.FaqSimilaritySearchBot.WebOnnx/ # Static web application for AI based search
β β βββ index.html # Main UI
β β βββ css/style.css # Styling
β β βββ js/chatbot.js # Client-side logic
β βββ TacTicA.FaqSimilaritySearchBot.Shared/ # Shared models and utilities
β βββ Models/Models.css
β βββ Utils/VectorUtils.cs
βββ wwwroot/data/ # Generated embeddings & data
βββ build.ps1 # Build & Training script
βββ TacTicA.FaqSimilaritySearchBot.sln
- .NET 9.0 SDK
- Python 3.x (optional, for local web server)
- Visual Studio Code or Visual Studio
git clone <your-repo-url>
cd tactica.faq.similaritysearchData files in the TacTicA.FaqSimilaritySearchBot.Training/data/ directory:
data/questions_answers.txt (Q&A pairs, one per 2 lines):
Q: Where is android webview build instructions?
A: https://source.chromium.org/chromium/chromium/src/+/main:android_webview/docs/build-instructions.md
Q: What are frame trees in Chromium?
A: https://source.chromium.org/chromium/chromium/src/+/main:docs/frame_trees.md
data/wiki_links.txt (one URL per line):
https://source.chromium.org/chromium/chromium/src/+/main:docs/documentation_best_practices.md
https://source.chromium.org/chromium/chromium/src/+/main:docs/fuchsia/gtests.md
data/exported-content
Put as many as required exported content files in a simple markdown format into this directory. It allows system to work in full isolation without even reading wiki-links which is very helpful in case content is restricted via any type of authentication and you need to make some portion of data available to the bot without over-complicating things.
Format is following (URL is mandatory key field here):
URL: https://source.chromium.org/chromium/chromium/src/+/main:docs/fuchsia/gtests.md
# Content in Markdown starts here
Your actual content goes here...
# Build solution
dotnet restore
# Run training
dotnet build --configuration Release
# OR Build & Run together
cd "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Training" ; dotnet run --configuration ReleaseThe process is super straightforward and NOT optimized. So it will run super slow... for couple of hours.
Depending on the similarity search you want to use, make sure $WebPath in build.ps1 points to the right Web Project:
TacTicA.FaqSimilaritySearchBot.WebOnnx - for AI based similarity search
TacTicA.FaqSimilaritySearchBot.Web - for simplistic hash-based or TF-IDF similarity search
# Copy data to web server
Copy-Item -Path "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Training\wwwroot\data\*" -Destination "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.WebOnnx\data\" -Recurse -Force
# Run
cd "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.WebOnnx"; python -m http.server 8080OR
# Copy data to web server
Copy-Item -Path "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Training\wwwroot\data\*" -Destination "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Web\data\" -Recurse -Force
# Run
cd "[PATH_TO_SOURCES]\src\TacTicA.FaqSimilaritySearchBot.Web"; python -m http.server 8080{
"ProcessingSettings": {
"MaxTokensPerChunk": 500,
"SimilarityThreshold": 0.90,
"TopKResults": 5
}
}The current implementation uses three possible embedding services for demo purposes:
ββ SimpleEmbeddingService - Vector embedding generation based on simplest hash-based vectors embeddings
ββ TfIdfEmbeddingService - Vector embedding generation based on TF-IDF Vectorization and Keyword-Based Similarity
ββ OnnxEmbeddingService - Vector embedding generation based on pre-trained sentence transformer model all-MiniLM-L6-v2 converted to ONNX format
-
Training Phase:
Q&A Text β Embeddings β Vector Index Wiki URLs β Content β Chunks β Embeddings β Document Index -
Query Phase:
User Question β Embedding β FAQ Search β High Score? ββ Yes: FAQ Response + Related Docs ββ No: RAG Search β Top Chunks β Generated Response
Low Accuracy Responses:
- We can increase similarity threshold in config
- We may add more Q&A pairs to training data
- Improve existing wiki content quality
Built with β€οΈ AI

