A Python script for processing Markdown files, generating embeddings, and storing them in a vector store. This tool allows you to clean, split, and embed Markdown documents using various methods and embedding models. Features
Data Cleaning: Removes duplicates and filters out unwanted content like '404' pages and lines containing the '©' symbol.
Flexible Input: Supports input from JSON files containing URLs and Markdown data, folders of Markdown files, or single Markdown files.
Document Splitting: Splits documents using Markdown headers or recursive character splitting.
Embedding Options: Supports embedding using HuggingFace or Ollama embeddings.
Vector Store Integration: Stores embeddings in a Chroma vector store for efficient retrieval and analysis.
Customizable Filters: Option to disable filters that remove specific content.
Logging: Generates logs for duplicates and removed files for better traceability.
Installation Prerequisites
Python 3.7 or higher
pip
Git (optional, for cloning the repository)
Clone the Repository
git clone https://github.com/GATERAGE/mdmbed.git
cd mdmbed
Install Required Packages
Install the required Python packages using pip:
pip install -r requirements.txt
Note: The requirements.txt file should list all the dependencies, such as tqdm, langchain, chromadb, huggingface, etc. Usage
Run the script using Python:
python md-embed.py [--filters-off]
--filters-off: Disable filters that remove lines containing '©' and skip files containing both '404' and 'page not found'.
Upon running the script, you will be prompted to choose an input method:
JSON Input File Containing URLs and Markdown Data
Folder of Markdown Files
Single Markdown File
JSON Input File
If you choose Option 1, you will be asked to provide:
Path of the JSON input file: The file should be a JSON array of objects, each containing url and markdown keys.
Path of the output folder: The folder where cleaned Markdown files and logs will be saved.
The script will:
Clean the data by removing duplicates.
Save the cleaned Markdown files to the specified output folder.
Generate a file_to_url.json mapping file.
Display a summary of the processing.
Folder of Markdown Files
If you choose Option 2, you will be asked to provide:
Path of the folder containing Markdown files.
The script will:
Load all .md files from the specified folder.
Optionally filter out unwanted content.
Proceed to document splitting.
Single Markdown File
If you choose Option 3, you will be asked to provide:
Path of the Markdown file.
The script will:
Load the specified Markdown file.
Optionally filter out unwanted content.
Proceed to document splitting.
Document Splitting
After loading the documents, you will be prompted to split them:
Split Method: Choose between markdown or recursive splitting.
Remove Links: Optionally remove links from the Markdown content.
Language: Specify the programming language or language of the content.
Additional Settings:
For Markdown Splitting:
Header Levels: Specify which header levels (#, ##, etc.) to split on.
For Recursive Splitting:
Chunk Size: Specify the maximum size of each chunk.
Chunk Overlap: Specify the number of overlapping characters between chunks.
You will have the option to preview the split data before proceeding. Embedding and Saving
After splitting, you will be prompted to embed and save the documents:
Embedding Method: Choose between huggingface or ollama.
HuggingFace: Enter the embedding model name (default: all-MiniLM-L6-v2).
Ollama: Enter the Ollama model name (default: nomic-embed-text).
Persist Directory: Specify the directory to save the vector store database.
Collection Name: Enter a name for the Chroma collection.
The script will:
Embed the documents using the chosen embedding method.
Save the embeddings to a Chroma vector store.
Display information about the saved collections.
Examples Example 1: Process JSON Input File
python md-embed.py
Choose Input Method: 1
Enter the path of the JSON input file: ./data/input.json
Enter the path of the output folder: ./output
Proceed through the prompts to clean data, split documents, and embed them. Example 2: Process Folder of Markdown Files with Filters Off
python md-embed.py --filters-off
Choose Input Method: 2
Enter the path of the folder containing markdown files: ./markdown_files
Proceed through the prompts to load, split, and embed the documents. Contributing
Contributions are welcome! Please follow these steps:
Fork the repository.
Create a new branch:
git checkout -b feature/your-feature-name
Make your changes and commit them:
git commit -m "Add your message"
Push to the branch:
git push origin feature/your-feature-name
Open a Pull Request.
Please make sure your code adheres to the existing style and that all tests pass. License
This project is licensed under the MIT License. Acknowledgments web3dguy LangChain for text splitting and document handling. HuggingFace for embedding models. Chroma for the vector store. TQDM for progress bars. The open-source community for continuous support and contributions.
md-embed processes markdown files, cleans and prepares the data, splits the text into manageable chunks, and creates embeddings for use in vector databases (specifically ChromaDB). It supports multiple input methods and provides options for customizing the splitting and embedding process.
- Multiple Input Methods:
- JSON file containing URLs and markdown data
- Folder of markdown files
- Single markdown file
- JSON file containing URLs and markdown data
- Data Cleaning:
- Removes duplicate entries based on URL section titles
- Handles encoding issues
- Sanitizes filenames for safe saving
- Optionally filters out files containing "404" and "page not found" (can be disabled)
- Removes lines containing the copyright symbol "©"
- Removes duplicate entries based on URL section titles
- Text Splitting:
- Markdown Header Splitting: Splits text based on specified markdown header levels (e.g.,
#
,##
). Allows for custom header level selection. Preserves header hierarchy in metadata - Recursive Character Text Splitting: Splits text into chunks of specified size and overlap
- Link Removal: Optionally removes markdown links, keeping only the link text
- Markdown Header Splitting: Splits text based on specified markdown header levels (e.g.,
- **Embedding Generation:*
- Supports Hugging Face embeddings (using
langchain_huggingface
). Defaults toall-MiniLM-L6-v2
- Supports Ollama embeddings (using
langchain_community
). Defaults tonomic-embed-text
, requires a local Ollama server running athttp://localhost:11434
- Supports Hugging Face embeddings (using
- Vector Database Integration:
- Uses ChromaDB (
langchain_chroma
) to store embeddings and associated metadata - Allows specifying the collection name and persistence directory
- Handles large datasets by processing in batches
- Uses ChromaDB (
- Logging:
- Comprehensive logging through the
logging
module
- Comprehensive logging through the
- Duplicate Logs:
- Writes URLs with duplicate sections to a log
- Writes URLs with duplicate sections to a log
- Removed Files Logs
- Write to a log files that have been removed due to filters
- Write to a log files that have been removed due to filters
- Python 3.7+
langchain
(various components - see import statements)chromadb
tqdm
beautifulsoup4
(if you were scraping, but this script doesn't actually use it)requests
(if you were scraping, but this script doesn't actually use it)
To install the required packages, run:
pip install langchain langchain-chroma langchain-huggingface tqdm
If you are planning to use Ollama, you need to:
Install Ollama by following the instructions provided at Ollama's official website.
Run an Ollama server locally on port 11434
md-embed can be run from the command line. It provides a command-line interface using argparse with the following option:
--filters-off: Disables the "404" and "©" filters
The script will then guide you through a series of interactive prompts to configure the processing:
Input Method Selection: Choose between JSON input, a folder of markdown files, or a single markdown file
Input File/Folder/URL: Provide the path to the input file or folder, as appropriate
Output Folder (for JSON input): Specify the directory where cleaned markdown files will be saved
Data Cleaning Options: The script will show total entires and total duplicates
Language: Specify the primary language of the input files (e.g., "TypeScript", "Python")
Splitting Method: Choose between "markdown" (header-based splitting) and "recursive" (chunk size and overlap)
Markdown Splitting Options (if applicable):
Remove Links: Choose whether to remove markdown links
Header Levels: Specify which header levels to split on (e.g., "1,2,3" for #, ##, and ###). Enter "all" for all header levels
Recursive Splitting Options (if applicable):
Remove Links: Choose whether to remove markdown links
Chunk Size: Specify the desired chunk size (in characters)
Chunk Overlap: Specify the desired chunk overlap (in characters)
Preview Splits: Choose whether to preview the split data ("yes", "full", or "no")
Split Again: You'll be prompted to continue or modify the settings
Embedding Method: Choose between "huggingface" and "ollama"
Embedding Model (Hugging Face): Enter the Hugging Face model name (defaults to all-MiniLM-L6-v2)
Embedding Model (Ollama): Enter the Ollama model name (defaults to nomic-embed-text)
Persistence Directory: Specify the directory where the ChromaDB database will be stored
Collection Name: Choose a name for the ChromaDB collection
Example (JSON Input):
python md-embed.py
Follow the prompts, providing the necessary information (input file, output folder, embedding choices, etc.)
Example (Disabling Filters):
python md-embed.py --filters-off
Cleaned Markdown Files (JSON Input): If using JSON input, the script will save cleaned markdown files to the specified output folder
ChromaDB Database: The script will create a ChromaDB database in the specified persistence directory, containing the embeddings and metadata
Logs: The logs directory will contain logs of removed files (if any) and duplicate entries (if using JSON input)
file_to_url.json: Json file that contains the original URL of each document
Error Handling
The script includes error handling for various scenarios, such as:
Invalid input file/folder paths
File I/O errors
Exceptions during data cleaning, splitting, or embedding
Invalid user input for prompts
Errors are logged using the logging module
Notes
The script assumes that the input JSON data has "url" and "markdown" keys for each entry
The script uses uuid4 to generate unique IDs for each document in the vector database
The script processes in batches to deal with a large number of splits
Disclaimer: This tool is provided "as is" without warranty of any kind. Use it at your own risk. Open source or go away.