Skip to content
forked from Web3dGuy/md-embed

RAGE ingest for digest as markdown embed

Notifications You must be signed in to change notification settings

GATERAGE/mdmbed

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

md-embed (c) 2024 web3dguy

A Python script for processing Markdown files, generating embeddings, and storing them in a vector store. This tool allows you to clean, split, and embed Markdown documents using various methods and embedding models. Features

Data Cleaning: Removes duplicates and filters out unwanted content like '404' pages and lines containing the '©' symbol.
Flexible Input: Supports input from JSON files containing URLs and Markdown data, folders of Markdown files, or single Markdown files.
Document Splitting: Splits documents using Markdown headers or recursive character splitting.
Embedding Options: Supports embedding using HuggingFace or Ollama embeddings.
Vector Store Integration: Stores embeddings in a Chroma vector store for efficient retrieval and analysis.
Customizable Filters: Option to disable filters that remove specific content.
Logging: Generates logs for duplicates and removed files for better traceability.

Installation Prerequisites

    Python 3.7 or higher
    pip
    Git (optional, for cloning the repository)

Clone the Repository

git clone https://github.com/GATERAGE/mdmbed.git
cd mdmbed

Install Required Packages

Install the required Python packages using pip:

pip install -r requirements.txt

Note: The requirements.txt file should list all the dependencies, such as tqdm, langchain, chromadb, huggingface, etc. Usage

Run the script using Python:

python md-embed.py [--filters-off]

Command-Line Arguments

--filters-off: Disable filters that remove lines containing '©' and skip files containing both '404' and 'page not found'.

Upon running the script, you will be prompted to choose an input method:

JSON Input File Containing URLs and Markdown Data
Folder of Markdown Files
Single Markdown File

JSON Input File

If you choose Option 1, you will be asked to provide:

Path of the JSON input file: The file should be a JSON array of objects, each containing url and markdown keys.
Path of the output folder: The folder where cleaned Markdown files and logs will be saved.

The script will:

Clean the data by removing duplicates.
Save the cleaned Markdown files to the specified output folder.
Generate a file_to_url.json mapping file.
Display a summary of the processing.

Folder of Markdown Files

If you choose Option 2, you will be asked to provide:

Path of the folder containing Markdown files.

The script will:

Load all .md files from the specified folder.
Optionally filter out unwanted content.
Proceed to document splitting.

Single Markdown File

If you choose Option 3, you will be asked to provide:

Path of the Markdown file.

The script will:

Load the specified Markdown file.
Optionally filter out unwanted content.
Proceed to document splitting.

Document Splitting

After loading the documents, you will be prompted to split them:

Split Method: Choose between markdown or recursive splitting.
Remove Links: Optionally remove links from the Markdown content.
Language: Specify the programming language or language of the content.
Additional Settings:
    For Markdown Splitting:
        Header Levels: Specify which header levels (#, ##, etc.) to split on.
    For Recursive Splitting:
        Chunk Size: Specify the maximum size of each chunk.
        Chunk Overlap: Specify the number of overlapping characters between chunks.

You will have the option to preview the split data before proceeding. Embedding and Saving

After splitting, you will be prompted to embed and save the documents:

Embedding Method: Choose between huggingface or ollama.
    HuggingFace: Enter the embedding model name (default: all-MiniLM-L6-v2).
    Ollama: Enter the Ollama model name (default: nomic-embed-text).
Persist Directory: Specify the directory to save the vector store database.
Collection Name: Enter a name for the Chroma collection.

The script will:

Embed the documents using the chosen embedding method.
Save the embeddings to a Chroma vector store.
Display information about the saved collections.

Examples Example 1: Process JSON Input File

python md-embed.py

Choose Input Method: 1

Enter the path of the JSON input file: ./data/input.json
Enter the path of the output folder: ./output

Proceed through the prompts to clean data, split documents, and embed them. Example 2: Process Folder of Markdown Files with Filters Off

python md-embed.py --filters-off

Choose Input Method: 2

Enter the path of the folder containing markdown files: ./markdown_files

Proceed through the prompts to load, split, and embed the documents. Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.

Create a new branch:
git checkout -b feature/your-feature-name

Make your changes and commit them:

git commit -m "Add your message"

Push to the branch:

git push origin feature/your-feature-name
Open a Pull Request.

Please make sure your code adheres to the existing style and that all tests pass. License

This project is licensed under the MIT License. Acknowledgments web3dguy LangChain for text splitting and document handling. HuggingFace for embedding models. Chroma for the vector store. TQDM for progress bars. The open-source community for continuous support and contributions.

Markdown Processor and Embedder

md-embed processes markdown files, cleans and prepares the data, splits the text into manageable chunks, and creates embeddings for use in vector databases (specifically ChromaDB). It supports multiple input methods and provides options for customizing the splitting and embedding process.

Features

  • Multiple Input Methods:
    • JSON file containing URLs and markdown data
    • Folder of markdown files
    • Single markdown file
  • Data Cleaning:
    • Removes duplicate entries based on URL section titles
    • Handles encoding issues
    • Sanitizes filenames for safe saving
    • Optionally filters out files containing "404" and "page not found" (can be disabled)
    • Removes lines containing the copyright symbol "©"
  • Text Splitting:
    • Markdown Header Splitting: Splits text based on specified markdown header levels (e.g., #, ##). Allows for custom header level selection. Preserves header hierarchy in metadata
    • Recursive Character Text Splitting: Splits text into chunks of specified size and overlap
    • Link Removal: Optionally removes markdown links, keeping only the link text
  • **Embedding Generation:*
    • Supports Hugging Face embeddings (using langchain_huggingface). Defaults to all-MiniLM-L6-v2
    • Supports Ollama embeddings (using langchain_community). Defaults to nomic-embed-text, requires a local Ollama server running at http://localhost:11434
  • Vector Database Integration:
    • Uses ChromaDB (langchain_chroma) to store embeddings and associated metadata
    • Allows specifying the collection name and persistence directory
    • Handles large datasets by processing in batches
  • Logging:
    • Comprehensive logging through the logging module
  • Duplicate Logs:
    • Writes URLs with duplicate sections to a log
  • Removed Files Logs
    • Write to a log files that have been removed due to filters

Requirements

  • Python 3.7+
  • langchain (various components - see import statements)
  • chromadb
  • tqdm
  • beautifulsoup4 (if you were scraping, but this script doesn't actually use it)
  • requests (if you were scraping, but this script doesn't actually use it)

To install the required packages, run:

pip install langchain langchain-chroma langchain-huggingface tqdm
If you are planning to use Ollama, you need to:
Install Ollama by following the instructions provided at Ollama's official website.
Run an Ollama server locally on port 11434

md-embed can be run from the command line. It provides a command-line interface using argparse with the following option:
--filters-off: Disables the "404" and "©" filters
The script will then guide you through a series of interactive prompts to configure the processing:
Input Method Selection: Choose between JSON input, a folder of markdown files, or a single markdown file
Input File/Folder/URL: Provide the path to the input file or folder, as appropriate
Output Folder (for JSON input): Specify the directory where cleaned markdown files will be saved
Data Cleaning Options: The script will show total entires and total duplicates
Language: Specify the primary language of the input files (e.g., "TypeScript", "Python")
Splitting Method: Choose between "markdown" (header-based splitting) and "recursive" (chunk size and overlap)
Markdown Splitting Options (if applicable):
Remove Links: Choose whether to remove markdown links
Header Levels: Specify which header levels to split on (e.g., "1,2,3" for #, ##, and ###). Enter "all" for all header levels
Recursive Splitting Options (if applicable):
Remove Links: Choose whether to remove markdown links
Chunk Size: Specify the desired chunk size (in characters)
Chunk Overlap: Specify the desired chunk overlap (in characters)
Preview Splits: Choose whether to preview the split data ("yes", "full", or "no")
Split Again: You'll be prompted to continue or modify the settings
Embedding Method: Choose between "huggingface" and "ollama"
Embedding Model (Hugging Face): Enter the Hugging Face model name (defaults to all-MiniLM-L6-v2)
Embedding Model (Ollama): Enter the Ollama model name (defaults to nomic-embed-text)
Persistence Directory: Specify the directory where the ChromaDB database will be stored
Collection Name: Choose a name for the ChromaDB collection
Example (JSON Input):

python md-embed.py

Follow the prompts, providing the necessary information (input file, output folder, embedding choices, etc.)
Example (Disabling Filters):

python md-embed.py --filters-off

Cleaned Markdown Files (JSON Input): If using JSON input, the script will save cleaned markdown files to the specified output folder
ChromaDB Database: The script will create a ChromaDB database in the specified persistence directory, containing the embeddings and metadata
Logs: The logs directory will contain logs of removed files (if any) and duplicate entries (if using JSON input)
file_to_url.json: Json file that contains the original URL of each document
Error Handling
The script includes error handling for various scenarios, such as:
Invalid input file/folder paths
File I/O errors
Exceptions during data cleaning, splitting, or embedding
Invalid user input for prompts
Errors are logged using the logging module
Notes
The script assumes that the input JSON data has "url" and "markdown" keys for each entry
The script uses uuid4 to generate unique IDs for each document in the vector database
The script processes in batches to deal with a large number of splits

Disclaimer: This tool is provided "as is" without warranty of any kind. Use it at your own risk. Open source or go away.

About

RAGE ingest for digest as markdown embed

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%