The Knowledge Dumper is a Python-based tool designed to aggregate content from multiple sources into a single, comprehensive text file. This tool is particularly useful for preparing large knowledge bases for use with Large Language Models (LLMs) by consolidating data from local directories, Python files, and remote web pages.
- Merge content from websites, with conversion to Markdown.
- Include files from a selected directory, supporting filtering by patterns and file extensions.
- Selectively extract Python classes, methods, functions, and symbols.
- User-friendly graphical interface for easy project management.
- Ideal for researchers, developers, and data scientists building knowledge bases for LLMs.
- StartWindow: Interface for project selection and creation.
- MainWindow: Primary workspace for managing files, symbols, and remote content.
- DatabaseManager: Handles data storage and retrieval.
- ProjectManager: Manages the project lifecycle and recent project tracking.
- FileTreeBuilder: Creates a navigable structure of project files.
- FileMerger: Combines selected content into a single output file.
- SymbolExtractor: Analyzes Python files to extract relevant symbols.
- RemoteIndexer: Fetches and processes content from web pages.
This tool was born out of a personal need to streamline the process of gathering diverse content sources into a cohesive document. While it may have niche applications, it is extremely valuable for creating knowledge bases to aid LLM development and training.
- Python 3.7 or higher
- pip (Python package installer)
-
Clone the repository or download the source code:
git clone https://github.com/jacekjursza/dumper.git cd dumper
-
Create a virtual environment (optional but recommended):
- On Windows:
python -m venv venv venv\Scripts\activate
- On macOS and Linux:
python3 -m venv venv source venv/bin/activate
- On Windows:
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the application:
python main.py
- Launch the application and create a new project or open an existing one.
- In the File Selection tab, choose the files to include in your merged output.
- Use the Symbol Index tab to view and filter extracted symbols from Python files.
- In the Remote Docs tab, add and manage remote web pages to include in your merged content.
- Click "Create output file" to generate your consolidated knowledge base file.
This tool is highly customizable. Modify the source code to add new features, adjust the user interface, or tweak the merging logic to meet your specific needs, especially when preparing LLM knowledge bases.
Contributions to the project are welcome. Submit pull requests or open issues to suggest improvements or report bugs.
This project is licensed under the MIT License.