A robust command-line tool built in Rust that makes merging and deduplicating text files a breeze. Whether you're dealing with small files or massive datasets, this tool handles the heavy lifting with parallel processing and smart error handling.
- Smart File Merging: Feed it a list of file paths via
-i/--input-files
, and it'll combine them into a single output file (-o/--output-files
). - No More Duplicates: Uses a
HashSet
under the hood to ensure each line appears exactly once in your final output. - Memory-Friendly: Processes files in 10MB chunks by default, so your RAM stays happy.
- Optimized I/O: Uses generous buffer sizes (32MB read, 16MB write) to keep things moving quickly.
- Parallel Processing: Spreads the work across 10 threads by default (but you can adjust this).
- Resource-Conscious: Chunks files to keep memory usage in check, even with large files.
- Know What's Happening: Shows you exactly where you are with progress bars for:
- Overall progress
- Current file
- Deduplication status
- Your Tool, Your Rules: Tweak buffer sizes and other settings to match your needs.
- Keeps Going: Logs errors without stopping, because one bad file shouldn't ruin everything.
- UTF-8 Problems? No Problem: Skips problematic lines and keeps moving.
- Checks First: Makes sure all your input files exist and are readable before starting.
- Safe Writes: Uses atomic writing to protect your output file from corruption.
- Never Lose Progress: Creates checkpoint files as it works.
- Ctrl+C Friendly: Saves its state when interrupted so you can pick up where you left off.
- Easy Resumption: Just use
--resume <progress-file>
to continue an interrupted job. - Knows Its Place: Keeps track of exactly where it stopped, down to the line.
Robert Pimentel
- GitHub: @pr0b3r7
- LinkedIn: pimentelrobert1
- Website: hackerhermanos.com
This project relies on several high-quality Rust crates to provide its functionality:
- tokio (1.36) - Asynchronous runtime powering parallel processing
- clap (4.4) - Command-line argument parsing
- serde (1.0) - Serialization framework for configuration
- anyhow (1.0.91) - Error handling with context
- async-compression (0.4.17) - Handles various compression formats (bzip2, gzip, xz)
- zip (2.2.0) - ZIP archive support
- unrar (0.5.6) - RAR archive support
- sevenz-rust (0.6.1) - 7z archive support
- tar (0.4.42) - TAR archive support
- indicatif (0.17) - Progress bars and spinners
- dialoguer (0.11.0) - Interactive command prompts
- crossterm (0.28.1) - Terminal manipulation
- terminal_size (0.4.0) - Terminal dimensions detection
- chrono (0.4.38) - Date and time handling
- uuid (1.11.0) - Unique identifier generation
- sha2 (0.10.8) - Cryptographic hashing
- encoding_rs (0.8.35) - Character encoding support
- sys-info (0.9.1) - System information gathering
- reqwest (0.12.9) - HTTP client with streaming support
- url (2.5.2) - URL parsing and manipulation
- env_logger (0.11.5) - Environment-based logging
- log (0.4.22) - Logging framework
- thiserror (1.0.65) - Custom error types
- ctrlc (3.4.5) - Ctrl+C signal handling
- signal-hook (0.3.17) - OS signal handling
- Rust toolchain (1.70+)
- Cargo package manager
-
Grab the code:
git clone https://github.com/yourusername/file-merger-tool.git cd file-merger-tool
-
Build it:
cargo build --release
-
Want it system-wide? (Optional):
sudo cp target/release/file-merger-tool /usr/local/bin/
file-merger-tool merge -w input_list.txt -o merged_output.txt
Usage: rustmerger [OPTIONS] <COMMAND>
Commands:
merge Merge wordlists and rules
generate-config Generate configuration file
guided-setup Run guided setup
resume Resume interrupted operation
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Set verbosity level (-v: debug, -vv: trace)
--log-level <LOG_LEVEL> [default: info]
-h, --help Print help
-V, --version Print version
Usage: rustmerger merge [OPTIONS]
Options:
-v, --verbose... Set verbosity level (-v: debug, -vv: trace)
-w, --wordlists-file <FILE> Text file containing one wordlist path per line
-r, --rules-file <FILE> Text file containing one rule path per line
--output-wordlist <FILE> Destination path for merged and deduplicated wordlist
--output-rules <FILE> Destination path for merged and deduplicated rules
-c, --config <FILE> JSON configuration file with default settings
--progress-file <FILE> Save progress state for resume capability
-d, --debug Enable detailed progress output
-h, --help Print help
Usage: rustmerger generate-config [OPTIONS] <FILE>
Arguments:
<FILE> Destination path for configuration file
Options:
-t, --template Generate default configuration template
-v, --verbose... Set verbosity level (-v: debug, -vv: trace)
-h, --help Print help
Usage: rustmerger guided-setup [OPTIONS] <FILE>
Arguments:
<FILE> Destination path for interactive configuration
Options:
-v, --verbose... Set verbosity level (-v: debug, -vv: trace)
-h, --help Print help
{
"input_files": "/tmp/wordlists_to_merge_dev.txt",
"output_files": "/tmp/merged_wordlist.txt",
"threads": 90,
"verbose": true,
"debug": true
}
The heavy lifting happens in the FileProcessor
struct (src/processing.rs
). Here's what makes it tick:
-
Smart File Reading:
- Uses async I/O with
tokio
for non-blocking file access - Buffers reads to minimize system calls
- Uses async I/O with
-
Reliable Error Handling:
- Logs issues but keeps going
- Won't let one bad file stop the whole show
-
Line-by-Line Processing:
- Handles each line individually
- Gracefully skips UTF-8 issues
-
Progress Tracking:
- Keeps tabs on processed files
- Makes resuming interrupted jobs seamless
-
Parallel Power:
- Spreads work across multiple threads (default: 10)
- Built on
tokio
for efficient async processing
-
Smart Deduplication:
- Uses
HashSet
for O(1) lookups - Keeps memory usage in check
- Uses
-
Visual Feedback:
- Real-time progress bars
- Shows you exactly what's happening
-
Interruption-Proof:
- Handles Ctrl+C gracefully
- Saves progress for later
- Managed by
AppState
insrc/app_state.rs
-
Flexible Configuration:
- JSON config support via
--config <path>
- Interactive setup with
--guided-setup
- JSON config support via
This tool is built to be reliable, efficient, and adaptable to your needs. Whether you're merging a few files or processing thousands, it's got you covered.