This workflow uses Sourmash to generate MinHash signatures for sequence files and perform similarity searches between them. It's implemented in WDL (Workflow Description Language) and designed to run with Cromwell or other WDL-compatible workflow engines.
- Cromwell or another WDL-compatible workflow engine
- Docker (the workflow uses the
getwilds/sourmash:4.8.2
container) - Sufficient disk space for your sequence files and their signatures
The workflow consists of two main steps:
- SketchBothSequences: Generates MinHash signatures for both query and database sequences
- RunSourmashSearch: Performs similarity search between the generated signatures
Create a JSON file (e.g., inputs.json
) with the following structure:
{
"SourmashSketchAndSearch.query_fastq": "/path/to/query.fastq.gz",
"SourmashSketchAndSearch.database_fasta": "/path/to/database.fa.gz"
}
Create a JSON file (e.g., options.json
) to configure workflow execution:
{
"workflow_failure_mode": "ContinueWhilePossible",
"write_to_cache": true,
"read_from_cache": true,
"default_runtime_attributes": {
"maxRetries": 1
},
"final_workflow_outputs_dir": "/path/to/outputs/",
"use_relative_output_paths": true
}
workflow_failure_mode
: Determines how the workflow handles task failureswrite_to_cache
: Enables caching of task outputs for future runsread_from_cache
: Allows reuse of cached outputs from previous runsmaxRetries
: Number of times to retry failed tasksfinal_workflow_outputs_dir
: Directory for final workflow outputsuse_relative_output_paths
: Maintains relative path structure in output directory
The workflow includes several default parameters that can be overridden in your inputs JSON:
ksize
: "31" (k-mer size for sketching and searching)threshold
: 0.08 (minimum similarity threshold for reporting matches)sketch_type
: "dna" (molecule type: dna, protein, dayhoff, hp, or nucleotide)scaled
: true (use scaled MinHash)scale_factor
: 1000 (scale factor for scaled MinHash)
java -jar cromwell.jar run \
sourmash-search-workflow.wdl \
-i inputs.json \
-o options.json
The workflow produces three main outputs:
- Query sequence signature file (
.sig
) - Database sequence signature file (
.sig
) - Search results file (
search_results.csv
)
Default resource allocations per task:
- SketchBothSequences:
- Memory: 4GB
- CPU: 2
- Disk: 50GB SSD
- RunSourmashSearch:
- Memory: 4GB
- CPU: 1
- Disk: 50GB SSD
Adjust these values in the WDL file based on your input data sizes and computational resources.
- Insufficient Disk Space: Increase
disk_size_gb
in task runtime sections - Memory Errors: Adjust
memory_gb
based on input file sizes - Docker Pulling Failures: Ensure access to Docker Hub and correct image version
- Tasks will retry once on failure (configurable via
maxRetries
) - The workflow continues executing possible tasks if one fails (
ContinueWhilePossible
mode)
Feel free to submit issues and enhancement requests to improve this workflow.
Distributed under the MIT License. See LICENSE
for details.