sash workflow details

Overview
HMFtools WiGiTs
Other Tools
Pipeline Inputs
Workflows
Common Reports
sash Module Outputs
Coverage
Reference Data
FAQ

Overview

The sash Workflow is a genomic analysis framework comprising three primary pipelines:

Somatic Small Variants (SNV somatic): Detects single nucleotide variants (SNVs) and indels in tumor samples, emphasizing clinical relevance.
Somatic Structural Variants (SV somatic): Identifies large-scale genomic alterations (deletions, duplications, etc.) and integrates copy number data.
Germline Variants (SNV germline): Focuses on inherited variants linked to cancer predisposition.

These pipelines utilize Bolt (a Python package designed for modular processing) and leverage outputs from the DRAGEN Variant Caller alongside the Hartwig Medical Foundation (HMF) tools integrated via Oncoanalyser. Each pipeline is tailored to a specific type of genomic variant, incorporating filtering, annotation and HTML reports for research and curation.

HMFtools

HMFtools is an open-source suite for cancer genomics developed by the Hartwig Medical Foundation. Key components used in sash include:

SAGE (Somatic Alterations in Genome): A tiered SNV/indel caller targeting cancer hotspots from databases including Cancer Genome Interpreter, CIViC, and OncoKB to recover low-frequency variants missed by DRAGEN. Outputs a VCF with confidence tiers (hotspot, panel, high/low confidence).
PURPLE: Estimates tumor purity (tumor cell fraction) and ploidy (average copy number), integrates copy number data, and calculates TMB (tumor mutation burden) and MSI (microsatellite instability).
Cobalt: Calculates read-depth ratios from sequencing data, providing essential input for copy number analysis. Its outputs are used by PURPLE to generate accurate copy number profiles across the genome.
Amber: Computes B-allele frequencies, which are critical for estimating tumor purity and ploidy. The Amber directory contains these measurements, supporting PURPLE's analysis.

Other Tools

SIGRAP

A framework for running PCGR and other genomic reporting tools.

Personal Cancer Genome Reporter (PCGR)

Tool for comprehensive clinical interpretation of somatic variants, providing tiered classifications and extensive annotation.

Cancer Predisposition Sequencing Report (CPSR)

Tool for predisposition variant analysis and reporting in germline samples.

Genomics Platform Group Reporting (GPGR)

UMCCR-developed R package for generating cancer genomics reports.

Linx

Tool for structural variant annotation and visualization to classify complex rearrangements.

ESVEE

Esvee is a structural variant caller optimised for short read sequencing that identifies somatic and germline somatic rearrangements.

VIRUSBreakend

Tool for detecting viral integration events in human genome sequencing data.

Pipeline Inputs

DRAGEN

{tumor_id}.hard-filtered.vcf.gz: Somatic variant calls from DRAGEN pipeline.
Optional: ${tumor_id}.hrdscore.csv homologous recombination deficiency scores (surfaced in the cancer report when present).

Oncoanalyser

ESVEE

${tumor_id}.esvee.ref_depth.vcf.gz and the accompanying esvee/ directory: depth and preparation files used to seed eSVee structural variant calling.

SAGE

{tumor_id}.sage.somatic.vcf.gz: Somatic SNV/indel calls from SAGE.

VIRUSBreakend

Directory: virusbreakend/: Contains outputs from VIRUSBreakend, used for detecting viral integration events.

Cobalt

Directory: cobalt/: Contains read-depth ratio data required for copy number analysis by PURPLE.

Amber

Directory: amber/: Contains B-allele frequency measurements used by PURPLE to estimate tumor purity and ploidy.

CHORD

File: chord/{tumor_id}.chord.prediction.tsv (optional): HRD predictions generated by oncoanalyser; incorporated into the cancer report when present.

Workflows

Somatic Small Variants

General

In the Somatic Small Variants workflow, variant detection is performed using the DRAGEN Variant Caller and Oncoanalyser (relying on SAGE and PURPLE outputs). It's structured into four steps: Re-calling, Annotation, Filter, and Report. The final outputs include an HTML report summarizing the results.

Summary

Re-calling SAGE variants to recover low-frequency mutations in hotspots.
Annotate variants with clinical and functional information using PCGR.
Filter variants based on quality and frequency criteria, while retaining those of potential clinical significance.
Generate comprehensive HTML reports (PCGR, Cancer Report, LINX, MultiQC).

Variant Calling Re-calling

The variant calling re-calling step uses variants from SAGE, which is more sensitive than DRAGEN in detecting variants, particularly those with low allele frequency. SAGE focuses on cancer hotspots, prioritizing predefined genomic regions of high clinical or biological relevance with its filtering system. This enables the re-calling of biologically significant variants that may have been missed otherwise.

Inputs

From DRAGEN: Somatic small variant caller VCF
- ${tumor_id}.main.dragen.vcf.gz
From Oncoanalyser: SAGE VCF
- ${tumor_id}.main.sage.filtered.vcf.gz
Filtered on chromosomes 1-22, X, Y, and M.

Output

Re-calling: VCF
- ${tumor_id}.rescued.vcf.gz

Steps

Select High-Confidence SAGE Calls in Hotspot Regions:
- Filter the SAGE output to retain only variants that pass quality filters and overlap with known hotspot regions.
- Compare the input VCF and the SAGE VCF to identify overlapping and unique variants.
Annotate existing somatic variant calls also present in the SAGE calls in the input VCF:
- For each variant in the input VCF, check if it exists in the SAGE existing calls.
- For variants integrated by SAGE:
  - If SAGE FILTER=PASS and input VCF FILTER=PASS:
    - Set INFO/SAGE_HOTSPOT to indicate the variant is called by SAGE in a hotspot.
  - If SAGE FILTER=PASS and input VCF FILTER is not PASS:
    - Set INFO/SAGE_HOTSPOT and INFO/SAGE_RESCUE to indicate the variant is re-called from SAGE.
    - Update FILTER=PASS to include the variant in the final analysis.
  - If SAGE FILTER is not PASS:
    - Append SAGE_lowconf to the FILTER field to flag low-confidence variants.
- Transfer SAGE FORMAT fields to the input VCF with a SAGE_ prefix.
Combine annotated input VCF with novel SAGE calls:
- Prepare novel SAGE calls. For each variant in the SAGE VCF missing from the input VCF:
  - Rename certain FORMAT fields in the novel SAGE VCF to avoid namespace collisions:
    - For example, FORMAT/SB is renamed to FORMAT/SAGE_SB.
  - Retain necessary INFO and FORMAT annotations while removing others to streamline the data.

Annotation

The Annotation process employs Reference Sources (GA4GH/GIAB problem region stratifications, GIAB high confidence regions, gnomAD, Hartwig hotspots), UMCCR panel of normals (built from approximately 200 normal samples), and the PCGR tool to enrich variants with classification and clinical information. These annotations are used to decide which variants are retained or filtered in the next step.

Inputs

Small variant VCF
- ${tumor_id}.rescued.vcf.gz

Output

Annotated VCF
- ${tumor_id}.annotations.vcf.gz

Steps

Set FILTER to "PASS" for unfiltered variants:
- Iterate over the input VCF file and set the FILTER field to PASS for any variants that currently have no filter status (FILTER is . or None).
Annotate the VCF against reference sources:
- Use vcfanno to add annotations to the VCF file:
  - gnomAD (version 2.1)
  - Hartwig Hotspots
  - ENCODE Blacklist
  - Genome in a Bottle High-Confidence Regions (v4.2.1)
  - Low and High GC Regions (< 30% or > 65% GC content, compiled by GA4GH)
  - Bad Promoter Regions (compiled by GA4GH)
Annotate with UMCCR panel of normals counts:
- Use vcfanno and bcftools to annotate the VCF with counts from the UMCCR panel of normals.
Standardize the VCF fields:
- Add new INFO fields for use with PCGR:
  - TUMOR_AF, NORMAL_AF: Tumor and normal allele frequencies.
  - TUMOR_DP, NORMAL_DP: Tumor and normal read depths.
- Add the AD FORMAT field:
  - AD: Allelic depths for the reference and alternate alleles.
Prepare VCF for PCGR annotation:
- Make minimal VCF header keeping only INFO AF/DP, and contigs size.
- Move tumor and normal FORMAT/AF and FORMAT/DP annotations to the INFO field as required by PCGR.
- Set FILTER to PASS and remove all FORMAT and sample columns.
Run PCGR (v1.4.1) to annotate VCF against external sources:
- Classify variants by tiers based on annotations and functional impact according to AMP/ASCO/CAP guidelines.
- Add INFO fields into the VCF: TIER, SYMBOL, CONSEQUENCE, MUTATION_HOTSPOT, TCGA_PANCANCER_COUNT, CLINVAR_CLNSIG, ICGC_PCAWG_HITS, COSMIC_CNT.
- External sources include VEP, ClinVar, COSMIC, TCGA, ICGC, Open Targets Platform, CancerMine, DoCM, CBMDB, DisGeNET, Cancer Hotspots, dbNSFP, UniProt/SwissProt, Pfam, DGIdb, and ChEMBL.
Transfer PCGR annotations to the full set of variants:
- Merge the PCGR annotations back into the original VCF file.
- Ensure that all variants, including those not selected for PCGR annotation, have relevant clinical annotations where available.
- Preserve the FILTER statuses and other annotations from the original VCF.

Filter

The Filter step applies a series of stringent filters to somatic variant calls in the VCF file, ensuring the retention of high-confidence and biologically meaningful variants.

Inputs

Annotated VCF
- ${tumor_id}.annotations.vcf.gz

Output

Filtered VCF
- ${tumor_id}*filters_set.vcf.gz

Filters

Variants that do not meet these criteria will be filtered out unless they qualify for Clinical Significance Exceptions:

Filter Type	Threshold/Criteria
Allele Frequency (AF) Filter	Tumor AF < 10% (0.10)
Allele Depth (AD) Filter	Fewer than 4 supporting reads (6 in low-complexity regions)
Non-GIAB AD Filter	Stricter thresholds outside GIAB high-confidence regions
Problematic Genomic Regions Filter	Overlap with ENCODE blacklist, bad promoter, or low-complexity regions
Population Frequency (gnomAD) Filter	gnomAD AF ≥ 1% (0.01)
Panel of Normals (PoN) Germline Filter	Present in ≥ 5 normal samples or PoN AF > 20% (0.20)

Clinical Significance Exceptions

Exception Category	Criteria
Reference Database Hit Count	COSMIC count ≥10 OR TCGA pan-cancer count ≥5 OR ICGC PCAWG count ≥5
ClinVar Pathogenicity	ClinVar classification of `conflicting_interpretations_of_pathogenicity`, `likely_pathogenic`, `pathogenic`, or `uncertain_significance`
Mutation Hotspots	Annotated as `HMF_HOTSPOT`, `PCGR_MUTATION_HOTSPOT`, or SAGE Hotspots (CGI, CIViC, OncoKB)
PCGR Tier Exception	Classified as `TIER_1` OR `TIER_2`

Reports

The Report step utilizes the Personal Cancer Genome Reporter (PCGR) and other tools to generate comprehensive reports.

Inputs

Purple purity data
Filtered VCF
- ${tumor_id}*filters_set.vcf.gz
DRAGEN VCF
- ${tumor_id}.main.dragen.vcf.gz

Output

PCGR Cancer report
- ${tumor_id}.pcgr.grch38.html

Steps

Generate BCFtools Statistics on the Input VCF:
- Run bcftools stats to gather statistics on variant quality and distribution.
Calculate Allele Frequency Distributions:
- Filter and normalize variants according to high-confidence regions.
- Extract allele frequency data from tumor samples.
- Produce both a global allele frequency summary and a subset of allele frequencies restricted to key cancer genes.
Compare Variant Counts From Two Variant Sets (DRAGEN vs. BOLT):
- Count the total number and types of variants (SNPs, Indels, Others) passing filters in both the DRAGEN VCF and the Filtered BOLT VCF.
Count Variants by Processing Stage.
Parse Purity and Ploidy Information (Purple Data).
Run PCGR (GRCh38 VEP 113 / pcgr_ref_data.20250314) to generate the final report. If PCGR struggles with very large VCFs, tune chunking with --pcgr_variant_chunk_size to cap variants per batch.

VCF to MAF conversion

After filtering, the pipeline converts the somatic VCF to MAF using vcf2maf (v1.6.22) for downstream tools that expect MAF format.

Output

MAF file for the tumour/normal pair
- ${tumor_id}.maf

Somatic Structural Variants

The Somatic Structural Variants (SVs) pipeline identifies and annotates large-scale genomic alterations, including deletions, duplications, inversions, insertions, and translocations in tumor samples. Calls now come from eSVee (replacing GRIDSS/GRIPSS), but the downstream PURPLE/SnpEff/prioritisation steps remain unchanged.

Summary

eSVee filtering:
- Refines the structural variant calls using read counts, panel-of-normals, known fusion hotspots, and repeat masker annotations.
PURPLE:
- Combines the eSVee-filtered SV calls with copy number variation (CNV) data and tumor purity/ploidy estimates.
Annotation:
- Combines SV calls with CNV data and annotates using SnpEff.
Prioritization:
- Prioritizes SV annotations based on AstraZeneca-NGS using curated reference data.
Report:
- Generates cancer report and MultiQC output.

Inputs

eSVee (GRIDSS/GRIPSS replacement)
- ${tumor_id}.esvee.somatic.vcf.gz

Steps

eSVee filtering:
- Evaluate split-read and paired-end support; discard variants with low support.
- Apply panel-of-normals filtering to remove artifacts observed in normal samples.
- Retain variants overlapping known oncogenic fusion hotspots (using UMCCR-curated lists).
- Exclude variants in repetitive regions based on Repeat Masker annotations.
PURPLE:
- Merge SV calls with CNV segmentation data.
- Estimate tumor purity and ploidy.
- Adjust SV breakpoints based on copy number transitions.
- Classify SVs as somatic or germline.
Annotation:
- Compile SV and CNV information into a unified VCF file.
- Extend the VCF header with PURPLE-related INFO fields (e.g., PURPLE_baf, PURPLE_copyNumber).
- Convert CNV records from TSV format into VCF records with appropriate SVTYPE tags (e.g., 'DUP' for duplications, 'DEL' for deletions).
- Run SnpEff to annotate the unified VCF with functional information such as gene names, transcript effects, and coding consequences.
Prioritization:
- Run the prioritization module (forked from the AstraZeneca simple_sv_annotation tool) using reference data files including known fusion pairs, known fusion 5′ and 3′ lists, key genes, and key tumor suppressor genes.
- Classify Variants:
  - Structural Variants (SVs): Variants labeled with the source sv_esvee.
  - Copy Number Variants (CNVs): Variants labeled with the source cnv_purple.
Prioritize variants on a 4-tier system using prioritize_sv:
- 1 (high) - 2 (moderate) - 3 (low) - 4 (no interest)
- Exon loss:
  - On cancer gene list (1)
  - Other (2)
- Gene fusion:
  - Paired (hits two genes):
    - On list of known pairs (1) (curated by HMF)
    - One gene is a known promiscuous fusion gene (1) (curated by HMF)
    - On list of FusionCatcher known pairs (2)
    - Other:
      - One or two genes on cancer gene list (2)
      - Neither gene on cancer gene list (3)
  - Unpaired (hits one gene):
    - On cancer gene list (2)
    - Others (3)
- Upstream or downstream: A specific type of fusion where one gene comes under the control of another gene's promoter, potentially leading to overexpression (oncogene) or underexpression (tumor suppressor gene):
  - On cancer gene list genes (2)
- LoF or HIGH impact in a tumor suppressor:
  - On cancer gene list (2)
  - Other TS gene (3)
- Other (4)
Filter Low-Quality Calls:
- Apply Quality Filters:
  - Keep variants with sufficient read support (e.g., split reads (SR) ≥ 5 and paired reads (PR) ≥ 5).
  - Exclude Tier 3 and Tier 4 variants where SR < 5 and PR < 5.
  - Exclude Tier 3 and Tier 4 variants where SR < 10, PR < 10, and allele frequencies (AF0 or AF1) are below 0.1.
Report:
- Generate MultiQC and cancer report outputs.

Germline Small Variants

Filtering Select passing variants in the given gene panel transcript regions made with PMCC familial cancer clinic list then make CPSR report.

Inputs

DRAGEN VCF
- ${normal_id}.hard-filtered.vcf.gz

Output

CPSR report
- ${normal_id}.cpsr.grch38.html

Steps

Prepare:
- Selection of Passing Variants:
  - Raw germline variant calls from DRAGEN are filtered to retain only those variants marked as PASS (or with no filter flag).
- Selection of Gene Panel Variants:
  - The filtered variants are further restricted to regions defined by the gene panel transcript regions file, based on the PMCC familial cancer clinic list.
Report:
- Generate CPSR (Cancer Predisposition Sequencing Report) summarizing germline findings.

Common Reports

Cancer Report

UMCCR cancer report containing:

Tumor Mutation Burden (TMB)

Data Source: filtered somatic VCF
Tool: PURPLE

Mutational Signatures

Data Source: filtered somatic SNV VCF (Sigrap MutationalPatterns output)
Tool: Sigrap (MutationalPatterns wrapper)

Contamination Score

Data Source: –
Note: No dedicated contamination metric is currently generated

Purity & Ploidy

Data Source: COBALT (providing read-depth ratios) and AMBER (providing B-allele frequency measurements)
Tool: PURPLE, which uses these inputs to compute sample purity (percentage of tumor cells) and overall ploidy (average copy number)

HRD Score

Data Source: optional DRAGEN HRD score (${tumor_id}.hrdscore.csv), Sigrap HRDetect JSON, and oncoanalyser CHORD predictions
Tool: DRAGEN HRD, Sigrap HRDetect, and CHORD

MSI (Microsatellite Instability)

Data Source: Indels in microsatellite regions from SNV/CNV
Tool: PURPLE

Structural Variant Metrics

Data Source: eSVee SV VCF and PURPLE CNV segmentation
Tools: eSVee, PURPLE, and the AstraZeneca simple_sv_annotation prioritisation rules

Copy Number Metrics (Segments, Deleted Genes, etc.)

Data Source: PURPLE CNV outputs (segmentation files, gene-level CNV TSV)
Tool: PURPLE

The LINX report includes the following:

Tables of Variants:
- Breakends
- Links
- Driver Catalog
Plots:
- Cluster-Level Plots

MultiQC

General Stats: Overview of QC metrics aggregated from all tools, providing high-level sample quality information.

DRAGEN: Mapping metrics (mapped reads, paired reads, duplicated alignments, secondary alignments), WGS coverage (average depth, cumulative coverage, per-contig coverage), fragment length distributions, trimming metrics, and time metrics for pipeline steps.

PURPLE: Sample QC status (PASS/FAIL), ploidy, tumor purity, polyclonality percentage, tumor mutational burden (TMB), microsatellite instability (MSI) status, and variant metrics for somatic and germline SNPs/indels.

BcfTools Stats: Variant substitution types, SNP and indel counts, quality scores, variant depth, and allele frequency metrics for both somatic and germline variants.

DRAGEN-FastQC: Per-base sequence quality, per-sequence quality scores, GC content (per-sequence and per-position), HRD score, sequence length distributions, adapter contamination, and sequence duplication levels.

PCGR

Personal Cancer Genome Reporter (PCGR) tool generates a comprehensive, interactive HTML report that consolidates filtered and annotated variant data, providing detailed insights into the somatic variants identified.

Key Metrics:

Variant Classification and Tier Distribution: PCGR categorizes variants into tiers based on their clinical and biological significance. The report details the proportion of variants across different tiers, indicating their potential clinical relevance.
Mutational Signatures: The report includes analysis of mutational signatures, offering insights into the mutational processes active in the tumor.
Copy Number Alterations (CNAs): Visual representations of CNAs are provided, highlighting significant gains and losses across the genome. Genome-wide plots display regions of copy number gains and losses.
Tumor Mutational Burden (TMB): Calculations of TMB are included, which can have implications for immunotherapy eligibility. The report presents the TMB value, representing the number of mutations per megabase.
Microsatellite Instability (MSI) Status: Assessment of MSI status is performed, relevant for certain cancer types and treatment decisions.
Clinical Trials Information: Information on relevant clinical trials is incorporated, offering potential therapeutic options based on the identified variants.

Note: The PCGR tool is designed to process a maximum of 500,000 variants. If the input VCF file contains more than this limit, variants exceeding 500,000 will be filtered out.

CPSR Report

The CPSR (Cancer Predisposition Sequencing Report) includes the following:

Settings:

Sample metadata
Report configuration
Virtual gene panel

Summary of Findings:

Variant statistics

Variant Classification:

ClinVar and Non-ClinVar variants:

Class 5 - Pathogenic variants
Class 4 - Likely Pathogenic variants
Class 3 - Variants of Uncertain Significance (VUS)
Class 2 - Likely Benign variants
Class 1 - Benign variants
Biomarkers

PCGR TIER according to ACMG:

Tier 1 (High): Highest priority variants with strong clinical relevance.
Tier 2 (Moderate): Variants with potential clinical significance.
Tier 3 (Low): Variants with uncertain significance.
Tier 4 (No Interest): Variants unlikely to be clinically relevant.

Coverage

The sash workflow utilizes coverage metrics from DRAGEN to evaluate the sequencing quality and depth across target regions. Coverage analysis includes:

Mean coverage across targeted genomic regions
Percentage of target regions covered at various depth thresholds (10X, 20X, 50X, 100X)
Coverage uniformity metrics
Gap analysis for regions with insufficient coverage

These metrics are integrated into the MultiQC report, providing a comprehensive overview of sequencing quality and coverage.

Reference Data

UMCCR Gene Panels

Curated gene panels for specific analyses, including the germline cancer predisposition gene panel used in the Germline Small Variants workflow.

Genome Annotations

HMFtools Reference Data

Ensembl reference data (GRCh38)
Somatic driver catalogs
Known fusion gene pairs
Driver gene panels

Annotation Databases:

gnomAD (v2.1): Provides population allele frequencies to help distinguish common variants from rare ones.
ClinVar (20220103): Offers clinically curated variant information, aiding in the interpretation of potential pathogenicity.
COSMIC: Contains data on somatic mutations found in cancer, facilitating the identification of cancer-related variants.
Gene Panels: Focuses analysis on specific sets of genes relevant to particular conditions or research interests.

Structural Variant Data:

SnpEff Databases: Used for predicting the effects of variants on genes and proteins.
Panel of Normals (PON): Helps filter out technical artifacts by comparing against a set of normal samples.
RepeatMasker: Identifies repetitive genomic regions to prevent false-positive variant calls.

Databases/datasets PCGR Reference Data:

Version: pcgr_ref_data.20250314.grch38.tgz with GRCh38 VEP 113 cache (homo_sapiens_vep_113_GRCh38.tar.gz). Both archives are auto-extracted by the PREPARE_REFERENCE subworkflow.
Contents include refreshed ClinVar, COSMIC, dbNSFP, gnomAD, OncoKB/CGI biomarker sets, and PCGR/CPSR configuration files aligned with PCGR v2.x.

sash Module Outputs

Somatic SNVs

File: smlv_somatic/filter/{tumor_id}.pass.vcf.gz
Description: Contains somatic single nucleotide variants (SNVs) with filtering applied (VCF format).

Somatic SVs

File: sv_somatic/prioritise/{tumor_id}.sv.prioritised.vcf.gz
Description: Contains somatic structural variants (SVs) with prioritization applied (VCF format).

Somatic CNVs

File: cancer_report/cancer_report_tables/purple/{tumor_id}-purple_cnv_som.tsv.gz
Description: Contains somatic copy number variations (CNVs) data (TSV format).

Somatic Gene CNVs

File: cancer_report/cancer_report_tables/purple/{tumor_id}-purple_cnv_som_gene.tsv.gz
Description: Contains gene-level somatic copy number variations (CNVs) data (TSV format).

Germline SNVs

File: dragen_germline_output/{normal_id}.hard-filtered.vcf.gz
Description: Contains germline single nucleotide variants (SNVs) with hard filtering applied (VCF format).

Purple Purity, Ploidy, MS Status

File: purple/{tumor_id}.purple.purity.tsv
Description: Contains estimated tumor purity, ploidy, and microsatellite status (TSV format).

PCGR JSON with TMB

File: smlv_somatic/report/pcgr/{tumor_id}.pcgr.grch38.json.gz
Description: Contains PCGR annotations, including tumor mutational burden (TMB) (JSON format).

DRAGEN HRD Score (input)

File: ${tumor_id}.hrdscore.csv (from dragen_somatic_dir)
Description: Optional DRAGEN homologous recombination deficiency (HRD) score propagated into the cancer report when provided.

Sigrap HRDetect

File: sigrap/hrdetect/hrdetect.json.gz
Description: HRDetect JSON summarising HRD probability from combined SNV/SV/CNV signals.

Sigrap MutationalPatterns

Directory: sigrap/mutpat/
Description: Mutational signature TSVs/plots (SBS/DBS/indels) generated by Sigrap’s MutationalPatterns wrapper.

Somatic MAF export

File: vcf2maf/{tumor_id}.maf
Description: MAF representation of the filtered somatic VCF for downstream tools that prefer MAF input.

FAQ

Q: Do we use PCGR for the rescue of SAGE?

A: Rescue is performed by BOLT using SAGE hotspot calls layered onto the DRAGEN VCF. PCGR is only used later for reporting/annotation; it does not drive the rescue step.

Q: How are hypermutated samples handled in the current version, and is there any impact on derived metrics such as TMB or MSI?

A: In the current version of sash, hypermutated samples are identified based on a threshold of 500,000 total somatic variant counts. If the variant count exceeds this threshold, the sample is flagged as hypermutated. When this occurs, we will filter variants that: 1) don't have clinical impact, 2) aren't in hotspot regions, until we meet the threshold. This impacts the TMB and MSI calculations by PURPLE. Currently, we are using the TMB and MSI values from PURPLE in these edge cases. A future release will provide correct TMB and MSI calculations from PURPLE.

Q: How are we handling non-standard chromosomes if present in the input VCFs (ALTs, chrM, etc)?

A: We filter on chromosomes 1-22 and chromosomes X, Y, M. All other non-standard chromosomes and contigs are filtered out.

Q: What inputs for the cancer reporter - have they changed (and what can we harmonize); e.g., where is the Circos plot from at this point?

A: Circos plots are generated by PURPLE.

Q: We dropped the CACAO coverage reports. Can we discuss how to utilize DRAGEN or HMFtools coverage information instead?

A: DRAGEN coverage metrics are now integrated into the MultiQC report, providing a comprehensive overview of sequencing quality and coverage across the genome. We are exploring further integration of HMFtools coverage analysis for future releases.

Q: What TMB score is displayed in the cancer reporter?

A: The cancer report surfaces the PURPLE-derived TMB; the PCGR HTML also reports its own TMB estimate for comparison.

Q: What filtered VCF is the source for the mutational signatures?

A: Sigrap MutationalPatterns uses the filtered somatic VCF (post-rescue and filtering); its outputs are published under sigrap/mutpat/ and fed into the cancer report.

Q: Where is the contamination score coming from currently?

A: Currently, sash does not calculate a dedicated contamination metric. Tumor purity estimation from PURPLE serves as the primary indicator of sample quality.

Q: Do the SV steps do something more than what's happening in Oncoanalyser?

A: SASH reuses the WiGiTS export to re-run eSVee with UMCCR reference data and panel-of-normals, then applies PURPLE, SnpEff and simple_sv_annotation. GRIDSS/GRIPSS are no longer used.

Q: Does the data from Somatic Small Variants workflow get used for the SV analysis?

A: No, the somatic small variant workflow data is not used in the structural variant (SV) workflow. These are independent analyses that run in parallel.

FilesExpand file tree

details.md

Latest commit

History

details.md

File metadata and controls

sash workflow details

Table of Contents

Overview

HMFtools

Other Tools

Pipeline Inputs

DRAGEN

Oncoanalyser

CHORD

Workflows

Somatic Small Variants

General

Summary

Variant Calling Re-calling

Inputs

Output

Steps

Annotation

Inputs

Output

Steps

Filter

Inputs

Output

Filters

Clinical Significance Exceptions

Reports

Inputs

Output

Steps

VCF to MAF conversion

Output

Somatic Structural Variants

Summary

Inputs

Steps

Germline Small Variants

Inputs

Output

Steps

Common Reports

Tumor Mutation Burden (TMB)

Mutational Signatures

Contamination Score

Purity & Ploidy

HRD Score

MSI (Microsatellite Instability)

Structural Variant Metrics

Copy Number Metrics (Segments, Deleted Genes, etc.)

MultiQC

PCGR

CPSR Report

Coverage

Reference Data

Genome Annotations

HMFtools Reference Data

Annotation Databases:

Structural Variant Data:

sash Module Outputs

Somatic SNVs

Somatic SVs

Somatic CNVs

Somatic Gene CNVs

Germline SNVs

Purple Purity, Ploidy, MS Status

PCGR JSON with TMB

DRAGEN HRD Score (input)

Sigrap HRDetect

Sigrap MutationalPatterns

Somatic MAF export

FAQ

Q: Do we use PCGR for the rescue of SAGE?

Q: How are hypermutated samples handled in the current version, and is there any impact on derived metrics such as TMB or MSI?

Q: How are we handling non-standard chromosomes if present in the input VCFs (ALTs, chrM, etc)?