Skip to content

Latest commit

 

History

History
588 lines (442 loc) · 31.5 KB

File metadata and controls

588 lines (442 loc) · 31.5 KB

sash workflow details

Table of Contents

Overview

Summary

The sash Workflow is a genomic analysis framework comprising three primary pipelines:

  • Somatic Small Variants (SNV somatic): Detects single nucleotide variants (SNVs) and indels in tumor samples, emphasizing clinical relevance.
  • Somatic Structural Variants (SV somatic): Identifies large-scale genomic alterations (deletions, duplications, etc.) and integrates copy number data.
  • Germline Variants (SNV germline): Focuses on inherited variants linked to cancer predisposition.

These pipelines utilize Bolt (a Python package designed for modular processing) and leverage outputs from the DRAGEN Variant Caller alongside the Hartwig Medical Foundation (HMF) tools integrated via Oncoanalyser. Each pipeline is tailored to a specific type of genomic variant, incorporating filtering, annotation and HTML reports for research and curation.


HMFtools

HMFtools is an open-source suite for cancer genomics developed by the Hartwig Medical Foundation. Key components used in sash include:

  • SAGE (Somatic Alterations in Genome): A tiered SNV/indel caller targeting cancer hotspots from databases including Cancer Genome Interpreter, CIViC, and OncoKB to recover low-frequency variants missed by DRAGEN. Outputs a VCF with confidence tiers (hotspot, panel, high/low confidence).

  • PURPLE: Estimates tumor purity (tumor cell fraction) and ploidy (average copy number), integrates copy number data, and calculates TMB (tumor mutation burden) and MSI (microsatellite instability).

  • Cobalt: Calculates read-depth ratios from sequencing data, providing essential input for copy number analysis. Its outputs are used by PURPLE to generate accurate copy number profiles across the genome.

  • Amber: Computes B-allele frequencies, which are critical for estimating tumor purity and ploidy. The Amber directory contains these measurements, supporting PURPLE's analysis.


Other Tools

A framework for running PCGR and other genomic reporting tools.

Tool for comprehensive clinical interpretation of somatic variants, providing tiered classifications and extensive annotation.

Tool for predisposition variant analysis and reporting in germline samples.

UMCCR-developed R package for generating cancer genomics reports.

Tool for structural variant annotation and visualization to classify complex rearrangements.

Esvee is a structural variant caller optimised for short read sequencing that identifies somatic and germline somatic rearrangements.

Tool for detecting viral integration events in human genome sequencing data.


Pipeline Inputs

DRAGEN

  • {tumor_id}.hard-filtered.vcf.gz: Somatic variant calls from DRAGEN pipeline.
  • Optional: ${tumor_id}.hrdscore.csv homologous recombination deficiency scores (surfaced in the cancer report when present).

Oncoanalyser

  • ${tumor_id}.esvee.ref_depth.vcf.gz and the accompanying esvee/ directory: depth and preparation files used to seed eSVee structural variant calling.
  • {tumor_id}.sage.somatic.vcf.gz: Somatic SNV/indel calls from SAGE.
  • Directory: virusbreakend/: Contains outputs from VIRUSBreakend, used for detecting viral integration events.
  • Directory: cobalt/: Contains read-depth ratio data required for copy number analysis by PURPLE.
  • Directory: amber/: Contains B-allele frequency measurements used by PURPLE to estimate tumor purity and ploidy.

CHORD

  • File: chord/{tumor_id}.chord.prediction.tsv (optional): HRD predictions generated by oncoanalyser; incorporated into the cancer report when present.

Workflows

Somatic Small Variants

General

In the Somatic Small Variants workflow, variant detection is performed using the DRAGEN Variant Caller and Oncoanalyser (relying on SAGE and PURPLE outputs). It's structured into four steps: Re-calling, Annotation, Filter, and Report. The final outputs include an HTML report summarizing the results.

Summary

  1. Re-calling SAGE variants to recover low-frequency mutations in hotspots.
  2. Annotate variants with clinical and functional information using PCGR.
  3. Filter variants based on quality and frequency criteria, while retaining those of potential clinical significance.
  4. Generate comprehensive HTML reports (PCGR, Cancer Report, LINX, MultiQC).

Variant Calling Re-calling

The variant calling re-calling step uses variants from SAGE, which is more sensitive than DRAGEN in detecting variants, particularly those with low allele frequency. SAGE focuses on cancer hotspots, prioritizing predefined genomic regions of high clinical or biological relevance with its filtering system. This enables the re-calling of biologically significant variants that may have been missed otherwise.

Inputs

  • From DRAGEN: Somatic small variant caller VCF

    • ${tumor_id}.main.dragen.vcf.gz
  • From Oncoanalyser: SAGE VCF

    • ${tumor_id}.main.sage.filtered.vcf.gz

    Filtered on chromosomes 1-22, X, Y, and M.

Output

  • Re-calling: VCF
    • ${tumor_id}.rescued.vcf.gz

Steps

  1. Select High-Confidence SAGE Calls in Hotspot Regions:
    • Filter the SAGE output to retain only variants that pass quality filters and overlap with known hotspot regions.
    • Compare the input VCF and the SAGE VCF to identify overlapping and unique variants.
  2. Annotate existing somatic variant calls also present in the SAGE calls in the input VCF:
    • For each variant in the input VCF, check if it exists in the SAGE existing calls.
    • For variants integrated by SAGE:
      • If SAGE FILTER=PASS and input VCF FILTER=PASS:
        • Set INFO/SAGE_HOTSPOT to indicate the variant is called by SAGE in a hotspot.
      • If SAGE FILTER=PASS and input VCF FILTER is not PASS:
        • Set INFO/SAGE_HOTSPOT and INFO/SAGE_RESCUE to indicate the variant is re-called from SAGE.
        • Update FILTER=PASS to include the variant in the final analysis.
      • If SAGE FILTER is not PASS:
        • Append SAGE_lowconf to the FILTER field to flag low-confidence variants.
    • Transfer SAGE FORMAT fields to the input VCF with a SAGE_ prefix.
  3. Combine annotated input VCF with novel SAGE calls:
    • Prepare novel SAGE calls. For each variant in the SAGE VCF missing from the input VCF:
      • Rename certain FORMAT fields in the novel SAGE VCF to avoid namespace collisions:
        • For example, FORMAT/SB is renamed to FORMAT/SAGE_SB.
      • Retain necessary INFO and FORMAT annotations while removing others to streamline the data.

Annotation

The Annotation process employs Reference Sources (GA4GH/GIAB problem region stratifications, GIAB high confidence regions, gnomAD, Hartwig hotspots), UMCCR panel of normals (built from approximately 200 normal samples), and the PCGR tool to enrich variants with classification and clinical information. These annotations are used to decide which variants are retained or filtered in the next step.

Inputs

  • Small variant VCF
    • ${tumor_id}.rescued.vcf.gz

Output

  • Annotated VCF
    • ${tumor_id}.annotations.vcf.gz

Steps

  1. Set FILTER to "PASS" for unfiltered variants:
    • Iterate over the input VCF file and set the FILTER field to PASS for any variants that currently have no filter status (FILTER is . or None).
  2. Annotate the VCF against reference sources:
    • Use vcfanno to add annotations to the VCF file:
      • gnomAD (version 2.1)
      • Hartwig Hotspots
      • ENCODE Blacklist
      • Genome in a Bottle High-Confidence Regions (v4.2.1)
      • Low and High GC Regions (< 30% or > 65% GC content, compiled by GA4GH)
      • Bad Promoter Regions (compiled by GA4GH)
  3. Annotate with UMCCR panel of normals counts:
    • Use vcfanno and bcftools to annotate the VCF with counts from the UMCCR panel of normals.
  4. Standardize the VCF fields:
    • Add new INFO fields for use with PCGR:
      • TUMOR_AF, NORMAL_AF: Tumor and normal allele frequencies.
      • TUMOR_DP, NORMAL_DP: Tumor and normal read depths.
    • Add the AD FORMAT field:
      • AD: Allelic depths for the reference and alternate alleles.
  5. Prepare VCF for PCGR annotation:
    • Make minimal VCF header keeping only INFO AF/DP, and contigs size.
    • Move tumor and normal FORMAT/AF and FORMAT/DP annotations to the INFO field as required by PCGR.
    • Set FILTER to PASS and remove all FORMAT and sample columns.
  6. Run PCGR (v1.4.1) to annotate VCF against external sources:
    • Classify variants by tiers based on annotations and functional impact according to AMP/ASCO/CAP guidelines.
    • Add INFO fields into the VCF: TIER, SYMBOL, CONSEQUENCE, MUTATION_HOTSPOT, TCGA_PANCANCER_COUNT, CLINVAR_CLNSIG, ICGC_PCAWG_HITS, COSMIC_CNT.
    • External sources include VEP, ClinVar, COSMIC, TCGA, ICGC, Open Targets Platform, CancerMine, DoCM, CBMDB, DisGeNET, Cancer Hotspots, dbNSFP, UniProt/SwissProt, Pfam, DGIdb, and ChEMBL.
  7. Transfer PCGR annotations to the full set of variants:
    • Merge the PCGR annotations back into the original VCF file.
    • Ensure that all variants, including those not selected for PCGR annotation, have relevant clinical annotations where available.
    • Preserve the FILTER statuses and other annotations from the original VCF.

Filter

The Filter step applies a series of stringent filters to somatic variant calls in the VCF file, ensuring the retention of high-confidence and biologically meaningful variants.

Inputs

  • Annotated VCF
    • ${tumor_id}.annotations.vcf.gz

Output

  • Filtered VCF
    • ${tumor_id}*filters_set.vcf.gz

Filters

Variants that do not meet these criteria will be filtered out unless they qualify for Clinical Significance Exceptions:

Filter Type Threshold/Criteria
Allele Frequency (AF) Filter Tumor AF < 10% (0.10)
Allele Depth (AD) Filter Fewer than 4 supporting reads (6 in low-complexity regions)
Non-GIAB AD Filter Stricter thresholds outside GIAB high-confidence regions
Problematic Genomic Regions Filter Overlap with ENCODE blacklist, bad promoter, or low-complexity regions
Population Frequency (gnomAD) Filter gnomAD AF ≥ 1% (0.01)
Panel of Normals (PoN) Germline Filter Present in ≥ 5 normal samples or PoN AF > 20% (0.20)

Clinical Significance Exceptions

Exception Category Criteria
Reference Database Hit Count COSMIC count ≥10 OR TCGA pan-cancer count ≥5 OR ICGC PCAWG count ≥5
ClinVar Pathogenicity ClinVar classification of conflicting_interpretations_of_pathogenicity, likely_pathogenic, pathogenic, or uncertain_significance
Mutation Hotspots Annotated as HMF_HOTSPOT, PCGR_MUTATION_HOTSPOT, or SAGE Hotspots (CGI, CIViC, OncoKB)
PCGR Tier Exception Classified as TIER_1 OR TIER_2

Reports

The Report step utilizes the Personal Cancer Genome Reporter (PCGR) and other tools to generate comprehensive reports.

Inputs

  • Purple purity data
  • Filtered VCF
    • ${tumor_id}*filters_set.vcf.gz
  • DRAGEN VCF
    • ${tumor_id}.main.dragen.vcf.gz

Output

  • PCGR Cancer report
    • ${tumor_id}.pcgr.grch38.html

Steps

  1. Generate BCFtools Statistics on the Input VCF:
    • Run bcftools stats to gather statistics on variant quality and distribution.
  2. Calculate Allele Frequency Distributions:
    • Filter and normalize variants according to high-confidence regions.
    • Extract allele frequency data from tumor samples.
    • Produce both a global allele frequency summary and a subset of allele frequencies restricted to key cancer genes.
  3. Compare Variant Counts From Two Variant Sets (DRAGEN vs. BOLT):
    • Count the total number and types of variants (SNPs, Indels, Others) passing filters in both the DRAGEN VCF and the Filtered BOLT VCF.
  4. Count Variants by Processing Stage.
  5. Parse Purity and Ploidy Information (Purple Data).
  6. Run PCGR (GRCh38 VEP 113 / pcgr_ref_data.20250314) to generate the final report. If PCGR struggles with very large VCFs, tune chunking with --pcgr_variant_chunk_size to cap variants per batch.

VCF to MAF conversion

After filtering, the pipeline converts the somatic VCF to MAF using vcf2maf (v1.6.22) for downstream tools that expect MAF format.

Output

  • MAF file for the tumour/normal pair
    • ${tumor_id}.maf

Somatic Structural Variants

The Somatic Structural Variants (SVs) pipeline identifies and annotates large-scale genomic alterations, including deletions, duplications, inversions, insertions, and translocations in tumor samples. Calls now come from eSVee (replacing GRIDSS/GRIPSS), but the downstream PURPLE/SnpEff/prioritisation steps remain unchanged.

Summary

  1. eSVee filtering:
    • Refines the structural variant calls using read counts, panel-of-normals, known fusion hotspots, and repeat masker annotations.
  2. PURPLE:
    • Combines the eSVee-filtered SV calls with copy number variation (CNV) data and tumor purity/ploidy estimates.
  3. Annotation:
    • Combines SV calls with CNV data and annotates using SnpEff.
  4. Prioritization:
    • Prioritizes SV annotations based on AstraZeneca-NGS using curated reference data.
  5. Report:
    • Generates cancer report and MultiQC output.

Inputs

  • eSVee (GRIDSS/GRIPSS replacement)
    • ${tumor_id}.esvee.somatic.vcf.gz

Steps

  1. eSVee filtering:
    • Evaluate split-read and paired-end support; discard variants with low support.
    • Apply panel-of-normals filtering to remove artifacts observed in normal samples.
    • Retain variants overlapping known oncogenic fusion hotspots (using UMCCR-curated lists).
    • Exclude variants in repetitive regions based on Repeat Masker annotations.
  2. PURPLE:
    • Merge SV calls with CNV segmentation data.
    • Estimate tumor purity and ploidy.
    • Adjust SV breakpoints based on copy number transitions.
    • Classify SVs as somatic or germline.
  3. Annotation:
    • Compile SV and CNV information into a unified VCF file.
    • Extend the VCF header with PURPLE-related INFO fields (e.g., PURPLE_baf, PURPLE_copyNumber).
    • Convert CNV records from TSV format into VCF records with appropriate SVTYPE tags (e.g., 'DUP' for duplications, 'DEL' for deletions).
    • Run SnpEff to annotate the unified VCF with functional information such as gene names, transcript effects, and coding consequences.
  4. Prioritization:
    • Run the prioritization module (forked from the AstraZeneca simple_sv_annotation tool) using reference data files including known fusion pairs, known fusion 5′ and 3′ lists, key genes, and key tumor suppressor genes.
    • Classify Variants:
      • Structural Variants (SVs): Variants labeled with the source sv_esvee.
      • Copy Number Variants (CNVs): Variants labeled with the source cnv_purple.
  5. Prioritize variants on a 4-tier system using prioritize_sv:
    • 1 (high) - 2 (moderate) - 3 (low) - 4 (no interest)
    • Exon loss:
      • On cancer gene list (1)
      • Other (2)
    • Gene fusion:
      • Paired (hits two genes):
        • On list of known pairs (1) (curated by HMF)
        • One gene is a known promiscuous fusion gene (1) (curated by HMF)
        • On list of FusionCatcher known pairs (2)
        • Other:
          • One or two genes on cancer gene list (2)
          • Neither gene on cancer gene list (3)
      • Unpaired (hits one gene):
        • On cancer gene list (2)
        • Others (3)
    • Upstream or downstream: A specific type of fusion where one gene comes under the control of another gene's promoter, potentially leading to overexpression (oncogene) or underexpression (tumor suppressor gene):
      • On cancer gene list genes (2)
    • LoF or HIGH impact in a tumor suppressor:
      • On cancer gene list (2)
      • Other TS gene (3)
    • Other (4)
  6. Filter Low-Quality Calls:
    • Apply Quality Filters:
      • Keep variants with sufficient read support (e.g., split reads (SR) ≥ 5 and paired reads (PR) ≥ 5).
      • Exclude Tier 3 and Tier 4 variants where SR < 5 and PR < 5.
      • Exclude Tier 3 and Tier 4 variants where SR < 10, PR < 10, and allele frequencies (AF0 or AF1) are below 0.1.
  7. Report:
    • Generate MultiQC and cancer report outputs.

Germline Small Variants

Filtering Select passing variants in the given gene panel transcript regions made with PMCC familial cancer clinic list then make CPSR report.

Inputs

  • DRAGEN VCF
    • ${normal_id}.hard-filtered.vcf.gz

Output

  • CPSR report
    • ${normal_id}.cpsr.grch38.html

Steps

  1. Prepare:
    • Selection of Passing Variants:
      • Raw germline variant calls from DRAGEN are filtered to retain only those variants marked as PASS (or with no filter flag).
    • Selection of Gene Panel Variants:
  2. Report:
    • Generate CPSR (Cancer Predisposition Sequencing Report) summarizing germline findings.

Common Reports

UMCCR cancer report containing:

Tumor Mutation Burden (TMB)

  • Data Source: filtered somatic VCF
  • Tool: PURPLE

Mutational Signatures

  • Data Source: filtered somatic SNV VCF (Sigrap MutationalPatterns output)
  • Tool: Sigrap (MutationalPatterns wrapper)

Contamination Score

  • Data Source: –
  • Note: No dedicated contamination metric is currently generated

Purity & Ploidy

  • Data Source: COBALT (providing read-depth ratios) and AMBER (providing B-allele frequency measurements)
  • Tool: PURPLE, which uses these inputs to compute sample purity (percentage of tumor cells) and overall ploidy (average copy number)

HRD Score

  • Data Source: optional DRAGEN HRD score (${tumor_id}.hrdscore.csv), Sigrap HRDetect JSON, and oncoanalyser CHORD predictions
  • Tool: DRAGEN HRD, Sigrap HRDetect, and CHORD

MSI (Microsatellite Instability)

  • Data Source: Indels in microsatellite regions from SNV/CNV
  • Tool: PURPLE

Structural Variant Metrics

  • Data Source: eSVee SV VCF and PURPLE CNV segmentation
  • Tools: eSVee, PURPLE, and the AstraZeneca simple_sv_annotation prioritisation rules

Copy Number Metrics (Segments, Deleted Genes, etc.)

  • Data Source: PURPLE CNV outputs (segmentation files, gene-level CNV TSV)
  • Tool: PURPLE

The LINX report includes the following:

  • Tables of Variants:
    • Breakends
    • Links
    • Driver Catalog
  • Plots:
    • Cluster-Level Plots

MultiQC

General Stats: Overview of QC metrics aggregated from all tools, providing high-level sample quality information.

DRAGEN: Mapping metrics (mapped reads, paired reads, duplicated alignments, secondary alignments), WGS coverage (average depth, cumulative coverage, per-contig coverage), fragment length distributions, trimming metrics, and time metrics for pipeline steps.

PURPLE: Sample QC status (PASS/FAIL), ploidy, tumor purity, polyclonality percentage, tumor mutational burden (TMB), microsatellite instability (MSI) status, and variant metrics for somatic and germline SNPs/indels.

BcfTools Stats: Variant substitution types, SNP and indel counts, quality scores, variant depth, and allele frequency metrics for both somatic and germline variants.

DRAGEN-FastQC: Per-base sequence quality, per-sequence quality scores, GC content (per-sequence and per-position), HRD score, sequence length distributions, adapter contamination, and sequence duplication levels.

PCGR

Personal Cancer Genome Reporter (PCGR) tool generates a comprehensive, interactive HTML report that consolidates filtered and annotated variant data, providing detailed insights into the somatic variants identified.

Key Metrics:

  • Variant Classification and Tier Distribution: PCGR categorizes variants into tiers based on their clinical and biological significance. The report details the proportion of variants across different tiers, indicating their potential clinical relevance.
  • Mutational Signatures: The report includes analysis of mutational signatures, offering insights into the mutational processes active in the tumor.
  • Copy Number Alterations (CNAs): Visual representations of CNAs are provided, highlighting significant gains and losses across the genome. Genome-wide plots display regions of copy number gains and losses.
  • Tumor Mutational Burden (TMB): Calculations of TMB are included, which can have implications for immunotherapy eligibility. The report presents the TMB value, representing the number of mutations per megabase.
  • Microsatellite Instability (MSI) Status: Assessment of MSI status is performed, relevant for certain cancer types and treatment decisions.
  • Clinical Trials Information: Information on relevant clinical trials is incorporated, offering potential therapeutic options based on the identified variants.

Note: The PCGR tool is designed to process a maximum of 500,000 variants. If the input VCF file contains more than this limit, variants exceeding 500,000 will be filtered out.

CPSR Report

The CPSR (Cancer Predisposition Sequencing Report) includes the following:

Settings:

  • Sample metadata
  • Report configuration
  • Virtual gene panel

Summary of Findings:

  • Variant statistics

Variant Classification:

ClinVar and Non-ClinVar variants:

  • Class 5 - Pathogenic variants
  • Class 4 - Likely Pathogenic variants
  • Class 3 - Variants of Uncertain Significance (VUS)
  • Class 2 - Likely Benign variants
  • Class 1 - Benign variants
  • Biomarkers

PCGR TIER according to ACMG:

  • Tier 1 (High): Highest priority variants with strong clinical relevance.
  • Tier 2 (Moderate): Variants with potential clinical significance.
  • Tier 3 (Low): Variants with uncertain significance.
  • Tier 4 (No Interest): Variants unlikely to be clinically relevant.

Coverage

The sash workflow utilizes coverage metrics from DRAGEN to evaluate the sequencing quality and depth across target regions. Coverage analysis includes:

  • Mean coverage across targeted genomic regions
  • Percentage of target regions covered at various depth thresholds (10X, 20X, 50X, 100X)
  • Coverage uniformity metrics
  • Gap analysis for regions with insufficient coverage

These metrics are integrated into the MultiQC report, providing a comprehensive overview of sequencing quality and coverage.


Reference Data

Curated gene panels for specific analyses, including the germline cancer predisposition gene panel used in the Germline Small Variants workflow.

Genome Annotations

HMFtools Reference Data

  • Ensembl reference data (GRCh38)
  • Somatic driver catalogs
  • Known fusion gene pairs
  • Driver gene panels

Annotation Databases:

  • gnomAD (v2.1): Provides population allele frequencies to help distinguish common variants from rare ones.
  • ClinVar (20220103): Offers clinically curated variant information, aiding in the interpretation of potential pathogenicity.
  • COSMIC: Contains data on somatic mutations found in cancer, facilitating the identification of cancer-related variants.
  • Gene Panels: Focuses analysis on specific sets of genes relevant to particular conditions or research interests.

Structural Variant Data:

  • SnpEff Databases: Used for predicting the effects of variants on genes and proteins.
  • Panel of Normals (PON): Helps filter out technical artifacts by comparing against a set of normal samples.
  • RepeatMasker: Identifies repetitive genomic regions to prevent false-positive variant calls.

Databases/datasets PCGR Reference Data:

  • Version: pcgr_ref_data.20250314.grch38.tgz with GRCh38 VEP 113 cache (homo_sapiens_vep_113_GRCh38.tar.gz). Both archives are auto-extracted by the PREPARE_REFERENCE subworkflow.
  • Contents include refreshed ClinVar, COSMIC, dbNSFP, gnomAD, OncoKB/CGI biomarker sets, and PCGR/CPSR configuration files aligned with PCGR v2.x.

sash Module Outputs

Somatic SNVs

  • File: smlv_somatic/filter/{tumor_id}.pass.vcf.gz
  • Description: Contains somatic single nucleotide variants (SNVs) with filtering applied (VCF format).

Somatic SVs

  • File: sv_somatic/prioritise/{tumor_id}.sv.prioritised.vcf.gz
  • Description: Contains somatic structural variants (SVs) with prioritization applied (VCF format).

Somatic CNVs

  • File: cancer_report/cancer_report_tables/purple/{tumor_id}-purple_cnv_som.tsv.gz
  • Description: Contains somatic copy number variations (CNVs) data (TSV format).

Somatic Gene CNVs

  • File: cancer_report/cancer_report_tables/purple/{tumor_id}-purple_cnv_som_gene.tsv.gz
  • Description: Contains gene-level somatic copy number variations (CNVs) data (TSV format).

Germline SNVs

  • File: dragen_germline_output/{normal_id}.hard-filtered.vcf.gz
  • Description: Contains germline single nucleotide variants (SNVs) with hard filtering applied (VCF format).

Purple Purity, Ploidy, MS Status

  • File: purple/{tumor_id}.purple.purity.tsv
  • Description: Contains estimated tumor purity, ploidy, and microsatellite status (TSV format).

PCGR JSON with TMB

  • File: smlv_somatic/report/pcgr/{tumor_id}.pcgr.grch38.json.gz
  • Description: Contains PCGR annotations, including tumor mutational burden (TMB) (JSON format).

DRAGEN HRD Score (input)

  • File: ${tumor_id}.hrdscore.csv (from dragen_somatic_dir)
  • Description: Optional DRAGEN homologous recombination deficiency (HRD) score propagated into the cancer report when provided.

Sigrap HRDetect

  • File: sigrap/hrdetect/hrdetect.json.gz
  • Description: HRDetect JSON summarising HRD probability from combined SNV/SV/CNV signals.

Sigrap MutationalPatterns

  • Directory: sigrap/mutpat/
  • Description: Mutational signature TSVs/plots (SBS/DBS/indels) generated by Sigrap’s MutationalPatterns wrapper.

Somatic MAF export

  • File: vcf2maf/{tumor_id}.maf
  • Description: MAF representation of the filtered somatic VCF for downstream tools that prefer MAF input.

FAQ

Q: Do we use PCGR for the rescue of SAGE?

A: Rescue is performed by BOLT using SAGE hotspot calls layered onto the DRAGEN VCF. PCGR is only used later for reporting/annotation; it does not drive the rescue step.

Q: How are hypermutated samples handled in the current version, and is there any impact on derived metrics such as TMB or MSI?

A: In the current version of sash, hypermutated samples are identified based on a threshold of 500,000 total somatic variant counts. If the variant count exceeds this threshold, the sample is flagged as hypermutated. When this occurs, we will filter variants that: 1) don't have clinical impact, 2) aren't in hotspot regions, until we meet the threshold. This impacts the TMB and MSI calculations by PURPLE. Currently, we are using the TMB and MSI values from PURPLE in these edge cases. A future release will provide correct TMB and MSI calculations from PURPLE.

Q: How are we handling non-standard chromosomes if present in the input VCFs (ALTs, chrM, etc)?

A: We filter on chromosomes 1-22 and chromosomes X, Y, M. All other non-standard chromosomes and contigs are filtered out.

Q: What inputs for the cancer reporter - have they changed (and what can we harmonize); e.g., where is the Circos plot from at this point?

A: Circos plots are generated by PURPLE.

Q: We dropped the CACAO coverage reports. Can we discuss how to utilize DRAGEN or HMFtools coverage information instead?

A: DRAGEN coverage metrics are now integrated into the MultiQC report, providing a comprehensive overview of sequencing quality and coverage across the genome. We are exploring further integration of HMFtools coverage analysis for future releases.

Q: What TMB score is displayed in the cancer reporter?

A: The cancer report surfaces the PURPLE-derived TMB; the PCGR HTML also reports its own TMB estimate for comparison.

Q: What filtered VCF is the source for the mutational signatures?

A: Sigrap MutationalPatterns uses the filtered somatic VCF (post-rescue and filtering); its outputs are published under sigrap/mutpat/ and fed into the cancer report.

Q: Where is the contamination score coming from currently?

A: Currently, sash does not calculate a dedicated contamination metric. Tumor purity estimation from PURPLE serves as the primary indicator of sample quality.

Q: Do the SV steps do something more than what's happening in Oncoanalyser?

A: SASH reuses the WiGiTS export to re-run eSVee with UMCCR reference data and panel-of-normals, then applies PURPLE, SnpEff and simple_sv_annotation. GRIDSS/GRIPSS are no longer used.

Q: Does the data from Somatic Small Variants workflow get used for the SV analysis?

A: No, the somatic small variant workflow data is not used in the structural variant (SV) workflow. These are independent analyses that run in parallel.