Skip to content

sash process OOM / timeout for hypermutated samples #42

@qclayssen

Description

@qclayssen

Related discussion: nextflow-stack issue #133

SUMMARY

On ICA, UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE fails for hypermutated samples with an apparent OOM kill inside PCGR. The pod spec suggests ICA is applying a fixed memory preset (16 GiB) regardless of the Nextflow process.memory requested via process_low, which makes “retry with more memory” ineffective unless ICA also scales the preset.

Example failing sample

SBJ02862

Log from PCGR (chunk 1):
aws s3 cp s3://pipeline-dev-cache-503977275616-ap-southeast-2/byob-icav2/development/logs/sash/20251207361fe534/work/aa/7efa389de75e3287c3620a6ec9305a/output/pcgr/pcgr_1/run_somatic.log -

2025-12-08 02:47:42 - pcgr-writer - INFO - PCGR - STEP 6: Generation of output files - molecular interpretation report for precision cancer medicine /bin/sh: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/sh) /tmp/tmpdjqrrp0b: line 3: 7535 Killed pcgr --sample_id nosampleset --input_vcf output/vcf_chunks/L2201449.pcgr_prep.vcf_chunk1.vcf --vep_dir vep_dir --refdata_dir pcgr_dir --tumor_dp_tag TUMOR_DP --tumor_af_tag TUMOR_AF --control_dp_tag NORMAL_DP --control_af_tag NORMAL_AF --genome_assembly grch38 --assay WGS --estimate_signatures --estimate_msi --estimate_tmb --vcfanno_n_proc 2 --vep_n_forks 4 --vep_pick_order biotype,rank,appris,tsl,ccds,canonical,length,mane_plus_clinical,mane_select --pcgrr_conda pcgrr --output_dir output/pcgr/pcgr_1

ICA is enforcing a fixed memory preset

it look like ICA give resources regarless of nextflow config

aws s3 cp s3://pipeline-dev-cache-503977275616-ap-southeast-2/byob-icav2/development/logs/sash/20251207361fe534/work/aa/7efa389de75e3287c3620a6ec9305a/.command.tes.yml -

api_version: v1
kind: Pod
metadata:
  annotations:
    illumina.com/taskId: tes-ea12b03d5b6f4f61bb7f4f72e2b869ab-umccr-sash-sash-bolt-sm-0
    scheduler.illumina.com/presetSize: standard-medium
    tes-executor.illumina.com/serverTaskGuid: stg.b60b41a2-7f8e-464b-a67e-86e3ef34effa
  labels:
    nextflow.io/app: nextflow
    nextflow.io/processName: UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE
    nextflow.io/runName: silly_poisson
    nextflow.io/sessionId: uuid-7287b018-9fc7-4e97-bdcd-62c95e6b2c3d
    nextflow.io/taskName: UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE_L2201449__L2201450
  name: tes-ea12b03d5b6f4f61bb7f4f72e2b869ab-umccr-sash-sash-bolt-sm-0
  namespace: wf-d81225c7-a45d-4a2b-9666-1723e9f835c4
spec:
  containers:
  - args:
    - /bin/bash
    - -ue
    - /ces/scheduler/run/d81225c7-a45d-4a2b-9666-1723e9f835c4/data/work/aa/7efa389de75e3287c3620a6ec9305a/.command.run
    image: ghcr.io/umccr/bolt:0.3.0-dev-20-pcgr
    name: nf-aa7efa389de75e3287c3620a6ec9305a-951ce
    resources:
      limits:
        memory: 16384Mi
      requests:
        cpu: 4
        memory: 16384Mi
    volume_mounts:
    - mount_path: /ces
      name: vol-7
  restart_policy: Never
  service_account_name: task
  volumes:
  - name: vol-7
    persistent_volume_claim:
      claim_name: pvc-d81225c7-a45d-4a2b-9666-1723e9f835c4

So even though the process is labelled process_low, ICA is setting memory: 16384Mi and killing the job when PCGR spikes above that.

Current process config

where UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE have label process_low

    withLabel:process_low {
        cpus   = { check_max( 2     * task.attempt, 'cpus'    ) }
        memory = { check_max( 12.GB * task.attempt, 'memory'  ) }
        time   = { check_max( 4.h   * task.attempt, 'time'    ) }
    }

So i'm not sure that ICA take in account label retry strategy would work.

What we already tried

  • Disabled PCGR parallelisation for chunks (run sequentially).
  • Tested --no_html in pcgr helped runtime and I/O, did not materially reduce peak RAM.
  • Local tests show chunk size changes reduce memory but increase runtime.

Proposal

I was thinking make it's own config files or add to base.config for process affected by hypermutated samples

withLabel:process_hypermutated_affected {
  cpus   = { check_max( 4     , 'cpus'   ) }
  memory = { check_max( 24.GB * task.attempt, 'memory' ) }
  time   = { check_max( 24.h  * task.attempt, 'time'   ) }
  errorStrategy = 'retry'
  maxRetries    = 3
}

One quick fix, add label to process affected by hypermutated samples, adding : label 'process_low', 'error_retry', 'process_long'

error_retry for memory scaling and process_long for longer size process (take more than 20h for mutpat process)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions