Related discussion: nextflow-stack issue #133
SUMMARY
On ICA, UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE fails for hypermutated samples with an apparent OOM kill inside PCGR. The pod spec suggests ICA is applying a fixed memory preset (16 GiB) regardless of the Nextflow process.memory requested via process_low, which makes “retry with more memory” ineffective unless ICA also scales the preset.
Example failing sample
SBJ02862
Log from PCGR (chunk 1):
aws s3 cp s3://pipeline-dev-cache-503977275616-ap-southeast-2/byob-icav2/development/logs/sash/20251207361fe534/work/aa/7efa389de75e3287c3620a6ec9305a/output/pcgr/pcgr_1/run_somatic.log -
2025-12-08 02:47:42 - pcgr-writer - INFO - PCGR - STEP 6: Generation of output files - molecular interpretation report for precision cancer medicine /bin/sh: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/sh) /tmp/tmpdjqrrp0b: line 3: 7535 Killed pcgr --sample_id nosampleset --input_vcf output/vcf_chunks/L2201449.pcgr_prep.vcf_chunk1.vcf --vep_dir vep_dir --refdata_dir pcgr_dir --tumor_dp_tag TUMOR_DP --tumor_af_tag TUMOR_AF --control_dp_tag NORMAL_DP --control_af_tag NORMAL_AF --genome_assembly grch38 --assay WGS --estimate_signatures --estimate_msi --estimate_tmb --vcfanno_n_proc 2 --vep_n_forks 4 --vep_pick_order biotype,rank,appris,tsl,ccds,canonical,length,mane_plus_clinical,mane_select --pcgrr_conda pcgrr --output_dir output/pcgr/pcgr_1
ICA is enforcing a fixed memory preset
it look like ICA give resources regarless of nextflow config
aws s3 cp s3://pipeline-dev-cache-503977275616-ap-southeast-2/byob-icav2/development/logs/sash/20251207361fe534/work/aa/7efa389de75e3287c3620a6ec9305a/.command.tes.yml -
api_version: v1
kind: Pod
metadata:
annotations:
illumina.com/taskId: tes-ea12b03d5b6f4f61bb7f4f72e2b869ab-umccr-sash-sash-bolt-sm-0
scheduler.illumina.com/presetSize: standard-medium
tes-executor.illumina.com/serverTaskGuid: stg.b60b41a2-7f8e-464b-a67e-86e3ef34effa
labels:
nextflow.io/app: nextflow
nextflow.io/processName: UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE
nextflow.io/runName: silly_poisson
nextflow.io/sessionId: uuid-7287b018-9fc7-4e97-bdcd-62c95e6b2c3d
nextflow.io/taskName: UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE_L2201449__L2201450
name: tes-ea12b03d5b6f4f61bb7f4f72e2b869ab-umccr-sash-sash-bolt-sm-0
namespace: wf-d81225c7-a45d-4a2b-9666-1723e9f835c4
spec:
containers:
- args:
- /bin/bash
- -ue
- /ces/scheduler/run/d81225c7-a45d-4a2b-9666-1723e9f835c4/data/work/aa/7efa389de75e3287c3620a6ec9305a/.command.run
image: ghcr.io/umccr/bolt:0.3.0-dev-20-pcgr
name: nf-aa7efa389de75e3287c3620a6ec9305a-951ce
resources:
limits:
memory: 16384Mi
requests:
cpu: 4
memory: 16384Mi
volume_mounts:
- mount_path: /ces
name: vol-7
restart_policy: Never
service_account_name: task
volumes:
- name: vol-7
persistent_volume_claim:
claim_name: pvc-d81225c7-a45d-4a2b-9666-1723e9f835c4
So even though the process is labelled process_low, ICA is setting memory: 16384Mi and killing the job when PCGR spikes above that.
Current process config
where UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE have label process_low
withLabel:process_low {
cpus = { check_max( 2 * task.attempt, 'cpus' ) }
memory = { check_max( 12.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}
So i'm not sure that ICA take in account label retry strategy would work.
What we already tried
- Disabled PCGR parallelisation for chunks (run sequentially).
- Tested --no_html in pcgr helped runtime and I/O, did not materially reduce peak RAM.
- Local tests show chunk size changes reduce memory but increase runtime.
Proposal
I was thinking make it's own config files or add to base.config for process affected by hypermutated samples
withLabel:process_hypermutated_affected {
cpus = { check_max( 4 , 'cpus' ) }
memory = { check_max( 24.GB * task.attempt, 'memory' ) }
time = { check_max( 24.h * task.attempt, 'time' ) }
errorStrategy = 'retry'
maxRetries = 3
}
One quick fix, add label to process affected by hypermutated samples, adding : label 'process_low', 'error_retry', 'process_long'
error_retry for memory scaling and process_long for longer size process (take more than 20h for mutpat process)
Related discussion: nextflow-stack issue #133
SUMMARY
On ICA, UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE fails for hypermutated samples with an apparent OOM kill inside PCGR. The pod spec suggests ICA is applying a fixed memory preset (16 GiB) regardless of the Nextflow process.memory requested via process_low, which makes “retry with more memory” ineffective unless ICA also scales the preset.
Example failing sample
SBJ02862
Log from PCGR (chunk 1):
aws s3 cp s3://pipeline-dev-cache-503977275616-ap-southeast-2/byob-icav2/development/logs/sash/20251207361fe534/work/aa/7efa389de75e3287c3620a6ec9305a/output/pcgr/pcgr_1/run_somatic.log -ICA is enforcing a fixed memory preset
it look like ICA give resources regarless of nextflow config
aws s3 cp s3://pipeline-dev-cache-503977275616-ap-southeast-2/byob-icav2/development/logs/sash/20251207361fe534/work/aa/7efa389de75e3287c3620a6ec9305a/.command.tes.yml -So even though the process is labelled process_low, ICA is setting memory: 16384Mi and killing the job when PCGR spikes above that.
Current process config
where
UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATEhave labelprocess_lowsash/modules/local/bolt/smlv_somatic/annotate/main.nf
Line 3 in 5f01cf1
So i'm not sure that ICA take in account label retry strategy would work.
What we already tried
Proposal
I was thinking make it's own config files or add to base.config for process affected by hypermutated samples
One quick fix, add label to process affected by hypermutated samples, adding :
label 'process_low', 'error_retry', 'process_long'error_retry for memory scaling and process_long for longer size process (take more than 20h for mutpat process)