sash process OOM / timeout for hypermutated samples

Related discussion: [nextflow-stack issue #133](https://github.com/umccr/nextflow-stack/issues/133)

## SUMMARY
On ICA, UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE fails for hypermutated samples with an apparent OOM kill inside PCGR. The pod spec suggests ICA is applying a fixed memory preset (16 GiB) regardless of the Nextflow process.memory requested via process_low, which makes “retry with more memory” ineffective unless ICA also scales the preset.

## Example failing sample
SBJ02862
- [portal workflow](https://orcaui.dev.umccr.org/runs/workflow/wfr.01KBXM09HH46VQGJFR01V3Z2WQ)
- [ICA analysis](https://ica.illumina.com/ica/projects/ea19a3f5-ec7c-4940-a474-c31cd91dbad4/analyses/d81225c7-a45d-4a2b-9666-1723e9f835c4)

Log from PCGR (chunk 1):
`aws s3 cp s3://pipeline-dev-cache-503977275616-ap-southeast-2/byob-icav2/development/logs/sash/20251207361fe534/work/aa/7efa389de75e3287c3620a6ec9305a/output/pcgr/pcgr_1/run_somatic.log -`

```
2025-12-08 02:47:42 - pcgr-writer - INFO - PCGR - STEP 6: Generation of output files - molecular interpretation report for precision cancer medicine /bin/sh: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/sh) /tmp/tmpdjqrrp0b: line 3: 7535 Killed pcgr --sample_id nosampleset --input_vcf output/vcf_chunks/L2201449.pcgr_prep.vcf_chunk1.vcf --vep_dir vep_dir --refdata_dir pcgr_dir --tumor_dp_tag TUMOR_DP --tumor_af_tag TUMOR_AF --control_dp_tag NORMAL_DP --control_af_tag NORMAL_AF --genome_assembly grch38 --assay WGS --estimate_signatures --estimate_msi --estimate_tmb --vcfanno_n_proc 2 --vep_n_forks 4 --vep_pick_order biotype,rank,appris,tsl,ccds,canonical,length,mane_plus_clinical,mane_select --pcgrr_conda pcgrr --output_dir output/pcgr/pcgr_1
```

## ICA is enforcing a fixed memory preset
it look like ICA give resources regarless of nextflow config

`aws s3 cp s3://pipeline-dev-cache-503977275616-ap-southeast-2/byob-icav2/development/logs/sash/20251207361fe534/work/aa/7efa389de75e3287c3620a6ec9305a/.command.tes.yml - `
```
api_version: v1
kind: Pod
metadata:
  annotations:
    illumina.com/taskId: tes-ea12b03d5b6f4f61bb7f4f72e2b869ab-umccr-sash-sash-bolt-sm-0
    scheduler.illumina.com/presetSize: standard-medium
    tes-executor.illumina.com/serverTaskGuid: stg.b60b41a2-7f8e-464b-a67e-86e3ef34effa
  labels:
    nextflow.io/app: nextflow
    nextflow.io/processName: UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE
    nextflow.io/runName: silly_poisson
    nextflow.io/sessionId: uuid-7287b018-9fc7-4e97-bdcd-62c95e6b2c3d
    nextflow.io/taskName: UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE_L2201449__L2201450
  name: tes-ea12b03d5b6f4f61bb7f4f72e2b869ab-umccr-sash-sash-bolt-sm-0
  namespace: wf-d81225c7-a45d-4a2b-9666-1723e9f835c4
spec:
  containers:
  - args:
    - /bin/bash
    - -ue
    - /ces/scheduler/run/d81225c7-a45d-4a2b-9666-1723e9f835c4/data/work/aa/7efa389de75e3287c3620a6ec9305a/.command.run
    image: ghcr.io/umccr/bolt:0.3.0-dev-20-pcgr
    name: nf-aa7efa389de75e3287c3620a6ec9305a-951ce
    resources:
      limits:
        memory: 16384Mi
      requests:
        cpu: 4
        memory: 16384Mi
    volume_mounts:
    - mount_path: /ces
      name: vol-7
  restart_policy: Never
  service_account_name: task
  volumes:
  - name: vol-7
    persistent_volume_claim:
      claim_name: pvc-d81225c7-a45d-4a2b-9666-1723e9f835c4
```

So even though the process is labelled process_low, ICA is setting memory: 16384Mi and killing the job when PCGR spikes above that.

## Current process config
where `UMCCR_SASH_SASH_BOLT_SMLV_SOMATIC_ANNOTATE` have label `process_low` https://github.com/umccr/sash/blob/5f01cf12ccdc77f8b99a08d59cf9cf050a00998c/modules/local/bolt/smlv_somatic/annotate/main.nf#L3

```
    withLabel:process_low {
        cpus   = { check_max( 2     * task.attempt, 'cpus'    ) }
        memory = { check_max( 12.GB * task.attempt, 'memory'  ) }
        time   = { check_max( 4.h   * task.attempt, 'time'    ) }
    }
```

So i'm not sure that ICA take in account label retry strategy would work.


## What we already tried

- Disabled PCGR parallelisation for chunks (run sequentially).
- Tested --no_html in pcgr helped runtime and I/O, did not materially reduce peak RAM.
- Local tests show chunk size changes reduce memory but increase runtime.

## Proposal

I was thinking make it's own config files or add to [base.config](https://github.com/umccr/sash/blob/main/conf/base.config) for process affected by hypermutated samples

```
withLabel:process_hypermutated_affected {
  cpus   = { check_max( 4     , 'cpus'   ) }
  memory = { check_max( 24.GB * task.attempt, 'memory' ) }
  time   = { check_max( 24.h  * task.attempt, 'time'   ) }
  errorStrategy = 'retry'
  maxRetries    = 3
}
```
One quick fix, add label to process affected by hypermutated samples, adding :  `label 'process_low', 'error_retry', 'process_long'`

error_retry for memory scaling and process_long for longer size process (take more than 20h for mutpat process)






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sash process OOM / timeout for hypermutated samples #42

SUMMARY

Example failing sample

ICA is enforcing a fixed memory preset

Current process config

What we already tried

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sash process OOM / timeout for hypermutated samples #42

Description

SUMMARY

Example failing sample

ICA is enforcing a fixed memory preset

Current process config

What we already tried

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions