Basecalling/demux+modbases: std::bad_alloc #1213

sklages · 2025-01-09T19:51:06Z

I have a strange issue with v0.8.3 when basecalling/demux in sup mode with modbases on a Nvidia A100/40G.

[2025-01-07 07:44:46.218] [info] Running:
"basecaller" "sup,5mCG_5hmCG" "/dev/shm/mxqd/mnt/job/53410651"
"--device" "cuda:all" "--batchsize" "0" "--trim" "all" "--verbose"
"--kit-name" "SQK-NBD114-96" "--sample-sheet" "/path/to/samplesheet.csv"
[2025-01-07 07:44:46.566] [debug] set models directory to: 
'/path/to/v0.8.3-Release/models' from 'DORADO_MODELS_DIRECTORY' environment variable
[2025-01-07 07:44:46.657] [info] > Creating basecall pipeline

<..>

[2025-01-09 11:59:27.922] [info] > Simplex reads basecalled: 152677113
[2025-01-09 11:59:27.922] [info] > Simplex reads filtered: 3048
[2025-01-09 11:59:27.922] [info] > Basecalled @ Samples/s: 7.592271e+06
[2025-01-09 11:59:27.922] [debug] > Including Padding @ Samples/s: 1.051e+07 (72.25%)
[2025-01-09 11:59:27.922] [info] > 154785331 reads demuxed @ classifications/s: 8.230199e+02
[2025-01-09 11:59:27.922] [debug] Barcode distribution :
[2025-01-09 11:59:27.922] [debug] SQK-NBD114-96_barcode70 : 48370444
[2025-01-09 11:59:27.922] [debug] SQK-NBD114-96_barcode71 : 37403828
[2025-01-09 11:59:27.922] [debug] SQK-NBD114-96_barcode72 : 32852492
[2025-01-09 11:59:27.922] [debug] SQK-NBD114-96_barcode73 : 29870372
[2025-01-09 11:59:27.922] [debug] unclassified : 6288195
[2025-01-09 11:59:27.941] [debug] Classified rate 95.93747%
[2025-01-09 12:00:07.015] [info] > Finished
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

.. directly after basecalling has finished (after appr 53h) ..

free disk space 76TB
RAM usage was below 20G (system RAM is 384G, single user, single job)

That happened with two datasets, short insert libraries, many reads. Never seen this with dorado before.

What did dorado cause to crash immediately after basecalling has finished?

Result files seem to be complete though, e.g.:

# dorado reports
[info] > 154785331 reads demuxed

# samtools reports on BAM file
154785331 + 0 primary

Any idea what is going wrong here, what is causing the std::bad_alloc?

The text was updated successfully, but these errors were encountered:

HalfPhoton · 2025-01-13T10:01:18Z

Hi @sklages,

Can you repro this with a small dataset e.g. ~10k reads?

Best regards,
Rich

sklages · 2025-01-14T14:45:37Z

@HalfPhoton - I tried two different 10K subsets and both succeeded .. I will re-run the large dataset with both v0.8.3 and current v0.9.0 in order to see if I can actually reproduce the issue and if both version behave differently ..

sklages · 2025-01-18T16:49:21Z

@HalfPhoton - Using the v0.8.3 the error is reproducible, just after basecalling, as before.

Version v0.9.0 quits very early:

<..>
[2025-01-17 11:09:51.039] [info] Running: "basecaller" "sup,5mCG_5hmCG" "/dev/shm/mxqd/mnt/job/53455719" "--device" "cuda:all" "--batchsize" "0" "  --trim" "all" "--verbose" "--kit-name" "SQK-NBD114-96" "--sample-sheet" "/path/to/samplesheet.csv"

<..>
  [2025-01-17 11:10:03.348] [debug] Largest batch size for cuda:0: 960, time per chunk 0.443559 ms
  [2025-01-17 11:10:03.348] [debug] Final batch size for cuda:0[0]: 480
  [2025-01-17 11:10:03.348] [debug] Final batch size for cuda:0[1]: 960
  [2025-01-17 11:10:03.348] [info] cuda:0 using chunk size 11520, batch size 480
  [2025-01-17 11:10:03.348] [debug] cuda:0 Model memory 32.12GB
  [2025-01-17 11:10:03.348] [debug] cuda:0 Decode memory 3.90GB
  [2025-01-17 11:10:04.448] [info] cuda:0 using chunk size 5760, batch size 960
  [2025-01-17 11:10:04.448] [debug] cuda:0 Model memory 32.12GB
  [2025-01-17 11:10:04.448] [debug] cuda:0 Decode memory 3.90GB
  [2025-01-17 11:10:05.060] [debug] BasecallerNode chunk size 11520
  [2025-01-17 11:10:05.060] [debug] BasecallerNode chunk size 5760
  [2025-01-17 11:10:05.084] [debug] Load reads from file /dev/shm/mxqd/mnt/job/53455719/f3c0a0d2ace8.pod5
  [2025-01-17 11:10:06.451] [debug] > Kits to evaluate: 1
  terminate called after throwing an instance of 'std::bad_alloc'
    what():  std::bad_alloc
    
<..>(core dumped) /path/to/v0.9.0-Release/bin/dorado basecaller $DORADO_BC_MODEL_PM $POD5_TMPDIR --device cuda:all --batchsize 0 --trim all $DORADO_DEBUG $DEMUX_PRM > $BAM_OUT

Any idea where to start looking for the problem? It may be dataset-specific, system-specific ..

Both with small memory footprint (less than 20G) and plenty of free diskspace.

I will run a different dataset which worked before, to exclude the latter ..

sklages · 2025-01-18T17:36:31Z

I ran both versions on another (smaller) dataset, both finished successfully .. so it seems somehow dataset(size)-related ..

MueFab · 2025-01-20T09:42:13Z

` [2025-01-19 18:57:20.000] [info] Running: "basecaller" "/models/[email protected]" "." "--modified-bases-models" "/models/[email protected]_5mCG_5hmCG@v3" "--device" "cuda:all"

[2025-01-19 18:57:20.148] [info] > Creating basecall pipeline
[2025-01-19 18:57:34.593] [info] cuda:0 using chunk size 11520, batch size 960
[2025-01-19 18:57:34.659] [info] cuda:1 using chunk size 11520, batch size 960
[2025-01-19 18:57:35.792] [info] cuda:0 using chunk size 5760, batch size 960
[2025-01-19 18:57:35.798] [info] cuda:1 using chunk size 5760, batch size 960
[2025-01-20 00:48:50.799] [error] Failed to get read 436 signal: Invalid: Input data failed to decompress using zstd: (18446744073709551552 Allocation error : not enough memory)
terminate called after throwing an instance of 'std::bad_alloc'
`

I’m encountering the same issue on my end. Certain datasets cause a crash with std::bad_alloc (not enough memory), despite having over 500GB of free RAM and plenty of disk space available. This behavior only occurs with some datasets. I’m still investigating whether there’s a pattern.

malton-ont · 2025-01-20T17:25:32Z

@MueFab - that looks like a bug handling corrupt/invalid data. We're trying to allocate 18446744073709551552 bytes (~18500 PB!) which makes me think we've got a small negative number (that value is uint64_max - 63) being passed to something that isn't expecting it.

Is your dataset both small and something you're able to share with us?

MueFab · 2025-02-19T13:27:24Z

Hi, to get back to this issue : unfortunately I, cannot share the dataset as it is sensitive patient data.

However, I did some debugging and experimented with different sets of parameters. The pod5 files don't seem to be damaged, at least I am able to open them with the pod5 python library without any issues. I also tried different settings for the batch size unsuccessfully.

Currently, it seems to me like it might somehow related to modified basecalling with the latest model. I was using [email protected] + [email protected]_5mCG_5hmCG@v3 when I experienced the crashes. After downgrading to [email protected] + [email protected][email protected] the issue doesn't seem to appear any longer.

HalfPhoton · 2025-02-19T14:45:30Z

@MueFab,
This error is issued in the code which loads and decompresses reads and should have nothing to do with the modbases - I can't help but feel this is still an input data issue.

HalfPhoton added bug Something isn't working barcode Issues related to barcoding labels Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basecalling/demux+modbases: std::bad_alloc #1213

Basecalling/demux+modbases: std::bad_alloc #1213

sklages commented Jan 9, 2025

HalfPhoton commented Jan 13, 2025

sklages commented Jan 14, 2025

sklages commented Jan 18, 2025 •

edited

Loading

sklages commented Jan 18, 2025

MueFab commented Jan 20, 2025 •

edited

Loading

malton-ont commented Jan 20, 2025

MueFab commented Feb 19, 2025

HalfPhoton commented Feb 19, 2025

Basecalling/demux+modbases: std::bad_alloc #1213

Basecalling/demux+modbases: std::bad_alloc #1213

Comments

sklages commented Jan 9, 2025

HalfPhoton commented Jan 13, 2025

sklages commented Jan 14, 2025

sklages commented Jan 18, 2025 • edited Loading

sklages commented Jan 18, 2025

MueFab commented Jan 20, 2025 • edited Loading

malton-ont commented Jan 20, 2025

MueFab commented Feb 19, 2025

HalfPhoton commented Feb 19, 2025

sklages commented Jan 18, 2025 •

edited

Loading

MueFab commented Jan 20, 2025 •

edited

Loading