Skip to content

Add ONT BAM and bedMethyl test data: HG002 GIAB 10-read subset (PAW70337)#1969

Open
sahuno wants to merge 1 commit intonf-core:modulesfrom
sahuno:add-giab-ont-bam-testdata
Open

Add ONT BAM and bedMethyl test data: HG002 GIAB 10-read subset (PAW70337)#1969
sahuno wants to merge 1 commit intonf-core:modulesfrom
sahuno:add-giab-ont-bam-testdata

Conversation

@sahuno
Copy link
Copy Markdown

@sahuno sahuno commented Apr 6, 2026

Summary

Adds 5 test data files derived from the GIAB HG002 ONT run PAW70337 (5kHz, R10.4.1), companion to the pod5 file merged in #1968.

New files

Unaligned BAM — raw dorado basecaller output, no reference:

  • data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.unaligned.bam (178 KB)

Aligned sorted BAM + index — coordinate-sorted alignment to hg38 via minimap2:

  • data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam (314 KB)
  • data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam.bai (566 KB)

bedMethyl + tabix index — modkit pileup output (5mCG+5hmCG), bgzipped:

  • data/genomics/homo_sapiens/nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz (69 KB)
  • data/genomics/homo_sapiens/nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz.tbi (1.7 KB)

Modules that will use these files

File Module(s)
unaligned.bam dorado/summary, dorado/trim, dorado/correct
aligned.sorted.bam + .bai modkit/pileup, dorado/basecaller functional test
bedmethyl.gz + .tbi modkit/localize, modkit/localize/plot, modkit/pileup/plot

Source

s3://ont-open-data/giab_2025.01/flowcells/HG002/PAW70337/pod5/PAW70337_66b2eea5_de8117b1_33.pod5
  • Sample: HG002 (NIST/Genome in a Bottle reference sample, public domain)
  • Chemistry: 5kHz, R10.4.1 (PAW70337, released 2025-01-14)
  • Subset: 10 reads — basecalled with dorado 1.4.0, aligned with minimap2, methylation called with modkit 0.6.1

Test plan

  • BAM opens with samtools view
  • BAI index validates with samtools idxstats
  • bedMethyl decompresses and tabix queries correctly
  • All files used in passing nf-test stub tests for the modules above

🤖 Generated with Claude Code

…337)

Adds 5 files derived from the GIAB HG002 ONT run PAW70337 (5kHz, R10.4.1),
companion to the pod5 file added in nf-core#1968.

New files:
- nanopore/bam/HG002_PAW70337_giab_10reads.unaligned.bam (178 KB)
  Raw dorado basecaller output, no reference alignment.
  Used by: dorado/summary, dorado/trim, dorado/correct

- nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam (314 KB)
- nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam.bai (566 KB)
  Coordinate-sorted alignment to hg38 via minimap2.
  Used by: modkit/pileup, modkit/dmr, dorado/basecaller functional test

- nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz (69 KB)
- nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz.tbi (1.7 KB)
  modkit pileup output (5mCG+5hmCG), bgzipped + tabix indexed.
  Used by: modkit/localize, modkit/localize/plot, modkit/pileup/plot

Source: s3://ont-open-data/giab_2025.01/flowcells/HG002/PAW70337/pod5/
Sample: HG002 (NIST/Genome in a Bottle), public domain data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sahuno
Copy link
Copy Markdown
Author

sahuno commented Apr 18, 2026

Hey @dialvarezs 👋 , when you have a moment, could you pls review data inclusion pr?

This PR adds the unaligned/aligned HG002 BAM + bedMethyl test data (same GIAB 10-read subset as #1968, which you approved). It's the test-data dependency for two downstream module PRs:

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant