Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support holmes extract from fastq #44

Open
3 tasks
alexiswl opened this issue Dec 13, 2024 · 8 comments
Open
3 tasks

Support holmes extract from fastq #44

alexiswl opened this issue Dec 13, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@alexiswl
Copy link
Member

Shower thought:
Rather than running extract on bams from only some files, we support extraction from a fastq pair instead.

i.e

minimap2 -ax sr ref.fa read1.fa read2.fa | samtools view -S -b > output.bam
samtools index output.bam
somalier extract output.bam

Where ref.fa is a bedtools 'slop' of our sites file (say 100 bp either side of each value)

We can (hypothetically) stream ora files in with FIFOs (I have not tested the below at all!)

mkfifo read1fifo
mkfifo read2fifo

(orad --raw --stdout "<r1_presigned_url>" | tee read1fifo 1>/dev/null ) & \
(orad --raw --stdout "<r2_presigned_url>" | tee read2fifo 1>/dev/null ) & \
(minimap2 -ax sr ref.fa read1fifo read2fifo | samtools view -S -b > output.bam ) & \
wait

This would solve a few things:

  1. We can then run wgts-qc with --enable-map-align-output set to false, this would generate alignment stats but save on storage.
  2. We can run this on every fastq pair we sequence, not just WGS samples or cttsov2 samples.

minimap2 can handle pipes / fifos given the reference is small enough, which it should be in this case. See lh3/minimap2#532

Things to test:

  • Speed (how long does alignment take - would fargate be appropriate for this?)
  • Storage size (Assuming a bam over a small reference would be smaller than the ephemeral storage on a fargate instance, but need to test)
  • Does orad / minimap2 actually support fifos, or do inputs need to first be written to disk
@alexiswl alexiswl added the enhancement New feature or request label Dec 13, 2024
@alexiswl alexiswl self-assigned this Dec 13, 2024
@ohofmann
Copy link
Member

If we are going down that route would it make sense to consider switching to (or at least exploring) Heng's ntsm, a k-mer based approach? Paper, repo

@alexiswl
Copy link
Member Author

consider switching to (or at least exploring) Heng's ntsm,

Possibly, would be incompatible with existing bam data though.

But the paper looks really impressive

We found that ntsm ran at an average of ∼8 minutes, orders of magnitude less than bwa mem and minimap2 at ∼1.9 CPU hours and 5.9 CPU hours, respectively. Memory usage is low because we are only counting a very small specific subset of k-mers. Note that we did not include sorting or indexing time in this analysis as we hoped to illustrate that even without this in our comparison, k-mer counting was still much less resource intensive. Also, sorting can partially be run in parallel with alignments as reads are streamed.

@alexiswl
Copy link
Member Author

Comparison between samples though is much slower than somalier

image

@ohofmann
Copy link
Member

The loss of fingerprint backlogs is painful, agreed. A switch would have had to happen as part of the migration. or we need to be okay with just comparing within a run until we build up a new collection. The slower comparison time is a really good argument, though, and even if somalier seems to be on a different slope ntsm might not scale. The main benefit I can think of is not having to worry about what we align against, but pretty much all the for-service work is human data anyway.

@andrewpatto
Copy link
Member

We don't use somalier (or ntsm) in the mode where that slowness would affect things - that's for an all-pairs comparison as run by the tool itself. Our results are Nx1 comparisons, not NxN.

We scale out sideways with lambdas - with fixed size jobs (of like 10 samples per lambda). So even if ntsm takes longer and we need to do 5 samples per lambda - that is no problem.

@alexiswl
Copy link
Member Author

alexiswl commented Dec 19, 2024

From the paper, it looks like 5x coverage is enough though!

image

Taking the first 50 million reads for a WGS sample

ntsmCount \
  --threads 8 \
  --output summary_5x.txt \
  --snp /opt/ntsm/human_sites_n10.fa \
  <( \
       icav2 projectdata view /ora-compression/240816_A01052_0220_AHM7VHDSXC/20241122a3712050/WGS_TsqNano/PRJ241420_L2401276_S8_L002_R1_001.fastq.ora | \
      orad --raw --ora-reference /opt/orad/oradata/ --stdout - | \
      head --lines 200000000 \
  ) \
  <( \
      icav2 projectdata view /ora-compression/240816_A01052_0220_AHM7VHDSXC/20241122a3712050/WGS_TsqNano/PRJ241420_L2401276_S8_L002_R2_001.fastq.ora | \
      orad --raw --ora-reference /opt/orad/oradata/ --stdout - | \
      head --lines 200000000 \
  ) \
  > stdout_5x.txt

Returns as

Total Bases Considered: 15086908204
Total k-mers Considered: 13286372081
Total k-mers Recorded: 2719778
Distinct k-mers in initial set: 1270317
Total Sites: 96287
Sites Covered by at least one k-mer: 93584

Time: 483.206 s Memory: 137508 kbytes

Leaving us with around the ~8 minute mark as expected.

We can always leave the --lines parameter of head to be dynamic so that if we come across two samples that are still ambiguous we can always rerun ntsm at a greater depth. For these 80x WGS samples, running without 'head' took around 2 hours to complete.

ntsmEval stdout_5x.txt
sample  cov     errorRate       miss    hom     het
stdout_5x.txt   4.536324        0.000048        12268   71980   12039

@ohofmann
Copy link
Member

Thanks for testing. Really tempting. Could be build up a small collection over time (i.e., run this in parallel to somalier)? I understand we lose the historic information but at least it wouldn't be an abrupt switch.

@andrewpatto
Copy link
Member

andrewpatto commented Dec 23, 2024

I can easily make a "ntsm" steps extract that sends things to a "new" fingerprint folder.. and if Alexis can trigger it at the right spot we can start to build up an alternate fingerprint db? (in parallel with the existing setup - so no other changes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants