Support holmes extract from fastq #44

alexiswl · 2024-12-13T23:39:41Z

Shower thought:
Rather than running extract on bams from only some files, we support extraction from a fastq pair instead.

i.e

minimap2 -ax sr ref.fa read1.fa read2.fa | samtools view -S -b > output.bam
samtools index output.bam
somalier extract output.bam

Where ref.fa is a bedtools 'slop' of our sites file (say 100 bp either side of each value)

We can (hypothetically) stream ora files in with FIFOs (I have not tested the below at all!)

mkfifo read1fifo
mkfifo read2fifo

(orad --raw --stdout "<r1_presigned_url>" | tee read1fifo 1>/dev/null ) & \
(orad --raw --stdout "<r2_presigned_url>" | tee read2fifo 1>/dev/null ) & \
(minimap2 -ax sr ref.fa read1fifo read2fifo | samtools view -S -b > output.bam ) & \
wait

This would solve a few things:

We can then run wgts-qc with --enable-map-align-output set to false, this would generate alignment stats but save on storage.
We can run this on every fastq pair we sequence, not just WGS samples or cttsov2 samples.

minimap2 can handle pipes / fifos given the reference is small enough, which it should be in this case. See lh3/minimap2#532

Things to test:

Speed (how long does alignment take - would fargate be appropriate for this?)
Storage size (Assuming a bam over a small reference would be smaller than the ephemeral storage on a fargate instance, but need to test)
Does orad / minimap2 actually support fifos, or do inputs need to first be written to disk

The text was updated successfully, but these errors were encountered:

ohofmann · 2024-12-14T01:29:50Z

If we are going down that route would it make sense to consider switching to (or at least exploring) Heng's ntsm, a k-mer based approach? Paper, repo

alexiswl · 2024-12-14T03:44:40Z

consider switching to (or at least exploring) Heng's ntsm,

Possibly, would be incompatible with existing bam data though.

But the paper looks really impressive

We found that ntsm ran at an average of ∼8 minutes, orders of magnitude less than bwa mem and minimap2 at ∼1.9 CPU hours and 5.9 CPU hours, respectively. Memory usage is low because we are only counting a very small specific subset of k-mers. Note that we did not include sorting or indexing time in this analysis as we hoped to illustrate that even without this in our comparison, k-mer counting was still much less resource intensive. Also, sorting can partially be run in parallel with alignments as reads are streamed.

alexiswl · 2024-12-14T03:46:19Z

Comparison between samples though is much slower than somalier

ohofmann · 2024-12-15T21:29:38Z

The loss of fingerprint backlogs is painful, agreed. A switch would have had to happen as part of the migration. or we need to be okay with just comparing within a run until we build up a new collection. The slower comparison time is a really good argument, though, and even if somalier seems to be on a different slope ntsm might not scale. The main benefit I can think of is not having to worry about what we align against, but pretty much all the for-service work is human data anyway.

andrewpatto · 2024-12-15T21:56:35Z

We don't use somalier (or ntsm) in the mode where that slowness would affect things - that's for an all-pairs comparison as run by the tool itself. Our results are Nx1 comparisons, not NxN.

We scale out sideways with lambdas - with fixed size jobs (of like 10 samples per lambda). So even if ntsm takes longer and we need to do 5 samples per lambda - that is no problem.

alexiswl · 2024-12-19T22:39:31Z

From the paper, it looks like 5x coverage is enough though!

Taking the first 50 million reads for a WGS sample

ntsmCount \
  --threads 8 \
  --output summary_5x.txt \
  --snp /opt/ntsm/human_sites_n10.fa \
  <( \
       icav2 projectdata view /ora-compression/240816_A01052_0220_AHM7VHDSXC/20241122a3712050/WGS_TsqNano/PRJ241420_L2401276_S8_L002_R1_001.fastq.ora | \
      orad --raw --ora-reference /opt/orad/oradata/ --stdout - | \
      head --lines 200000000 \
  ) \
  <( \
      icav2 projectdata view /ora-compression/240816_A01052_0220_AHM7VHDSXC/20241122a3712050/WGS_TsqNano/PRJ241420_L2401276_S8_L002_R2_001.fastq.ora | \
      orad --raw --ora-reference /opt/orad/oradata/ --stdout - | \
      head --lines 200000000 \
  ) \
  > stdout_5x.txt

Returns as

Total Bases Considered: 15086908204
Total k-mers Considered: 13286372081
Total k-mers Recorded: 2719778
Distinct k-mers in initial set: 1270317
Total Sites: 96287
Sites Covered by at least one k-mer: 93584

Time: 483.206 s Memory: 137508 kbytes

Leaving us with around the ~8 minute mark as expected.

We can always leave the --lines parameter of head to be dynamic so that if we come across two samples that are still ambiguous we can always rerun ntsm at a greater depth. For these 80x WGS samples, running without 'head' took around 2 hours to complete.

ntsmEval stdout_5x.txt
sample  cov     errorRate       miss    hom     het
stdout_5x.txt   4.536324        0.000048        12268   71980   12039

ohofmann · 2024-12-21T22:43:48Z

Thanks for testing. Really tempting. Could be build up a small collection over time (i.e., run this in parallel to somalier)? I understand we lose the historic information but at least it wouldn't be an abrupt switch.

andrewpatto · 2024-12-23T00:34:22Z

I can easily make a "ntsm" steps extract that sends things to a "new" fingerprint folder.. and if Alexis can trigger it at the right spot we can start to build up an alternate fingerprint db? (in parallel with the existing setup - so no other changes)

alexiswl added the enhancement New feature or request label Dec 13, 2024

alexiswl self-assigned this Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support holmes extract from fastq #44

Support holmes extract from fastq #44

alexiswl commented Dec 13, 2024

ohofmann commented Dec 14, 2024

alexiswl commented Dec 14, 2024

alexiswl commented Dec 14, 2024

ohofmann commented Dec 15, 2024

andrewpatto commented Dec 15, 2024

alexiswl commented Dec 19, 2024 •

edited

Loading

ohofmann commented Dec 21, 2024

andrewpatto commented Dec 23, 2024 •

edited

Loading

Support holmes extract from fastq #44

Support holmes extract from fastq #44

Comments

alexiswl commented Dec 13, 2024

ohofmann commented Dec 14, 2024

alexiswl commented Dec 14, 2024

alexiswl commented Dec 14, 2024

ohofmann commented Dec 15, 2024

andrewpatto commented Dec 15, 2024

alexiswl commented Dec 19, 2024 • edited Loading

ohofmann commented Dec 21, 2024

andrewpatto commented Dec 23, 2024 • edited Loading

alexiswl commented Dec 19, 2024 •

edited

Loading

andrewpatto commented Dec 23, 2024 •

edited

Loading