-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support holmes extract from fastq #44
Comments
Possibly, would be incompatible with existing bam data though. But the paper looks really impressive We found that ntsm ran at an average of ∼8 minutes, orders of magnitude less than bwa mem and minimap2 at ∼1.9 CPU hours and 5.9 CPU hours, respectively. Memory usage is low because we are only counting a very small specific subset of k-mers. Note that we did not include sorting or indexing time in this analysis as we hoped to illustrate that even without this in our comparison, k-mer counting was still much less resource intensive. Also, sorting can partially be run in parallel with alignments as reads are streamed. |
The loss of fingerprint backlogs is painful, agreed. A switch would have had to happen as part of the migration. or we need to be okay with just comparing within a run until we build up a new collection. The slower comparison time is a really good argument, though, and even if somalier seems to be on a different slope |
We don't use somalier (or ntsm) in the mode where that slowness would affect things - that's for an all-pairs comparison as run by the tool itself. Our results are Nx1 comparisons, not NxN. We scale out sideways with lambdas - with fixed size jobs (of like 10 samples per lambda). So even if |
From the paper, it looks like 5x coverage is enough though! Taking the first 50 million reads for a WGS sample ntsmCount \
--threads 8 \
--output summary_5x.txt \
--snp /opt/ntsm/human_sites_n10.fa \
<( \
icav2 projectdata view /ora-compression/240816_A01052_0220_AHM7VHDSXC/20241122a3712050/WGS_TsqNano/PRJ241420_L2401276_S8_L002_R1_001.fastq.ora | \
orad --raw --ora-reference /opt/orad/oradata/ --stdout - | \
head --lines 200000000 \
) \
<( \
icav2 projectdata view /ora-compression/240816_A01052_0220_AHM7VHDSXC/20241122a3712050/WGS_TsqNano/PRJ241420_L2401276_S8_L002_R2_001.fastq.ora | \
orad --raw --ora-reference /opt/orad/oradata/ --stdout - | \
head --lines 200000000 \
) \
> stdout_5x.txt Returns as
Leaving us with around the ~8 minute mark as expected. We can always leave the
|
Thanks for testing. Really tempting. Could be build up a small collection over time (i.e., run this in parallel to somalier)? I understand we lose the historic information but at least it wouldn't be an abrupt switch. |
I can easily make a "ntsm" steps extract that sends things to a "new" fingerprint folder.. and if Alexis can trigger it at the right spot we can start to build up an alternate fingerprint db? (in parallel with the existing setup - so no other changes) |
Shower thought:
Rather than running extract on bams from only some files, we support extraction from a fastq pair instead.
i.e
Where ref.fa is a bedtools 'slop' of our sites file (say 100 bp either side of each value)
We can (hypothetically) stream ora files in with FIFOs (I have not tested the below at all!)
This would solve a few things:
false
, this would generate alignment stats but save on storage.minimap2 can handle pipes / fifos given the reference is small enough, which it should be in this case. See lh3/minimap2#532
Things to test:
The text was updated successfully, but these errors were encountered: