Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add clumpify-based dedup #970

Open
wants to merge 56 commits into
base: master
Choose a base branch
from
Open

add clumpify-based dedup #970

wants to merge 56 commits into from

Conversation

tomkinsc
Copy link
Member

@tomkinsc tomkinsc commented Jul 2, 2019

add bbmap.BBMapTool().dedup_clumpify(), along with unit tests; pass JVMmemory to bbmap and clumpify; add rmdup_clumpify_bam to read_utils.py; change TestRmdupUnaligned unit tests for bbmap to use read_utils.py::rmdup_clumpify_bam; add dedup_bam WDL task to tasks_read_utils.wdl

add bbmap.BBMapTool().dedup_clumpify(), along with unit tests
pass JVMmemory to bbmap and clumpify; add rmdup_clumpify_bam to read_utils.py; change TestRmdupUnaligned unit tests for bbmap to use read_utils.py::rmdup_clumpify_bam; add dedup_bam WDL task to tasks_read_utils.wdl
replace mvicuna-based read deduplication in taxon_filter.py::deplete() with clumpify-based deduplication that occurs farther upstream in advance of BWA-based depletion; add dedup_bam WDL workflow; in dedup_bam WDL task, create and emit FastQC report of only de-duplicated reads; update unit test input to include dup reads, and update expected output for the test_taxon_filter::TestDepleteHuman integration tests to reflect difference in output from clumpify vs previous mvicuna output
pipes/WDL/workflows/tasks/tasks_read_utils.wdl Outdated Show resolved Hide resolved
pipes/WDL/workflows/tasks/tasks_read_utils.wdl Outdated Show resolved Hide resolved
read_utils.py Outdated Show resolved Hide resolved
read_utils.py Outdated Show resolved Hide resolved
tools/bbmap.py Outdated Show resolved Hide resolved
@tomkinsc
Copy link
Member Author

Let's hold off on merging this for now. While the tests pass here, the slow tests from DNAnexus (for which we do not have feedback on GitHub) are currently failing during dedup, with a NullPointerException in Picard. For example: https://platform.dnanexus.com/projects/F8PQ6380xf5bK0Qk0YPjB17P/monitor/job/FZZbVxj0xf5jZ07ZJ71gpFJ6
It's possible PR #977 may resolve the issue, so the changes there should be merged first.

Allow containments (where one sequence is shorter) when using bbmap clumpify to deduplicate
@tomkinsc
Copy link
Member Author

tomkinsc commented Aug 26, 2019

We should either remove containment=t or wait to merge this PR until there's movement to address the apparent bug in bbmap clumpify identified by @yesimon: https://sourceforge.net/p/bbmap/tickets/18/

Copy link
Member

@dpark01 dpark01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some specific comments in a few places above, mostly on the WDL, and broader question as a line-comment on the taxon_filter.deplete step.

One thing I can't quite tell from the code diffs: how well does clumpify.sh & our wrapper code handle the preservation of bam headers? This is something that we go to extreme lengths to preserve in all our other fastq-based tool invocations (to the extent of splitting out into separate files by RG, running things like novoalign separately on each set, and re-merging together at the end, to maintain proper RG to read mappings).

}
}

scatter(reads_bam in dedup.dedup_bam) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't double-scatter. You can just keep this as a single scatter block on the raw_reads and put all the task calls together in that single scatter. WDL interpreters/compilers are smart enough to figure out the DAG and parallelization opportunities within the scatter based on the dependencies between their inputs and outputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So specifically:

  scatter(raw_reads in illumina_demux.raw_reads_unaligned_bams) {
    call reads.dedup_bam as dedup {
        input:
            in_bam = raw_reads        
    }
    call reports.spikein_report as spikein {
      input:
        reads_bam = dedup.dedup_bam
    }
    call taxon_filter.deplete_taxa as deplete {
      input:
        raw_reads_unmapped_bam = dedup.dedup_bam
    }
    call assembly.assemble as spades {
      input:
        assembler = "spades",
        reads_unmapped_bam = deplete.cleaned_bam
    }
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more thoughts on this topic:

  1. Importantly, merging them would allow the execution platform to keep going on the linear DAG portions of each sample as they become ready without waiting for all samples to complete dedup before proceeding to the next steps.
  2. I wonder if we should consider running the spikein counting step on raw / non-deduplicated reads.... ERCC's are so short that we might quickly hit an artificial upper bound on the counts if we do it on dedup output.

@@ -5,18 +5,26 @@ import "tasks_metagenomics.wdl" as metagenomics
import "tasks_taxon_filter.wdl" as taxon_filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think this is ready and want to try it out, shouldn't you remove #DX_SKIP_WORKFLOW?

@@ -39,6 +47,6 @@ workflow demux_metag {
}
call metagenomics.kaiju as kaiju {
input:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't really used kaiju regularly via WDL yet, but I'm betting that we may want to consider moving it to a scatter-on-single-sample execution mode (like everything else in our WDLs except kraken). Its database is about 4x smaller (I'm guessing the localization time is just a few minutes) and the execution time of the algorithm is much slower, so the cost efficiency (algorithmic compute time vs VM wall clock time) of kaiju on a single sample is much better than kraken... so we might as well move it within the same scatter block as well.

@@ -27,7 +35,7 @@ workflow demux_metag {

call metagenomics.krakenuniq as kraken {
input:
reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams,
reads_unmapped_bam = dedup.dedup_bam
}
call reports.aggregate_metagenomics_reports as metag_summary_report {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call aggregate_metageomics_reports a second time on the kaiju outputs as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metagenomics.py::taxlevel_summary() hasn't been adapted/tested to read kaiju summary files yet. I'd like that to be a separate PR (this one is already way beyond its initial scope).

}
}

scatter(reads_bam in dedup.dedup_bam) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in demux_metag about combining the scatter blocks.

taxon_filter.py Outdated
sanitize = not args.do_not_sanitize) as bamToDeplete:
sanitize = not args.do_not_sanitize) as bam_to_dedup:

read_utils.rmdup_mvicuna_bam(bam_to_dedup, args.rmdupBam, JVMmemory=args.JVMmemory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this new world, should we consider:

  1. dropping deduplication entirely from taxon_filter.deplete -- since you now include it in all the pipelines prior to depletion anyway, and since it never really fit in the scope of the name of the command, it was historically embedded in a funny place between depletion steps primary because of its performance profile: it was slower than bmtagger (so we ran it after that) but faster than blastn (so we ran it before that)
  2. dropping mvicuna altogether if we think bbmap is better

fix bug in conda command quiet calling ('-q -y' must be after 'conda <command>')
for bbmap clumpify de-dup, merge like-library RGs and perform deduplication on each, then gather the IDs of kept reads, and filter the input sam based on the list of IDs to keep so as to maintain header and RG information. move most of the theprocessing to bbmap.py::dedup_clumpify so it has a more simple interface that accepts one bam and emits one bam. ToDo: parallelize across LBs
input:
reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams
# classify de-duplicated reads to taxa via krakenuniq
call metagenomics.krakenuniq as krakenuniq {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I don't think I'd bump this one inside the scatter. Unless maybe you want to take the opportunity to collect some data from this branch on cost efficiency changes (I've always been curious). The kraken and krakenuniq dbs have always burned about 10-15mins per job on db localization and unpacking -- which, when multiplied by the number of samples on highmem machines, adds up (which motivated the batched approach). But on principle I've always wanted to improve our staging time and move it within the scatter, but I guess I always assumed we'd never get there until we move to kraken2 (much smaller databases). That said, what's the dollars-per-demux-plus on our test/CI flowcell comparing scattered kraken vs batched kraken? If it's only a couple bucks extra, I might be fine with moving there now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious of the cost delta, but CI builds don't show us costs since they're billed to the DNAnexus Science Team. I'll just move it back outside the scatter...

change to clumpify for pre-depletion dedup; dedup lication can be likely be removed from depletion entirely in the future once all calls in the codebase have been updated to have one fewer arg
remove rmdup from depletion call, remove *.rmdup.bam from positional arguments for depletion CLI parser, remove *.rmdup.bam from inputs where depletion is called (test cases, WDL), remove *.rmdup.bam from expected depletion outputs. Chance Snakemake merge_one_per_sample rule to call  rmdup_clumpify_bam rather than rmdup_mvicuna_bam
tools/bbmap.py Outdated
for line in inf:
if (line_num % 4) == 0:
idVal = line.rstrip('\n')[1:]
if idVal.endswith('/1'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of what you mean? NovaSeq-like read IDs look the same from a few bam files I compared (NovaSeq vs HiSeq). Since we're doing the conversion to fastq files it should hopefully be fine? This is also the same code we've been using for mvicuna dedup, so the behavior should be the same (though that's not to say we missed a code path to update for NovaSeq support).

Copy link
Contributor

@yesimon yesimon Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read IDs that identify the pairs with a space and 1:N:0:barcodes instead of /1 or /2
image

tools/bbmap.py Outdated
for fq in (outFastq1, outFastq2):
with util.file.open_or_gzopen(fq, 'rt') as inf:
line_num = 0
for line in inf:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for line_num, line in enumerate(inf)


per_lb_read_lists.append(per_lb_read_list)
# merge per-library read lists together
util.file.concat(per_lb_read_lists, readListAll)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the read lists be written to a single file to start with instead of concat'ing later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read lists are separate in prep for parallelizing across libraries.

@tomkinsc
Copy link
Member Author

Here's another one, time to port over to viral-core?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants