Mapping samples to multiple reference fasta #181

rbpisupati · 2020-11-30T13:50:58Z

Hi,

I was thinking of adapting this pipeline to take care of multiple reference genomes. Something like each sample would be aligned to different reference fasta file. Any votes for such a feature?

The input if given would be a csv file with sample_id, path_to_reference_fasta, path_to_sample_fastq

Cheers,
Rahul

The text was updated successfully, but these errors were encountered:

phue · 2020-12-01T11:25:19Z

I think this would be a nice addition, particularly for population (epi)genetics.
Something similar to a design file as was done over at nf-core/chipseq by @drpatelh

We need to think about how this could work together with iGenomes though

drpatelh · 2020-12-01T19:56:06Z

I have done something similar in the nanoseq pipeline. Check out the format of the input samplesheet here. The important thing is to have a validation script for the samplesheet that checks the entries provided by users. I wrote this function to resolve the iGenomes per sample if provided. Note, you will also have to configure the channels to only build genome indices once if multiple samples share the same reference. It will be tricky but not impossible 🙂

this is inspired by the functionality in nf-core/nanoseq and nf-core/rnaseq The idea is to require a samplesheet to run the pipeline, which will allow for single/paired end auto-detection and mapping samples against different reference genomes. addresses nf-core#181

phue · 2021-03-25T10:50:23Z

There is now a dsl2 branch with this functionality.

Please give it a try 👍

ewels · 2022-11-03T23:37:26Z

NB: This was removed from the dsl2 branch so that we could work with centralised modules in nf-core/modules.

However, I then wrote nf-core modules patch to allow local modifications of remote DSL2 modules for exactly this use case ☝🏻

Now it just needs implementing again..

njspix · 2022-11-09T14:52:27Z

Hey all, thanks so much for all your hard work on the DSL2 pipeline!
Just wanted to note that an alternative to patching the DSL2 module could be a pattern something like the following:

input = Channel.of(
    ["homo_sapiens", [fastq_1, fastq_2]],
    ["homo_sapiens", [fastq_1, fastq_2]],
    ["mus_musculus", [fastq_1, fastq_2]]
)
refs = Channel.of(["homo_sapiens", path/to/ref], ["mus_musculus", path/to/ref])

input
    .combine(refs, by: 0) // "homo_sapiens", [fastq_1, fastq_2], path/to/ref
    .multiMap{ it ->
        fastqs: it[0..1]
        refs: it[2]
    }.set{ for_align }

aligner(for_align.fastqs, for_align.refs)

I've had succes using this in pipelines that I've written for personal use. The combine operation syncs everything up, and the multiMap operator kind of 'locks' the two output channels together so that they're always in sync.

mobilegenome · 2023-07-11T13:18:12Z

It would be great to get this issue going. For my population epigenomics project, I will have an individual genome for each sample, so I would priotize providing paths to FASTA files in the 'genome column. Also, I'm happy to chip in work but would appreciate if you could help with some starting points. If I understand correctly there has been some previous work, that's still in the history.

Second, do I understand correctly that would like to keep the nf-core modules untouched so they stay for general use?

@njspix do you have a full repository of the pipeline with your additions available and mind sharing it?

njspix · 2023-07-12T12:55:39Z

Hey @mobilegenome thanks for your interest! Unfortunately I don't have a lot to contribute - most of the previous work on this (I think) was done with bespoke/custom modules, which we prefer to avoid (especially for core modules like alignment, etc).

The core issue here is that the aligner modules take the fastq files as one channel and the index/reference in a separate channel. Nextflow ordinarily doesn't care much about the order of items in a channel, so it's difficult to sync up two channels (e.g. a 'fastq' channel with files from species a, b, and c; and a 'reference' channel with indexes for species a, b, and c).

To overcome this issue, there are at least 2 ideas -

patch the aligner so that it only takes one channel as input (fastq and reference)
use the multiMap approach above, which syncs everything up inside one channel, then 'splits' that channel into two sub-channels, which have everything in the same order.

Phil wrote the patch tool to facilitate solution (1) - where you want to take a 'stock' module and customize it for the pipeline. The patch tool makes changes to the module locally, but in an automated and reproducible way.

Solution (2) is really untested in a real-life pipeline, but would avoid having to patch a stock module.

That's all I have - sorry I can't devote more time to this right now!

edmundmiller · 2023-08-25T18:36:41Z

So is this:
1.

sample,fastq_1,fastq_2,genome
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz,hg19
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,,hg18

or

Just supporting something like nextflow run --fasta hg18,hg19 and run it on all the samples.

The first I think is going to be fundamentally incompatible with nf-core modules for now (See nextflow-io/nextflow#2085). Nextflow doesn't join across metas and there's just no way to link between tuples like you'd expect.

The second option might be pretty easy.

I think in either case, it might just be easier to run the pipeline twice, but I get the desire to want to do so.

njspix · 2023-08-25T18:41:33Z

Hey @emiller88 could you clarify why 1) is incompatible with nf-core modules?? I'm not following you 100%

edmundmiller · 2023-08-25T19:00:55Z

It's a rabbit hole, I'll try to keep this short 😆

So usually there's a patten like:

    input:
    tuple val(meta), path(reads)
    path index

So you'd think you could do:

    input:
    tuple val(meta), path(reads)
    tuple val(meta), path(index)

And it would give you a join for free on the meta. But they just silent overwrite the first one.

So you really need:

    input:
    tuple val(meta), path(reads), path(index)

To ensure you're matching the index with the proper reads.

But it starts to get really messy quickly.

I guess with could nf-core modules patch every module that needed it, and that wouldn't be too painful.

njspix · 2023-08-25T19:06:01Z

ok - thank you so much for clarifying!!
If i'm understanding correctly, this is the problem that my suggestion above was designed to address.

It's a bit magical (and I don't mean that in a good way), but it allows you to use the pattern

    input:
    tuple val(meta), path(reads)
    path index

as is by synchronizing the two input channels. in essence, you join the reads channel and the index channel, then multiMap them into two sub-channels. These can then be passed to the aligner separately (e.g. align(reads, indexes) but will stay in sync with each other.

let me know if i'm missing the bus here!

Thanks much for your work on this.

edmundmiller · 2023-08-25T19:18:21Z

If i'm understanding correctly, this is the problem that my suggestion #181 (comment) was designed to address.

Ah, I missed that, that's pretty cool!

I guess we can just add it to a meta map, but in the methylseq pipeline, I don't actually see any other use of the reference besides just alignment? I just searched for "fasta", "index" "genome" and it only looks like it's the first few steps. So that would be pretty easy to maintain!

njspix · 2023-08-25T19:24:36Z

Yeah it's a pretty nifty feature! I'm not sure about bismark and other aligners, but I know that biscuit utilizes the reference (not the index, though) for several steps, including bisulfite conversion filtering, pileup, and merging CpGs.

It's a bit inelegant but my first attempt would probably be something like this for each process requiring a reference:

process1.out.map{meta, bam -> [meta.genome, meta, bam]}
    .combine(refs, by: 0)
    .multiMap{ genome, meta, bam, ref ->
        bams: [meta, bam]
        refs: ref
    }.set{ for_process2 }
    
process2(for_process2.bams, for_process2.refs)

If that's too clunky I wonder if we could abstract out the repeated code into a function...

mobilegenome · 2023-09-04T15:12:55Z

Hi all, thanks for your work and sorry for not getting back to your earlier answer @njspix! Unfortunately, I couldn't spend any time on this so far. Regarding the use of the reference/index in steps following the alignment, it is also used in the methylationextractor and coverage2cytosine modules.

njspix · 2023-09-05T12:11:59Z

Good to know, thanks very much!

phue mentioned this issue Jan 19, 2021

dsl2 port #182

Closed

17 tasks

phue added dsl2 enhancement New feature or request labels Mar 25, 2021

phue mentioned this issue Nov 4, 2021

Template update, big DSL2 modules update #222

Closed

7 tasks

phue mentioned this issue May 25, 2022

dsl2 port #199

Merged

3 tasks

edmundmiller added this to the 2.0.0 milestone Nov 4, 2022

ewels modified the milestones: 2.0.0, 2.1 Nov 5, 2022

ewels modified the milestones: 2.1, 2.2 Nov 10, 2022

ewels modified the milestones: 2.2, 2.3 Nov 18, 2022

sateeshperi removed this from the 2.4 milestone Jun 2, 2023

edmundmiller added this to the 3.0.0 milestone Aug 25, 2023

edmundmiller changed the title ~~Mapping bs-seq samples to multiple specific reference fasta~~ Mapping samples to multiple reference fasta Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping samples to multiple reference fasta #181

Mapping samples to multiple reference fasta #181

rbpisupati commented Nov 30, 2020

phue commented Dec 1, 2020

drpatelh commented Dec 1, 2020

phue commented Mar 25, 2021

ewels commented Nov 3, 2022

njspix commented Nov 9, 2022

mobilegenome commented Jul 11, 2023

njspix commented Jul 12, 2023

edmundmiller commented Aug 25, 2023

njspix commented Aug 25, 2023

edmundmiller commented Aug 25, 2023 •

edited

Loading

njspix commented Aug 25, 2023

edmundmiller commented Aug 25, 2023

njspix commented Aug 25, 2023

mobilegenome commented Sep 4, 2023

njspix commented Sep 5, 2023

Mapping samples to multiple reference fasta #181

Mapping samples to multiple reference fasta #181

Comments

rbpisupati commented Nov 30, 2020

phue commented Dec 1, 2020

drpatelh commented Dec 1, 2020

phue commented Mar 25, 2021

ewels commented Nov 3, 2022

njspix commented Nov 9, 2022

mobilegenome commented Jul 11, 2023

njspix commented Jul 12, 2023

edmundmiller commented Aug 25, 2023

njspix commented Aug 25, 2023

edmundmiller commented Aug 25, 2023 • edited Loading

njspix commented Aug 25, 2023

edmundmiller commented Aug 25, 2023

njspix commented Aug 25, 2023

mobilegenome commented Sep 4, 2023

njspix commented Sep 5, 2023

edmundmiller commented Aug 25, 2023 •

edited

Loading