Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping samples to multiple reference fasta #181

Open
rbpisupati opened this issue Nov 30, 2020 · 15 comments
Open

Mapping samples to multiple reference fasta #181

rbpisupati opened this issue Nov 30, 2020 · 15 comments
Labels
dsl2 enhancement New feature or request
Milestone

Comments

@rbpisupati
Copy link

Hi,

I was thinking of adapting this pipeline to take care of multiple reference genomes. Something like each sample would be aligned to different reference fasta file. Any votes for such a feature?

The input if given would be a csv file with sample_id, path_to_reference_fasta, path_to_sample_fastq

Cheers,
Rahul

@phue
Copy link
Member

phue commented Dec 1, 2020

I think this would be a nice addition, particularly for population (epi)genetics.
Something similar to a design file as was done over at nf-core/chipseq by @drpatelh

We need to think about how this could work together with iGenomes though

@drpatelh
Copy link
Member

drpatelh commented Dec 1, 2020

I have done something similar in the nanoseq pipeline. Check out the format of the input samplesheet here. The important thing is to have a validation script for the samplesheet that checks the entries provided by users. I wrote this function to resolve the iGenomes per sample if provided. Note, you will also have to configure the channels to only build genome indices once if multiple samples share the same reference. It will be tricky but not impossible 🙂

phue added a commit to phue/methylseq that referenced this issue Jan 19, 2021
this is inspired by the functionality in nf-core/nanoseq and
nf-core/rnaseq
The idea is to require a samplesheet to run the pipeline, which will
allow for single/paired end auto-detection and mapping samples against
different reference genomes.

addresses nf-core#181
@phue phue mentioned this issue Jan 19, 2021
17 tasks
phue added a commit to phue/methylseq that referenced this issue Mar 22, 2021
this is inspired by the functionality in nf-core/nanoseq and
nf-core/rnaseq
The idea is to require a samplesheet to run the pipeline, which will
allow for single/paired end auto-detection and mapping samples against
different reference genomes.

addresses nf-core#181
@phue phue added dsl2 enhancement New feature or request labels Mar 25, 2021
@phue
Copy link
Member

phue commented Mar 25, 2021

There is now a dsl2 branch with this functionality.

Please give it a try 👍

@phue phue mentioned this issue May 25, 2022
3 tasks
@ewels
Copy link
Member

ewels commented Nov 3, 2022

NB: This was removed from the dsl2 branch so that we could work with centralised modules in nf-core/modules.

However, I then wrote nf-core modules patch to allow local modifications of remote DSL2 modules for exactly this use case ☝🏻

Now it just needs implementing again..

@edmundmiller edmundmiller added this to the 2.0.0 milestone Nov 4, 2022
@ewels ewels modified the milestones: 2.0.0, 2.1 Nov 5, 2022
@njspix
Copy link
Contributor

njspix commented Nov 9, 2022

Hey all, thanks so much for all your hard work on the DSL2 pipeline!
Just wanted to note that an alternative to patching the DSL2 module could be a pattern something like the following:

input = Channel.of(
    ["homo_sapiens", [fastq_1, fastq_2]],
    ["homo_sapiens", [fastq_1, fastq_2]],
    ["mus_musculus", [fastq_1, fastq_2]]
)
refs = Channel.of(["homo_sapiens", path/to/ref], ["mus_musculus", path/to/ref])

input
    .combine(refs, by: 0) // "homo_sapiens", [fastq_1, fastq_2], path/to/ref
    .multiMap{ it ->
        fastqs: it[0..1]
        refs: it[2]
    }.set{ for_align }

aligner(for_align.fastqs, for_align.refs)

I've had succes using this in pipelines that I've written for personal use. The combine operation syncs everything up, and the multiMap operator kind of 'locks' the two output channels together so that they're always in sync.

@ewels ewels modified the milestones: 2.1, 2.2 Nov 10, 2022
@ewels ewels modified the milestones: 2.2, 2.3 Nov 18, 2022
@sateeshperi sateeshperi removed this from the 2.4 milestone Jun 2, 2023
@mobilegenome
Copy link

It would be great to get this issue going. For my population epigenomics project, I will have an individual genome for each sample, so I would priotize providing paths to FASTA files in the 'genome column. Also, I'm happy to chip in work but would appreciate if you could help with some starting points. If I understand correctly there has been some previous work, that's still in the history.

Second, do I understand correctly that would like to keep the nf-core modules untouched so they stay for general use?

@njspix do you have a full repository of the pipeline with your additions available and mind sharing it?

@njspix
Copy link
Contributor

njspix commented Jul 12, 2023

Hey @mobilegenome thanks for your interest! Unfortunately I don't have a lot to contribute - most of the previous work on this (I think) was done with bespoke/custom modules, which we prefer to avoid (especially for core modules like alignment, etc).

The core issue here is that the aligner modules take the fastq files as one channel and the index/reference in a separate channel. Nextflow ordinarily doesn't care much about the order of items in a channel, so it's difficult to sync up two channels (e.g. a 'fastq' channel with files from species a, b, and c; and a 'reference' channel with indexes for species a, b, and c).

To overcome this issue, there are at least 2 ideas -

  1. patch the aligner so that it only takes one channel as input (fastq and reference)
  2. use the multiMap approach above, which syncs everything up inside one channel, then 'splits' that channel into two sub-channels, which have everything in the same order.

Phil wrote the patch tool to facilitate solution (1) - where you want to take a 'stock' module and customize it for the pipeline. The patch tool makes changes to the module locally, but in an automated and reproducible way.

Solution (2) is really untested in a real-life pipeline, but would avoid having to patch a stock module.

That's all I have - sorry I can't devote more time to this right now!

@edmundmiller edmundmiller added this to the 3.0.0 milestone Aug 25, 2023
@edmundmiller edmundmiller changed the title Mapping bs-seq samples to multiple specific reference fasta Mapping samples to multiple reference fasta Aug 25, 2023
@edmundmiller
Copy link
Collaborator

So is this:
1.

sample,fastq_1,fastq_2,genome
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz,hg19
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,,hg18

or

  1. Just supporting something like nextflow run --fasta hg18,hg19 and run it on all the samples.

The first I think is going to be fundamentally incompatible with nf-core modules for now (See nextflow-io/nextflow#2085). Nextflow doesn't join across metas and there's just no way to link between tuples like you'd expect.

The second option might be pretty easy.

I think in either case, it might just be easier to run the pipeline twice, but I get the desire to want to do so.

@njspix
Copy link
Contributor

njspix commented Aug 25, 2023

Hey @emiller88 could you clarify why 1) is incompatible with nf-core modules?? I'm not following you 100%

@edmundmiller
Copy link
Collaborator

edmundmiller commented Aug 25, 2023

It's a rabbit hole, I'll try to keep this short 😆

So usually there's a patten like:

    input:
    tuple val(meta), path(reads)
    path index

So you'd think you could do:

    input:
    tuple val(meta), path(reads)
    tuple val(meta), path(index)

And it would give you a join for free on the meta. But they just silent overwrite the first one.

So you really need:

    input:
    tuple val(meta), path(reads), path(index)

To ensure you're matching the index with the proper reads.

But it starts to get really messy quickly.

I guess with could nf-core modules patch every module that needed it, and that wouldn't be too painful.

@njspix
Copy link
Contributor

njspix commented Aug 25, 2023

ok - thank you so much for clarifying!!
If i'm understanding correctly, this is the problem that my suggestion above was designed to address.

It's a bit magical (and I don't mean that in a good way), but it allows you to use the pattern

    input:
    tuple val(meta), path(reads)
    path index

as is by synchronizing the two input channels. in essence, you join the reads channel and the index channel, then multiMap them into two sub-channels. These can then be passed to the aligner separately (e.g. align(reads, indexes) but will stay in sync with each other.

let me know if i'm missing the bus here!

Thanks much for your work on this.

@edmundmiller
Copy link
Collaborator

If i'm understanding correctly, this is the problem that my suggestion #181 (comment) was designed to address.

Ah, I missed that, that's pretty cool!

I guess we can just add it to a meta map, but in the methylseq pipeline, I don't actually see any other use of the reference besides just alignment? I just searched for "fasta", "index" "genome" and it only looks like it's the first few steps. So that would be pretty easy to maintain!

@njspix
Copy link
Contributor

njspix commented Aug 25, 2023

Yeah it's a pretty nifty feature! I'm not sure about bismark and other aligners, but I know that biscuit utilizes the reference (not the index, though) for several steps, including bisulfite conversion filtering, pileup, and merging CpGs.

It's a bit inelegant but my first attempt would probably be something like this for each process requiring a reference:

process1.out.map{meta, bam -> [meta.genome, meta, bam]}
    .combine(refs, by: 0)
    .multiMap{ genome, meta, bam, ref ->
        bams: [meta, bam]
        refs: ref
    }.set{ for_process2 }
    
process2(for_process2.bams, for_process2.refs)

If that's too clunky I wonder if we could abstract out the repeated code into a function...

@mobilegenome
Copy link

Hi all, thanks for your work and sorry for not getting back to your earlier answer @njspix! Unfortunately, I couldn't spend any time on this so far. Regarding the use of the reference/index in steps following the alignment, it is also used in the methylationextractor and coverage2cytosine modules.

@njspix
Copy link
Contributor

njspix commented Sep 5, 2023

Good to know, thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dsl2 enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants