Skip to content

Trimming #450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
milos7250 opened this issue Mar 18, 2025 · 17 comments
Open

Trimming #450

milos7250 opened this issue Mar 18, 2025 · 17 comments
Labels
enhancement New feature or request

Comments

@milos7250
Copy link

Description of feature

I think detecting doublets in the quantified cells would be a good addition to this pipeline. Using something like DoubletDetection, additional noise coming from doublets could be removed from the data.

I was thinking of starting to add this to this pipeline after the cellbender post-processing step, but I'd like to hear other's opinion on this first.

@milos7250 milos7250 added the enhancement New feature or request label Mar 18, 2025
@grst
Copy link
Member

grst commented Mar 18, 2025

Hi, it's already part of the #scdownstream pipeline. I'm not sure how many features of scdownstream we'd want to incorporate here.

@nictru, what's your take here?

@nictru
Copy link
Contributor

nictru commented Mar 25, 2025

I am generally fine with adding all the preprocessing steps of scdownstream as postprocessing to scrnaseq. Namely this would be the following:

  • Empty droplet detection (already added)
  • Doublet detection
  • Ambient RNA removal

The workflow for the latter two would be similar to what we did for empty droplet detection:

  1. Add the modules to the shared modules repository
  2. Create a shared subworkflow
  3. Import it into both pipelines

Efforts have already been started during the Barcelona hackathon 2024, but since then have slowed down. I hope I create some momentum over the next weeks.

@milos7250
Copy link
Author

Have you also thought about using fastp/sortmerna prior to quantification, or is that generally discouraged for scRNA?

@grst
Copy link
Member

grst commented Mar 25, 2025

fastp for adapter trimming and sortmerna for rRNA detection, or what did you have in mind?

At least for the 10x protocols, I don't think this is typically done. Ribosomal reads are instead considered as a QC metric during downstream analysis.

@milos7250
Copy link
Author

Yes, I meant fastp for adapter trimming and quality filtering, and sortmerna for rRNA filtering.

If you have the full R1 reads with 150bp, you can run fastp with --max_len1 28 --max_len2 9999 --trim_poly_x --correction --overrepresentation_analysis --length_required 28. This will trim adapters from read2, and any pair with read1 trimmed below 28bp will be discarded.

Could you elaborate for me on how ribosomal reads are used as a QC metric? I struggle to find any information on this. Thanks.

@grst
Copy link
Member

grst commented Mar 26, 2025

If you have the full R1 reads with 150bp, you can run fastp with --max_len1 28 --max_len2 9999 --trim_poly_x --correction --overrepresentation_analysis --length_required 28. This will trim adapters from read2, and any pair with read1 trimmed below 28bp will be discarded.

I don't think it is required. Afaik the tools just ignore excess nucleotides from R1, and 10x explicitly advises against trimming because apparently it can damage cell barcodes:

Do not trim adapters during demultiplexing. Leave these settings blank. Trimming adapters from reads can potentially damage the 10x barcodes and the UMIs, resulting in pipeline failure or data loss. If you are using an Illumina sample sheet for demultiplexing with bcl2fastq, BCL Convert or our mkfastq pipeline, please remove these lines under the [Settings] section: Adapter or AdapterRead1 or AdapterRead2.

(https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/inputs/cr-mkfastq)


Could you elaborate for me on how ribosomal reads are used as a QC metric? I struggle to find any information on this. Thanks.

See, for instance, https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html#filtering-low-quality-cells

The fraction of ribosomal read is one metric amonst several that is considered to identify outlier cells.

@milos7250
Copy link
Author

Thanks for sharing those links with me. I am currently working with plant genomes where I unfortunately do not have easy access to annotated ribosomal/mitochondrial genes.

The reason I asked about trimming is that I have samples where 75% of the R2 reads have the Template Switching Oligo (TSO) sequence at the 5' end (AAGCAGTGGTATCAACGCAGAGTACATGGG) that I wanted to trim. The R2 reads also had poly-a tailing that I wanted to remove.

I ended up adapting the pipeline to use the cutadapt module. This gives more control over what adapters you trim, from which end and from which read. I used the sequences mentioned here (https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html) as the adapters and got where I wanted. This increased my simpleaf mapping rates by about 20%, and only dropped about 2% of reads that were too short after trimming.

If that is a desirable addition to this pipeline, I could create a draft PR with the changes I have made, but I am not sure that I would have the time to write proper tests.

@grst
Copy link
Member

grst commented Apr 1, 2025

@DongzeHE, what's your recommendation for trimming for simpleaf?

@DongzeHE
Copy link
Member

DongzeHE commented Apr 3, 2025

I think it really depends on the quality of the data. Usually, for high-quality datasets, the mapping rate is 90%-95%. Since simpleaf is a transcriptome-based tool, it means the majority of the reads come from transcripts.

I have samples where 75% of the R2 reads have the Template Switching Oligo (TSO) sequence at the 5' end (AAGCAGTGGTATCAACGCAGAGTACATGGG) that I wanted to trim.

From the section "(5) Use Fragmentase to fragment cDNA and perform A-tailing:" in the link you shared, we know that, TSO are attached to the 5-prime end of synthesized full-length cDNAs. If you see it in your reads, which all come from the 3-prime end of synthesized full-length cDNAs, this means 75% of the synthesized full-length cDNAs, are shorter than the expected fragment length, usually 300bps. In this case, rather than trimming these TSO sequences, I would worry about if RNAs are segmented even before the cDNA library preparation step.

The R2 reads also had poly-a tailing that I wanted to remove.

PolyA reads are frequent in single-cell. This is mainly because most 3' assays, for example 10x Chromium assays, use polyT primers to capture the polyA tail of polyadenylated transcripts. When polyT primers bind middle bases of polyA tails, the upstream unbound As will be sequenced.

PolyA reads usually will not be mapped in simpleaf. This is because, in contrast to genome, there are usually no polyA sites in the transcriptome.

As for my recommendations, I would say, it will be great if we can add this module for biological reads, reads2, because (1) cellranger has an internal trimming module, (2) it might help with processing low quality data, and (3) there are existing nf-core modules for trimming.

The only caveat here is this will have limited effects for data in good quality. So I am not sure if we want to add it as an optional or mandatory step as trimming usually takes a while.

Best,
Dongze

@grst
Copy link
Member

grst commented Apr 3, 2025

Thanks for your input! I'm open to adding trimming to the simpleaf, kallistobustools and starsolo workflows then. But I will have to insist on adding tests before this gets merged.

As for the trimming tool, do you have any preference? fastp is on the faster end in my experience.

@grst grst changed the title Doublet Detection Trimming Apr 3, 2025
@DongzeHE
Copy link
Member

DongzeHE commented Apr 3, 2025

I am not very experienced with this step so I actually don't have a preference. For me both fastp and TrimGalore look good. Maybe we can follow nf-core/rnaseq to combine the fastqc and trimming steps using TrimGalore?

@milos7250
Copy link
Author

With the experience from my attempt to remove the adapters, neither fastp, nor Trimgalore helped me. Trimgalore only trims adapters from the 3' end (the TSO was at the start of my reads), and for fastp I was not able to specify to only trim read2 and not touch read1. I also have a feeling that fastp only trims from 3', as it detects adapters by overlap analysis, but I was not able to find this in the docs.

What I ended up doing is using the fastq_trim_fastp_fasqc subworkflow as a template, and swapped the fastp module for cutadapt.

Cutadapt has very fine-tunable control of adapter trimming, but the downside is you need to specify the adapters you want to trim.

@DongzeHE
Copy link
Member

DongzeHE commented Apr 4, 2025

Let's just ask our guru: @FelixKrueger, would you mind sharing some insights here?

@FelixKrueger
Copy link
Contributor

I have to read this thread a bit more carefully, but it is true that Trim Galore does in it's current form remove adapters from the 3' end only, as this usually the right thing to do for most applications, and allows hard-trimming on the 5'-end for known sequence contaminations.

I believe I have added an option to pass custom arguments to Cutadapt with might allow the specification 5' trimming sequences, but this would need some further looking at (I believe it is already working on the dev branch (see FelixKrueger/TrimGalore#184) but back then I didn't want to re-implement the polyA for read 2, so I stopped working on that at some point last year. Let me know if you'd like me revive this in some form....

@milos7250
Copy link
Author

Since you mention polyA for read2, this is also something that cutadapt does not allow for naturally. The docs currently say:

On paired-end reads, --poly-a removes poly-A tails from R1 and poly-T “heads” from R2.

I've managed to cheat around this by swapping R1 and R2, but I don't particularly like that solution for the pipeline, as it makes handling filenames with the nf-core/cutadapt module difficult. Alternative option was to run cutadapt in single-end mode on R2 to polyA trim, and then run a second pass of cutadapt in paired-end mode to filter out reads that were too short. But ideally all the trimming, polyA and length filtering would be done in one pass.

It just seemed that neither fastp, nor cutadapt were particularly well-suited for scRNA samples.

@FelixKrueger
Copy link
Contributor

Hmm, if there is a niche to fill and it is useful, maybe we should give it a go?

@DongzeHE
Copy link
Member

DongzeHE commented Apr 4, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants