Trimming #450

milos7250 · 2025-03-18T12:15:01Z

Description of feature

I think detecting doublets in the quantified cells would be a good addition to this pipeline. Using something like DoubletDetection, additional noise coming from doublets could be removed from the data.

I was thinking of starting to add this to this pipeline after the cellbender post-processing step, but I'd like to hear other's opinion on this first.

grst · 2025-03-18T12:25:02Z

Hi, it's already part of the #scdownstream pipeline. I'm not sure how many features of scdownstream we'd want to incorporate here.

@nictru, what's your take here?

nictru · 2025-03-25T09:40:12Z

I am generally fine with adding all the preprocessing steps of scdownstream as postprocessing to scrnaseq. Namely this would be the following:

Empty droplet detection (already added)
Doublet detection
Ambient RNA removal

The workflow for the latter two would be similar to what we did for empty droplet detection:

Add the modules to the shared modules repository
Create a shared subworkflow
Import it into both pipelines

Efforts have already been started during the Barcelona hackathon 2024, but since then have slowed down. I hope I create some momentum over the next weeks.

milos7250 · 2025-03-25T11:02:08Z

Have you also thought about using fastp/sortmerna prior to quantification, or is that generally discouraged for scRNA?

grst · 2025-03-25T13:23:14Z

fastp for adapter trimming and sortmerna for rRNA detection, or what did you have in mind?

At least for the 10x protocols, I don't think this is typically done. Ribosomal reads are instead considered as a QC metric during downstream analysis.

milos7250 · 2025-03-25T14:46:16Z

Yes, I meant fastp for adapter trimming and quality filtering, and sortmerna for rRNA filtering.

If you have the full R1 reads with 150bp, you can run fastp with --max_len1 28 --max_len2 9999 --trim_poly_x --correction --overrepresentation_analysis --length_required 28. This will trim adapters from read2, and any pair with read1 trimmed below 28bp will be discarded.

Could you elaborate for me on how ribosomal reads are used as a QC metric? I struggle to find any information on this. Thanks.

grst · 2025-03-26T07:37:36Z

If you have the full R1 reads with 150bp, you can run fastp with --max_len1 28 --max_len2 9999 --trim_poly_x --correction --overrepresentation_analysis --length_required 28. This will trim adapters from read2, and any pair with read1 trimmed below 28bp will be discarded.

I don't think it is required. Afaik the tools just ignore excess nucleotides from R1, and 10x explicitly advises against trimming because apparently it can damage cell barcodes:

Do not trim adapters during demultiplexing. Leave these settings blank. Trimming adapters from reads can potentially damage the 10x barcodes and the UMIs, resulting in pipeline failure or data loss. If you are using an Illumina sample sheet for demultiplexing with bcl2fastq, BCL Convert or our mkfastq pipeline, please remove these lines under the [Settings] section: Adapter or AdapterRead1 or AdapterRead2.

(https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/inputs/cr-mkfastq)

Could you elaborate for me on how ribosomal reads are used as a QC metric? I struggle to find any information on this. Thanks.

See, for instance, https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html#filtering-low-quality-cells

The fraction of ribosomal read is one metric amonst several that is considered to identify outlier cells.

milos7250 · 2025-04-01T12:56:10Z

Thanks for sharing those links with me. I am currently working with plant genomes where I unfortunately do not have easy access to annotated ribosomal/mitochondrial genes.

The reason I asked about trimming is that I have samples where 75% of the R2 reads have the Template Switching Oligo (TSO) sequence at the 5' end (AAGCAGTGGTATCAACGCAGAGTACATGGG) that I wanted to trim. The R2 reads also had poly-a tailing that I wanted to remove.

I ended up adapting the pipeline to use the cutadapt module. This gives more control over what adapters you trim, from which end and from which read. I used the sequences mentioned here (https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html) as the adapters and got where I wanted. This increased my simpleaf mapping rates by about 20%, and only dropped about 2% of reads that were too short after trimming.

If that is a desirable addition to this pipeline, I could create a draft PR with the changes I have made, but I am not sure that I would have the time to write proper tests.

grst · 2025-04-01T13:32:27Z

@DongzeHE, what's your recommendation for trimming for simpleaf?

DongzeHE · 2025-04-03T03:19:14Z

I think it really depends on the quality of the data. Usually, for high-quality datasets, the mapping rate is 90%-95%. Since simpleaf is a transcriptome-based tool, it means the majority of the reads come from transcripts.

I have samples where 75% of the R2 reads have the Template Switching Oligo (TSO) sequence at the 5' end (AAGCAGTGGTATCAACGCAGAGTACATGGG) that I wanted to trim.

From the section "(5) Use Fragmentase to fragment cDNA and perform A-tailing:" in the link you shared, we know that, TSO are attached to the 5-prime end of synthesized full-length cDNAs. If you see it in your reads, which all come from the 3-prime end of synthesized full-length cDNAs, this means 75% of the synthesized full-length cDNAs, are shorter than the expected fragment length, usually 300bps. In this case, rather than trimming these TSO sequences, I would worry about if RNAs are segmented even before the cDNA library preparation step.

The R2 reads also had poly-a tailing that I wanted to remove.

PolyA reads are frequent in single-cell. This is mainly because most 3' assays, for example 10x Chromium assays, use polyT primers to capture the polyA tail of polyadenylated transcripts. When polyT primers bind middle bases of polyA tails, the upstream unbound As will be sequenced.

PolyA reads usually will not be mapped in simpleaf. This is because, in contrast to genome, there are usually no polyA sites in the transcriptome.

As for my recommendations, I would say, it will be great if we can add this module for biological reads, reads2, because (1) cellranger has an internal trimming module, (2) it might help with processing low quality data, and (3) there are existing nf-core modules for trimming.

The only caveat here is this will have limited effects for data in good quality. So I am not sure if we want to add it as an optional or mandatory step as trimming usually takes a while.

Best,
Dongze

grst · 2025-04-03T06:43:42Z

Thanks for your input! I'm open to adding trimming to the simpleaf, kallistobustools and starsolo workflows then. But I will have to insist on adding tests before this gets merged.

As for the trimming tool, do you have any preference? fastp is on the faster end in my experience.

DongzeHE · 2025-04-03T22:19:38Z

I am not very experienced with this step so I actually don't have a preference. For me both fastp and TrimGalore look good. Maybe we can follow nf-core/rnaseq to combine the fastqc and trimming steps using TrimGalore?

milos7250 · 2025-04-04T08:01:25Z

With the experience from my attempt to remove the adapters, neither fastp, nor Trimgalore helped me. Trimgalore only trims adapters from the 3' end (the TSO was at the start of my reads), and for fastp I was not able to specify to only trim read2 and not touch read1. I also have a feeling that fastp only trims from 3', as it detects adapters by overlap analysis, but I was not able to find this in the docs.

What I ended up doing is using the fastq_trim_fastp_fasqc subworkflow as a template, and swapped the fastp module for cutadapt.

Cutadapt has very fine-tunable control of adapter trimming, but the downside is you need to specify the adapters you want to trim.

DongzeHE · 2025-04-04T14:51:53Z

Let's just ask our guru: @FelixKrueger, would you mind sharing some insights here?

FelixKrueger · 2025-04-04T15:04:24Z

I have to read this thread a bit more carefully, but it is true that Trim Galore does in it's current form remove adapters from the 3' end only, as this usually the right thing to do for most applications, and allows hard-trimming on the 5'-end for known sequence contaminations.

I believe I have added an option to pass custom arguments to Cutadapt with might allow the specification 5' trimming sequences, but this would need some further looking at (I believe it is already working on the dev branch (see FelixKrueger/TrimGalore#184) but back then I didn't want to re-implement the polyA for read 2, so I stopped working on that at some point last year. Let me know if you'd like me revive this in some form....

milos7250 · 2025-04-04T15:21:01Z

Since you mention polyA for read2, this is also something that cutadapt does not allow for naturally. The docs currently say:

On paired-end reads, --poly-a removes poly-A tails from R1 and poly-T “heads” from R2.

I've managed to cheat around this by swapping R1 and R2, but I don't particularly like that solution for the pipeline, as it makes handling filenames with the nf-core/cutadapt module difficult. Alternative option was to run cutadapt in single-end mode on R2 to polyA trim, and then run a second pass of cutadapt in paired-end mode to filter out reads that were too short. But ideally all the trimming, polyA and length filtering would be done in one pass.

It just seemed that neither fastp, nor cutadapt were particularly well-suited for scRNA samples.

FelixKrueger · 2025-04-04T22:00:27Z

Hmm, if there is a niche to fill and it is useful, maybe we should give it a go?

DongzeHE · 2025-04-04T23:11:40Z

10X's explanation of their trimming strategy: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview#read-trimming

milos7250 added the enhancement New feature or request label Mar 18, 2025

grst changed the title ~~Doublet Detection~~ Trimming Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trimming #450

Trimming #450

milos7250 commented Mar 18, 2025

grst commented Mar 18, 2025

nictru commented Mar 25, 2025 •

edited

Loading

milos7250 commented Mar 25, 2025

grst commented Mar 25, 2025

milos7250 commented Mar 25, 2025

grst commented Mar 26, 2025

milos7250 commented Apr 1, 2025

grst commented Apr 1, 2025

DongzeHE commented Apr 3, 2025

grst commented Apr 3, 2025

DongzeHE commented Apr 3, 2025 •

edited

Loading

milos7250 commented Apr 4, 2025

DongzeHE commented Apr 4, 2025

FelixKrueger commented Apr 4, 2025

milos7250 commented Apr 4, 2025

FelixKrueger commented Apr 4, 2025

DongzeHE commented Apr 4, 2025

Trimming #450

Trimming #450

Comments

milos7250 commented Mar 18, 2025

Description of feature

grst commented Mar 18, 2025

nictru commented Mar 25, 2025 • edited Loading

milos7250 commented Mar 25, 2025

grst commented Mar 25, 2025

milos7250 commented Mar 25, 2025

grst commented Mar 26, 2025

milos7250 commented Apr 1, 2025

grst commented Apr 1, 2025

DongzeHE commented Apr 3, 2025

grst commented Apr 3, 2025

DongzeHE commented Apr 3, 2025 • edited Loading

milos7250 commented Apr 4, 2025

DongzeHE commented Apr 4, 2025

FelixKrueger commented Apr 4, 2025

milos7250 commented Apr 4, 2025

FelixKrueger commented Apr 4, 2025

DongzeHE commented Apr 4, 2025

nictru commented Mar 25, 2025 •

edited

Loading

DongzeHE commented Apr 3, 2025 •

edited

Loading