-
Notifications
You must be signed in to change notification settings - Fork 188
Trimming #450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, it's already part of the #scdownstream pipeline. I'm not sure how many features of scdownstream we'd want to incorporate here. @nictru, what's your take here? |
I am generally fine with adding all the preprocessing steps of scdownstream as postprocessing to scrnaseq. Namely this would be the following:
The workflow for the latter two would be similar to what we did for empty droplet detection:
Efforts have already been started during the Barcelona hackathon 2024, but since then have slowed down. I hope I create some momentum over the next weeks. |
Have you also thought about using fastp/sortmerna prior to quantification, or is that generally discouraged for scRNA? |
fastp for adapter trimming and sortmerna for rRNA detection, or what did you have in mind? At least for the 10x protocols, I don't think this is typically done. Ribosomal reads are instead considered as a QC metric during downstream analysis. |
Yes, I meant fastp for adapter trimming and quality filtering, and sortmerna for rRNA filtering. If you have the full R1 reads with 150bp, you can run fastp with Could you elaborate for me on how ribosomal reads are used as a QC metric? I struggle to find any information on this. Thanks. |
I don't think it is required. Afaik the tools just ignore excess nucleotides from R1, and 10x explicitly advises against trimming because apparently it can damage cell barcodes:
(https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/inputs/cr-mkfastq)
See, for instance, https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html#filtering-low-quality-cells The fraction of ribosomal read is one metric amonst several that is considered to identify outlier cells. |
Thanks for sharing those links with me. I am currently working with plant genomes where I unfortunately do not have easy access to annotated ribosomal/mitochondrial genes. The reason I asked about trimming is that I have samples where 75% of the R2 reads have the Template Switching Oligo (TSO) sequence at the 5' end (AAGCAGTGGTATCAACGCAGAGTACATGGG) that I wanted to trim. The R2 reads also had poly-a tailing that I wanted to remove. I ended up adapting the pipeline to use the cutadapt module. This gives more control over what adapters you trim, from which end and from which read. I used the sequences mentioned here (https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html) as the adapters and got where I wanted. This increased my simpleaf mapping rates by about 20%, and only dropped about 2% of reads that were too short after trimming. If that is a desirable addition to this pipeline, I could create a draft PR with the changes I have made, but I am not sure that I would have the time to write proper tests. |
@DongzeHE, what's your recommendation for trimming for simpleaf? |
I think it really depends on the quality of the data. Usually, for high-quality datasets, the mapping rate is 90%-95%. Since simpleaf is a transcriptome-based tool, it means the majority of the reads come from transcripts.
From the section "(5) Use Fragmentase to fragment cDNA and perform A-tailing:" in the link you shared, we know that, TSO are attached to the 5-prime end of synthesized full-length cDNAs. If you see it in your reads, which all come from the 3-prime end of synthesized full-length cDNAs, this means 75% of the synthesized full-length cDNAs, are shorter than the expected fragment length, usually 300bps. In this case, rather than trimming these TSO sequences, I would worry about if RNAs are segmented even before the cDNA library preparation step.
PolyA reads are frequent in single-cell. This is mainly because most 3' assays, for example 10x Chromium assays, use polyT primers to capture the polyA tail of polyadenylated transcripts. When polyT primers bind middle bases of polyA tails, the upstream unbound As will be sequenced. PolyA reads usually will not be mapped in simpleaf. This is because, in contrast to genome, there are usually no polyA sites in the transcriptome. As for my recommendations, I would say, it will be great if we can add this module for biological reads, reads2, because (1) cellranger has an internal trimming module, (2) it might help with processing low quality data, and (3) there are existing nf-core modules for trimming. The only caveat here is this will have limited effects for data in good quality. So I am not sure if we want to add it as an optional or mandatory step as trimming usually takes a while. Best, |
Thanks for your input! I'm open to adding trimming to the simpleaf, kallistobustools and starsolo workflows then. But I will have to insist on adding tests before this gets merged. As for the trimming tool, do you have any preference? |
I am not very experienced with this step so I actually don't have a preference. For me both fastp and TrimGalore look good. Maybe we can follow nf-core/rnaseq to combine the fastqc and trimming steps using TrimGalore? |
With the experience from my attempt to remove the adapters, neither fastp, nor Trimgalore helped me. Trimgalore only trims adapters from the 3' end (the TSO was at the start of my reads), and for fastp I was not able to specify to only trim read2 and not touch read1. I also have a feeling that fastp only trims from 3', as it detects adapters by overlap analysis, but I was not able to find this in the docs. What I ended up doing is using the Cutadapt has very fine-tunable control of adapter trimming, but the downside is you need to specify the adapters you want to trim. |
Let's just ask our guru: @FelixKrueger, would you mind sharing some insights here? |
I have to read this thread a bit more carefully, but it is true that Trim Galore does in it's current form remove adapters from the 3' end only, as this usually the right thing to do for most applications, and allows hard-trimming on the 5'-end for known sequence contaminations. I believe I have added an option to pass custom arguments to Cutadapt with might allow the specification 5' trimming sequences, but this would need some further looking at (I believe it is already working on the |
Since you mention polyA for read2, this is also something that cutadapt does not allow for naturally. The docs currently say:
I've managed to cheat around this by swapping R1 and R2, but I don't particularly like that solution for the pipeline, as it makes handling filenames with the nf-core/cutadapt module difficult. Alternative option was to run cutadapt in single-end mode on R2 to polyA trim, and then run a second pass of cutadapt in paired-end mode to filter out reads that were too short. But ideally all the trimming, polyA and length filtering would be done in one pass. It just seemed that neither fastp, nor cutadapt were particularly well-suited for scRNA samples. |
Hmm, if there is a niche to fill and it is useful, maybe we should give it a go? |
10X's explanation of their trimming strategy: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview#read-trimming |
Description of feature
I think detecting doublets in the quantified cells would be a good addition to this pipeline. Using something like DoubletDetection, additional noise coming from doublets could be removed from the data.
I was thinking of starting to add this to this pipeline after the cellbender post-processing step, but I'd like to hear other's opinion on this first.
The text was updated successfully, but these errors were encountered: