Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary benchmarking of counts output for scRNA-seq pre-processing tools #82

Merged
merged 20 commits into from
May 13, 2021

Conversation

allyhawkins
Copy link
Member

Related to #63, this is a start to some comparisons of quantification output of 4 pre-processing tools, Cellranger 6.0, Alevin, Alevin-fry (with/without --sketch), and Kallisto. I used four 10Xv3 samples that have been processed using all 5 tools.

I updated the 01-import-quant-data.Rmd notebook to include importing output data from all of these tools and for all 4 of these samples. That notebook is used to create a single cell experiments for all samples which are the input for the pre-processing-benchmarking-metrics.Rmd.

I started with some basic comparisons including number of cells detected and shared between tools, number of UMI/cell, genes/cell, and correlation of these metrics for each tool and cellranger.

One thing that I think would be really nice that is missing from this is creation of a knee plot, but we would need to output information about every cell from each tool, not just those that each tool considers "true" cells. I'm not sure if that's entirely necessary, but I do think it would be nice to have.

Additionally, one of the samples, SCPCR000003, is a very poor sample and appears to be mostly dead, so I chose to remove that sample for the majority of the comparisons.

As we move forward more samples will need to be added, since we have samples from snRNA-seq and different 10X technologies.

Some questions to reviewers:

  • Are there metrics that you think are missing that you would like to see added to this comparison?
  • Do you have any suggestions for different ways to visualize these metrics that might help us better compare these tools?

@allyhawkins allyhawkins requested a review from jashapiro April 22, 2021 22:09
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, with a lot of interesting results!

I have a few suggested changes, mostly to eliminate some redundant plots in favor of just the ones that you already have split by samples that I think are more informative.

The other major suggestion is one purely of style: I find it is much easier to review text if we follow a standard of one sentence per line (OSPL). (I got this from working here, it hadn't really occurred to me before.) I have had people suggest even more line breaks, separating clauses, but I think OSPL is usually good enough, as it allows me to comment on a specific sentence, and makes tracking changes easier, since they are more discrete.

ylab("Number of Cells Detected") +
xlab("")
```
Okay, now we can see that even within one sample there is quite the variation across the tools. This seems a bit concerning... Cellranger looks like it detects the most amount of cells for 2 of the 4 samples and Alevin for the other 2. But then in the 2 samples (126 and 127) that Cellranger has the most, Alevin has the least? Is there something potentially wrong with those runs with alevin?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing to note is that the kallisto "cells detected" is actually being determined by the DropletUtils::barcodeRanks tool in the previous script. Overall, it definitely seems like this cell count metric is a bit fraught. I have no particular answers.

```{r}
## mitochondrial content
cell_qc %>%
mutate(tool = fct_reorder(tool, subsets_mito_percent, .fun = 'mean')) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this reordering? I feel like we should keep order consistent among graphs...it also looks like you are doing this twice?

ylab("% Mito /Cell") +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())
```
So this is really informative. It looks like SCPCR000003 is a very poor sample across all tools and has a lot of dead cells. It might be a good idea to remove that from our comparisons and not consider this for benchmarking. Additionally, it looks like we can see where our outlier in Alevin is, SCPCR000006, has a much higher mito content in the Alevin run than in other tools. Let's see if other metrics are affected with that sample.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably start here; this seems more informative than the last plot, to be sure.



```{r}
ggplot(cell_qc_filter, aes(y = detected, fill = tool)) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, I think the plots split by sample are probably all that we need. Aggregating across the samples hides info that I think we want to see, and having both plots seems redundant. If there are no obvious trends from the separated samples, I would probably argue we can't make a good decision about the merits of different tools.

cr_ka_cor = cor(cellranger, kallisto, method = "spearman")
)
```
There is some slight variation, but still quite hgih. Again, it looks like Alevin fry has the best, with Kallisto having the worst.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably, alevin-fry sketch is still outperforming kallisto. This was something I wanted to know about...

```
When actually looking at each gene, you see that kallisto starts to show some variation, where as alevin fry has a much cleaner diagonal and is more similar to cellranger for the cells that are detected by both. Another thing to note, which is shown in the boxplots previously, the ranges for these comparisons are not always the same. So yes, there is strong correlation, but cellranger still has a lower range of gene/cell than both Alevin-fry and kallisto.

### Per Gene QC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't matter too much, but maybe we want to separate this into a different notebook? Obviously there is shared info here, but for clarity it might be worth doing. Maybe.

cr_ka_cor = cor(cellranger, kallisto, method = "spearman")
)
```
So here we have correlation of mean expression/gene and the correlations here are much more varied than when looking at UMI/cell. We see that the highest correlation is between Alevin and Alevin-fry (without sketch), while Kallisto has the lowest correlation. It appears that the `--sketch` mode, although increases time (shown elsewhere), doesn't appear to have an increased advantage over Alevin-fry in terms of quantification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the --sketch mode, although increases time (shown elsewhere), doesn't appear to have an increased advantage over Alevin-fry in terms of quantification.

I'd say this the opposite way... I would have expected --sketch to have a worse correlation, and it doesn't! Sketch mode seems good!

```
Interestingly, Alevin-fry has a higher correlation than Kallisto, the pattern is much different. The mean gene expression tends to be higher in genes in Alevin-fry than in Cellranger. In kallisto, there are genes that are higher in both kallisto and cellranger.

In conclusion, it appears that there is some variability between the tools in cell barcode assignment, with cellranger assigning what appears to be more cells, but a lower number of UMI/cells and gene/cells in compared to Kallisto and Alevin-fry. For the most part, there is high correlation of each tool with cellranger, but Alevin-Fry/Alevin have the highest and Kallisto the lowest. Although Kallisto has the fastest compute time, Alevin-Fry is still much faster than cellranger and uses a fraction of the memory while obtaining similar quantification information. The biggest discrepancy is in the number of cells that are identified, so perhaps it is worth looking into the other resolution methods used in Alevin-fry, as we are using the "full" resolution method here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would phrase this probably more as a difference in cell filtering; I expect that cellranger is being more permissive. So cellranger is filtering fewer cells out, not that the barcode assignment is necessarily different.

I don't think the resolution methods for alevin-fry are at the cell level, but rather the read level, as to how a read gets counted if it maps to more than one gene/transcript.

@allyhawkins
Copy link
Member Author

@jashapiro Thank you for all the great feedback! I went ahead and removed all of the summary data and now have the plots separated to show each sample across each tool and when applicable if there are different indices that are used.

We have quite a few samples that have been run at this point with various configurations that I have included into one updated notebook that looks at only the cell level metrics. I am going to split it up to look at gene level metrics in another notebook and PR (per your suggestion, and the notebook was getting to be very long and overwhelming).

For your reference, there are 4 scRNA-seq samples that have been used to this point: SCPCR000003, SCPCR000006, SCPCR000126, and SCPCR000127
All of the scRNA-seq samples have gone through the following tool/index combinations:

  • Cellranger mRNA only: labeled as cellranger_cdnapre_mRNA
  • Cellranger pre-mRNA: labeled as cellranger_cdna
  • Alevin with the ensembl generated index + full decoy: labeled as alevin
  • Kallisto with the ensembl generated index: labeled as kallisto
  • Alevin-fry with the ensembl generated index: labeled as alevin-fry
  • Alevin-fry with the ensembl generated index in --sketch mode: labeled as alevin-fry-sketch
  • Alevin-fry with the spliced index in --unfiltered mode: labeled as alevin-fry-unfiltered
  • Alevin-fry with the spliced index in --unfiltered mode and --sketch mode: labeled as alevin-fry-unfiltered-sketch

We also now have 2 snRNA-seq samples that have been added: SCPCR000118 and SCPCR000119.
They have been run through the following tool/index combinations:

  • Cellranger mRNA only: labeled as cellranger_cdnapre_mRNA
  • Cellranger pre-mRNA: labeled as cellranger_cdna
  • Alevin with the spliced index + full decoy: labeled as alevin*
  • Kallisto with the spliced index: labeled as kallisto*
  • Alevin-fry with the spliced index: labeled as alevin-fry*
  • Alevin-fry with the spliced index in --sketch mode: labeled as alevin-fry-sketch*
  • Alevin-fry with the spliced index in --unfiltered mode: labeled as alevin-fry-unfiltered*
  • Alevin-fry with the spliced index in --unfiltered mode and --sketch mode: labeled as alevin-fry-unfiltered-sketch*
    Those indicated with a * were also run with the spliced_intron index.

In general, I am a bit concerned with the number of cells that alevin-fry (regardless of mode) is detecting. It doesn't quite seem feasible for that to be happening. I'm going to go back and look at the workflow to make sure there aren't any issues with how we are implementing the --unfiltered workflow, but seeing ~50-60,000 cells in a sample doesn't seem ideal.

The other striking finding is in the single nuclei samples. In general, the counts are much lower per cell, as expected, but interestingly it looks like kallisto and cellranger are consistently performing better than the other options. Also, using the pre-mRNA index definitely looks beneficial. Although kallisto does have an increase in memory usage with snRNA-seq.

Questions I have for reviewers:

  • Any ideas on why we could be seeing such a large increase in cell numbers deteced in alevin-fry --unfiltered?
  • Should we try other snRNA-seq samples to see if this same trend of low coverage is consistent?
  • Any thoughts on other metrics that would be good to include here?

@jashapiro
Copy link
Member

  • Any ideas on why we could be seeing such a large increase in cell numbers deteced in alevin-fry --unfiltered?

Yes, I have an idea! Caveat: I haven't actually looked at anything.

I think alevin-fry is probably now behaving like kallisto does, and returning everything that matches a known barcode. --unfiltered is really unfiltered. (What I think I would like to see is them combine the barcode list and knee filtering, so that they don't create cells that don't match the canonical barcodes, but I can see why they might not have done that yet. I may request it...)

In the mean time, we can probably pull a bit of the code I used to preprocess kallisto and filter with DropletUtils::barcodeRanks or similar.

@rob-p
Copy link

rob-p commented May 6, 2021

@jashapiro --- you hit the nail directly on the head. The algorithm used in the unfiltered mode is described here. Basically, it's the following. Any barcode that appears in the provided 10x "whitelist" (I will henceforth use the term permit list instead) will constitute the "existence" or "presence" of a corresponding cell. Once all of the "present" 10x permit list barcodes (there are ~6.7M possible barcodes in the 10x v3 permit list) are found, the reads with barcodes that did not exactly match are checked against this list to see if they can be corrected with a single edit to any of the "present" barcodes. If they can be corrected uniquely to a single present barcode, the read is rescued, otherwise, if there is no "present" barcode within 1-edit (or if there are >= 2 barcodes within 1-edit), the read is discarded. By default, the only filtering applied is that a barcode must have 10 reads assigned to it in the first pass to be considered present (thought this, too, can be controlled by the --min-reads parameter, and could be set to 1 to be most permissive or higher if you really don't want to consider any barcodes with fewer reads).

To be clear, as is the case with other "unfiltered" output, this is not a prediction of all of the "high-quality" cells. This is a quantification of everything present (by means of intersection with the 10x permit list of possible barcodes). So, to determine the "high quality" cells --- i.e. the barcodes that are likely to have corresponded to live cells and not e.g. empty droplets, the quantified output in the --unfiltered mode should be combined with another method for estimating the true number of cells in the sample / experiment.

I'm happy to discuss further if you have any follow-up questions or suggestions here.

@rob-p
Copy link

rob-p commented May 6, 2021

@allyhawkins, there are also two other things that I think are likely worth considering here (after deciding how to handle the filtering of the --unfiltered output ... since this is certainly leading to the quantification of a large number of "present" barcodes that do no correspond to true cells, which is also likely to drive up the mito. reads you're seeing in that pipeline).

The first is that considering the number of detected UMIs / cell is a bit of a problematic metric when the tools are not doing alignment filtering in the same way. Specifically, 10x has noted that in the single-nucleus data, there tend to be a high number of detected reads aligning in the antisense orientation (upwards of 30%). These are being filtered out by orientation (at least under the current settings) all of the alevin tools and in cell ranger too I believe; so looking at UMIs/cell is a bit apples-to-oranges. If you want to see what the alevin-fry counts look like when including these reads, you can just pass --d both in the generate-permit-list step.

The second thought is a more general one, buy may be particularly poignant in the single-nucleus data and when looking at the number of UMIs/cell. That is, I think it's worth throwing the cr-like resolution method into the mix. Because of the simpler UMI deduplication approach, it will also yield more UMIs/cell, and seeing how that compares to the other results may provide a better perspective on the UMIs/cell metric.

@jashapiro
Copy link
Member

A couple more quick comments on my first pass through:

  • It looks like there is only one sample where we have both the txome and the spliced txome files generated by us? Is that right? Either way, I think it would make sense to collapse all of the index-related categories to just cDNA (however generated) vs unspliced. Then all tools will be in the same panels, rather than spread out in ways that make it harder to compare.
  • I'm not sure how the spliced_intron results are being read in: are the spliced cDNA and intron counts being combined into single genes? Does cellranger keep them separate as well? If so, how is that affecting results?

@allyhawkins
Copy link
Member Author

Thank you everyone for all the comments and feedback. I can't believe I didn't even think about the fact that we were now allowing any and every identified cell through which makes complete sense as to why we are seeing that increase in cell numbers. And also why we probably see a larger shift in UMI/cell and gene/cell distributions when looking at only shared cells in the alevin-fry workflow. I will go back and add in a pre-filtering step to this analysis, similar to what Josh had previously implemented for Kallisto and start there.

It looks like there is only one sample where we have both the txome and the spliced txome files generated by us? Is that right? Either way, I think it would make sense to collapse all of the index-related categories to just cDNA (however generated) vs unspliced. Then all tools will be in the same panels, rather than spread out in ways that make it harder to compare.

Yes, that's correct. I wasn't sure if we should separate them out or not based on spliced vs. txome files, but I will go ahead and do some more data cleaning and collapse them to be cDNA vs. pre-mRNA index (unspliced). I'm also going to remove the one run of txome from the sample that does have both txome and spliced txome (119) for now.

I'm not sure how the spliced_intron results are being read in: are the spliced cDNA and intron counts being combined into single genes? Does cellranger keep them separate as well? If so, how is that affecting results?

Let me go back and actually look at this, I am now remembering us talking about having to combine the counts for spliced and unspliced genes for Alevin and Kallisto based on our discussion here #68 (comment). I will go through this to make sure this is accurate, because the counts do seem very low.

@allyhawkins
Copy link
Member Author

  • I'm not sure how the spliced_intron results are being read in: are the spliced cDNA and intron counts being combined into single genes? Does cellranger keep them separate as well? If so, how is that affecting results?

I've now included a step that collapses the counts matrix by gene for the snRNA-seq counts for alevin, alevin-fry, and kallisto so that it is comparable to cellranger.

I also incorporated a cell filtering step for alevin-fry --unfiltered-pl which now filters out any barcodes associated with empty droplets. Using just DropletUtils::barcodeRanks still left samples with ~50,000 cells, so per the suggestion on the vignette for DropletUtils, I used DropletUtils::emptyDrops to identify the probability of drops being empty and filtered those out. After doing this, the alevin-fry --unfiltered-pl runs look much more comparable to the other tools.

@allyhawkins allyhawkins requested a review from jashapiro May 10, 2021 20:18
@jashapiro
Copy link
Member

Using just DropletUtils::barcodeRanks still left samples with ~50,000 cells, so per the suggestion on the vignette for DropletUtils, I used DropletUtils::emptyDrops to identify the probability of drops being empty and filtered those out.

How long did this take? I think I had avoided it because it seemed relatively slow, though I do like the idea behind it. Speed probably doesn't matter in the grand scheme of things, especially if we implement it as part of a workflow.

@allyhawkins
Copy link
Member Author

How long did this take? I think I had avoided it because it seemed relatively slow, though I do like the idea behind it. Speed probably doesn't matter in the grand scheme of things, especially if we implement it as part of a workflow.

It is definitely longer to use emptyDrops than barcodeRanks. Doing a time test with one of the snRNA-seq counts matrix, the elapsed time on the first line is for barcodeRanks, while the second line is for emptyDrops. It is orders of magnitude greater, but will only add a little over 1 minute to processing time for each sample.
Screen Shot 2021-05-10 at 3 46 16 PM

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good! If this were more of a workflow, I would probably comment more on some of the places where there is a lot of repeated code, but in this application I think it is fine.

The one thing I do think is worth a change is the fact that kallisto uses a different filter than alevin-fry-unfiltered. It would only seem fair to give it the same emptyDrops filtering that we are doing for alevin-fry.

This means that we are going to have cells that are not considered 'true' cells and we need to filter them out before doing any comparisons across the tools.
We could use `DropletUtils::barcodeRanks` but even after using that we still see some samples with 50,000 cells above the knee.
According to the documenation for DropletUtils another recommendation is to use their method for detecting [empty droplets](https://www.bioconductor.org/packages/devel/bioc/vignettes/DropletUtils/inst/doc/DropletUtils.html#detecting-empty-droplets).
Here, I am implementing `DropletUtils::emptyDrops` for the alevin-fry samples used with the `--unfiltered` flag.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a fair comparison, we should probably do this for the kallisto samples as well.

```
It looks like SCPCR000003 is a very poor sample across all tools and has a lot of dead cells.
It might be a good idea to remove that from our comparisons and not consider this for benchmarking.
Additionally, it looks like we can see that SCPCR000118 (snRNAseq) also has some variation, but the other samples tend to be fairly consistent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super odd, and seems like a potential problem for alevin-fry with the the default settings. mito genes should be essentially nonexistent in nuclei, so I am surprised that we are seeing so much variability there! A question I have is whether those are the same cells, or if alevin/-fry are not throwing out some droplets that should have been discarded...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like even when filtering for only cells that are shared across all tools, alevin/alevin-fry is still giving higher coverage of the mito genes. That seems to decrease with use of --sketch and also use of the pre-mRNA index rather than cDNA index.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really interesting analyses @allyhawkins! I'm Tagging @k3yavi and @mohsenzakeri here to take a look at these samples from our end.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey guys, thanks for maintaining such a great resource. Just to add to Rob's point here, we also had similar observation about the mito/ribo genes, for single-cell data (not single-nuclei) having high counts compared to other tools. It has been a while but if I remember correctly these group of genes share highly repetitive genomic sequences which almost all the tools ignore in their quantification methods (some tools have EM related flags but they are disabled by default), which is enabled by default in alevin. I'd recommend checking the uniqueness ratio of these outlier genes to get a better sense of why alevin is showing more counts for them compared to other tools in non-ambiguous modes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insight on this topic. What's interesting is that when we look at the same sample run on alevin-fry with --sketch enabled we see a decrease in the percentage of reads mapping to mito/ribo genes.
It seems like it is a consequence of using alevin/alevin-fry in its default mode with selective alignment rather than pseudoalignment? Does using the --sketch flag in alevin-fry change the default settings in Alevin to not use the EM flags? What are the benefits to having that enabled vs. disabled?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ally. So, this is certainly an interesting observation. It's not the case that --sketch mode turns off the EM automatically or some such. The resolution mode is independent of how the records make their way into the RAD file. Here, my best guess is that selective-alignment is just more sensitive, and is somehow finding alignments to these genes that either (a) are good enough given the default filtering criterion for alignments (and no other equally good alignment is being found to other genes) or (b) are as good as the alignments that it is finding with respect to some other gene --- leading to this alignment being multi-mapping, and the corresponding counts being split during the full resolution. The selective-alignment algorithm verifies all of the alignments it finds with "normal" dynamic programming scoring.

So this means that it must be finding alignments for fragments to these ribo genes that are good enough to pass the filtering criteria (which are, themselves, modifiable). As @k3yavi points out, it's not immediately obvious if reads are mapping to these genes in other methods or not, as they will simply discard gene multimapping reads rather than try to resolve them. In the case of Cell Ranger, one could look at one of the BAM files and see if alignments exist to these genes, but are simply not being counted. Likewise, one could run alevin-fry in --cr-like mode and see how that changes the abundance of those genes in these samples. Finally, I'll mention that the two mapping algorithms (with and without sketch) are certainly different. While selective alignment will validate the mapping with a score, and return all (and only) those places where the mapping yields an equally best score, the role of the "score" in --sketch mode is instead played by what set of k-mers are searched and found. As such, it's always possible that there may be different alignments with equally valid alignment scores, but only one (or a subset) the mappings is found in sketch mode because of the heuristics used to extract the k-mer set that is eventually tested.

Comment on lines 261 to 269
cell_qc_filter %>%
filter(seq_unit == "nucleus") %>%
ggplot(aes(x = tool, y = sum, fill = tool)) +
geom_boxplot() +
facet_grid(sample ~ index_type) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("UMI/cell") +
xlab("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the outliers are messing with the scale here, it might be nice to zoom in with coord_cartesian(ylim = c(0,10000)) or something like that to see the differences here more clearly.

Generally, the coverage is still quite uniform.
However, it looks like cellranger has one of the lowest genes/cell means with alevin-fry `--sketch` and alevin-fry `--unfiltered` the highest.

Let's do the same thing, but only look at genes above a certain threshold.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make a new section here for navigation ease...

allyhawkins and others added 3 commits May 11, 2021 12:28
@allyhawkins
Copy link
Member Author

The one thing I do think is worth a change is the fact that kallisto uses a different filter than alevin-fry-unfiltered. It would only seem fair to give it the same emptyDrops filtering that we are doing for alevin-fry.

This makes complete sense. I now changed it so that both alevin and kallisto are filtered using emptyDrops, which solves the problem of the one sample having ~ 40,000 cells in Kallisto.

I also went ahead and actually did get rid of some of the repetitive code, in case future me finds it beneficial. I added in a function that imports quants data from a data directory and a list of quant ids, collapses any counts matrices aligned to the spliced_intron fasta, and then outputs a sce object that can be used for cell filtering using emptyDrops depending on the tool used.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We may want to follow up on some of the points & comparisons raised by @rob-p and @k3yavi, but lets get this in for now.

One other thing I noted (and we discussed) is that the UMI per cell correlation for one sample (SCPCR00006) is poor with alevin-fry-sketch-unfiltered. I wonder if this may be a bug in the --unfiltered beta code, but not sure how to track that down carefully, and it doesn't seem to affect any other samples or the alevin-fry-unfiltered sample. We may want to check if this is resolved in 0.3.0 and/or with a clean rerun, but I don't think it is a major issue for the general pattern of comparisons we are trying to get at here.

Screen Shot 2021-05-13 at 9 39 26 AM

Screen Shot 2021-05-13 at 9 36 57 AM

@allyhawkins allyhawkins merged commit 8a1c87d into main May 13, 2021
@rob-p
Copy link

rob-p commented May 13, 2021

Wow @jashapiro : what is up with that sample. Any idea what the unfiltered number of cells looks like, and what the overlap of the filtered set is? The unfiltered barcode collection algorithm is pretty simple, so I'd be curious if this may be something sneaking in during another step (of course, we'd be happy to take a look as well).

@jashapiro
Copy link
Member

Wow @jashapiro : what is up with that sample. Any idea what the unfiltered number of cells looks like, and what the overlap of the filtered set is? The unfiltered barcode collection algorithm is pretty simple, so I'd be curious if this may be something sneaking in during another step (of course, we'd be happy to take a look as well).

No idea! And absolutely could be (likely?) something at our end as well! We will probably go back to investigate a bit more, but it is not today's problem. No correlation I could have understood as a random permutation, but a negative correlation? The pattern is strange too (plot from @allyhawkins).
Screen Shot 2021-05-13 at 9 25 17 AM

The good part is we don't really care about cell ids at the moment, but we definitely will when we get to integrating some CITE-seq data.

@rob-p
Copy link

rob-p commented May 13, 2021

Looks like we're directly visualizing the X chromosome 😆 . But more seriously, if, after further investigation, it does look like something strange on our end, please do keep us posted.

@allyhawkins
Copy link
Member Author

Any idea what the unfiltered number of cells looks like, and what the overlap of the filtered set is? The unfiltered barcode collection algorithm is pretty simple, so I'd be curious if this may be something sneaking in during another step (of course, we'd be happy to take a look as well).

@rob-p So before doing any filtering, we get around 69,000 cells identified for this sample. After filtering this sample using DropletUtils::emptyDrops we get ~2,000 cells, and 1,500 of those cell barcodes are also identified in this sample when running through cellranger. We will definitely dig around a bit more and see if we can come up with anything and keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants