Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N50 is not a median #194

Closed
Sebastien-Raguideau opened this issue Oct 3, 2024 · 8 comments · Fixed by #198
Closed

N50 is not a median #194

Sebastien-Raguideau opened this issue Oct 3, 2024 · 8 comments · Fixed by #198

Comments

@Sebastien-Raguideau
Copy link

Hello,

I am using sequali from quay.io/biocontainers/sequali:0.5.1--py310h4b81fae_0, so maybe a bit outdated since 6 month old and maybe this issue is fixed already.

I have a gripe with the reported values for N50 and other Ns, it seems that reported values are quantiles instead of N50 and such.

Just to be clear N50 should be a contig size, the contig size for which 50% of all nucleotide from the assembly are in smaller sized contigs. So for instance N50 of 9kb would mean that 50% of all nucleotides are in contigs smaller than 9kb. This is quite different from median and quantiles.

Best,
Seb

@rhpvorderman
Copy link
Owner

You are correct, these are quantiles. I got the naming wrong. I will fix this. Thanks very much for reporting!

@Sebastien-Raguideau
Copy link
Author

Thx for quick answer.
Cool.
Unsure if that is on your mind but N50, still something that would be cool stat to have for long reads.

@rhpvorderman
Copy link
Owner

I will think about adding it, but currently I am unsure how the N50 value would translate to actionable QC things to do. The quantile values are a bit clearer translation in this respect when it comes to length filtering. Feel free to convince me otherwise! I am always open to suggestions.

@Sebastien-Raguideau
Copy link
Author

I think I see your point. Hum, I suppose I just like to look at N50 to get a grasp of length distribution and it is a nice summary stat of your dataset. I think that your report help indeed with actionable QC, but also get an overview of the sequencing run? I mean I can run other tool to get that info, like seqkit stats, but what if I didn't have to?

@rhpvorderman
Copy link
Owner

I took a look: https://bioinf.shenwei.me/seqkit/usage/#stats N50 is the only thing missing from Sequali at the moment. Is that correct?

I can add N50 and N90 stats. That seems to be quite useful as a summary statistic.

@Sebastien-Raguideau
Copy link
Author

Think so. Would be nice :)

@rhpvorderman
Copy link
Owner

@Sebastien-Raguideau The latest release should have fixed the quantile issue.

@Sebastien-Raguideau
Copy link
Author

Hey, thanks a lot, will have a go at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants