-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference genome with or without unplaced contigs #83
Comments
Hi Shaun, In any case, please let us know how it is going. The experiment looks interesting in terms of testing |
I'm testing now without the unplaced contigs and comparing to with the unplaced contigs with |
--- quast-lg -es --fast --large -R fly.chr.fa assembly.fa
+++ quast-lg -es --fast --large --fragmented -R fly.all.fa assembly.fa
# contigs 1777 3307
Largest contig 27246440 831740
Total length 136770791 132088273
-Reference length 137567484 137567484
-Reference GC (%) 42.08 42.08
+Reference length 143726002 143726002
+Reference GC (%) 42.01 42.01
N50 20770625 152319
-NG50 20770625 147975
+NG50 20708867 141243
N75 20652883 64699
-NG75 20652883 53807
+NG75 20652883 43659
L50 3 245
-LG50 3 263
+LG50 4 284
L75 5 563
-LG75 5 631
-# misassemblies 453 377
-# misassembled contigs 209 328
-Misassembled contigs length 120125748 27027106
-# local misassemblies 1507 1135
-# scaffold gap ext. mis. 113 -
-# scaffold gap loc. mis. 394 -
-# possible TEs 72 50
-# unaligned mis. contigs 24 23
-# unaligned contigs 85 + 191 part 194 + 323 part
-Unaligned length 7031973 6529255
-Genome fraction (%) 89.194 89.173
+LG75 5 726
+# misassemblies 483 397
+# misassembled contigs 228 341
+Misassembled contigs length 120228985 27087838
+# local misassemblies 1516 1149
+# scaffold gap ext. mis. 116 -
+# scaffold gap loc. mis. 398 -
+# possible TEs 70 54
+# unaligned mis. contigs 13 12
+# unaligned contigs 54 + 164 part 157 + 291 part
+Unaligned length 6661039 6167582
+Genome fraction (%) 86.037 86.018
Duplication ratio 1.061 1.027
# N's per 100 kbp 3290.97 0.00
-# mismatches per 100 kbp 650.82 564.16
-# indels per 100 kbp 42.51 43.55
+# mismatches per 100 kbp 647.53 561.09
+# indels per 100 kbp 42.34 43.36
Largest alignment 3619434 734089
-Total aligned length 125169641 125408010
+Total aligned length 125526732 125755179
NA50 880995 129615
-NGA50 880995 122436
+NGA50 839777 111752
NA75 262872 47914
-NGA75 253454 38009
+NGA75 135651 27347
LA50 43 288
-LGA50 43 310
+LGA50 47 336
LA75 109 698
-LGA75 111 794
+LGA75 135 936 So |
Running
It looks as though |
Increasing --- quast-lg -es --fast --large -R fly.chr.fa assembly.fa
+++ quast-lg -es --fast --large --fragmented -x 7000 --fragmented-max-indent 7000 -R fly.all.fa assembly.fa
# contigs 1777 3307
Largest contig 27246440 831740
Total length 136770791 132088273
-Reference length 137567484 137567484
-Reference GC (%) 42.08 42.08
+Reference length 143726002 143726002
+Reference GC (%) 42.01 42.01
N50 20770625 152319
-NG50 20770625 147975
+NG50 20708867 141243
N75 20652883 64699
-NG75 20652883 53807
+NG75 20652883 43659
L50 3 245
-LG50 3 263
+LG50 4 284
L75 5 563
-LG75 5 631
-# misassemblies 453 377
-# misassembled contigs 209 328
-Misassembled contigs length 120125748 27027106
-# local misassemblies 1507 1135
-# scaffold gap ext. mis. 113 -
-# scaffold gap loc. mis. 394 -
-# possible TEs 72 50
-# unaligned mis. contigs 24 23
-# unaligned contigs 85 + 191 part 194 + 323 part
-Unaligned length 7031973 6529255
-Genome fraction (%) 89.194 89.173
+LG75 5 726
+# misassemblies 476 390
+# misassembled contigs 223 335
+Misassembled contigs length 120207681 27064051
+# local misassemblies 1529 1161
+# scaffold gap ext. mis. 115 -
+# scaffold gap loc. mis. 398 -
+# possible TEs 68 52
+# unaligned mis. contigs 13 12
+# unaligned contigs 54 + 164 part 157 + 291 part
+Unaligned length 6661084 6167617
+Genome fraction (%) 86.037 86.020
Duplication ratio 1.061 1.027
# N's per 100 kbp 3290.97 0.00
-# mismatches per 100 kbp 650.82 564.16
-# indels per 100 kbp 42.51 43.55
+# mismatches per 100 kbp 647.54 561.10
+# indels per 100 kbp 42.34 43.36
Largest alignment 3619434 734089
-Total aligned length 125169641 125408010
+Total aligned length 125527295 125755684
NA50 880995 129615
-NGA50 880995 122436
+NGA50 839777 111752
NA75 262872 47914
-NGA75 253454 38009
+NGA75 135651 27347
LA50 43 288
-LGA50 43 310
+LGA50 47 336
LA75 109 698
-LGA75 111 794
+LGA75 135 936
|
I feel like the behaviour that I would prefer would be to ignore all alignments to unplaced contigs (reference sequences shorter than 100 kbp) when computing misassemblies and NGA50 stats. |
Increasing both --- quast-lg -es --fast --large -R fly.chr.fa assembly.fa
+++ quast-lg -es --fast --large --fragmented ---extensive-mis-size 50000 --fragmented-max-indent 50000 -R fly.all.fa assembly.fa
# contigs 1777 3307
Largest contig 27246440 831740
Total length 136770791 132088273
-Reference length 137567484 137567484
-Reference GC (%) 42.08 42.08
+Reference length 143726002 143726002
+Reference GC (%) 42.01 42.01
N50 20770625 152319
-NG50 20770625 147975
+NG50 20708867 141243
N75 20652883 64699
-NG75 20652883 53807
+NG75 20652883 43659
L50 3 245
-LG50 3 263
+LG50 4 284
L75 5 563
-LG75 5 631
-# misassemblies 453 377
-# misassembled contigs 209 328
-Misassembled contigs length 120125748 27027106
-# local misassemblies 1507 1135
-# scaffold gap ext. mis. 113 -
-# scaffold gap loc. mis. 394 -
-# possible TEs 72 50
-# unaligned mis. contigs 24 23
-# unaligned contigs 85 + 191 part 194 + 323 part
-Unaligned length 7031973 6529255
-Genome fraction (%) 89.194 89.173
+LG75 5 726
+# misassemblies 206 206
+# misassembled contigs 165 182
+Misassembled contigs length 89183619 4998964
+# local misassemblies 1780 1348
+# scaffold gap ext. mis. 8 -
+# scaffold gap loc. mis. 505 -
+# possible TEs 90 52
+# unaligned mis. contigs 13 12
+# unaligned contigs 54 + 162 part 157 + 289 part
+Unaligned length 6659706 6166247
+Genome fraction (%) 86.041 86.025
Duplication ratio 1.061 1.027
# N's per 100 kbp 3290.97 0.00
-# mismatches per 100 kbp 650.82 564.16
-# indels per 100 kbp 42.51 43.55
-Largest alignment 3619434 734089
-Total aligned length 125169641 125408010
-NA50 880995 129615
-NGA50 880995 122436
-NA75 262872 47914
-NGA75 253454 38009
-LA50 43 288
-LGA50 43 310
-LA75 109 698
-LGA75 111 794
+# mismatches per 100 kbp 647.55 561.14
+# indels per 100 kbp 42.33 43.36
+Largest alignment 26304422 831628
+Total aligned length 125528847 125757294
+NA50 17502836 144852
+NGA50 17502836 128811
+NA75 7124996 52395
+NGA75 963854 30445
+LA50 4 260
+LGA50 4 302
+LA75 6 625
+LGA75 9 840 |
I've thought about an alternative definition of scaffold NGA50: I think that's roughly equivalent to saying… remove all the incongruous alignment blocks from a scaffold, and what's the N50 of what remains? So a translocation scaffold The goal is to capture the fact the the scaffolds in the assembly are 💯 correct at the large scale. There is one scaffold per reference chromosome, and the scaffolds spans from the start of each chromosome to the end, with some local assembly errors along the way. A reported NGA50 of < 1 Mbp doesn't capture that large scale correctness. |
Wow, so many thoughts and suggestions! As for
it looks like an ad hoc behaviour, so probably should not be included in the release version. Maybe you just need to perform two separate Quast runs (against Anyway, thanks for sharing all these info! |
Indeed these are just ramblings and wild thoughts. Don't feel compelled to implement them unless you truly think they're awesome ideas. which is great of course if you do! Happy holidays! |
The Drosophila melanogaster reference genome has 5.5 Mbp of unplaced sequence in 1,862 unplaced contigs. Do you recommend running QUAST with only the chromosomes, or with the chromosomes and unplaced contigs (possibly with
--fragmented
).ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz
The text was updated successfully, but these errors were encountered: