Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chromosome order #146

Open
Phlya opened this issue Jan 8, 2021 · 11 comments
Open

chromosome order #146

Phlya opened this issue Jan 8, 2021 · 11 comments

Comments

@Phlya
Copy link

Phlya commented Jan 8, 2021

Is there a way to specify chromosome order when installing a genome? E.g., when downloading a human genome I want the chromosomes to be sorted in the "normal" way aschr1, chr2, ... chr10, ... chr22, chrX, chrY, chrM. But currently they are sorted as chr1, chr10, ..., chr2, ..., chr22, chr3, ... chr9, chrM, chrX, chrY. Is it just the same order as in UCSC (I downloaded from there)?

@siebrenf
Copy link
Member

siebrenf commented Jan 8, 2021

genomepy does not (currently) support this. This snippet should do the trick:

from pyfaidx import Fasta
import re 

oldfa = '/path/to/genome.fa'
newfa = '/path/to/sorted_genome.fa'

def sorted_nicely( l ): 
    """ 
    Sort the given iterable in the way that humans expect.
    source: https://stackoverflow.com/questions/2669059/how-to-sort-alpha-numeric-set-in-python
    """ 
    convert = lambda text: int(text) if text.isdigit() else text 
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(l, key = alphanum_key)

fa = Fasta(oldfa)
sorted_chr = sorted_nicely(fa.keys())
with open(newfa) as sorted_fa:
    for chr in sorted_chr:
        sorted_fa.write('>'+chr+'\n')
        sorted_fa.write(fa[chr][:])

@Phlya
Copy link
Author

Phlya commented Jan 8, 2021

Thank you. Is there a way to perform this, but still use genomepy to manage the new sorted genome?

@siebrenf
Copy link
Member

siebrenf commented Jan 8, 2021

Assuming that with manage, you mean also applies this sorting to the support files (fa.sizes, gaps.bed & fa.fai). If so, you can run this after the previous code:

from genomepy import Genome
import os

g = Genome(oldfa)
os.unlink(g.index_file)
os.unlink(g.sizes_file)
os.unlink(g.gaps_file)

g = Genome(newfa)

if oldfa and newfa were the same it should just overwrite the old stuff (check if this does what you intended though)

@Phlya
Copy link
Author

Phlya commented Jan 8, 2021

OK, thank you!
Would this feature be something you might be interested in? I might try to make a PR to add it then. The ordering of chromosomes is pretty important, and non-trivial in some organisms.

@siebrenf
Copy link
Member

Thank you for the offer, but I don't think this is a desired feature. Chromosomes are sorted by size, which all tools I know can use, and some tools might not work otherwise (due to assumptions). Natural sorting would also fail for certain genomes (such as those that use roman numerals).

@Phlya
Copy link
Author

Phlya commented Jan 11, 2021

Are chromosomes sorted by size? That's not what I experienced, for me they were sorted alphabetically, as I indicated in the first post.
I'm not suggesting necessarily natural sorting, but just arbitrary sorting. Natural sorting can be an option, but also the actual desired order of chromosomes could be provided.

@siebrenf
Copy link
Member

My bad, you're right about the default order.
Inserting a sorting function into genomepy (as user) would however be equally complex to running the sorting afterward. And since we can't add sorting methods for every genome, I feel this step is best left to the user.

@Phlya
Copy link
Author

Phlya commented Jan 11, 2021

Doesn't need to be a function, just the actual list of chromosome names in the right order would work.

@simonvh
Copy link
Member

simonvh commented Jan 11, 2021

Hi @Phlya, I think we're hesitant to add functionality to genomepy where the use-case is not clear as this is functionality that needs to be tested and maintained down the line and adds additional complexity to the command line tool.
However, maybe some further information would help. Can you provide a clear use-case for this, i.e. an example where the order of the chromosomes in a genome FASTA need to be changed? I would expect most tools to expect lexicographical order, but maybe I'm wrong here.

@Phlya
Copy link
Author

Phlya commented Jan 11, 2021

No worries, I understand if this functionality might be not the most important.

My main use-case is that I want to use genomepy in a pipeline to manage the genome reference. Different tools and steps rely on a chromsizes file, and that needs to align with how the data is processed, including the order of chromosomes, and some tools also need the genome sequence. I need to investigate if the order of chromosomes in the fasta file needs to match the order in the chromsizes file, but even if it's not strictly required for the tools I am using now, I am not keen on the idea of having a chromsizes file with a different order than the fasta - I feel that's a dangerous combination that can lead to errors in some cases.

And of course I'd rather have my chromosomes sorted in the logical order (which matches the natural sorting for human, but could be arbitrarily different in other organisms).

I hope this explains my use case and gives you some context. Thanks!

@antonylebechec
Copy link

Dear all,

Actually, I downloaded a genome on UCSC (hg19), and retrieve a fasta file (and associated fai) with this contig/chromosome order (I also used regex 'chr[0-9XYM]+$'):

$ cat hg19.fa.fai
chr4	191154276	6	50	51
chr18	78077248	194977375	50	51
chr1	249250621	274616174	50	51
chr5	180915260	528851814	50	51
chr19	59128983	713385387	50	51
chr12	133851895	773696957	50	51
chr16	90354753	910225897	50	51
chr22	51304566	1002387753	50	51
chrM	16571	1054718417	50	51
chr17	81195210	1054735327	50	51
chr13	115169878	1137554449	50	51
chr8	146364022	1255027731	50	51
chr14	107349540	1404319041	50	51
chr20	63025520	1513815579	50	51
chr10	135534747	1578101617	50	51
chr11	135006516	1716347066	50	51
chr9	141213431	1854053719	50	51
chr15	102531392	1998091426	50	51
chr21	48129895	2102673453	50	51
chr2	243199373	2151765952	50	51
chrY	59373566	2399829319	50	51
chr6	171115067	2460390363	50	51
chrX	155270560	2634927738	50	51
chr7	159138663	2793303716	50	51
chr3	198022430	2955625159	50	51

This order is the same without regex filter. I tried also with 1 thread and multiple threads.

As I use tools that need an ordered genome (such as GATK, Picard...), it lead to an error.
It could be very useful to provide a genome fasta in standard order, or add an option to sort chromosome during installation.

If there is a solution to reorder all files (fa, fai, size...), let me know!

Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants