Gene symbol assignment in the query species #139
Replies: 15 comments
-
Hi Dario, This file has the orthology type (1:1, 1:many etc), defined as the number of reference genes vs. the number of intact orthologs in the query. TOGA assigns each query gene a identifier reg_number. If you have a 1:many, this would also work, e.g. If you have many:1 or many:many, it it not obvious which symbol to assign, e.g. If you already have a query annotation from Ensembl, then yes, you could overlap the TOGA transcripts with the Ensembl annotation and keep Ensembl's gene symbols. Hope that helps |
Beta Was this translation helpful? Give feedback.
-
Thank you, this is very helpful. Could you expand on what overlapping the TOGA transcripts with the Ensembl annotation would look like? Would it be like comparing the gtf from ENSEMBL to the TOGA gtf, and then transferring the ENSEMBL identifiers onto the TOGA file? Or is it much simpler than that: using the ENSEMBL orthology table to go from human gene to the ortholog in query species, and ignoring ENSEMBL's orthology classification since we're using TOGA's instead? |
Beta Was this translation helpful? Give feedback.
-
Hmm, I would probably prefer the first option, as this makes sure that TOGA and Ensembl agree on the exact locus in the query genome. |
Beta Was this translation helpful? Give feedback.
-
Okay thanks for the help! |
Beta Was this translation helpful? Give feedback.
-
Is there a quick way to identify the assembly used for each of the mammalian species in the human_hg38_reference directory? I couldn't find all of them in the mammalianDNAZooAssemblies, ie the chinese treeshrew tupChi1. Also, can TOGA be used to retrieve orthologous sequences that are not necessarily genes, like introns and intergenic regions. I know it uses intronic and intergenic similarity in the model but it seems like the main output is orthologous genes and protein alignments. |
Beta Was this translation helpful? Give feedback.
-
Pls have a look at https://genome.senckenberg.de/download/TOGA/human_hg38_reference/overview.table.tsv
|
Beta Was this translation helpful? Give feedback.
-
Thank you!
|
Beta Was this translation helpful? Give feedback.
-
Harder ... |
Beta Was this translation helpful? Give feedback.
-
Okay, that makes sense. Thank you for the help, I'll definitely try using the TOGA ortholog annotations! |
Beta Was this translation helpful? Give feedback.
-
Hi again, I was trying to make a CellRanger reference package for tupChi1 using the TOGA annotation file and the ncbi source FASTA. According to the overview table, the NCBI accession is "GCF_000334495.1", so I downloaded that FASTA from NCBI. However, I'm encountering an issue because the contigs in the TOGA GTF do not match those in the FASTA:
Contigs in the FASTA look like "NW_006159706.1" whereas the TOGA contigs look like "KB320653". Do I need to convert the TOGA-generated contigs somehow? |
Beta Was this translation helpful? Give feedback.
-
Hi Dario, the tupChi1 assembly comes from UCSC and according to UCSC It has this GCA ID. NCBI distinguishes between RefSeq and Genbank and they have the habit of renaming scaffolds. |
Beta Was this translation helpful? Give feedback.
-
Thank you, that worked! The GTF was still missing the ".1" suffix, ie "KB320809.1", but that was easy enough to add to the GTF. |
Beta Was this translation helpful? Give feedback.
-
UCSC is always stripping the .1 |
Beta Was this translation helpful? Give feedback.
-
Hi again! Using the TOGA GTF for mapping reads using Cellranger leads to this following downstream issue: The top-level of the GTF is transcripts, not genes, ie:
This is problematic because then STAR will toss the multi-mapping reads when they map to two similar transcripts of the same gene. I am considering adding a "gene" row for each gene so that STAR recognizes that it should lump all the transcripts of "APP" together. The gene I suppose would be the query gene identifiers supplied in the orthologyAnnotations.tsv ie "reg_14844" for APP in tupChi1:
Does this sound reasonable to you? I'm guessing there is a reason why your group decided to omit "gene" level information in the GTFs. |
Beta Was this translation helpful? Give feedback.
-
Sorry, I don't have much experience with GTF/GFF (horrible formats, genePred or bed12 is much better). It would help to also generate RNA-seq and/or IsoSeq from the same tissue to add UTRs to the TOGAs. Then you have proper gene models to map the scRNA-seq reads to. |
Beta Was this translation helpful? Give feedback.
-
I'd like to use the TOGA-derived orthologs from your paper. I'm confused as to how I would use the TOGA annotations found in orthologsClassification.tsv.gz because the gene symbol of the query species is not available. From my understanding, your group did not supply the ENSEMBL annotations of the query so I'm unsure of how to map the orthologs found by TOGA to the query species' gene symbol.
Perhaps the way I should be using TOGA is to simply identify the human orthologous genes and then use the ENSEMBL orthology tables to find the gene symbols in the query species.
Regardless, your help would be much appreciated!
Beta Was this translation helpful? Give feedback.
All reactions