Gene symbol assignment in the query species #139

dariotommasini · 2023-07-23T17:57:15Z

dariotommasini
Jul 23, 2023

I'd like to use the TOGA-derived orthologs from your paper. I'm confused as to how I would use the TOGA annotations found in orthologsClassification.tsv.gz because the gene symbol of the query species is not available. From my understanding, your group did not supply the ENSEMBL annotations of the query so I'm unsure of how to map the orthologs found by TOGA to the query species' gene symbol.

Perhaps the way I should be using TOGA is to simply identify the human orthologous genes and then use the ENSEMBL orthology tables to find the gene symbols in the query species.

Regardless, your help would be much appreciated!

MichaelHiller · 2023-07-23T18:21:09Z

MichaelHiller
Jul 23, 2023
Maintainer

Hi Dario,

This file has the orthology type (1:1, 1:many etc), defined as the number of reference genes vs. the number of intact orthologs in the query.

TOGA assigns each query gene a identifier reg_number.
If you have a 1:1, you can simply keep the gene symbol, e.g.
ENSG00000137672 ENST00000344327.TRPC6 reg_12370 ENST00000344327.TRPC6.22 one2one
ENSG00000137672 ENST00000348423.TRPC6 reg_12370 ENST00000348423.TRPC6.22 one2one
ENSG00000137672 ENST00000360497.TRPC6 reg_12370 ENST00000360497.TRPC6.22 one2one
ENSG00000137672 ENST00000532133.TRPC6 reg_12370 ENST00000532133.TRPC6.22 one2one
reg_12370 is TRPC6 and there are 4 transcripts.

If you have a 1:many, this would also work, e.g.
ENSG00000184840 ENST00000332598.TMED9 reg_4419 ENST00000332598.TMED9.56373 one2many
ENSG00000184840 ENST00000332598.TMED9 reg_4495 ENST00000332598.TMED9.127 one2many
--> two TMED9 genes in the query.

If you have many:1 or many:many, it it not obvious which symbol to assign, e.g.
ENSG00000172062 ENST00000625245.SMN1 reg_12660 ENST00000625245.SMN1.709 many2one
ENSG00000172062 ENST00000506163.SMN1 reg_12660 ENST00000506163.SMN1.709 many2one
ENSG00000172062 ENST00000380707.SMN1 reg_12660 ENST00000380707.SMN1.709 many2one
ENSG00000172062 ENST00000503079.SMN1 reg_12660 ENST00000503079.SMN1.709 many2one
ENSG00000205571 ENST00000628696.SMN2 reg_12660 ENST00000628696.SMN2.1564 many2one
ENSG00000205571 ENST00000380743.SMN2 reg_12660 ENST00000380743.SMN2.1564 many2one
ENSG00000205571 ENST00000380742.SMN2 reg_12660 ENST00000380742.SMN2.1564 many2one
ENSG00000205571 ENST00000638794.SMN2 reg_12660 ENST00000638794.SMN2.1564 many2one
ENSG00000205571 ENST00000626847.SMN2 reg_12660 ENST00000626847.SMN2.1564 many2one
SMN is a human duplication. You can call the single SMN gene in the query SMN1 or SMN2 (randomly pick a name), but it is not easy to correctly automate this.

If you already have a query annotation from Ensembl, then yes, you could overlap the TOGA transcripts with the Ensembl annotation and keep Ensembl's gene symbols.
Ensembl also provides orthology information.

Hope that helps

0 replies

dariotommasini · 2023-07-23T18:35:23Z

dariotommasini
Jul 23, 2023
Author

Thank you, this is very helpful.

Could you expand on what overlapping the TOGA transcripts with the Ensembl annotation would look like? Would it be like comparing the gtf from ENSEMBL to the TOGA gtf, and then transferring the ENSEMBL identifiers onto the TOGA file? Or is it much simpler than that: using the ENSEMBL orthology table to go from human gene to the ortholog in query species, and ignoring ENSEMBL's orthology classification since we're using TOGA's instead?

0 replies

MichaelHiller · 2023-07-24T05:28:25Z

MichaelHiller
Jul 24, 2023
Maintainer

Hmm, I would probably prefer the first option, as this makes sure that TOGA and Ensembl agree on the exact locus in the query genome.

0 replies

dariotommasini · 2023-07-25T19:59:52Z

dariotommasini
Jul 25, 2023
Author

Okay thanks for the help!

0 replies

dariotommasini · 2023-07-29T16:02:35Z

dariotommasini
Jul 29, 2023
Author

Is there a quick way to identify the assembly used for each of the mammalian species in the human_hg38_reference directory? I couldn't find all of them in the mammalianDNAZooAssemblies, ie the chinese treeshrew tupChi1.

Also, can TOGA be used to retrieve orthologous sequences that are not necessarily genes, like introns and intergenic regions. I know it uses intronic and intergenic similarity in the model but it seems like the main output is orthologous genes and protein alignments.

0 replies

MichaelHiller · 2023-07-29T18:34:43Z

MichaelHiller
Jul 29, 2023
Maintainer

Pls have a look at https://genome.senckenberg.de/download/TOGA/human_hg38_reference/overview.table.tsv
Every directory has the species name + a UCSC or our internal identifier (starting with HL). The tsv file lists the assembly accession and source, taxonomic lineage and more.

In principle yes, although this is not the primary output of TOGA. I guess the best is to use the query_annotation.bed file, which lists exons (and implicitly introns) in the query.

0 replies

dariotommasini · 2023-07-29T20:42:54Z

dariotommasini
Jul 29, 2023
Author

Thank you!

I see, I can get introns from inverting the exon coordinates. But intergenic regions are not recoverable?

0 replies

MichaelHiller · 2023-07-29T20:44:59Z

MichaelHiller
Jul 29, 2023
Maintainer

Harder ...
Why not use a whole genome alignment for that? E.g. alignment nets or Cactus?

0 replies

dariotommasini · 2023-07-29T20:51:45Z

dariotommasini
Jul 29, 2023
Author

Okay, that makes sense. Thank you for the help, I'll definitely try using the TOGA ortholog annotations!

0 replies

dariotommasini · 2023-08-14T17:30:28Z

dariotommasini
Aug 14, 2023
Author

Hi again,

I was trying to make a CellRanger reference package for tupChi1 using the TOGA annotation file and the ncbi source FASTA. According to the overview table, the NCBI accession is "GCF_000334495.1", so I downloaded that FASTA from NCBI. However, I'm encountering an issue because the contigs in the TOGA GTF do not match those in the FASTA:

mkref has failed: error building reference package
Error while parsing GTF file /clusterfs/kslab/REFERENCE/TreeShrew_TOGA/TreeShrew.gtf.gz
Invalid contig name encountered on GTF line 1: KB320809. The FASTA file has contigs:
['NW_006159427.1', 'NW_006159428.1', 'NW_006159429.1', 'NW_006159430.1', ...

Contigs in the FASTA look like "NW_006159706.1" whereas the TOGA contigs look like "KB320653". Do I need to convert the TOGA-generated contigs somehow?

0 replies

MichaelHiller · 2023-08-15T07:57:05Z

MichaelHiller
Aug 15, 2023
Maintainer

Hi Dario,

the tupChi1 assembly comes from UCSC and according to UCSC
UCSC Genome Browser assembly ID: tupChi1
Sequencing/Assembly provider ID: Beijing Genomics Institute TupChi_1.0
Assembly date: Jan. 2013
Accession ID: GCA_000334495.1
NCBI Genome ID: 13028 (Tupaia chinensis)
NCBI Assembly ID: 531728
NCBI BioProject ID: 221634
NCBI BioSample ID: SAMN01084043

It has this GCA ID.

NCBI distinguishes between RefSeq and Genbank and they have the habit of renaming scaffolds.
I assume the GCA and GCF are actually identical, but they just have different names.
Can you pls try to use the GCA assembly?

0 replies

dariotommasini · 2023-08-16T00:12:57Z

dariotommasini
Aug 16, 2023
Author

Thank you, that worked! The GTF was still missing the ".1" suffix, ie "KB320809.1", but that was easy enough to add to the GTF.

0 replies

MichaelHiller · 2023-08-16T06:09:41Z

MichaelHiller
Aug 16, 2023
Maintainer

UCSC is always stripping the .1
We typically do the same, as some tools have problems correctly processing these genomes.

0 replies

dariotommasini · 2023-08-18T01:01:52Z

dariotommasini
Aug 18, 2023
Author

Hi again!

Using the TOGA GTF for mapping reads using Cellranger leads to this following downstream issue:

The top-level of the GTF is transcripts, not genes, ie:

KB321016.1	stdin	transcript	1542828	1777300	.	-	.	gene_id "ENST00000359726.APP.304"; transcript_id "ENST00000359726.APP.304";

This is problematic because then STAR will toss the multi-mapping reads when they map to two similar transcripts of the same gene. I am considering adding a "gene" row for each gene so that STAR recognizes that it should lump all the transcripts of "APP" together. The gene I suppose would be the query gene identifiers supplied in the orthologyAnnotations.tsv ie "reg_14844" for APP in tupChi1:

ENSG00000142192 ENST00000346798.APP reg_14844 ENST00000346798.APP.304         one2one

Does this sound reasonable to you? I'm guessing there is a reason why your group decided to omit "gene" level information in the GTFs.

0 replies

MichaelHiller · 2023-08-18T06:38:47Z

MichaelHiller
Aug 18, 2023
Maintainer

Sorry, I don't have much experience with GTF/GFF (horrible formats, genePred or bed12 is much better).
But here is a consideration. TOGA does not give you UTRs. Most of your scRNA reads will map to UTRs and thus in a TOGA-only annotation downstream of a gene.

It would help to also generate RNA-seq and/or IsoSeq from the same tissue to add UTRs to the TOGAs. Then you have proper gene models to map the scRNA-seq reads to.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gene symbol assignment in the query species #139

{{title}}

Replies: 15 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Gene symbol assignment in the query species #139

dariotommasini Jul 23, 2023

Replies: 15 comments

MichaelHiller Jul 23, 2023 Maintainer

dariotommasini Jul 23, 2023 Author

MichaelHiller Jul 24, 2023 Maintainer

dariotommasini Jul 25, 2023 Author

dariotommasini Jul 29, 2023 Author

MichaelHiller Jul 29, 2023 Maintainer

dariotommasini Jul 29, 2023 Author

MichaelHiller Jul 29, 2023 Maintainer

dariotommasini Jul 29, 2023 Author

dariotommasini Aug 14, 2023 Author

MichaelHiller Aug 15, 2023 Maintainer

dariotommasini Aug 16, 2023 Author

MichaelHiller Aug 16, 2023 Maintainer

dariotommasini Aug 18, 2023 Author

MichaelHiller Aug 18, 2023 Maintainer

dariotommasini
Jul 23, 2023

MichaelHiller
Jul 23, 2023
Maintainer

dariotommasini
Jul 23, 2023
Author

MichaelHiller
Jul 24, 2023
Maintainer

dariotommasini
Jul 25, 2023
Author

dariotommasini
Jul 29, 2023
Author

MichaelHiller
Jul 29, 2023
Maintainer

dariotommasini
Jul 29, 2023
Author

MichaelHiller
Jul 29, 2023
Maintainer

dariotommasini
Jul 29, 2023
Author

dariotommasini
Aug 14, 2023
Author

MichaelHiller
Aug 15, 2023
Maintainer

dariotommasini
Aug 16, 2023
Author

MichaelHiller
Aug 16, 2023
Maintainer

dariotommasini
Aug 18, 2023
Author

MichaelHiller
Aug 18, 2023
Maintainer