This was used to produce all data in the original manuscript:
Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK, Zoonomia Consortium, Hiller M. Integrating gene annotation with orthology inference at scale. Science, in press, 2022
-
Improved handling of compensating frameshifts. If two or more frameshifts (such as +1, +1, +1) compensate each other and restore the reading frame without having a stop codon in the frame-shifted region, TOGA does not consider these frameshifts as inactivating. However, the region between these frameshifts is translated into a protein sequence that is likely entirely different from the reference proteins. We observed extreme cases where compensating frameshifts cover essentially the entire transcript, such that considering the transcript as intact is not correct. In TOGA 1.1, we now consider the region between compensating frameshifts as deleted. This improves the transcript classification step. For example, the case where compensating frameshifts cover the entire transcript is now classified as lost.
-
Improved classification of N-terminal mutations. In TOGA 1.0, we masked inactivating mutations if they are located in the first 10% of the reading frame. The reason was that often there is a downstream ATG in proximity, but this is not always the case. In TOGA 1.1, we now determine where the next inframe ATG codon is located that is downstream of a potential inactivating mutation. If this ATG is located in the first 10% of the reading frame, the mutation is not considered as inactivating (as in 1.0). However, if this ATG is located outside the first 10% of the reading frame, a truncated protein would lack at least 10% of the N-terminus and TOGA 1.1 now considers this mutation as inactivating in the transcript classification step.
-
Improved classification of borderline cases where a projection has considerable amount of both deleted and missing regions.
-
Fixed intron retention module bug. Now, TOGA should correctly identify them.
- The codonAlignment.fasta file previously contained the raw CESAR output, including exons that were classified as deleted or missing in later processing steps. TOGA 1.1 now generates a codonAlignment.fasta where such exons are excluded.
- TOGA 1.0 generated bed and tab files that could be loaded as a total of 4 SQL tables to display the query annotation and provide rich information about each transcript via the TOGA handler function upon clicking on a transcript in the UCSC genome browser. TOGA 1.1 utilizes UCSC's bigBed functionality and produces a single bigBed file that contains all that information, enabling a SQL-free TOGA display. The TOGA handler function was updated to work with bigBed. The bigBed TOGA track often displays detailed information faster than the SQL version.
- Longer lists of several compensating frameshifts (e.g. FS_1,2,3,4,5) could result in warnings when loading SQL tables if the string gets too long. To avoid this, we now write a condensed format FS_1-5.
- Avoiding overlapping exon annotations in fragmented query assemblies. If a gene is split into several query chains, there were rare cases where several query chains overlap in the query genome, resulting in exon annotations that overlap each other. This creates invalid bed or genePred records. In TOGA 1.1, we now classify such genes as missing and do not output a bed entry for this gene.
- Sometimes a gene had several orthologous chains and one of them is a 'runaway' chain that spans megabases in the query locus, which is not computable in the CESAR step. TOGA 1.1 now ignores the runaway chain but runs CESAR for the other (normal) orthologous chains. This adds a few orthologs to the annotation.
- Added TOGA_assemblyStats.py
- Bug fixes in merge_cesar_output.py and CESAR output parser.
- Minor changes in plot_mutations.py.
MIDDLE_80%_INTACT
projection's feature renamed toMIDDLE_IS_PRESENT
SSM
mutation type (splice site mutation) is split into more preciseSSMD
(donor) andSSMA
(acceptor)
- Fixed codon alignment generation procedure - all codons are 3bp long. Previous version could mistakenly include dinucleoties into the codon alignment.
- Fixed parsing the protein and codon alignment output. Previous version could add a sequence to protein alignment if its ID contained "PROT".
Documentation improvements.
- Added support for Nextflow DSL2.
- Enabled generation of the processed pseudogenes annotation track.
- Removed
-f/--fragmented_genome
flag from toga.py CLI - the "assemble transcript from fragments" feature is enabled by default. Please use--disable_fragments_joining/--dfj
option to disable this feature. - Fixed package versions in the
requirements.txt
- CESAR_wrapper.py does not create temporary input files for CESAR. Instead, passes the input directly to CESARs stdin.
- Fixed GCC flags for ARM architecture.
- A little improvement in the
run_test.sh
script. - Better versions management - added version.py
- extract_codon_alignment.py: "!" characters inserted my MACSE2.0 to compensate frameshifts are replaced with "N".
- CESAR_wrapper.py -> does not use
/dev/shm/
or/tmp/
partitions anymore. - Significantly improved logging - all logs are automatically stored in the output directory.
- Nomenclature update: identifiers of TOGA-annotated genes now start with "TOGA_" instead of potentially confusing "reg_"
- Nomenclature update: TRANS chain class (considered confusing) renamed to SPAN (better fits spanning chains)
- Added
bed2gtf
submodule to facilitate results post-processing. - Improved parallelization strategy: using abstract class to handle different ways to parallelize computations.
- (started): better code organisation - constants class
- Replaced CESAR2.0 submodule with Kirilenko's lightweight fork (without Kent, etc.)
- Removed obsolete optimised CESAR path.
- Fixed a minor bug with re-running CESAR jobs memory parameter if no buckets were specified.
- (by shjenkins94) added gene_prefix argument
- (by shjenkins94) fixed U12 argument check bug
- minor code style improvements
- Drafted sanity checker to ensure no temporary directories match necessary directories, see
modules.toga_sanity_checks.TogaSanityChecker.check_dir_args_safety
and issue 132 - Added logging for filtered out reference transcripts
- Updated library versions - now compatible with python3.11
- Updated configure.sh -> fixed inability to install CESAR if TOGA is installed from a zip archive + cleanup scenario
- Fixed bug with parsing intron retention in the CESAR output -> in the prev versions, query chars located vs > were skipped
- Utility class for TOGA sanity checks
- Added argument to control cluster queue name (before, it was just
batch
). GLP_values.py
content merged withconstants.py
- Added support for the Poetry package manager through the inclusion of a
pyproject.toml
file.
- Continue moving static functions of the Toga class to dedicated utility class.
- (planned) Step management - now user can rerun TOGA from any desired step.
- (planned) Using CESAR in the single exon mode to reduce memory requirements and increase runtime
- (planned) Improved prediction of splice sites
- (planned) Detection of introns present in the query but not the reference
- (planned) Updated set of reference introns with non-canonical splice sites