Replies: 5 comments
-
Hi Flora, However, if you are stuck with this, I would suggest running TOGA with the standard trained model (human-mouse), which we know does a great job for turtles, fish, birds and therefore likely reptiles as well. |
Beta Was this translation helpful? Give feedback.
-
Thanks! I have already been using the standard trained model - I was just curious whether retraining was worth it to improve results further, so I thought I'd at least give it a try. |
Beta Was this translation helpful? Give feedback.
-
Hi, If I may jump in. I'm working on a dataset of springtails. A much deeper phylogenetic framework than mammals for example. I'd also be interesting to train toga for this. |
Beta Was this translation helpful? Give feedback.
-
Hi all! Your English is already pretty solid, but I've made a few corrections and adjustments for clarity and grammatical accuracy. Here's the revised version: Hi all! Thanks for bringing up this point. The general idea was that the classifier module is customizable. However, in practice, it was written a few years ago and hasn't been revisited since. @fuesseler, answering your questions: I'm sorry... this notebook is really outdated. I'd be happy to look at the polished version.
@francicco,
The script creates two model files:
Then, they are used in Toga.__classify_chains, which triggers the modules/classify_chains.py script - it takes paths to the two models, and data extracted from chains, and essentially applies the models to this data. The input data for classification is stored in self.chain_results_df -> os.path.join(self.temp_wd, "chain_results_df.tsv"), in the output directory. The crucial part is the output of this script, which is saved as self.transcript_to_chain_classes -> self.transcript_to_chain_classes = os.path.join(self.temp_wd, "trans_to_chain_classes.tsv"), and self.pred_scores = os.path.join(self.wd, "orthology_scores.tsv") The prediction scores are just a table with three columns: transcript (which should be called "gene" in the code, but it's actually a transcript), chain, and score - a float from 0 to 1, where 1 indicates an ortholog, and 0 indicates a paralog. This is just raw model output. (!) However, TOGA distinguishes four classes of chains, and ML models are used only to separate orthologs from paralogs! It also separates spanning chains (their separation is quite trivial and does not require ML) and processed pseudogene chains. So, the second output file, "trans_to_chain_classes.tsv", contains information for these four classes. This is how the second file is created: TOGA/modules/classify_chains.py Line 253 in 97eb5a1 Basically, it contains a header: f.write(f"GENE\t{ORTH}\t{PARA}\t{SPAN}\t{P_PGENES}\n") (again, it should say "transcript" instead of "gene") Ah ja, important detail - in the raw output file, spanning chains have "-1" score, and processed pseudogenes - "-2" to not confuse them with truly classified transcript/chain pairs. Then, each row contains tab-separated fields. If the group is empty, it is just 0. To customize this part of TOGA deeply, I'd recommend patching and tuning the __classify_chains function. You could select a set of features that best fit your goal - using and combining already existing features would be the simplest option. Create some training dataset with columns like transcript + chain + X features + y target. Then train a model(s?), write a script that applies it/them to the dataset stored in self.chain_results_df, and produces two output files. This is a very general high-level overview; I'd be happy to answer more detailed questions :). |
Beta Was this translation helpful? Give feedback.
-
@kirilenkobm thanks for the explanation! I will give the models I trained at least a try then, but might stick to the standard TOGA models for my final analysis, depending on how that goes. |
Beta Was this translation helpful? Give feedback.
-
Dear authors,
As I am working with squamate genomes I wanted to test retraining the TOGA models for this group, and I have some questions regarding this. Any help is much appreciated!
I was following your tutorial, using the "create_train.ipynb" and the "train_model.py" scripts to train SE and ME models.
I had to do several minor edits to the ipython notebook to produce a training dataset that is accepted by the train_model.py, because using the ipython notebook "as is" as it was throwing some errors about some required features missing.
While I am not particularly skilled with python and the changes I made might not be the most elegant, I wanted to ask whether you would like to take a look at this edited version of create_train.ipynb? Maybe other people attempting the training might find it useful not to have to troubleshoot this again, and I'd be happy to share.
I was wondering about the stats of my custom-trained XGBoost models. I feel like the accuracy is quite lower (and the training dataset is also smaller) than what you report in your paper. Of course, as far as I saw the "artificially rearranged" gene-chain pairs that you did in the publication is not included in the ipython notebook. So that might explain it? Or do you think I might have to go for 2 more closely related Ensembl Annotations as a starting point (I could do the Green Anole + a snake genome instead).
This is the log I was getting:
Best regards,
Flora
Beta Was this translation helpful? Give feedback.
All reactions