Questions about custom training of XGBoost models #150

fuesseler · 2024-02-26T09:23:38Z

fuesseler
Feb 26, 2024

Dear authors,

As I am working with squamate genomes I wanted to test retraining the TOGA models for this group, and I have some questions regarding this. Any help is much appreciated!

I was following your tutorial, using the "create_train.ipynb" and the "train_model.py" scripts to train SE and ME models.
I had to do several minor edits to the ipython notebook to produce a training dataset that is accepted by the train_model.py, because using the ipython notebook "as is" as it was throwing some errors about some required features missing.
While I am not particularly skilled with python and the changes I made might not be the most elegant, I wanted to ask whether you would like to take a look at this edited version of create_train.ipynb? Maybe other people attempting the training might find it useful not to have to troubleshoot this again, and I'd be happy to share.
I was wondering about the stats of my custom-trained XGBoost models. I feel like the accuracy is quite lower (and the training dataset is also smaller) than what you report in your paper. Of course, as far as I saw the "artificially rearranged" gene-chain pairs that you did in the publication is not included in the ipython notebook. So that might explain it? Or do you think I might have to go for 2 more closely related Ensembl Annotations as a starting point (I could do the Green Anole + a snake genome instead).

This is the log I was getting:

Training dataset size: 10404
Single exon train length: 810; multi exon: 9594
Single exon model:
Training on 810 samples
Positives: 255; Negatives: 555
Using features: Index(['gl_exo', 'flank_cov', 'exon_perc', 'synt_log'], dtype='object')
Accuracy: 95.309 0.741
Model saved to: /work/uesseler/software/TOGA/models/se_model_AnoCar_SalMer.dat
Multi exon model:
Training on 9594 samples
Positives: 4609; Negatives: 4985
Using features: Index(['gl_exo', 'loc_exo', 'flank_cov', 'synt_log', 'intr_perc'], dtype='object')
Accuracy: 96.102 0.449
Model saved to: /work/uesseler/software/TOGA/models/me_model_AnoCar_SalMer.dat
Done in 0:03:16.825670

I wanted to try out the new model I trained but could not find any details on how to specify that TOGA use the custom models instead of the standard one. Can you please point me the right way?

Best regards,
Flora

MichaelHiller · 2024-02-27T11:30:07Z

MichaelHiller
Feb 27, 2024
Maintainer

Hi Flora,
regarding re-training @kirilenkobm is the expert. Hopefully he can respond to your questions.

However, if you are stuck with this, I would suggest running TOGA with the standard trained model (human-mouse), which we know does a great job for turtles, fish, birds and therefore likely reptiles as well.

0 replies

fuesseler · 2024-02-28T08:11:09Z

fuesseler
Feb 28, 2024
Author

Thanks! I have already been using the standard trained model - I was just curious whether retraining was worth it to improve results further, so I thought I'd at least give it a try.
For now I'll continue with the standard model, but still happy to get some answers, once kirilenkobm finds the time

0 replies

francicco · 2024-02-28T11:39:17Z

francicco
Feb 28, 2024

Hi,

If I may jump in. I'm working on a dataset of springtails. A much deeper phylogenetic framework than mammals for example. I'd also be interesting to train toga for this.
But I have no idea where to start from.
Cheers
F

0 replies

kirilenkobm · 2024-02-29T21:58:29Z

kirilenkobm
Feb 29, 2024
Maintainer

Hi all!

Your English is already pretty solid, but I've made a few corrections and adjustments for clarity and grammatical accuracy. Here's the revised version:

Hi all!

Thanks for bringing up this point. The general idea was that the classifier module is customizable. However, in practice, it was written a few years ago and hasn't been revisited since.

@fuesseler, answering your questions:

I'm sorry... this notebook is really outdated. I'd be happy to look at the polished version.
train_model.py is used to create the models and must output the quality metrics, which should match the reported values. I should mention that the augmented part of the dataset contributed significantly. We also invested a lot of time in distilling the training data. However, there's a chance that the model is slightly overfitted towards mammalian species. For instance, among mammals, synteny isn't a crucial feature, but its importance may increase when we annotate a platypus based on human data, just as an example.
Shame on me, but they are hardcoded in toga.Toga.__classify_chains like this:

        self.se_model = os.path.join(self.LOCATION, "models", "se_model.dat")
        self.me_model = os.path.join(self.LOCATION, "models", "me_model.dat")

@francicco,
first, please take a look at the train_model.py script. In fact, TOGA extracts more features than it needs, and it uses two models - one for single-exon and another for multi-exon genes, with each model having a slightly different set of features. The entire dataset is in models/train.tsv, but then it's split into two, respective to the target models. See constants.py:

    # lists of features required by single and multi exon models
    SE_MODEL_FEATURES = ["gl_exo", "flank_cov", "exon_perc", "synt_log"]
    ME_MODEL_FEATURES = ["gl_exo", "loc_exo", "flank_cov", "synt_log", "intr_perc"]

The script creates two model files:

        self.se_model = os.path.join(self.LOCATION, "models", "se_model.dat")
        self.me_model = os.path.join(self.LOCATION, "models", "me_model.dat")

Then, they are used in Toga.__classify_chains, which triggers the modules/classify_chains.py script - it takes paths to the two models, and data extracted from chains, and essentially applies the models to this data.

The input data for classification is stored in self.chain_results_df -> os.path.join(self.temp_wd, "chain_results_df.tsv"), in the output directory. The crucial part is the output of this script, which is saved as self.transcript_to_chain_classes -> self.transcript_to_chain_classes = os.path.join(self.temp_wd, "trans_to_chain_classes.tsv"), and self.pred_scores = os.path.join(self.wd, "orthology_scores.tsv")

The prediction scores are just a table with three columns: transcript (which should be called "gene" in the code, but it's actually a transcript), chain, and score - a float from 0 to 1, where 1 indicates an ortholog, and 0 indicates a paralog. This is just raw model output.

(!) However, TOGA distinguishes four classes of chains, and ML models are used only to separate orthologs from paralogs! It also separates spanning chains (their separation is quite trivial and does not require ML) and processed pseudogene chains. So, the second output file, "trans_to_chain_classes.tsv", contains information for these four classes. This is how the second file is created:

TOGA/modules/classify_chains.py

Line 253 in 97eb5a1

f = open(output, "w") if output != "stdout" else sys.stdout

Basically, it contains a header:
f.write(f"GENE\t{ORTH}\t{PARA}\t{SPAN}\t{P_PGENES}\n")
(again, it should say "transcript" instead of "gene")
Ah ja, important detail - in the raw output file, spanning chains have "-1" score, and processed pseudogenes - "-2" to not confuse them with truly classified transcript/chain pairs.

Then, each row contains tab-separated fields.
The first one is the transcript ID.
The second is a comma-separated list of orthologous chain IDs,
the third is the same for paralogs,
and so on.

If the group is empty, it is just 0.

To customize this part of TOGA deeply, I'd recommend patching and tuning the __classify_chains function.
self.chain_results_df contains features extracted from chains.
And it must fill two files: self.transcript_to_chain_classes and self.pred_scores.

You could select a set of features that best fit your goal - using and combining already existing features would be the simplest option. Create some training dataset with columns like transcript + chain + X features + y target. Then train a model(s?), write a script that applies it/them to the dataset stored in self.chain_results_df, and produces two output files. This is a very general high-level overview; I'd be happy to answer more detailed questions :).

0 replies

fuesseler · 2024-03-04T11:56:00Z

fuesseler
Mar 4, 2024
Author

@kirilenkobm thanks for the explanation! I will give the models I trained at least a try then, but might stick to the standard TOGA models for my final analysis, depending on how that goes.
I sent the edited iPython Notebook to your Gmail-adress as I could not upload it here. Changes are marked with a "#Flora Edit:" comment. Some lines throw an error within python but still create the right output for me (but I couldn't figure out a better way).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about custom training of XGBoost models #150

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions about custom training of XGBoost models #150

fuesseler Feb 26, 2024

Replies: 5 comments

MichaelHiller Feb 27, 2024 Maintainer

fuesseler Feb 28, 2024 Author

francicco Feb 28, 2024

kirilenkobm Feb 29, 2024 Maintainer

fuesseler Mar 4, 2024 Author

fuesseler
Feb 26, 2024

MichaelHiller
Feb 27, 2024
Maintainer

fuesseler
Feb 28, 2024
Author

francicco
Feb 28, 2024

kirilenkobm
Feb 29, 2024
Maintainer

fuesseler
Mar 4, 2024
Author