Almanac

Deep learning predictor for the outcome of football (soccer) games. The models created here are encoder-only transformer models that leverage multi-headed self-attention to capture the time-series impact of previous matches on deciding each team's likelihood of winning a future match.

How to train the model

Run 'ScrapeData.py' to obtain the datasets of previous football matches and their statistics from the internet.
Run 'TransformData.py' to create the training and test datasets for the model.
Run 'EncoderPretraining.py'. This pretrains the transformer model and saves the learnt weights to a .pt file.
Run 'Training.py'. This fine-tunes the pretrained model to predict the percentage chance of each team winning a given match. This model is saves to a new .pt file.

Current Performance

The model performance is limited by many factors, such as number of features, dataset size, and model scale. Nevertheless, the model achieves very good performance. The plot below shows how the % predictions (Predicted Accuracy) matched up to the % of those selections that were true (Model Accuracy) (including win, draw and loss), over an unseen test set of ~1000 matches. I've included definitions of model accuracy and predicted accuracy below.

You can see the predicted probabilities match up quite well with the real-world results, proving the efficacy of the trained model.

Predicted Accuracy - For all the matches in the test set, the probability of win, lose and draw have each been rounded to the nearest 5% (with a 2.5% offset). These matches are averaged with other samples that fall under the same 'bucket' to form the data for the data that is considered for the given bar.

Model Accuracy - For all these matches for each bar, the correct predictions and incorrect predictions are used to calculate the mean accuracy for this bar. This value is the model accuracy. How close this value is to the predicted accuracy (the rounded to 5% value) defines how well the model is performing.

NOTE: Some predicted values (such as 97.5) have no value - this is because the model didn't predict any outcomes had this probability, so the number of samples considered for that bar is zero.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.idea		.idea
Data		Data
Inference		Inference
Model		Model
Training		Training
.gitignore		.gitignore
LICENSE		LICENSE
PerformanceGraph.png		PerformanceGraph.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Almanac

How to train the model

Current Performance

About

Releases

Packages

Languages

License

Sam-Armstrong/Almanac

Folders and files

Latest commit

History

Repository files navigation

Almanac

How to train the model

Current Performance

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages