About Premier League Prediction
This is a machine learning model that forecasts the final standings of the English Premier League based on historical data. The data is sourced from football-data.org. This project was inspired by the now-defunct FiveThirtyEight club soccer predictions.
Model
Methodology
The model uses an Elo-based system to predict the final league standings. The Elo rating system is a method for calculating the relative skill levels of players in two-player games such as chess. In this case, the Elo rating system is used to measure the relative strength of each Premier League club and uses this information to assess how likely a result is in a given match.
Elo Calculation
Before the season, each club is assigned an Elo rating based on their performance in the previous season. The Elo ratings of all clubs are adjusted based on club value.
Newly promoted teams receive the maximum Elo of the relegated teams. These values are then adjusted based on club value.
Clubs retain 50% of their Elo rating from the previous season. This attempts to implicitly factor out the adjustment for club value in the previous season. 1500
is the average Elo rating in this model.
Club values are then used to adjust the Elo rating. The club value adjustment is normalized for each club based on the maximum and minimum club values. As the best and richest teams win more often, the normalized club values are exponentially adjusted. Club values are factored into the Elo rating as follows:
The adjustment factor is currently set to 300, meaning the best clubs get a 300 point bump to their Elo rating.
Elo Updates
As the season progresses, the Elo rating of each club is updated based on the outcome of each match. The model uses the following formulas to calculate Elo rating for each match:
Win/Lose
Draw
Decay
The Elo rating of each club has a half-life of 1/4 of the season. This ensures that the most recent matches have the most impact on a team's Elo rating.
Model Architecture
The model is trained on the following data:
- Elo
- Table Position
- Manager Games in Charge
- Recent Form
which is compared to the actual outcome of the match. A Random Forest Classifier is trained on the past two seasons and is used to predict the outcome of each match.
Forecasting
The model generates a forecast before the start of each new match week where the model is making predictions based on by the knowledge of the results from previous match weeks. Before running the forecast, the model processes current form, manager tenure, and position in the league table for each Premier League club.
The forecast simulates the current season 10,000 to determine a distribute of where each team is likely to finish in the final league table.
Computing Infrastructure
This model is deployed on the AWS cloud using ECS, Fargate, and S3. Docker containers are used to manage the model and the data pipeline. The model is run on a schedule based on the Premier League fixture list which is updated every time the model is run. EventBridge is used to trigger the model runs.
Front End
Here are a few facts about how the data is displayed on this site:
- The pages are statically generated
- The underlying data updates every time the model runs
This is accomplished by caching the data and triggering a webhook to invalidate the cache when the model is run. This means that only the affected pages are rebuilt when the model is run.
Other Ideas
The next big step is incorporating goals into the model. Right now the model has no concept of the margin of victory. More seasons of data could also be used to train the model. At the beginning of the season, the model looks too much like a table of the clubs market value.
Additionally, injuries and squad depth play a large role in the outcome of matches. However, data on lineups, injuries, and squads are not included in the tier I have access to from football-data.org so these factors are not included in the current model.
Model Revisions
Version | Date | Changes |
---|---|---|
1.0 | 2024-08-16 | Initial Elo-based model |