About Premier League Prediction

League Table Prediction Match Predictions Table Timeline About

This is a machine learning model that forecasts the final standings of the English Premier League based on historical data. The data is sourced from football-data.org. This project was inspired by the now-defunct FiveThirtyEight club soccer predictions.

Model

Methodology

The model uses an Elo-based system to predict the final league standings. The Elo rating system is a method for calculating the relative skill levels of players in two-player games such as chess. In this case, the Elo rating system is used to measure the relative strength of each Premier League club and uses this information to assess how likely a result is in a given match.

Elo Calculation

Before the season, each club is assigned an Elo rating based on their performance in the previous season. The Elo ratings of all clubs are adjusted based on club value.

Newly promoted teams receive the maximum Elo of the relegated teams. These values are then adjusted based on club value.

Clubs retain 50% of their Elo rating from the previous season. This attempts to implicitly factor out the adjustment for club value in the previous season. 1500 is the average Elo rating in this model.

Elo_{base} = Elo_{previous season} \times \frac{1}{2} + 1500 \times \frac{1}{2}

Club values are then used to adjust the Elo rating. The club value adjustment is normalized for each club based on the maximum and minimum club values. As the best and richest teams win more often, the normalized club values are exponentially adjusted. Club values are factored into the Elo rating as follows:

Elo_{adjusted} = Elo_{base} + 300 \times Normalized Exponential Club Value

The adjustment factor is currently set to 300, meaning the best clubs get a 300 point bump to their Elo rating.

Elo Updates

As the season progresses, the Elo rating of each club is updated based on the outcome of each match. The model uses the following formulas to calculate Elo rating for each match:

Win/Lose

Expected Win = 1 \div (1 + 1 0^{\frac{Loser Elo - Winner Elo}{400}})

Change in Winner Elo = K \times (1 - Expected Win)

Change in Loser Elo = - Change in Winner Elo

Draw

Expected Home Win = 1 \div (1 + 1 0^{\frac{Away Elo - Home Elo}{400}})

Change in Home Elo = K \times (0.5 - Expected Home Win)

Change in Away Elo = - Change in Home Elo

Decay

The Elo rating of each club has a half-life of 1/4 of the season. This ensures that the most recent matches have the most impact on a team's Elo rating.

Decay Factor = 0. 5^{\frac{1}{38 \div 4}}

Elo_{decayed} = Elo \times Decay Factor + 1500 \times (1 - Decay Factor)

Model Architecture

The model is trained on the following data:

Elo
Table Position
Manager Games in Charge
Recent Form

which is compared to the actual outcome of the match. A Random Forest Classifier is trained on the past two seasons and is used to predict the outcome of each match.

Forecasting

The model generates a forecast before the start of each new match week where the model is making predictions based on by the knowledge of the results from previous match weeks. Before running the forecast, the model processes current form, manager tenure, and position in the league table for each Premier League club.

The forecast simulates the current season 10,000 to determine a distribute of where each team is likely to finish in the final league table.

Computing Infrastructure

This model is deployed on the AWS cloud using ECS, Fargate, and S3. Docker containers are used to manage the model and the data pipeline. The model is run on a schedule based on the Premier League fixture list which is updated every time the model is run. EventBridge is used to trigger the model runs.

Front End

Here are a few facts about how the data is displayed on this site:

The pages are statically generated
The underlying data updates every time the model runs

This is accomplished by caching the data and triggering a webhook to invalidate the cache when the model is run. This means that only the affected pages are rebuilt when the model is run.

Other Ideas

The next big step is incorporating goals into the model. Right now the model has no concept of the margin of victory. More seasons of data could also be used to train the model. At the beginning of the season, the model looks too much like a table of the clubs market value.

Additionally, injuries and squad depth play a large role in the outcome of matches. However, data on lineups, injuries, and squads are not included in the tier I have access to from football-data.org so these factors are not included in the current model.

Model Revisions

Version	Date	Changes
1.0	2024-08-16	Initial Elo-based model