Why this project?
Since the start of the course I knew my capstone would have to be inspired on what I'm passionate about the most ... Football!
I've always felt that although performances and results depend on many factors, there is so much that stats can explain and insights to be found that can be extremely useful and used to add value to a team. I've decided to collect as much detailed data as possible from each player so that I could have as many insights as possible and make the most accurate predictions.
( for full pdf presentation - presentation )
( for full code - code )
- Provide valuable insights that can be useful to agents, investors, betting companies, managers, scouts and football professionals in general.
- Predict season average ratings and most important stats for each position with regression models.
- Predict level of season performance with classification models.
- Find the best performing young stars. (in Tableau)
- Data Source - Football Api
I soon realized that collecting and cleaning the data would turn out to be the most challenging side of the project, this was due to the amount of nested dictionaries and lists and how the data was structured within the api.
- Selecting the data and creating final dataframe.
Once I was able to clean all the data and collect all the information from the top 5 leagues, I've realized that the last 3 seasons were the ones where the data was accurate and complete. I've then decided I would be using just these seasons for the purpose of this project.
- Heatmap, pair plot, histograms and correlations.
Initially, my idea was to only predict the average rating players get at the end of each season but after analysing the correlations via the heatmap, I was surprised that for some of the features the correlation was not as high as I expected. This led me to decide I would try and predict others variables as well, due to the nature of the sport I've selected the most important stats for each position (goals_scored for attackers and midfielders, assists for midfielders, duels_won for the defenders and goals_conceded for the goalkeepers). As the full heatmap is harder to read, I've created specific heatmaps to analyse each of the target variables and their correlations.
Models used : Linear regression, LassoCV, RidgeCV, ElasticnetCV, Knn Regressor, Decision Tree Regressor ( shown below only some of the best results )
- Predicting rating:
- Linear Regression
| LR | Score |
|---|---|
| Mean cross validation score | 0.2169554443734551 |
- DecisionTreeRegressor(max_depth=8)
| Decision Tree Regressor | Score |
|---|---|
| Mean cross validation score | 0.3432106216819707 |
- Predicting goals_scored for attackers:
- LassoCV
| LassoCV | Score |
|---|---|
| Mean cross validation score | 0.8804026621170709 |
- Predicting duels_won for defenders:
- LassoCV
| LassoCV | Score |
|---|---|
| Mean cross validation score | 0.9895123559099694 |
As score was very high I've also tried the model without the most correlated feature (total_duels).
| LassoCV | Score |
|---|---|
| Mean cross validation score | 0.7430688883066259 |
- Predicting assists for midfielders:
- LassoCV
| LassoCV | Score |
|---|---|
| Mean cross validation score | 0.6845306409517158 |
- Predicting goals_scored for midfielders:
- LassoCV
| LassoCV | Score |
|---|---|
| Mean cross validation score | 0.7438977475673354 |
NOTE : Not as high as for the attackers.
- Predicting goals_conceded for goalkeepers:
- LassoCV
| LassoCV | Score |
|---|---|
| Mean cross validation score | 0.9242224609563434 |
Models used : Logistic regression and Gridsearch (L1 & L2 penalties), Knn Classifier, Decision Tree Classifier, Random Forrest Classifier ( shown below only some of the best results )
- Predicting top performing players with classifiers
count 17077.000000
mean 6.874672
std 0.381241
min 3.000000
25% 6.680000
50% 6.850000
75% 7.062500
max 9.400000
As mean rating is 6.87, I've decided to create a binary variable where 'top performers' are all players performing above threshold of 7.0.
| 'normal performers' | 'top performers' |
|---|---|
| 0 | 1 |
Proportion of the classes
| 'normal performers' | 'top performers' |
|---|---|
| 0.703636 | 0.296364 |
baseline = 0.7036364701059905
Due to unbalanced classes 'classweight = balanced' was used
- GridsearchCV ( penalty = l2, classweight = balanced, scoring = accuracy )
Best estimator mean cross validated training score: 0.7774645869511476
To reduce the amount of false positives ( these meant that the model was predicting players to be top performers when they weren't ) I've changed the scoring to 'precision'.
- GridsearchCV ( penalty = l2, classweight = balanced, scoring = precision )
Best estimator mean cross validated training score: 0.9418300653594771
- Comparison of discipline, offensive and defensive stats between leagues.
- Understanding where do goalscorers come from to be able to determine which nationalities have more scoring talent and adapt better to each league.
- Analysing the young stars that have been performing at a top level.
*Created a dataframe just with under 23 players.
*Filtered the dataframe to have only players who performed above average and played more than 9 matches.
(Tableau screenshots below)







