GitHub - kkkarnav/BetterxG: For CS-IS-3066-1: Applied Machine Learning in Football

CS-IS-3066-1: Applied Machine Learning in Football

Karnav Popat, Pranav Koka, Suyog Joshi

This repository represents the entire pipeline needed to setup our replication of StatsBomb's xG models, including our own calculated features, data exploration, and hypothesis testing.

Note that some of the datasets we've used are too large to upload, however, they can be created by running the first two notebooks.

Run Through:

To get started, create the main data we'll be using, augmented_data.csv:

1_data_creation: This notebook uses the statsbombpy API to grab all events for all seasons of StatsBomb's open data. Note that this is a large volume of data - about 5 GB, and running the notebook might therefore be difficult due to network speed and data transfer constraints on the API.

2_data_augmentation: This notebook uses the StatsBomb's open data to calculate our main dataset. We have added a lot of features in this notebook which we have later used in the baseline models. The output of this is stored in the ./data directory. The notebook with press and triangle does the same thing but represents defenders as Gaussian influences instead.

Once you've set up, you can proceed to explore the data and establish baseline results:

3_data_exploration: This notebook has some visualisations that we made along with an exploration of the dataset statistics and our line of thinking.

4_baseline_models: Based on the literature review that we did, we ran five models. We did no fine tuning, feature selection, or augmentation, and yielded the basic results.

Once we've established a baseline, we can use our calculated features and finetuning to obtain the best results we can:

5_hypotheses: Some hypothesis testing we did based on our earlier line of thinking.

6_augmented_models: This notebook contains our final results, as well as the techniques we applied and features we used to achieve them.

Finally, if you're interested in some of the work we did to get to our final results, check out the below notebooks. Note that these can only be run when you have the full dataset, not just the demo dataset.

7_feature_selection: Some of the feature selection methods we considered for data augmentation and used to test our model's out-of-distribution prediction.

8_leagues_testing: We tested the performance of the model on out of distribution data by training and testing on separate leagues and comparing the performance when training and testing within the same league. We also established our hypothesis that different leagues have different playstyles.

Requirements: (all packages at the latest version unless specified otherwise)

Python 3.10
pandas
numpy
matplotlib.pyplot
tqdm
statsbombpy
bokeh
scikit-learn
imbalanced-learn
dtreeviz
statsmodels
stargazer

Using the full dataset

To replicate results on the entire StatsBomb open dataset instead of just the demo season, simply remove "[23:24]" from the third cell of 1_data_creation.ipynb. This will allow it to iteratively download every season in the dataset instead of just one. Note that this can take quite a long time (10-30 hours) and uses a lot of storage and memory space (10GB+)

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
data		data
0_lit_review.md		0_lit_review.md
1_data_creation.ipynb		1_data_creation.ipynb
2_data_augmentation.ipynb		2_data_augmentation.ipynb
2_data_augmentation_with_press_and_triangle.ipynb		2_data_augmentation_with_press_and_triangle.ipynb
3_data_exploration.ipynb		3_data_exploration.ipynb
4_baseline_models.ipynb		4_baseline_models.ipynb
5_hypotheses.ipynb		5_hypotheses.ipynb
6_augmented_models.ipynb		6_augmented_models.ipynb
7_feature_selection_methods.ipynb		7_feature_selection_methods.ipynb
8_leagues_testing.ipynb		8_leagues_testing.ipynb
Football ISM Final Presentation.pdf		Football ISM Final Presentation.pdf
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS-IS-3066-1: Applied Machine Learning in Football

Karnav Popat, Pranav Koka, Suyog Joshi

Run Through:

Requirements: (all packages at the latest version unless specified otherwise)

Using the full dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CS-IS-3066-1: Applied Machine Learning in Football

Karnav Popat, Pranav Koka, Suyog Joshi

Run Through:

Requirements: (all packages at the latest version unless specified otherwise)

Using the full dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages