VolleyML is a machine learning project designed to predict the winner of women's Volleyball Nations League (VNL) matches. It works by scraping live match and player data, processing it to create meaningful features, and then training a Random Forest model to predict outcomes. The project also includes a suite of data visualizations to analyze team and player performance.
- 🌐 Web Scraping: Automatically scrapes match results and schedules from the official VNL website using Selenium and BeautifulSoup.
- 🔧 Data Preprocessing: Cleans and merges separate datasets for match results and player statistics into a unified, model-ready format.
- 🤖 Machine Learning Model: Implements a
RandomForestClassifierwithin a scikit-learnPipelineto predict match winners based on the statistical differences between the two competing teams. - 📊 Model Evaluation: Provides a complete evaluation of the model's performance, including accuracy, a classification report, and a confusion matrix.
- 🚀 Live Prediction: Includes a function to predict the outcome of a hypothetical match between any two teams with available data.
- 📈 Data Visualization: Generates and saves a variety of plots to explore the data, such as comparing team win rates against their average skill power.
VolleyML/
├── dataVisualizations/ # Output folder for generated plots
│ ├── team_win_rate_vs_attack_power.png
│ └── ...
├── preprocess.py # Handles data scraping and loading
├── train.py # Main script for model training, evaluation, and prediction
├── visualizeData.py # Generates and saves data visualizations
├── vnl.csv # Static CSV with player statistics (Required)
├── vnl_2024_matches_saved.csv # Cached data from scraping
└── chromedriver # Selenium WebDriver for Chrome (Required)
-
Clone the Repository
git clone <your-repository-url> cd VolleyML
-
Create a Virtual Environment (Recommended)
python3 -m venv venv source venv/bin/activate -
Install Dependencies Create a
requirements.txtfile with the following content:pandas scikit-learn selenium beautifulsoup4 matplotlib seaborn torch numpyThen install them:
pip install -r requirements.txt
-
ChromeDriver
- Download the
chromedriverexecutable that matches your version of Google Chrome. - Place it in the root directory of the project.
- Ensure the path in
preprocess.pyis correct:CHROMEDRIVER_PATH = '/path/to/your/VolleyML/chromedriver'
- Download the
-
Player Data
- Ensure you have the player statistics file named
vnl.csvin the root directory. This file is loaded by thegetData()function inpreprocess.py.
- Ensure you have the player statistics file named
The project is designed to be run in sequence.
-
Scrape and Prepare Data First, run the preprocessing script. This will scrape the latest match data from the web and save it to
vnl_2024_matches_saved.csvto avoid re-scraping every time.python3 preprocess.py
-
Generate Visualizations (Optional) To explore the data and see relationships between team stats and performance, run the visualization script. The plots will be saved in the
dataVisualizations/folder.python3 visualizeData.py
-
Train and Evaluate the Model This is the main script. It will load the preprocessed data, train the model, print an evaluation report, and run a sample prediction for a hardcoded match (e.g., ITA vs. BRA).
python3 train.py
The prediction model is built on the idea that the difference in skill between two teams is a strong predictor of the outcome.
- Features: The model doesn't use raw team stats. Instead, its features are the differences between the home and away teams for key metrics:
WinRate_DiffAttack_DiffBlock_DiffServe_DiffDig_DiffReceive_Diff
- Preprocessing: A
ColumnTransformerwithin aPipelinehandles all preprocessing automatically:StandardScaler: Applied to all numerical_Difffeatures to normalize their scale.OneHotEncoder: Applied to theHomeTeam_IDandAwayTeam_IDto convert team names into a numerical format the model can understand.
- Algorithm: A
RandomForestClassifieris used for the final classification task.