This project aims to streamline the process of predicting NBA game outcomes by focusing on advanced AI prediction models rather than extensive data collection and management. Unlike my previous project, NBA Betting, which aimed to create a comprehensive feature set for predicting NBA games through extensive data collection, this project simplifies the process. While the previous approach benefited from various industry-derived metrics, the cost and complexity of managing the data collection were too high. This project focuses on a core data set, such as play-by-play data, and leverages deep learning and GenAI to predict game outcomes.
The project is in active development with a complete data collection pipeline and basic prediction engines. Recent infrastructure cleanup removed unnecessary complexity (Airflow orchestration, Wandb experiment tracking) to focus on the core GenAI prediction engine development.
The current system supports seasons 2023-2026 with complete PBP → GameStates → PlayerBox/TeamBox → Features → Predictions pipeline. The default installation includes only the current season (2025-2026); historical data is available separately. The web app provides a simple interface for displaying games with current scores and predictions.
The project is built around a few key components:
-
Database Updater: This component is responsible for updating the database with the latest NBA game data. It fetches data from the NBA Stats API, performs ETL operations, generates features, creates predictions, and stores the data in a SQLite database. The pipeline includes:
database_update_manager.py: The main module that orchestrates the entire process.schedule.py: Fetches the schedule from the NBA API and updates the database.players.py: Fetches and updates player reference data.nba_official_injuries.py: Fetches injury reports from NBA's official injury report PDFs.betting.py: Fetches betting lines (spreads/totals) from ESPN API and Covers.com.pbp.py: Fetches play-by-play data for games and updates the database.game_states.py: Parses play-by-play data to generate game states and updates the database.boxscores.py: Fetches traditional boxscore stats (PlayerBox and TeamBox).prior_states.py: Determines prior final game states for teams.features.py: Uses prior final game states to generate features for the prediction engine.prediction_manager.py: Generates predictions for games using the chosen prediction engine.
-
Games API: This component is responsible for updating predictions for ongoing or completed games and providing the data to the web app. It fetches data from the database, generates predictions, and serves the data to the web app.
games.py: Fetches game data from the database, manages prediction updating and data formatting.api.py: Defines the API endpoints.
-
Web App: This component is the front end of the project, providing a simple interface for users to view games and predictions. It is built using Flask.
start_app.py: The main entry point for the web app found in the root directory.app.py: The main module that defines the Flask app and routes.game_data_processor.py: Formats game data from the API for the web app.templates/: Contains the HTML templates for the web app.static/: Contains the CSS and JavaScript files for the web app.
-
Data Sourcing: Focus on a minimal number of data sources that fundamentally describe basketball. Currently, we use play-by-play data from the NBA API. In the future, incorporating video and tracking data would be interesting, though these require considerably more resources and access.
-
Prediction Engine: This is the core of the project and the current development focus. The current prediction engine options will be replaced with a DL and GenAI-based engine, allowing for decreased data parsing and feature engineering while also scaling to predict more complex outcomes, including individual player performance.
-
Data Storage: Future data storage will more seamlessly integrate with the prediction engine. The storage requirements will combine the current SQL-based data used for the API and web app with more advanced vector-based storage for RAG-based GenAI models.
-
Web App: This is the project's front end, displaying the games for the selected date along with current scores and predictions. The interface will remain simple while usability is gradually improved. A separate GenAI chat will be added in the future to allow users to interact with the prediction engine and modify individual predictions based on their preferences.
- Time Series Data Inclusive: A focus on incorporating the sequential nature of events in games and across seasons, recognizing the significance of order and timing in the NBA.
- Minimal Data Collection: Streamlining data sourcing to the essentials, aiming for maximum impact with minimal data, thereby reducing time and resource investment.
- Wider Applicability: Extending the scope to cover more comprehensive outcomes, moving beyond standard predictions like point spreads or over/unders.
- Advanced Modeling System: Developing a system that is not only a learning tool but also potentially novel compared to the methods used by odds setters.
- Minimal Human Decisions: Reducing the reliance on human decision-making to minimize errors and the limitations of individual expertise.
Currently, there are a few basic prediction engines used to predict the outcomes of NBA games. These serve as placeholders for the more advanced DL and GenAI engines that will be implemented in the future. The current engines make pre-game predictions for home and away scores using ML models. These predictions are then used to calculate the win percentage and margin for the home team. Updated (after game start) predictions are based on a combination of the current game score, time remaining, and the pre-game predictions.
- Baseline: A simple predictor that predicts scores based on teams' PPG and opponents' PPG.
- Linear: Ridge Regression model using 34 rolling average features from prior game states.
- Tree: XGBoost model using the same features as the Linear model (default, best performance).
- MLP (optional): PyTorch MLP model - requires uncommenting PyTorch in requirements.txt.
- Ensemble (optional): Weighted average of Linear (30%), Tree (40%), and MLP (30%) - requires PyTorch.
The current metrics are based on pre-game predictions for the home and away team scores, along with downstream metrics such as win percentage and margin. These simple predictors currently outperform the baseline predictor.
In the future, a more challenging baseline based on the Vegas spread will be added when the DL and GenAI models are implemented.
- Python 3.10+
- ~2GB disk space (database + models + dependencies)
# Clone the repository
git clone https://github.com/NBA-Betting/NBA_AI.git
cd NBA_AI
# Run automated setup
python setup.pyThe setup script will:
- Create a virtual environment
- Install all dependencies
- Download the database and trained models from GitHub Releases
- Create your
.envconfiguration file - Verify the installation
# Activate the virtual environment
source venv/bin/activate
# Start the web app
python start_app.pyVisit http://localhost:5000 to view games and predictions.
# Use a specific predictor
python start_app.py --predictor=Tree
# Enable debug mode
python start_app.py --debug
# Set log level
python start_app.py --log_level=DEBUGAvailable predictors: Baseline, Linear, Tree (default), MLP, Ensemble
*Requires PyTorch - uncomment in requirements.txt
This project is in active development.
The core data pipeline and prediction engines are functional. The focus is now on building advanced DL/GenAI prediction engines using play-by-play data.
This is a personal side project provided "as is" with no guarantees of quality, functionality, or ongoing maintenance. I've vibe-coded much of this release and while I'll try to address issues, I can't promise timely responses or fixes.
For production or commercial use: Consider using SportsRadar, the official NBA data partner. Their API would greatly simplify data management compared to scraping the NBA Stats API. I use this approach only because I can't justify the cost for a personal project.
The default setup downloads only the current season (2025-2026, ~1,300 games). A development database with 3 seasons (2023-2024 through 2025-2026, ~4,100 games total) is available from GitHub Releases.
To use it:
- Download
NBA_AI_dev.zipfrom the latest release - Extract to
data/NBA_AI_dev.sqlite - Update your
.env:
DATABASE_PATH=data/NBA_AI_dev.sqlite-
First run for a date: When viewing a date for the first time, the app fetches data from the NBA API. This initial update may take a few seconds per game. Subsequent views are instant since data is cached in the database.
-
Season restrictions: By default, the web app allows seasons 2023-2024 through 2025-2026. To restrict or expand this, modify
valid_seasonsinconfig.yaml.
- Default focus: 2025-2026 season (current season for public release)
- Database: SQLite with complete pipeline (Schedule → Players → Injuries → Betting → PBP → GameStates → Boxscores → Features → Predictions)
- Built with Python, Flask, SQLite, scikit-learn, XGBoost, and nba_api (PyTorch optional for MLP/Ensemble)






