Skip to content

gmalbert/march-madness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

422 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

march-madness

March Madness College Basketball Predictions

Overview

A comprehensive March Madness betting prediction system that combines historical data, machine learning models, and real-time game schedules to provide accurate betting predictions for tournament games. Now enhanced with KenPom and BartTorvik efficiency ratings for superior prediction accuracy.

Features

  • Comprehensive Data Collection: 10 years of historical tournament data (2016-2025)
  • Advanced Efficiency Metrics: KenPom and BartTorvik ratings integration (364 teams)
  • Enhanced ML Models: 11-feature models with 8.4% spread accuracy improvement
  • Real-time Predictions: Live game schedules with betting predictions (includes live Vegas spreads/OUs via Odds API)
  • Underdog Value Bets: Automatic detection of profitable underdog opportunities
  • Kelly Criterion Betting: Optimal bet sizing recommendations
  • Multi-API Integration: CBBD + ESPN + KenPom + BartTorvik
  • Streamlit UI: Interactive web interface for predictions
  • Betting Models Framework: Complete implementation of roadmap-betting-models.md

Recent Updates (v2.0)

🎯 Model Performance Improvements

  • Spread Accuracy: 8.4% improvement (-0.97 points MAE)
  • Moneyline Accuracy: +2.7 percentage points
  • Total Accuracy: 1.7% improvement
  • Feature Expansion: 3 → 11 features (CBBD + KenPom + BartTorvik)

🔧 New Data Sources

  • KenPom: Advanced efficiency ratings (NetRtg, ORtg, DRtg, AdjT, Luck, SOS)
  • BartTorvik: Adjusted offensive/defensive efficiency
  • Team Canonicalization: 99.7% mapping coverage (364/365 teams)
  • Automated Data Pipeline: Selenium scraping + data enrichment

📊 Enhanced Features

  • Extended Feature Set: 6 KenPom + 2 BartTorvik metrics
  • Canonical Team Names: Standardized naming across all sources
  • Data Validation: Complete feature coverage for 15,961 games
  • Model Retraining: Optimized for extended feature space
  • Betting Models: Complete implementation with evaluation and ensemble methods

API Information

This project uses:

Environment variables

  • CBBD_API_KEY - College Basketball Data API key
  • ODDS_API_KEY - Odds API key (used by fetch_live_odds.py)

Setup

Prerequisites

  • Python 3.8+
  • CBBD API key (set as CBBD_API_KEY environment variable)

Installation

pip install -r requirements.txt

Environment Setup

export CBBD_API_KEY="your_api_key_here"

Data Collection

Comprehensive Historical Data

Collect 10 years of betting data for model training:

python data_collection.py --collect

This collects:

  • Tournament games (2016-2025)
  • Betting lines and odds
  • Team season statistics
  • Efficiency ratings
  • Poll rankings

KenPom & BartTorvik Data

Automated collection of advanced efficiency metrics:

# Download KenPom ratings
python download_kenpom.py

# Download BartTorvik ratings (auto-renames and cleans duplicates)
python download_barttorvik.py

# Create canonical datasets
python data_tools/efficiency_loader.py

Output files:

  • data_files/kenpom_ratings.csv - Raw KenPom data (365 teams)
  • data_files/barttorvik_ratings.csv - Raw BartTorvik data (365 teams, auto-renamed from download)
  • data_files/kenpom_canonical.csv - Cleaned with canonical names (364 teams)
  • data_files/barttorvik_canonical.csv - Cleaned with canonical names (364 teams)

Team Name Canonicalization

Automatic mapping of team names across data sources:

# Match team names and create mappings
python scripts/match_teams.py

# Apply manual corrections
python scripts/apply_mappings.py

Output files:

  • data_files/kenpom_to_espn_matches.csv - KenPom → canonical mappings
  • data_files/bart_to_espn_matches.csv - BartTorvik → canonical mappings

Data Enrichment

Enhance training data with advanced metrics:

# Add KenPom/BartTorvik features to historical games
python enrich_training_data.py

# Retrain models with extended features
python retrain_with_extended_features.py

Output files:

  • data_files/training_data_enriched.csv - Historical data with all features
  • data_files/training_data_complete_features.csv - Complete cases only (15,961 games)
  • data_files/models/ - Retrained models with 11 features

Individual Data Collection Functions

Tournament Games

from data_collection import fetch_all_tournament_games

# Get games for specific years
games = fetch_all_tournament_games(2023, 2024)
print(f"Found {len(games)} tournament games")

Betting Lines

from data_collection import fetch_historical_lines

# Get betting lines for tournaments
lines = fetch_historical_lines(2023, 2024)
print(f"Found {len(lines)} betting lines")

Rankings

from data_collection import fetch_rankings

# Get poll rankings
rankings = fetch_rankings(2024)
print(f"Found {len(rankings)} ranking entries")

Team Statistics

from data_collection import fetch_team_season_data, fetch_efficiency_ratings

# Get team stats and efficiency ratings
team_stats = fetch_team_season_data(2024)
efficiency = fetch_efficiency_ratings(2024)

Examples

See examples/data_collection_examples.py for complete usage examples.

Automated Data Collection

Automated Efficiency Data Updates

The system automatically fetches KenPom and BartTorvik efficiency ratings daily at 2 AM ET, but only during basketball season and only when games are scheduled.

When it runs:

  • Basketball season: October 15 - April 15
  • Games scheduled: Today or tomorrow have college basketball games
  • Time: Daily at 2 AM Eastern Time (with random 0-60 minute delay)

Anti-detection measures:

  • Randomized timing to avoid consistent patterns
  • Headless Chrome with anti-bot countermeasures
  • Custom user agents and browser fingerprinting evasion
  • Random delays between site requests
  • Only runs when actually needed (season + games scheduled)

What gets updated:

  • data_files/kenpom_ratings.csv - Raw KenPom data
  • data_files/barttorvik_ratings.csv - Raw BartTorvik data
  • data_files/kenpom_canonical.csv - Cleaned KenPom with canonical names
  • data_files/barttorvik_canonical.csv - Cleaned BartTorvik with canonical names

Manual trigger: You can also manually run the efficiency data update:

python check_efficiency_update_needed.py && python download_kenpom.py && python download_barttorvik.py && python data_tools/efficiency_loader.py

GitHub Actions Workflow: The automated workflow (.github/workflows/update-efficiency-ratings.yml):

  • Runs daily at 2 AM ET (7 AM UTC) with randomization
  • Only executes during basketball season with scheduled games
  • Includes anti-detection measures to avoid blocking
  • Commits changes back to the repository
  • Only commits if efficiency data has actually changed

Underdog Value Bets

Finding Profitable Opportunities

The system automatically identifies underdog betting opportunities where the model predicts a higher win probability than the betting odds suggest.

from underdog_value import identify_underdog_value

# Example: Model gives underdog 40% chance, but odds imply 27%
game = {
    'home_team': 'Kansas',
    'away_team': 'NC State',
    'home_moneyline': -350,
    'away_moneyline': +275
}

home_win_prob = 0.60  # Model prediction
value_bet = identify_underdog_value(game, home_win_prob, min_ev_threshold=5.0)

if value_bet:
    print(f"Value bet: {value_bet['team']}")
    print(f"Edge: {value_bet['edge']:.1%}")
    print(f"ROI: {value_bet['roi']:.1f}%")

Kelly Criterion Bet Sizing

from underdog_value import get_betting_recommendation

recommendation = get_betting_recommendation(value_bet, bankroll=1000)
print(f"Recommended bet: ${recommendation['recommended_bet']:.2f}")
print(f"Kelly %: {recommendation['kelly_percentage']:.1f}%")

Examples

See examples/underdog_value_examples.py for detailed examples.

Running Predictions

Streamlit UI

streamlit run predictions.py

ESPN Data Integration

The system automatically fetches current season games from ESPN and enriches them with CBBD statistics for accurate predictions.

Machine Learning Models

Extended Feature Set (v2.0)

Models trained with 11 features (vs. original 3):

CBBD Features (3):

  • team_season_win_pct - Team's season win percentage
  • team_season_avg_mov - Average margin of victory
  • opponent_season_avg_mov - Opponent's average margin of victory

KenPom Features (6):

  • net_rtg - Net rating (offensive - defensive)
  • off_rtg - Offensive rating
  • def_rtg - Defensive rating
  • adj_tempo - Adjusted tempo
  • luck - Luck rating
  • sos - Strength of schedule

BartTorvik Features (2):

  • adj_oe - Adjusted offensive efficiency
  • adj_de - Adjusted defensive efficiency

Supported Models

  • XGBoost: Gradient boosting for complex patterns
  • Random Forest: Ensemble learning for robust predictions
  • Linear Regression: Spread and total predictions
  • Logistic Regression: Moneyline predictions

Performance Improvements (v2.0)

Significant accuracy gains with extended features:

Metric Baseline (3 features) Extended (11 features) Improvement
Spread MAE 12.8 11.7 8.4% better
Moneyline Accuracy 68.2% 70.9% +2.7 pts
Total MAE 14.2 13.9 1.7% better

Training Data: 15,961 complete games (2016-2024 seasons)

Training Process

# Retrain with extended features
python retrain_with_extended_features.py

Training Details:

  • Cross-validation with 5 folds
  • Ensemble of XGBoost, Random Forest, Linear/Logistic Regression
  • Separate models for spread, total, and moneyline predictions
  • Feature scaling and preprocessing

Betting Models Framework

Implementation Overview

Complete implementation of docs/roadmap-betting-models.md with comprehensive betting prediction capabilities.

Model Types by Bet:

Bet Type Model Type Target Key Metric
Moneyline Classification Win (0/1) Brier Score
Spread Regression → Classification Margin → Cover ATS Accuracy
Over/Under Regression → Classification Total → Over O/U Accuracy
Value Probability Comparison Edge ROI

Core Functions

from betting_models import (
    train_win_probability_model,  # Calibrated classifier for win probs
    train_spread_model,           # XGBoost regression for margins
    train_total_model,            # XGBoost regression for totals
    predict_ats,                  # ATS outcome predictions
    predict_over_under,           # O/U outcome predictions
    evaluate_betting_roi,         # ROI calculation with American odds
    create_betting_ensemble       # Voting ensemble models
)

# Train models
moneyline_model = train_win_probability_model(X_train, y_win)
spread_model = train_spread_model(X_train, y_margin)
total_model = train_total_model(X_train, y_total)

# Evaluate betting performance
roi_results = evaluate_betting_roi(predictions, actuals, odds)
print(f"ROI: {roi_results['roi_pct']:.1f}%")

Advanced Features

  • Time-based Cross Validation: tournament_cv() for leave-one-tournament-out validation
  • Ensemble Methods: Voting classifiers/regressors combining multiple algorithms
  • Probability Calibration: Isotonic regression for well-calibrated win probabilities
  • Comprehensive Evaluation: Brier score, log loss, MAE, RMSE, and ROI metrics

Testing & Validation

# Run comprehensive test suite
python test_betting_models.py

Test Coverage:

  • ✅ Model training (moneyline, spread, total)
  • ✅ Prediction functions (ATS, O/U)
  • ✅ Evaluation metrics (classification, regression, ROI)
  • ✅ Ensemble model creation
  • ✅ Cross validation
  • ✅ Dependency validation

Integration

The betting models framework integrates seamlessly with the existing prediction system:

  • Feature Engineering: Compatible with feature_engineering.py output
  • Streamlit UI: Used in predictions.py for real-time betting predictions
  • Model Persistence: Models saved with joblib in data_files/models/
  • Evaluation: Performance tracking with comprehensive metrics

Data Sources

College Basketball Data API (CBBD)

  • Historical tournament games (2016-2025)
  • Betting lines and odds
  • Team season statistics
  • Efficiency ratings
  • Poll rankings

ESPN API

  • Current season game schedules
  • Real-time game information
  • Team name standardization

Project Structure

march-madness/
├── predictions.py                    # Streamlit UI and prediction logic
├── betting_models.py                 # Betting models framework implementation
├── test_betting_models.py            # Comprehensive test suite for betting models
├── betting_models_README.md          # Detailed betting models documentation
├── data_collection.py               # CBBD API integration and data collection
├── fetch_espn_cbb_scores.py         # ESPN API scraper
├── model_training.py                # Original model training (3 features)
├── retrain_with_extended_features.py # Extended model training (11 features)
├── enrich_training_data.py          # Add KenPom/BartTorvik features
├── download_kenpom.py               # KenPom data scraper
├── download_barttorvik.py           # BartTorvik data scraper
├── underdog_value.py                # Value bet detection and Kelly sizing
├── generate_predictions.py          # Generate predictions for upcoming games
├── display_predictions.py           # Display prediction results
├── data_tools/
│   ├── efficiency_loader.py         # Load and clean efficiency data
│   └── __init__.py
├── scripts/
│   ├── match_teams.py               # Team name canonicalization
│   ├── apply_mappings.py            # Apply manual team mappings
│   └── __init__.py
├── examples/                        # Usage examples
│   ├── data_collection_examples.py
│   └── underdog_value_examples.py
├── data_files/                      # Cached data and models
│   ├── cache/                      # API response cache
│   ├── models/                     # Trained ML models
│   ├── espn_cbb_current_season.csv # Current season games
│   ├── training_data_complete_features.csv # Training data (15,961 games)
│   ├── kenpom_canonical.csv        # Cleaned KenPom data (364 teams)
│   ├── barttorvik_canonical.csv    # Cleaned BartTorvik data (364 teams)
│   └── upcoming_game_predictions.json # AI predictions
├── requirements.txt                 # Python dependencies
├── README.md                        # This file
└── copilot-instructions.md          # AI assistant guidelines

Roadmap Implementation

✅ Completed Features (v2.0)

  • Comprehensive data collection functions
  • Historical betting data (10 years)
  • ML model training pipeline
  • ESPN integration for current games
  • Streamlit prediction interface
  • Team name normalization
  • Data caching and compression
  • Underdog value bet detection
  • Kelly Criterion bet sizing
  • Expected value calculations
  • KenPom efficiency ratings integration (6 features)
  • BartTorvik advanced metrics integration (2 features)
  • Extended ML models with 11 features (8.4% spread improvement)
  • Automated team name canonicalization (99.7% coverage)
  • Enhanced data enrichment pipeline
  • Cross-validation training with performance metrics

🔄 Next Steps

  • Advanced betting features (road/neutral advantages)
  • Model evaluation and comparison
  • Real-time odds integration
  • Prediction confidence scoring
  • Live game tracking and in-game predictions

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

License

See LICENSE file for details.

About

March Madness College Basketball Predictions

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors