March Madness College Basketball Predictions
A comprehensive March Madness betting prediction system that combines historical data, machine learning models, and real-time game schedules to provide accurate betting predictions for tournament games. Now enhanced with KenPom and BartTorvik efficiency ratings for superior prediction accuracy.
- Comprehensive Data Collection: 10 years of historical tournament data (2016-2025)
- Advanced Efficiency Metrics: KenPom and BartTorvik ratings integration (364 teams)
- Enhanced ML Models: 11-feature models with 8.4% spread accuracy improvement
- Real-time Predictions: Live game schedules with betting predictions (includes live Vegas spreads/OUs via Odds API)
- Underdog Value Bets: Automatic detection of profitable underdog opportunities
- Kelly Criterion Betting: Optimal bet sizing recommendations
- Multi-API Integration: CBBD + ESPN + KenPom + BartTorvik
- Streamlit UI: Interactive web interface for predictions
- Betting Models Framework: Complete implementation of roadmap-betting-models.md
- Spread Accuracy: 8.4% improvement (-0.97 points MAE)
- Moneyline Accuracy: +2.7 percentage points
- Total Accuracy: 1.7% improvement
- Feature Expansion: 3 → 11 features (CBBD + KenPom + BartTorvik)
- KenPom: Advanced efficiency ratings (NetRtg, ORtg, DRtg, AdjT, Luck, SOS)
- BartTorvik: Adjusted offensive/defensive efficiency
- Team Canonicalization: 99.7% mapping coverage (364/365 teams)
- Automated Data Pipeline: Selenium scraping + data enrichment
- Extended Feature Set: 6 KenPom + 2 BartTorvik metrics
- Canonical Team Names: Standardized naming across all sources
- Data Validation: Complete feature coverage for 15,961 games
- Model Retraining: Optimized for extended feature space
- Betting Models: Complete implementation with evaluation and ensemble methods
This project uses:
- College Basketball Data API (CBBD) for game schedules and stats.
- The Odds API for live Vegas spreads / totals.
CBBD_API_KEY- College Basketball Data API keyODDS_API_KEY- Odds API key (used byfetch_live_odds.py)
- Python 3.8+
- CBBD API key (set as
CBBD_API_KEYenvironment variable)
pip install -r requirements.txtexport CBBD_API_KEY="your_api_key_here"Collect 10 years of betting data for model training:
python data_collection.py --collectThis collects:
- Tournament games (2016-2025)
- Betting lines and odds
- Team season statistics
- Efficiency ratings
- Poll rankings
Automated collection of advanced efficiency metrics:
# Download KenPom ratings
python download_kenpom.py
# Download BartTorvik ratings (auto-renames and cleans duplicates)
python download_barttorvik.py
# Create canonical datasets
python data_tools/efficiency_loader.pyOutput files:
data_files/kenpom_ratings.csv- Raw KenPom data (365 teams)data_files/barttorvik_ratings.csv- Raw BartTorvik data (365 teams, auto-renamed from download)data_files/kenpom_canonical.csv- Cleaned with canonical names (364 teams)data_files/barttorvik_canonical.csv- Cleaned with canonical names (364 teams)
Automatic mapping of team names across data sources:
# Match team names and create mappings
python scripts/match_teams.py
# Apply manual corrections
python scripts/apply_mappings.pyOutput files:
data_files/kenpom_to_espn_matches.csv- KenPom → canonical mappingsdata_files/bart_to_espn_matches.csv- BartTorvik → canonical mappings
Enhance training data with advanced metrics:
# Add KenPom/BartTorvik features to historical games
python enrich_training_data.py
# Retrain models with extended features
python retrain_with_extended_features.pyOutput files:
data_files/training_data_enriched.csv- Historical data with all featuresdata_files/training_data_complete_features.csv- Complete cases only (15,961 games)data_files/models/- Retrained models with 11 features
from data_collection import fetch_all_tournament_games
# Get games for specific years
games = fetch_all_tournament_games(2023, 2024)
print(f"Found {len(games)} tournament games")from data_collection import fetch_historical_lines
# Get betting lines for tournaments
lines = fetch_historical_lines(2023, 2024)
print(f"Found {len(lines)} betting lines")from data_collection import fetch_rankings
# Get poll rankings
rankings = fetch_rankings(2024)
print(f"Found {len(rankings)} ranking entries")from data_collection import fetch_team_season_data, fetch_efficiency_ratings
# Get team stats and efficiency ratings
team_stats = fetch_team_season_data(2024)
efficiency = fetch_efficiency_ratings(2024)See examples/data_collection_examples.py for complete usage examples.
The system automatically fetches KenPom and BartTorvik efficiency ratings daily at 2 AM ET, but only during basketball season and only when games are scheduled.
When it runs:
- Basketball season: October 15 - April 15
- Games scheduled: Today or tomorrow have college basketball games
- Time: Daily at 2 AM Eastern Time (with random 0-60 minute delay)
Anti-detection measures:
- Randomized timing to avoid consistent patterns
- Headless Chrome with anti-bot countermeasures
- Custom user agents and browser fingerprinting evasion
- Random delays between site requests
- Only runs when actually needed (season + games scheduled)
What gets updated:
data_files/kenpom_ratings.csv- Raw KenPom datadata_files/barttorvik_ratings.csv- Raw BartTorvik datadata_files/kenpom_canonical.csv- Cleaned KenPom with canonical namesdata_files/barttorvik_canonical.csv- Cleaned BartTorvik with canonical names
Manual trigger: You can also manually run the efficiency data update:
python check_efficiency_update_needed.py && python download_kenpom.py && python download_barttorvik.py && python data_tools/efficiency_loader.pyGitHub Actions Workflow:
The automated workflow (.github/workflows/update-efficiency-ratings.yml):
- Runs daily at 2 AM ET (7 AM UTC) with randomization
- Only executes during basketball season with scheduled games
- Includes anti-detection measures to avoid blocking
- Commits changes back to the repository
- Only commits if efficiency data has actually changed
The system automatically identifies underdog betting opportunities where the model predicts a higher win probability than the betting odds suggest.
from underdog_value import identify_underdog_value
# Example: Model gives underdog 40% chance, but odds imply 27%
game = {
'home_team': 'Kansas',
'away_team': 'NC State',
'home_moneyline': -350,
'away_moneyline': +275
}
home_win_prob = 0.60 # Model prediction
value_bet = identify_underdog_value(game, home_win_prob, min_ev_threshold=5.0)
if value_bet:
print(f"Value bet: {value_bet['team']}")
print(f"Edge: {value_bet['edge']:.1%}")
print(f"ROI: {value_bet['roi']:.1f}%")from underdog_value import get_betting_recommendation
recommendation = get_betting_recommendation(value_bet, bankroll=1000)
print(f"Recommended bet: ${recommendation['recommended_bet']:.2f}")
print(f"Kelly %: {recommendation['kelly_percentage']:.1f}%")See examples/underdog_value_examples.py for detailed examples.
streamlit run predictions.pyThe system automatically fetches current season games from ESPN and enriches them with CBBD statistics for accurate predictions.
Models trained with 11 features (vs. original 3):
CBBD Features (3):
team_season_win_pct- Team's season win percentageteam_season_avg_mov- Average margin of victoryopponent_season_avg_mov- Opponent's average margin of victory
KenPom Features (6):
net_rtg- Net rating (offensive - defensive)off_rtg- Offensive ratingdef_rtg- Defensive ratingadj_tempo- Adjusted tempoluck- Luck ratingsos- Strength of schedule
BartTorvik Features (2):
adj_oe- Adjusted offensive efficiencyadj_de- Adjusted defensive efficiency
- XGBoost: Gradient boosting for complex patterns
- Random Forest: Ensemble learning for robust predictions
- Linear Regression: Spread and total predictions
- Logistic Regression: Moneyline predictions
Significant accuracy gains with extended features:
| Metric | Baseline (3 features) | Extended (11 features) | Improvement |
|---|---|---|---|
| Spread MAE | 12.8 | 11.7 | 8.4% better |
| Moneyline Accuracy | 68.2% | 70.9% | +2.7 pts |
| Total MAE | 14.2 | 13.9 | 1.7% better |
Training Data: 15,961 complete games (2016-2024 seasons)
# Retrain with extended features
python retrain_with_extended_features.pyTraining Details:
- Cross-validation with 5 folds
- Ensemble of XGBoost, Random Forest, Linear/Logistic Regression
- Separate models for spread, total, and moneyline predictions
- Feature scaling and preprocessing
Complete implementation of docs/roadmap-betting-models.md with comprehensive betting prediction capabilities.
Model Types by Bet:
| Bet Type | Model Type | Target | Key Metric |
|---|---|---|---|
| Moneyline | Classification | Win (0/1) | Brier Score |
| Spread | Regression → Classification | Margin → Cover | ATS Accuracy |
| Over/Under | Regression → Classification | Total → Over | O/U Accuracy |
| Value | Probability Comparison | Edge | ROI |
from betting_models import (
train_win_probability_model, # Calibrated classifier for win probs
train_spread_model, # XGBoost regression for margins
train_total_model, # XGBoost regression for totals
predict_ats, # ATS outcome predictions
predict_over_under, # O/U outcome predictions
evaluate_betting_roi, # ROI calculation with American odds
create_betting_ensemble # Voting ensemble models
)
# Train models
moneyline_model = train_win_probability_model(X_train, y_win)
spread_model = train_spread_model(X_train, y_margin)
total_model = train_total_model(X_train, y_total)
# Evaluate betting performance
roi_results = evaluate_betting_roi(predictions, actuals, odds)
print(f"ROI: {roi_results['roi_pct']:.1f}%")- Time-based Cross Validation:
tournament_cv()for leave-one-tournament-out validation - Ensemble Methods: Voting classifiers/regressors combining multiple algorithms
- Probability Calibration: Isotonic regression for well-calibrated win probabilities
- Comprehensive Evaluation: Brier score, log loss, MAE, RMSE, and ROI metrics
# Run comprehensive test suite
python test_betting_models.pyTest Coverage:
- ✅ Model training (moneyline, spread, total)
- ✅ Prediction functions (ATS, O/U)
- ✅ Evaluation metrics (classification, regression, ROI)
- ✅ Ensemble model creation
- ✅ Cross validation
- ✅ Dependency validation
The betting models framework integrates seamlessly with the existing prediction system:
- Feature Engineering: Compatible with
feature_engineering.pyoutput - Streamlit UI: Used in
predictions.pyfor real-time betting predictions - Model Persistence: Models saved with joblib in
data_files/models/ - Evaluation: Performance tracking with comprehensive metrics
- Historical tournament games (2016-2025)
- Betting lines and odds
- Team season statistics
- Efficiency ratings
- Poll rankings
- Current season game schedules
- Real-time game information
- Team name standardization
march-madness/
├── predictions.py # Streamlit UI and prediction logic
├── betting_models.py # Betting models framework implementation
├── test_betting_models.py # Comprehensive test suite for betting models
├── betting_models_README.md # Detailed betting models documentation
├── data_collection.py # CBBD API integration and data collection
├── fetch_espn_cbb_scores.py # ESPN API scraper
├── model_training.py # Original model training (3 features)
├── retrain_with_extended_features.py # Extended model training (11 features)
├── enrich_training_data.py # Add KenPom/BartTorvik features
├── download_kenpom.py # KenPom data scraper
├── download_barttorvik.py # BartTorvik data scraper
├── underdog_value.py # Value bet detection and Kelly sizing
├── generate_predictions.py # Generate predictions for upcoming games
├── display_predictions.py # Display prediction results
├── data_tools/
│ ├── efficiency_loader.py # Load and clean efficiency data
│ └── __init__.py
├── scripts/
│ ├── match_teams.py # Team name canonicalization
│ ├── apply_mappings.py # Apply manual team mappings
│ └── __init__.py
├── examples/ # Usage examples
│ ├── data_collection_examples.py
│ └── underdog_value_examples.py
├── data_files/ # Cached data and models
│ ├── cache/ # API response cache
│ ├── models/ # Trained ML models
│ ├── espn_cbb_current_season.csv # Current season games
│ ├── training_data_complete_features.csv # Training data (15,961 games)
│ ├── kenpom_canonical.csv # Cleaned KenPom data (364 teams)
│ ├── barttorvik_canonical.csv # Cleaned BartTorvik data (364 teams)
│ └── upcoming_game_predictions.json # AI predictions
├── requirements.txt # Python dependencies
├── README.md # This file
└── copilot-instructions.md # AI assistant guidelines
- Comprehensive data collection functions
- Historical betting data (10 years)
- ML model training pipeline
- ESPN integration for current games
- Streamlit prediction interface
- Team name normalization
- Data caching and compression
- Underdog value bet detection
- Kelly Criterion bet sizing
- Expected value calculations
- KenPom efficiency ratings integration (6 features)
- BartTorvik advanced metrics integration (2 features)
- Extended ML models with 11 features (8.4% spread improvement)
- Automated team name canonicalization (99.7% coverage)
- Enhanced data enrichment pipeline
- Cross-validation training with performance metrics
- Advanced betting features (road/neutral advantages)
- Model evaluation and comparison
- Real-time odds integration
- Prediction confidence scoring
- Live game tracking and in-game predictions
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
See LICENSE file for details.