march-madness

March Madness College Basketball Predictions

Overview

A comprehensive March Madness betting prediction system that combines historical data, machine learning models, and real-time game schedules to provide accurate betting predictions for tournament games. Now enhanced with KenPom and BartTorvik efficiency ratings for superior prediction accuracy.

Features

Comprehensive Data Collection: 10 years of historical tournament data (2016-2025)
Advanced Efficiency Metrics: KenPom and BartTorvik ratings integration (364 teams)
Enhanced ML Models: 11-feature models with 8.4% spread accuracy improvement
Real-time Predictions: Live game schedules with betting predictions (includes live Vegas spreads/OUs via Odds API)
Underdog Value Bets: Automatic detection of profitable underdog opportunities
Kelly Criterion Betting: Optimal bet sizing recommendations
Multi-API Integration: CBBD + ESPN + KenPom + BartTorvik
Streamlit UI: Interactive web interface for predictions
Betting Models Framework: Complete implementation of roadmap-betting-models.md

Recent Updates (v2.0)

🎯 Model Performance Improvements

Spread Accuracy: 8.4% improvement (-0.97 points MAE)
Moneyline Accuracy: +2.7 percentage points
Total Accuracy: 1.7% improvement
Feature Expansion: 3 → 11 features (CBBD + KenPom + BartTorvik)

🔧 New Data Sources

KenPom: Advanced efficiency ratings (NetRtg, ORtg, DRtg, AdjT, Luck, SOS)
BartTorvik: Adjusted offensive/defensive efficiency
Team Canonicalization: 99.7% mapping coverage (364/365 teams)
Automated Data Pipeline: Selenium scraping + data enrichment

📊 Enhanced Features

Extended Feature Set: 6 KenPom + 2 BartTorvik metrics
Canonical Team Names: Standardized naming across all sources
Data Validation: Complete feature coverage for 15,961 games
Model Retraining: Optimized for extended feature space
Betting Models: Complete implementation with evaluation and ensemble methods

API Information

This project uses:

College Basketball Data API (CBBD) for game schedules and stats.
The Odds API for live Vegas spreads / totals.

Environment variables

CBBD_API_KEY - College Basketball Data API key
ODDS_API_KEY - Odds API key (used by fetch_live_odds.py)

Setup

Prerequisites

Python 3.8+
CBBD API key (set as CBBD_API_KEY environment variable)

Installation

pip install -r requirements.txt

Environment Setup

export CBBD_API_KEY="your_api_key_here"

Data Collection

Comprehensive Historical Data

Collect 10 years of betting data for model training:

python data_collection.py --collect

This collects:

Tournament games (2016-2025)
Betting lines and odds
Team season statistics
Efficiency ratings
Poll rankings

KenPom & BartTorvik Data

Automated collection of advanced efficiency metrics:

# Download KenPom ratings
python download_kenpom.py

# Download BartTorvik ratings (auto-renames and cleans duplicates)
python download_barttorvik.py

# Create canonical datasets
python data_tools/efficiency_loader.py

Output files:

data_files/kenpom_ratings.csv - Raw KenPom data (365 teams)
data_files/barttorvik_ratings.csv - Raw BartTorvik data (365 teams, auto-renamed from download)
data_files/kenpom_canonical.csv - Cleaned with canonical names (364 teams)
data_files/barttorvik_canonical.csv - Cleaned with canonical names (364 teams)

Team Name Canonicalization

Automatic mapping of team names across data sources:

# Match team names and create mappings
python scripts/match_teams.py

# Apply manual corrections
python scripts/apply_mappings.py

Output files:

data_files/kenpom_to_espn_matches.csv - KenPom → canonical mappings
data_files/bart_to_espn_matches.csv - BartTorvik → canonical mappings

Data Enrichment

Enhance training data with advanced metrics:

# Add KenPom/BartTorvik features to historical games
python enrich_training_data.py

# Retrain models with extended features
python retrain_with_extended_features.py

Output files:

data_files/training_data_enriched.csv - Historical data with all features
data_files/training_data_complete_features.csv - Complete cases only (15,961 games)
data_files/models/ - Retrained models with 11 features

Individual Data Collection Functions

Tournament Games

from data_collection import fetch_all_tournament_games

# Get games for specific years
games = fetch_all_tournament_games(2023, 2024)
print(f"Found {len(games)} tournament games")

Betting Lines

from data_collection import fetch_historical_lines

# Get betting lines for tournaments
lines = fetch_historical_lines(2023, 2024)
print(f"Found {len(lines)} betting lines")

Rankings

from data_collection import fetch_rankings

# Get poll rankings
rankings = fetch_rankings(2024)
print(f"Found {len(rankings)} ranking entries")

Team Statistics

from data_collection import fetch_team_season_data, fetch_efficiency_ratings

# Get team stats and efficiency ratings
team_stats = fetch_team_season_data(2024)
efficiency = fetch_efficiency_ratings(2024)

Examples

See examples/data_collection_examples.py for complete usage examples.

Automated Data Collection

Automated Efficiency Data Updates

The system automatically fetches KenPom and BartTorvik efficiency ratings daily at 2 AM ET, but only during basketball season and only when games are scheduled.

When it runs:

Basketball season: October 15 - April 15
Games scheduled: Today or tomorrow have college basketball games
Time: Daily at 2 AM Eastern Time (with random 0-60 minute delay)

Anti-detection measures:

Randomized timing to avoid consistent patterns
Headless Chrome with anti-bot countermeasures
Custom user agents and browser fingerprinting evasion
Random delays between site requests
Only runs when actually needed (season + games scheduled)

What gets updated:

data_files/kenpom_ratings.csv - Raw KenPom data
data_files/barttorvik_ratings.csv - Raw BartTorvik data
data_files/kenpom_canonical.csv - Cleaned KenPom with canonical names
data_files/barttorvik_canonical.csv - Cleaned BartTorvik with canonical names

Manual trigger: You can also manually run the efficiency data update:

python check_efficiency_update_needed.py && python download_kenpom.py && python download_barttorvik.py && python data_tools/efficiency_loader.py

GitHub Actions Workflow: The automated workflow (.github/workflows/update-efficiency-ratings.yml):

Runs daily at 2 AM ET (7 AM UTC) with randomization
Only executes during basketball season with scheduled games
Includes anti-detection measures to avoid blocking
Commits changes back to the repository
Only commits if efficiency data has actually changed

Underdog Value Bets

Finding Profitable Opportunities

The system automatically identifies underdog betting opportunities where the model predicts a higher win probability than the betting odds suggest.

from underdog_value import identify_underdog_value

# Example: Model gives underdog 40% chance, but odds imply 27%
game = {
    'home_team': 'Kansas',
    'away_team': 'NC State',
    'home_moneyline': -350,
    'away_moneyline': +275
}

home_win_prob = 0.60  # Model prediction
value_bet = identify_underdog_value(game, home_win_prob, min_ev_threshold=5.0)

if value_bet:
    print(f"Value bet: {value_bet['team']}")
    print(f"Edge: {value_bet['edge']:.1%}")
    print(f"ROI: {value_bet['roi']:.1f}%")

Kelly Criterion Bet Sizing

from underdog_value import get_betting_recommendation

recommendation = get_betting_recommendation(value_bet, bankroll=1000)
print(f"Recommended bet: ${recommendation['recommended_bet']:.2f}")
print(f"Kelly %: {recommendation['kelly_percentage']:.1f}%")

Examples

See examples/underdog_value_examples.py for detailed examples.

Running Predictions

Streamlit UI

streamlit run predictions.py

ESPN Data Integration

The system automatically fetches current season games from ESPN and enriches them with CBBD statistics for accurate predictions.

Machine Learning Models

Extended Feature Set (v2.0)

Models trained with 11 features (vs. original 3):

CBBD Features (3):

team_season_win_pct - Team's season win percentage
team_season_avg_mov - Average margin of victory
opponent_season_avg_mov - Opponent's average margin of victory

KenPom Features (6):

net_rtg - Net rating (offensive - defensive)
off_rtg - Offensive rating
def_rtg - Defensive rating
adj_tempo - Adjusted tempo
luck - Luck rating
sos - Strength of schedule

BartTorvik Features (2):

adj_oe - Adjusted offensive efficiency
adj_de - Adjusted defensive efficiency

Supported Models

XGBoost: Gradient boosting for complex patterns
Random Forest: Ensemble learning for robust predictions
Linear Regression: Spread and total predictions
Logistic Regression: Moneyline predictions

Performance Improvements (v2.0)

Significant accuracy gains with extended features:

Metric	Baseline (3 features)	Extended (11 features)	Improvement
Spread MAE	12.8	11.7	8.4% better
Moneyline Accuracy	68.2%	70.9%	+2.7 pts
Total MAE	14.2	13.9	1.7% better

Training Data: 15,961 complete games (2016-2024 seasons)

Training Process

# Retrain with extended features
python retrain_with_extended_features.py

Training Details:

Cross-validation with 5 folds
Ensemble of XGBoost, Random Forest, Linear/Logistic Regression
Separate models for spread, total, and moneyline predictions
Feature scaling and preprocessing

Betting Models Framework

Implementation Overview

Complete implementation of docs/roadmap-betting-models.md with comprehensive betting prediction capabilities.

Model Types by Bet:

Bet Type	Model Type	Target	Key Metric
Moneyline	Classification	Win (0/1)	Brier Score
Spread	Regression → Classification	Margin → Cover	ATS Accuracy
Over/Under	Regression → Classification	Total → Over	O/U Accuracy
Value	Probability Comparison	Edge	ROI

Core Functions

from betting_models import (
    train_win_probability_model,  # Calibrated classifier for win probs
    train_spread_model,           # XGBoost regression for margins
    train_total_model,            # XGBoost regression for totals
    predict_ats,                  # ATS outcome predictions
    predict_over_under,           # O/U outcome predictions
    evaluate_betting_roi,         # ROI calculation with American odds
    create_betting_ensemble       # Voting ensemble models
)

# Train models
moneyline_model = train_win_probability_model(X_train, y_win)
spread_model = train_spread_model(X_train, y_margin)
total_model = train_total_model(X_train, y_total)

# Evaluate betting performance
roi_results = evaluate_betting_roi(predictions, actuals, odds)
print(f"ROI: {roi_results['roi_pct']:.1f}%")

Advanced Features

Time-based Cross Validation: tournament_cv() for leave-one-tournament-out validation
Ensemble Methods: Voting classifiers/regressors combining multiple algorithms
Probability Calibration: Isotonic regression for well-calibrated win probabilities
Comprehensive Evaluation: Brier score, log loss, MAE, RMSE, and ROI metrics

Testing & Validation

# Run comprehensive test suite
python test_betting_models.py

Test Coverage:

✅ Model training (moneyline, spread, total)
✅ Prediction functions (ATS, O/U)
✅ Evaluation metrics (classification, regression, ROI)
✅ Ensemble model creation
✅ Cross validation
✅ Dependency validation

Integration

The betting models framework integrates seamlessly with the existing prediction system:

Feature Engineering: Compatible with feature_engineering.py output
Streamlit UI: Used in predictions.py for real-time betting predictions
Model Persistence: Models saved with joblib in data_files/models/
Evaluation: Performance tracking with comprehensive metrics

Data Sources

College Basketball Data API (CBBD)

Historical tournament games (2016-2025)
Betting lines and odds
Team season statistics
Efficiency ratings
Poll rankings

ESPN API

Current season game schedules
Real-time game information
Team name standardization

Project Structure

march-madness/
├── predictions.py                    # Streamlit UI and prediction logic
├── betting_models.py                 # Betting models framework implementation
├── test_betting_models.py            # Comprehensive test suite for betting models
├── betting_models_README.md          # Detailed betting models documentation
├── data_collection.py               # CBBD API integration and data collection
├── fetch_espn_cbb_scores.py         # ESPN API scraper
├── model_training.py                # Original model training (3 features)
├── retrain_with_extended_features.py # Extended model training (11 features)
├── enrich_training_data.py          # Add KenPom/BartTorvik features
├── download_kenpom.py               # KenPom data scraper
├── download_barttorvik.py           # BartTorvik data scraper
├── underdog_value.py                # Value bet detection and Kelly sizing
├── generate_predictions.py          # Generate predictions for upcoming games
├── display_predictions.py           # Display prediction results
├── data_tools/
│   ├── efficiency_loader.py         # Load and clean efficiency data
│   └── __init__.py
├── scripts/
│   ├── match_teams.py               # Team name canonicalization
│   ├── apply_mappings.py            # Apply manual team mappings
│   └── __init__.py
├── examples/                        # Usage examples
│   ├── data_collection_examples.py
│   └── underdog_value_examples.py
├── data_files/                      # Cached data and models
│   ├── cache/                      # API response cache
│   ├── models/                     # Trained ML models
│   ├── espn_cbb_current_season.csv # Current season games
│   ├── training_data_complete_features.csv # Training data (15,961 games)
│   ├── kenpom_canonical.csv        # Cleaned KenPom data (364 teams)
│   ├── barttorvik_canonical.csv    # Cleaned BartTorvik data (364 teams)
│   └── upcoming_game_predictions.json # AI predictions
├── requirements.txt                 # Python dependencies
├── README.md                        # This file
└── copilot-instructions.md          # AI assistant guidelines

Roadmap Implementation

✅ Completed Features (v2.0)

🔄 Next Steps

Advanced betting features (road/neutral advantages)
Model evaluation and comparison
Real-time odds integration
Prediction confidence scoring
Live game tracking and in-game predictions

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 422 Commits
.github/workflows		.github/workflows
data_files		data_files
data_tools		data_tools
docs		docs
examples		examples
hooks		hooks
pages		pages
scripts		scripts
tools		tools
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_probe_ncaa.py		_probe_ncaa.py
_train_error.txt		_train_error.txt
_train_output.txt		_train_output.txt
add_headers.py		add_headers.py
advanced_model_training.py		advanced_model_training.py
analyze_kenpom.py		analyze_kenpom.py
auto_capture_opening_lines.py		auto_capture_opening_lines.py
betting_models.py		betting_models.py
bracket_simulation.py		bracket_simulation.py
check_efficiency_update_needed.py		check_efficiency_update_needed.py
check_kenpom.py		check_kenpom.py
compare_location_impact.py		compare_location_impact.py
compare_prediction_files.py		compare_prediction_files.py
create_historical_betting_data.py		create_historical_betting_data.py
data_collection.py		data_collection.py
debug_betting_data.py		debug_betting_data.py
download_barttorvik.py		download_barttorvik.py
download_haslametrics.py		download_haslametrics.py
download_kenpom.py		download_kenpom.py
download_net_rankings.py		download_net_rankings.py
enrich_training_data.py		enrich_training_data.py
evaluate_new_features.py		evaluate_new_features.py
evanmiya_debug.html		evanmiya_debug.html
evanmiya_debug.txt		evanmiya_debug.txt
evanmiya_page.html		evanmiya_page.html
evanmiya_team_ratings.csv		evanmiya_team_ratings.csv
export_picks.py		export_picks.py
feature_engineering.py		feature_engineering.py
features.py		features.py
fetch_espn_cbb_scores.py		fetch_espn_cbb_scores.py
fetch_live_odds.py		fetch_live_odds.py
find_missing_mascots.py		find_missing_mascots.py
footer.py		footer.py
generate_predictions.py		generate_predictions.py
model_analysis.py		model_analysis.py
model_training.py		model_training.py
opening_line_database.py		opening_line_database.py
predictions.py		predictions.py
pull_bracket_sonnet.py		pull_bracket_sonnet.py
requirements-actions.txt		requirements-actions.txt
requirements-capture.txt		requirements-capture.txt
requirements.txt		requirements.txt
retrain_with_extended_features.py		retrain_with_extended_features.py
roi_tracker.py		roi_tracker.py
train_tournament_models.py		train_tournament_models.py
underdog_value.py		underdog_value.py
upset_prediction.py		upset_prediction.py

Folders and files

Latest commit

History

Repository files navigation

march-madness

Overview

Features

Recent Updates (v2.0)

🎯 Model Performance Improvements

🔧 New Data Sources

📊 Enhanced Features

API Information

Environment variables

Setup

Prerequisites

Installation

Environment Setup

Data Collection

Comprehensive Historical Data

KenPom & BartTorvik Data

Team Name Canonicalization

Data Enrichment

Individual Data Collection Functions

Tournament Games

Betting Lines

Rankings

Team Statistics

Examples

Automated Data Collection

Automated Efficiency Data Updates

Underdog Value Bets

Finding Profitable Opportunities

Kelly Criterion Bet Sizing

Examples

Running Predictions

Streamlit UI

ESPN Data Integration

Machine Learning Models

Extended Feature Set (v2.0)

Supported Models

Performance Improvements (v2.0)

Training Process

Betting Models Framework

Implementation Overview

Core Functions

Advanced Features

Testing & Validation

Integration

Data Sources

College Basketball Data API (CBBD)

ESPN API

Project Structure

Roadmap Implementation

✅ Completed Features (v2.0)

🔄 Next Steps

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages