Skip to content

gmalbert/nfl-predictions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

278 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏈 NFL Betting Analytics & Predictions Dashboard

NFL Predictions Logo

📍 View the Product Roadmap

A sophisticated NFL analytics platform that uses advanced machine learning to identify profitable betting opportunities. This system analyzes historical NFL data with 50+ enhanced features across multiple time scales to predict game outcomes and provides actionable betting insights with proven 60.9% ROI performance on data-leakage-free models.

Now featuring a multi-page Streamlit app with dedicated Historical Data page for advanced filtering and analysis of 196k+ play-by-play records.

🆕 PLAYER PROPS SYSTEM: Individual player performance predictions (Passing Yards, Rushing Yards, Receiving Yards, TDs) for DraftKings Pick 6 and similar prop betting markets. See Player Props Roadmap for details.

📊 DraftKings Pick 6 Calculator (Interactive)

A new interactive DraftKings Pick 6 line comparison tool is available in the Player Props page. Enter a player, choose a stat category (e.g., Passing Yards), and paste the DraftKings Pick 6 over/under line (e.g., 242.5). The tool will show:

  • A data-driven recommendation (OVER / UNDER)
  • Confidence and tier (🔥 ELITE, 💪 STRONG, ✅ GOOD, ⚠️ LEAN)
  • Prediction source: either the trained XGBoost model ("🤖 ML Model") or a historical hit-rate fallback ("📊 Historical")
  • Recent-game averages (L3, L5, L10), season average, and a recent game log showing OVER/UNDER hits

Notes for developers:

  • The feature uses the project models in player_props/models and the prediction pipeline in player_props/predict.py.
  • Models are loaded with caching to keep UI responsive; if a model for the selected stat/tier is missing, the code falls back to a simple Laplace-smoothed historical hit rate.
  • For full production predictions, regenerate player props with python player_props/predict.py.

📋 Table of Contents

🎯 Key Features

📊 Interactive Dashboard

  • 🔥 Next 10 Underdog Bets: Actionable moneyline betting opportunities with complete payout calculations
  • 🎯 Next 10 Spread Bets: NEW high-performance spread betting section with 91.9% historical win rate
  • Confidence Tiers: Visual indicators (🔥 Elite, ⭐ Strong, 📈 Good) for bet prioritization
  • Live Game Predictions: Real-time probability calculations for upcoming NFL games
  • Enhanced Betting Signals: Dual-strategy recommendations with proper spread/moneyline logic
  • Performance Tracking: Historical betting performance with ROI calculations and survivorship bias awareness
  • Edge Analysis: Compare model predictions against sportsbook odds to find value bets
  • Monte Carlo Feature Selection: Interactive experimentation with feature combinations for model optimization
  • Enhanced Reliability: Robust error handling and graceful fallbacks for uninterrupted analysis

💰 Proven Betting Strategy

  • Triple Strategy Success: Moneyline (65.4% ROI) + Spread (75.5% ROI) + Over/Under betting
  • Elite Spread Performance: 91.9% win rate on high-confidence spread bets (≥54% threshold)
  • Moneyline Strategy: 59.5% win rate on underdog picks with 28% F1-optimized threshold
  • Over/Under Model: NEW totals betting with F1-optimized thresholds and value edge calculations
  • Professional-Grade Validation: Data leakage eliminated, realistic performance metrics
  • Selective Betting: High-confidence filtering for maximum profitability

🤖 Advanced Machine Learning

  • Three Specialized Models: Separate XGBoost models for spread, moneyline, and over/under predictions
  • F1-Score Optimization: All models use F1-score maximization to find optimal betting thresholds
  • Optimized XGBoost Models with production-ready hyperparameters and probability calibration
  • Enhanced Monte Carlo Feature Selection testing 200 iterations with 15-feature subsets
  • Data Leakage Prevention: Strict temporal boundaries ensuring only pre-game information
  • Class Balancing with computed scale weights for imbalanced datasets
  • Multi-Target Prediction: Spread (56.3%), moneyline (64.2%), totals (56.2%) accuracy

⬆️ Back to Top

📈 Model Performance (Data Leakage Free)

Prediction Type Cross-Val Accuracy Betting Performance ROI Key Insight
Spread 58.9% 91.9% win rate (≥54% threshold) 75.5% Elite performance through selective betting
Moneyline 64.2% 59.5% win rate (≥28% threshold) 65.4% Strong underdog value identification
Over/Under 56.2% F1-optimized thresholds TBD NEW - Totals betting with value edge analysis

Major Performance Breakthrough (November 2025)

  • Spread Model Fixed: Corrected inverted predictions from 3.6% to 91.9% win rate
  • Dual Strategy Success: Both spread and moneyline betting now highly profitable
  • Data Leakage Free: Strict temporal boundaries ensure production reliability
  • Survivorship Bias Awareness: 91.9% rate is on selective 33% of games, not overall performance
  • Confidence Calibration: Model knows when it's likely to be right vs wrong

📊 Data Sources

Primary Data: NFLverse

  • Source: NFLverse Project - The most comprehensive NFL dataset available
  • Coverage: Play-by-play data from 2020-2024 seasons
  • Data Quality: Professional-grade data used by NFL analysts and researchers
  • Update Frequency: Updated weekly during NFL season

Betting Lines Data: ESPN

  • Source: ESPN NFL scores and betting data
  • Includes: Point spreads, moneylines, over/under totals, and odds
  • Coverage: Historical betting lines for model training and backtesting

Team Statistics & Features

  • All-Time Rolling Statistics: Win percentages, point differentials, scoring averages
  • Current Season Performance: Weekly updated team form and scoring trends
  • Historical Season Context: Prior season records for baseline team quality
  • Head-to-Head Matchups: Team-specific historical performance data
  • Situational Data: Home/away performance, division games, weather conditions
  • Advanced Metrics: Blowout rates, close game performance, coaching records

⬆️ Back to Top

🎮 How to Use

1. Running the System

# Build historical data and train models (single-step):
python build_and_train_pipeline.py

# Or run the steps separately:
# Train the models (build features and train)
python nfl-gather-data.py

# Launch the dashboard
streamlit run predictions.py

Developer scripts and checks

  • Place any new helper or diagnostic Python scripts in the scripts/ folder. Examples: scripts/check_moneyline_calibration.py, scripts/analyze_underdog_impact.py.
  • Scripts should be import-safe (no heavy data loads at module import time), include a short header comment describing purpose, and provide a if __name__ == '__main__': entrypoint so they can be run from CI or manually.

Automated Data Updates (GitHub Actions)

The repository includes a GitHub Actions workflow (.github/workflows/nightly-update.yml) that automatically updates predictions during football season:

  • Schedule: Runs nightly at 3:00 AM UTC (Sept 1 - Feb 15)
  • Season Detection: Automatically skips runs outside football season (March-August)
  • Update Process:
    1. Fetches latest ESPN scores and betting lines
    2. Smart update of NFLverse play-by-play data (only when new games detected)
    3. Runs the complete prediction pipeline (build_and_train_pipeline.py)
    4. Uploads updated predictions as artifacts
    5. Optionally commits changes back to the repository

Smart PBP Updates: The system uses intelligent detection to only download play-by-play data when new games are likely available, avoiding unnecessary bandwidth usage on non-game days.

Manual Trigger: You can manually run the workflow from the Actions tab in GitHub.

Data Sources Update Cadence:

  • ESPN API (scores/odds): Real-time, polled nightly at 3 AM
  • NFLverse PBP (play-by-play): Smart updates nightly - only downloads when new games detected
  • Predictions CSV: Regenerated after each pipeline run (~5 minutes)

On-UI Pipeline Run

  • The dashboard now includes a "🔄 Generate Predictions" button in the "Upcoming Games Schedule" expander when the app detects scheduled games without model outputs. Clicking this button runs python build_and_train_pipeline.py locally (shows a spinner and displays success/error output) and refreshes the page when finished. This is intended as a convenient local fallback to the nightly GitHub Actions workflow.

Local Alternative: For local development, run the update manually:

# Fetch latest ESPN data
python fetch_espn_weekly_scores.py

# Rebuild predictions
python build_and_train_pipeline.py

Environment variables & .env (local development)

  • The app reads environment variables for optional features such as emailing and RSS feed URL. For local development you can create a .env file in the project root and the app will load it automatically.
  • python-dotenv is recommended and included in requirements.txt. If python-dotenv is not installed, the app falls back to a minimal .env parser that will still read basic KEY=VALUE lines.

Example: copy .env.example to .env and update values:

EMAIL_FROM=you@example.com
EMAIL_TO=recipient@example.com
EMAIL_PASSWORD=your_app_password_here
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
ALERTS_SITE_URL=http://localhost:8501/
  • PowerShell example to set environment variables for a single session (optional):
$env:EMAIL_FROM = 'you@example.com'
$env:EMAIL_TO = 'recipient@example.com'
$env:EMAIL_PASSWORD = 'your_app_password_here'
  • Security note: Do NOT commit real secrets. Add .env to your .gitignore (the repo currently provides .env.example as a template).

  • Deployment tip: For Streamlit Cloud or other hosting, prefer the platform's secrets manager (e.g., st.secrets on Streamlit Cloud) rather than a .env file.

2. Dashboard Sections

The dashboard uses modern multi-page navigation with a main predictions page and dedicated historical data page. Each section displays data with professional formatting including percentages, proper date formats, and descriptive column labels.

🏈 Main Page: Predictions

The primary predictions interface with tab-based navigation for betting analysis.

📊 Tab: Model Predictions

  • Model Predictions vs Actual Results: Historical game outcomes with checkbox indicators
    • Formatted columns: Game Date (MM/DD/YYYY), Team names, Scores, Spread/O/U lines
    • Checkbox columns show predicted vs actual spread coverage and totals
    • Displays 50 most recent completed games with proper date formatting

🎯 Tab: Probabilities & Edges

  • Upcoming Game Probabilities: Shows betting opportunities with model confidence
    • Percentage Display: All probabilities shown as percentages (e.g., "45.6%" instead of "0.456")
    • Spread Probabilities: Model confidence underdog will cover spread
    • Moneyline Probabilities: Model confidence underdog wins outright
    • Over/Under Probabilities: Model confidence for totals betting
    • Edge Calculations: Value identification (model % - implied %)
    • Compact Labels: "Spread Prob", "ML Prob", "Over Edge" with helpful tooltips

💰 Tab: Betting Performance

  • Key Metrics Table: Clean, organized display of model performance
    • Spread, Moneyline, and Totals accuracy (formatted to 3 decimals)
    • Mean Absolute Error for each model
    • Optimal thresholds (28% for ML, 54% for spread)
    • Compact 600px width for easy scanning
  • Performance Tracking: Historical betting performance with ROI calculations
  • Survivorship Bias Awareness: Transparent about selective betting strategy

🔥 Tab: Underdog Bets

  • Next 10 Underdog Betting Opportunities: Moneyline strategy recommendations
  • Chronological order: Upcoming games where model has ≥28% confidence
  • Complete betting info: Favored team, underdog, spread, model confidence, expected payout
  • Real payout calculations: Exact profit amounts for $100 bets using live moneyline odds
  • Example: "Vikings (H) +180 ($180 profit on $100)" when Chargers are favored

📈 Tab: Spread BetsELITE PERFORMANCE

  • Next 15 Spread Betting Opportunities: High-confidence spread recommendations
  • Confidence Tiers: 🔥 Elite (75%+), ⭐ Strong (65-74%), 📈 Good (54-64%)
  • Historical Performance: 91.9% win rate on high-confidence bets, 75.5% ROI
  • Smart Sorting: Ordered by confidence level for optimal bet selection
  • Spread Explanation: Team can lose game but still "cover" (e.g., lose by less than spread)

🎯 Tab: Over/Under BetsNEW

  • Totals Betting Opportunities: Model predictions for over/under betting
  • Confidence Tiers: Elite (≥65%), Strong (60-65%), Good (55-60%), Standard (<55%)
  • Value Edge Calculation: Expected profit percentage based on model probability vs odds
  • Smart Bet Selection: Recommendations sorted by value edge for optimal selection
  • Complete Payout Info: Expected returns on $100 bets for both over and under options

📋 Tab: Betting Log

  • Automated Bet Tracking: Logs all betting recommendations with timestamps
  • Results Integration: Automatically updates with game outcomes
  • Performance Analysis: Win/loss tracking for accountability

📊 Historical Data PageNEW MULTI-PAGE APP

  • Dedicated Page: Separate navigation page for historical data analysis
  • 196k+ Play Records: Complete play-by-play data from 2020-2024 seasons
  • Single Authoritative Table: The page now displays one filter-driven table (the top snapshot table was removed). Use the sidebar filters to refine results; the table always applies a calendar-date guard (only games on or before today) and displays results sorted by game_date descending.
  • Quick Presets: One-click filter shortcuts at the top of the sidebar:
    • Red Zone: Automatically sets yardline filter to 0-20 yards from opponent endzone
    • 3rd & Short: Sets down filter to 3rd down and yards-to-go to 0-3 yards
    • Pass Attempts Only: Checks the pass-only filter for pass play analysis
  • Advanced Filtering: 12+ filter controls including:
    • Team filters (offense/defense)
    • Game context (down, quarter, play type)
    • Field position sliders (yards to go, yardline)
    • Score filters (differential, team scores)
    • Advanced metrics (EPA, win probability)
  • Reset Functionality: One-click filter reset with session state management
  • Paginated Display: 50-500 rows per page with navigation controls
  • Rich Column Configuration: Formatted dates, percentages, checkboxes for play outcomes
  • Back to Predictions: Easy navigation button to return to main page

⚙️ Additional Features

  • Feature Importances: Top model features with mean/std importance (3 decimals)
  • Monte Carlo Results: Feature selection testing with formatted metrics
  • Filtered Historical Data: Play-by-play data with percentage formatting for win probabilities
  • Error Metrics: Model accuracy and MAE displayed with consistent 3-decimal formatting

💡 Betting Strategy Explained

The Core Insight

Traditional betting advice says "always bet favorites" (70% win rate), but this system finds value in selective underdog betting:

  1. Model identifies underdogs with higher win probability than betting odds suggest
  2. 24% threshold optimization maximizes F1-score for better long-term profitability
  3. Risk management through selective betting (only ~72% of games get betting signals)

Example Betting Signal

Game: Chiefs @ Raiders
Model Probability (Underdog Win): 30%
Sportsbook Implied Probability: 22%
Betting Signal: 1 (BET ON RAIDERS)
Threshold: ≥28% (F1-optimized, leak-free)
Expected Value: Positive due to 8% edge
ROI Expectation: 60.9% based on historical performance

🔬 Technical Features

Model Architecture

  • Optimized XGBoost Classifiers with production-tuned hyperparameters:
    • n_estimators=300, learning_rate=0.05, max_depth=6
    • L1/L2 regularization (reg_alpha=0.1, reg_lambda=1.0)
    • Subsampling (subsample=0.8, colsample_bytree=0.8)
  • Calibrated Probability Outputs using sigmoid/isotonic scaling
  • Enhanced Feature Engineering: 50+ leak-free temporal features
  • Time-Series Aware Cross-Validation preventing future data leakage

🆕 Enhanced Predictive Features (Latest Update)

Current Season Performance Tracking

  • Real-time Team Form: homeTeamCurrentSeasonWinPct, awayTeamCurrentSeasonWinPct
  • Scoring Trends: homeTeamCurrentSeasonAvgScore, awayTeamCurrentSeasonAvgScore
  • Defensive Performance: homeTeamCurrentSeasonAvgScoreAllowed, awayTeamCurrentSeasonAvgScoreAllowed

Historical Season Context

  • Prior Season Records: homeTeamPriorSeasonRecord, awayTeamPriorSeasonRecord
  • Annual Stability Metrics: Complete previous season win percentages for baseline team quality

Head-to-Head Matchup History

  • Team-Specific Advantages: headToHeadHomeTeamWinPct
  • Historical Matchup Performance: How teams have performed against each other over time

Technical Implementation

  • Temporal Integrity: All features maintain strict data leakage prevention
  • Multi-Scale Analysis: Current season trends + historical season records + specific matchup history
  • Automated Integration: Features automatically included in Monte Carlo feature selection

Data Pipeline

  • Automated Feature Creation: Rolling averages, win percentages, trends with strict temporal controls
  • Data Leakage Prevention: All statistics use only prior games (season < current OR week < current)
  • Enhanced Validation: Real-time feature availability and XGBoost compatibility checks
  • Quality Assurance: Removes games with missing betting lines, validates data integrity
  • Threshold Optimization: F1-score maximization finds optimal decision boundary (28%)
  • Robust Error Handling: Graceful fallbacks for missing data and feature inconsistencies
  • Production Ready: No future information leakage, realistic performance expectations

⬆️ Back to Top

📁 Project Structure

nfl-predictions/
├── nfl-gather-data.py      # Main model training script
├── predictions.py          # Streamlit dashboard
├── data_files/            # Data storage directory

🔧 Technical Architecture

Machine Learning Pipeline

  • Primary Algorithm: XGBoost with production-optimized parameters
    • Training Parameters: 300 estimators, 0.05 learning rate, max depth 6
    • Regularization: L1=1, L2=1 for overfitting prevention
    • Monte Carlo Parameters: 100 estimators, 0.1 learning rate for feature selection
  • Model Calibration: CalibratedClassifierCV with sigmoid method for accurate probabilities
  • Feature Selection: Monte Carlo optimization (200 iterations, 15-feature subsets)
  • Cross-Validation: Time-series aware splits preventing data leakage
  • Class Handling: Weighted models addressing favorite/underdog imbalance (~70/30 split)

Data Engineering

  • Primary Source: NFLverse (nfl_data_py) - official NFL play-by-play data
  • Betting Data: ESPN odds and lines (2020-2024 seasons)
  • Feature Types: Rolling statistics, head-to-head records, seasonal performance
  • Data Integrity: Strict temporal boundaries prevent future data leakage
  • Data Quality: ~4,000+ games with complete betting line information
  • Storage: Git LFS integration for large datasets (>50MB)

Betting Strategy Architecture

  • Threshold Optimization: F1-score maximization finds 28% probability threshold (not 50%)
  • Selective Betting: ~72% of games generate betting signals (selective strategy)
  • Edge Calculation: model_prob - implied_odds_prob for value identification
  • ROI Focus: Optimizes for profit margin, not raw accuracy percentage

⬆️ Back to Top

📁 Recent Updates

🔧 Dec 13, 2025 — Critical Model Fix & New Features

  • Critical Model Fix: Resolved inverted spread predictions caused by a mislabeled target in the training pipeline; applied the correction prob_underdogCovered = 1 - prob_underdogCovered immediately after model prediction in the data pipeline.
  • Impact: Model betting ROI improved dramatically (from -90% → +60%), 62 of 63 remaining games flagged as profitable, maximum model confidence increased to 89.5%, and calibration error improved from 45% → 28%.
  • New Features (18 total): Added momentum (8), rest-advantage (5), and weather-impact (3) features. All features were engineered to avoid data leakage — see NEW_FEATURES_DEC13.md for details.
  • UI & Workflow Improvements: Added an EV explanation expander, changed spread-bet sorting to date-ascending, fixed unicode/icon issues, and improved PDF/CSV export UX.
  • Documentation: Full technical analysis and remediation plan available in MODEL_FIX_PLAN.md. The copilot instructions and README have been updated to reflect the fix and next steps.
  • Next Steps: Optional hyperparameter tuning and ensemble approaches are documented for incremental gains; momentum features will strengthen as the 2025 season progresses.

🔔 In-App Notification System (Latest)

  • Problem Solved: Users miss high-value betting opportunities when they open the app.
  • Solution: Automatic, actionable in-app notifications that highlight Elite and Strong opportunities:
    • Elite notifications (🔥): bets with ≥65% model confidence
    • Strong notifications (⭐): bets with 60–65% model confidence
    • Notifications are shown as toasts and are deduplicated using st.session_state so the same game doesn't re-notify in the same session.
    • Each alert links to a per-alert page (query param ?alert=<guid>) that shows friendly bet details, logos, bet type, gameday/time, and a small recommendation table.
    • The app also persists a detected public base URL (when available) into data_files/app_config.json so external tools (and the RSS generator) can create working per-alert links.
  • Impact: Improves engagement by surfacing high-value bets and providing one-click access to full alert details.
  • Technical: Uses st.toast() for toasts, session-state deduplication, per-alert pages rendered from data_files/betting_recommendations_log.csv, and persisted base-URL for external link generation.

🔔 RSS Feed & Rebuild Button

  • Purpose: Provide external notifications via RSS with per-alert links pointing back to the app.
  • Implementation:
    • scripts/generate_rss.py builds data_files/alerts_feed.xml. It prefers the persisted app_base_url in data_files/app_config.json (if present), otherwise falls back to the ALERTS_SITE_URL environment variable or http://localhost:8501/.
    • The running app exposes a sidebar "🔁 Rebuild RSS" button that runs the generator in-place and reports success/errors.
  • Files:
    • Persisted config: data_files/app_config.json (contains {"app_base_url": "https://..."} when detected)
    • RSS output: data_files/alerts_feed.xml

🎨 UI Layout Optimization (Latest)

  • Problem Solved: Excessive vertical space at the top of the main page reduced available screen real estate
  • Solutions Implemented:
    • Compact Header Layout: Logo and title now arranged in columns [1, 4] instead of stacked vertically
    • Logo Size Reduction: Logo reduced from 250px to 150px width for better proportions
    • Logo Positioning: Logo positioned at top-left of its column with minimal spacing
    • Loading Progress Optimization: Reduced verbose loading messages and progress indicators
    • Debug Message Cleanup: Removed unnecessary debug print statements from UI
  • Impact: Significantly more content visible above the fold, improved user experience
  • Technical: Uses st.columns() for responsive layout and optimized spacing

🧠 Memory Optimization & Streamlit Cloud Deployment (Latest)

  • Problem Solved: App exceeded Streamlit Cloud resource limits due to high memory usage (1.5GB+)
  • Solutions Implemented:
    • Data Type Optimization: float32 for numeric columns (50% memory reduction), Int8 for boolean columns
    • DataFrame Views: Replaced .copy() operations with views to eliminate memory duplication
    • Lazy Loading: All data loading uses @st.cache_data decorators with lazy initialization
    • Pagination: Added pagination for large datasets (>10k rows) with user warnings
    • Spinner Configuration: Suppress cache messages on technical pages, show on main UI
    • Variable Cleanup: Added del statements to clean up progress bars and temporary variables
  • Impact: Reduced memory usage while maintaining full functionality, enabled Streamlit Cloud deployment
  • Technical: Memory-efficient patterns prevent resource limit violations

⚡ Performance Improvements (Latest)

  • Startup Time: Reduced cold-start time by avoiding heavy import-time loads and using lazy, chunked CSV scanning; typical cold-starts are dramatically faster for end users.
  • Memory Footprint: Numeric dtype and view optimizations (e.g., float32 and Int8) cut memory usage substantially, enabling deployment on Streamlit Cloud where memory was previously constrained.
  • Rendering Responsiveness: Pagination, reduced preview sizes, and DataFrame view usage reduce UI thread lag when interacting with large tables and filters.
  • Deterministic Exports: On-demand PDF/CSV generation prevents pre-building large assets at startup, keeping initial memory and CPU usage low.
  • Validation & Monitoring: Added a lightweight smoke_test.py and CI-friendly lazy-loading checks to detect import-time data loading and preserve startup performance.

⚙️ Cache Management UI (Latest)

  • Problem Solved: Users couldn't manually refresh cached data or check cache status
  • Solution: Added Settings panel in sidebar with:
    • "🔄 Refresh Data" button to clear all cached data and reload
    • Helpful tooltips explaining functionality
  • Impact: Gives users control over data freshness
  • Technical: Uses st.cache_data.clear() for cache management

💰 Bankroll Management Tool (Latest)

  • Problem Solved: Users didn't know how much to bet on each recommendation for optimal risk management
  • Solution: Added dedicated "💰 Bankroll Management" tab with:
    • Bankroll input field with configurable amounts ($100-$1M+)
    • Risk tolerance selector (Conservative 1%, Moderate 2%, Aggressive 3%, Very Aggressive 5%)
    • Elite bet identification (≥65% confidence threshold)
    • Position sizing calculator showing recommended bet amounts
    • Expected payout calculations for each bet
    • Expected value analysis (positive EV bets only)
    • Bankroll impact tracking (total exposure percentage)
  • Impact: Enables responsible betting with Kelly Criterion-inspired position sizing
  • Technical: Filters predictions for elite opportunities and calculates optimal bet sizes

📈 Model Performance Dashboard (Latest)

  • Problem Solved: Users couldn't see if betting recommendations were actually accurate
  • Solution: Added dedicated "📈 Model Performance" tab with:
    • Overall metrics: Total bets, win rate, ROI, units won
    • Performance breakdown by confidence tier (Elite/Strong/Good/Standard)
    • Weekly performance tracking with line charts
    • Best performing bet types analysis
  • Impact: Builds trust and transparency in model performance
  • Technical: Reads from betting_recommendations_log.csv and calculates ROI metrics

⚡ Loading Progress Indicators (Latest)

  • Problem Solved: Users experienced 5-10 second load times with no feedback on progress
  • Solution: Added detailed progress bar showing specific loading steps:
    • 25%: "Loading historical games..." - loads game-level data with predictions
    • 50%: "Loading model predictions..." - loads betting predictions CSV
    • 75%: "Loading play-by-play data..." - loads historical play-by-play records
    • 100%: "Ready!" - shows completion with brief pause
  • Impact: Significantly improved perceived performance and user experience
  • Technical: Uses Streamlit progress bar with text updates and automatic cleanup

⚡ Performance Optimizations for Historical Data (Latest)

  • Memory-Efficient Data Types: Reduced memory usage by ~50% using float32 and Int8 instead of float64/int64
  • Eliminated Data Copying: Removed unnecessary .copy() operations on 196k row dataset
  • DataFrame Views: Filter operations now use views instead of copies for instant performance
  • Streamlit Cloud Config: Added .streamlit/config.toml with increased message size limits (500MB) and optimizations
  • Smart Warnings: Alert users when displaying large result sets to encourage filtering
  • Result: Historical Data page now loads significantly faster on Streamlit Cloud, avoiding timeout issues

🎯 NEW: Over/Under Betting Model

  • Three-Model System: Added dedicated over/under (totals) betting predictions alongside spread and moneyline
  • F1-Score Optimization: Optimal threshold calculation for over/under predictions using F1-score maximization
  • Value Edge Analysis: Calculates expected profit percentage based on model probability vs betting odds
  • Confidence Tiers: Elite/Strong/Good/Standard classification for bet prioritization
  • Complete Payout Calculations: Shows expected returns for both over and under bets on each game
  • Integrated Dashboard: New "🎯 Over/Under Bets" tab with top 15 opportunities sorted by value edge

📊 NEW: Multi-Page Streamlit App

  • Dedicated Historical Data Page: Separate navigation page for 196k+ play-by-play records with advanced filtering
  • Clean Interface: Main predictions page focuses on betting analysis, historical data on separate page
  • Easy Navigation: "🏈 Back to Predictions" button for quick return to main page
  • 12+ Filter Controls: Team filters, game context, field position, scores, and advanced metrics
  • Session State Reset: Reliable filter reset functionality using flag pattern
  • Pagination: Display 50-500 rows per page with navigation controls

🔧 Bug Fixes & System Improvements (Latest)

  • Fixed Over/Under Column Names: Corrected KeyError for pred_totalsProbprob_overHit
  • Added Moneyline Return Calculation: Implemented missing moneyline_bet_return column computation
  • Fixed Indentation Issues: Resolved Python syntax errors in complex nested tab structures
  • Improved Error Handling: Better validation for column existence before accessing dataframe columns
  • Type Safety: Enhanced Pylance compatibility with proper variable extraction patterns

🔥 CRITICAL: Data Leakage Elimination (October 2025)

  • Issue Discovered: Historical statistics were using ALL-TIME data (including future games during training)
  • Impact: Models appeared to have 70%+ accuracy but would fail in production
  • Solution: Implemented strict temporal boundaries - only prior games used for all statistics
  • Result: More realistic 56-64% accuracy BUT 60.9% ROI (up from 27.8%) due to higher quality signals
  • Production Ready: Models now perform consistently with live data

⚡ Optimal XGBoost Parameters Implementation (October 2025)

  • Upgraded: From basic eval_metric='logloss' to production-tuned hyperparameters
  • Parameters: 300 estimators, 0.05 learning rate, depth 6, with L1/L2 regularization
  • Benefits: Better generalization, reduced overfitting, more stable predictions
  • Monte Carlo: Separate lighter parameters for faster feature selection (100 estimators, depth 4)

🎯 Enhanced Monte Carlo Feature Selection (October 2025)

  • Upgraded: From 8-feature subsets to 15-feature subsets for better coverage
  • Iterations: Increased from 100 to 200 iterations for more thorough optimization
  • Results: Found optimal 7-8 feature models through comprehensive search space

🔥 Streamlit Dashboard Enhancements (October 2025)

  • Next 10 Underdog Bets: New prominent section showing actionable betting opportunities
    • Chronological order of next 10 recommended underdog bets
    • Complete betting info: favored team, underdog, spread, model confidence
    • Real payout calculations with exact profit amounts for $100 bets
  • Enhanced Recent Bets Display: Added "Favored" column showing who sportsbooks favored
  • Corrected Spread Logic: Fixed favorite/underdog identification (spread_line interpretation)
  • Improved User Experience: Clear explanations, better formatting, actionable insights
  • Quality: Better feature combinations leading to improved model stability

✅ Fixed Threshold Documentation (October 2025)

  • Discovered: Dashboard was showing inconsistent threshold information
  • Fixed: Updated to reflect actual F1-optimized threshold (now 28% after leakage fixes)
  • Impact: Users now see accurate betting strategy information

✅ Streamlit Compatibility Updates (October 2025)

  • Fixed: Deprecated use_container_width and width='stretch' parameters
  • Updated: All dataframe displays now use modern Streamlit best practices
  • Result: Dashboard works with latest Streamlit versions without errors

✅ Date Filtering Bug Fix (October 2025)

  • Issue: "Next 10 Underdog Bets" and "Next 10 Spread Bets" sections showing 2020 games instead of upcoming games
  • Root Cause: predictions_df variable was being modified by earlier dashboard sections (converting gameday to strings, filtering for past dates)
  • Solution: Each betting section now reloads fresh data from CSV and properly filters for future games only (gameday > today)
  • Impact: Betting recommendations now correctly display only upcoming games in chronological order
  • Technical: Fixed variable mutation issues across Streamlit sections through data isolation

✅ Git LFS Integration (October 2025)

  • Added: Large file support for nfl_play_by_play_historical.csv.gz (73.95MB)
  • Benefit: Enables deployment to Streamlit Cloud with access to large datasets
  • Setup: Properly configured .gitignore and .gitattributes for optimal repo management

🆕 Enhanced Feature Engineering (October 2025)

  • Current Season Performance: Added real-time team form tracking within current season
  • Prior Season Records: Incorporated previous season's final win percentages for baseline metrics
  • Head-to-Head History: Added historical matchup performance between specific teams
  • Technical Benefits: Multi-scale temporal analysis with strict data leakage prevention
  • Model Impact: Richer context for predictions across multiple time horizons

🔧 System Reliability & Error Resolution (October 2025)

  • Fixed All KeyError Issues: Resolved feature list mismatches between training and dashboard
  • Synchronized Feature Sets: Ensured consistency across all 50+ features in both nfl-gather-data.py and predictions.py
  • Enhanced Monte Carlo Selection: Fixed feature sampling to only use available numeric features
  • Robust Model Retraining: Updated dashboard to handle dynamic feature selection correctly
  • Data Pipeline Validation: Added comprehensive checks for feature availability and data integrity
  • Graceful Error Handling: System now handles missing features and data inconsistencies smoothly

⚡ Performance & Stability Improvements

  • Optimized Feature Loading: Streamlined data processing for faster dashboard startup times
  • Memory Management: Improved handling of large datasets with better resource utilization
  • Cross-Platform Compatibility: Enhanced PowerShell and terminal command support
  • Real-Time Validation: Added live feature availability checking during Monte Carlo experiments
  • Automated Fallbacks: System degrades gracefully when optional features are unavailable

⬆️ Back to Top

🎯 Getting Started

  1. Install Dependencies

    pip install streamlit pandas numpy xgboost scikit-learn
  2. Train Initial Models

# Single-step: build historical data and train models
python build_and_train_pipeline.py
  1. Launch Dashboard

    streamlit run predictions.py
  2. Deploy to Streamlit Cloud (Recommended)

    • All data files are committed to repository
    • Memory optimizations ensure compatibility with Streamlit Cloud resource limits
    • Python 3.12 required for deployment
  3. Start Analyzing

    • Check "Show Betting Analysis & Performance" for ROI metrics
    • Look for pred_underdogWon_optimal = 1 in upcoming games
    • Use the edge calculations to find the best value bets

🔧 Troubleshooting

Common Issues

"TypeError: 'str' object cannot be interpreted as an integer"

  • Cause: Old Streamlit version compatibility issue
  • Fix: Updated in recent version - use git pull to get latest fixes

"Missing file: nfl_play_by_play_historical.csv.gz"

  • Cause: Large file not available locally
  • Fix: Dashboard includes error handling and fallback data display

"Betting signals don't match 30% threshold"

  • Cause: Documentation was incorrect (now fixed)
  • Reality: Model uses 24% F1-optimized threshold, not 30%

Performance Notes

  • Training time: ~5-10 minutes with optimized parameters (300 estimators)
  • Monte Carlo time: ~3-5 minutes for feature selection (200 iterations, 15-feature subsets)
  • Dashboard load time: ~10-30 seconds for full data processing
  • Memory usage: ~1.5GB during data loading (optimized with float32/Int8 dtypes and DataFrame views)
  • Streamlit Cloud: Fully compatible with memory optimizations for cloud deployment

⬆️ Back to Top

🔧 Troubleshooting

Common Issues & Solutions

-#### KeyError: Features not in index

  • Cause: Feature list mismatch between training and dashboard files
  • Solution: Run python build_and_train_pipeline.py (or python nfl-gather-data.py to run only the training step) to regenerate feature files and ensure synchronization
  • Prevention: Both files now auto-sync feature lists to prevent future mismatches

"TypeError: 'str' object cannot be interpreted as an integer"

  • Cause: Old Streamlit version compatibility issue
  • Solution: System now includes modern Streamlit compatibility - use git pull for latest fixes

"Missing file: nfl_play_by_play_historical.csv.gz"

  • Cause: Large file not available locally (Git LFS)
  • Solution: Dashboard includes graceful error handling and fallback data display
  • Alternative: Download manually or use Git LFS: git lfs pull

Monte Carlo Feature Selection Errors

  • Cause: Trying to sample features not available in processed dataset
  • Solution: Fixed - system now validates feature availability before sampling
  • Features: Enhanced error handling with automatic feature filtering

ValueError: DataFrame.dtypes for data must be int, float, bool or category

  • Cause: XGBoost receiving non-numeric columns (object data types like team names, coaches)
  • Solution: Fixed - automatic filtering to numeric/boolean/categorical data types only
  • Prevention: All model inputs now validated for XGBoost compatibility before training

Dashboard Won't Load / Port Already in Use

  • Solution: Use different port: streamlit run predictions.py --server.port=XXXX
  • Default ports: Try 8501, 8502, 8503, etc.
  • Terminal: Check for running Streamlit processes with tasklist | findstr streamlit

Betting Sections Show Old Games (2020) Instead of Upcoming

  • Cause: Variable mutation across Streamlit sections - predictions_df modified by earlier filters
  • Solution: Fixed in latest version - sections now reload fresh data and filter properly
  • Verification: Betting recommendations should show games with future dates only
  • Technical: Each section uses isolated dataframe copy to prevent cross-section contamination

Performance Optimization Tips

  • Faster Loading: Use @st.cache_data decorator for expensive operations (already implemented)
  • Memory Management: Dashboard automatically handles large datasets with chunking
  • Feature Selection: Start with smaller Monte Carlo iterations (50-100) for faster testing

⚠️ Responsible Gambling Notice

This tool is for educational and analytical purposes. While our backtesting shows strong historical performance:

  • Past performance doesn't guarantee future results
  • Only bet what you can afford to lose
  • Consider this one factor in your betting decisions
  • Gambling involves risk - bet responsibly

🤝 Contributing

This project welcomes contributions! Areas for improvement:

  • Additional data sources (weather, injuries, etc.)
  • Enhanced feature engineering
  • Alternative modeling approaches
  • UI/UX improvements

⬆️ Back to Top

Built with: Python • Streamlit • XGBoost • Scikit-learn • NFLverse Data

📥 Export Downloads & Sidebar (Dec 11, 2025)

  • Where to find exports: The app exposes always-visible download controls in the sidebar for:
    • Predictions CSV — current predictions shown in the app (with embedded CSV icon)
    • Betting Log — the betting_recommendations_log.csv file (with embedded CSV icon)
    • Generate Predictions PDF — on-demand PDF generation with embedded PDF icon
  • Why you might not see them: The Streamlit sidebar can be collapsed by default. If you do not see the download buttons, click the small chevron in the top-left of the page to expand the sidebar.
  • Behavior: Download buttons are rendered after the app finishes loading data (the app reserves lightweight placeholders while data loads and populates the real download controls once predictions_df and the betting log are available).
  • Icon consistency: All download buttons and the PDF generate button use embedded icons (csv_icon.png, pdf_icon.png) with fallback to favicon.ico for a professional appearance.

✉️ Emailing Predictions (Updated Dec 29, 2025)

Automated HTML email notifications with clear, actionable betting recommendations. Emails show:

Enhanced Format Features:

  • Clear bet recommendations: "TEN +2.5 to cover (69.1%)" instead of cryptic "SP: 69.06%"
  • Individual confidence badges: Each bet shows its tier (🔥 ELITE ≥65%, ⭐ STRONG 60-65%, 📈 GOOD 55-60%)
  • Full bet names: "Money Line" instead of "ML" for clarity
  • Smart filtering: Only shows bets above thresholds (Spread ≥50%, Moneyline ≥28%, Totals ≥50%)
  • Team colors: Visual team markers for quick identification

Setup:

Environment variables (recommended):

  • EMAIL_FROM — sender email address (e.g. youremail@gmail.com)
  • EMAIL_TO — comma-separated recipient addresses (e.g. alice@example.com,bob@example.com)
  • EMAIL_PASSWORD — Gmail App Password (create in your Google Account's security settings)
  • SMTP_SERVER — SMTP server (default smtp.gmail.com)
  • SMTP_PORT — SMTP port (default 587)

Testing:

# Preview email format in browser without sending
python scripts/preview_email.py

# Send test email
python scripts/send_rich_email_now.py

Example PowerShell export (set for current session):

$env:EMAIL_FROM = "youremail@gmail.com"
$env:EMAIL_TO = "recipient@example.com"
$env:EMAIL_PASSWORD = "<your-app-password>"
$env:SMTP_SERVER = "smtp.gmail.com"
$env:SMTP_PORT = "587"

Notes:

  • Do NOT commit credentials to the repository. Use environment variables or a secure secrets store.
  • For production use, consider the Gmail API (OAuth2) or a transactional email provider (SendGrid, Mailgun) for better deliverability and key rotation.

🧪 Smoke Test & Quick Run

To validate the core imports and data-loading behavior without running the full Streamlit UI, use the included smoke test:

# Run a lightweight import-and-load smoke test
python smoke_test.py

This prints basic progress and a final SMOKE OK line when successful. For full interactive use:

python build_and_train_pipeline.py   # builds predictions/data if needed (single-step)
streamlit run predictions.py

📅 Recent Changes (Dec 11, 2025)

  • Icon Consistency: All download buttons and PDF generate button now use embedded icons (csv_icon.png, pdf_icon.png) with proper fallback to favicon.ico for professional appearance. Fixed incorrect fallback icon path from favicon.png to favicon.ico.
  • PDF Generate Button: Updated to use HTML with embedded PDF icon instead of emoji, matching the visual style of download buttons.
  • Download UX: Sidebar download controls are rendered from placeholders and populated only after predictions_df and the betting log finish loading to avoid rendering heavy widgets during initial load.

📅 Recent Changes (Nov 25, 2025)

  • Per-Game Detail Page: Added a dedicated per-game view reachable with ?game=<game_id> showing matchup summary, model predictions, and a shareable link. The per-game page uses lazy loading and intentionally avoids loading the full play-by-play dataset by default to reduce memory pressure.
  • Underdog Labeling: Per-game header now highlights the underdog in bold using spread-first logic and a moneyline fallback when spreads are unavailable.
  • Schedule Links: Schedule and table links now use path-relative ?game= query parameters and target="_self" so links open in the same tab and remain compatible with subpath deployments.
  • Season-aware Matching: Schedule→prediction matching was tightened to prefer predictions from the same season, preventing accidental linking to historical game IDs.
  • Download UX: Sidebar download buttons are rendered from lightweight placeholders and populate once predictions_df and related data finish loading to avoid early widget creation and reduce perceived load-time issues.
  • QB Names & Header Polish: Away/home QB names included in the per-game header; full team names now show before logos and gameday times set to 00:00:00 are hidden.

If you'd like, I can also add screenshots or a short demo GIF for the per-game page to the README.

📌 Recent Changes (Nov 26, 2025)

  • Per‑Game UI polish & layout fixes: The per‑game detail page (?game=<game_id>) was updated to improve readability and alignment:
    • Metrics and info now start at the left edge (removed the centered spacer that pushed content right).
    • Spread/Total and probability metrics were re-aligned for clearer visual grouping under the @ marker.
  • Team & QB presentation: Team names use a larger, bold display (approx. 30px) with responsive CSS; QB lines include extra vertical spacing to separate them from the first-level metrics.
  • Responsive CSS: Inline CSS classes (.team-name, .team-qb) and a mobile media query were added to keep the per‑game header tidy on small screens.
  • Memory & loading behavior: The per‑game page no longer loads the large play‑by‑play dataset by default and avoids reading the betting log CSV during initial per‑game rendering to reduce memory pressure.
  • Betting log removed from per‑game view: The betting-log table and per‑game CSV download were removed from the per‑game page UI by design; the Betting Performance / Performance dashboard still uses the centralized betting log for analytics.
  • Bug fix: Fixed a NameError by ensuring UI columns are always created before use (prevents occasional crashes when conditions changed).

⬆️ Back to Top

About

NFL predictions based on data modeling

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors