Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 34 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ This repository contains Airflow DAGs that automatically download sports data da
- Game events (shots, goals, hits, penalties)
- Shift data (time on ice)
- Player and team statistics
- **TrueSkill player ratings** ⭐ NEW!
- **Schedule:** Daily at 7:00 AM

### 🏇 Hong Kong Horse Racing
Expand All @@ -26,6 +27,9 @@ nhlstats/
│ └── hk_racing_daily_download.py # HK racing data collection
├── nhl_game_events.py # NHL API client
├── nhl_shifts.py # NHL shifts data
├── nhl_trueskill.py # TrueSkill rating system ⭐
├── calculate_trueskill_ratings.py # Calculate player ratings ⭐
├── query_trueskill_ratings.py # Query player ratings ⭐
├── hk_racing_scraper.py # HKJC web scraper
├── data/
│ ├── games/ # NHL game data
Expand All @@ -34,6 +38,7 @@ nhlstats/
│ └── nhlstats.duckdb # DuckDB database (future)
├── NORMALIZATION_PLAN.md # NHL schema design
├── HK_RACING_SCHEMA.md # Racing schema design
├── README_TRUESKILL.md # TrueSkill rating system guide ⭐
└── README_AIRFLOW.md # Airflow setup guide
```

Expand All @@ -44,20 +49,43 @@ nhlstats/
pip install -r requirements.txt
```

### 2. Initialize Airflow
### 2. TrueSkill Player Ratings (NEW!)

Calculate skill ratings for all NHL players:

```bash
# Calculate ratings for 2023-24 season
python calculate_trueskill_ratings.py --season 2023

# Or process all available seasons
python calculate_trueskill_ratings.py --all-seasons

# Query top players
python query_trueskill_ratings.py --top 50

# Search for specific player
python query_trueskill_ratings.py --player "McDavid"

# View ratings by position
python query_trueskill_ratings.py --by-position
```

**See [README_TRUESKILL.md](README_TRUESKILL.md) for detailed documentation.**

### 3. Initialize Airflow
```bash
export AIRFLOW_HOME=~/airflow
airflow db init
airflow users create --username admin --password admin --firstname Admin --lastname User --role Admin --email admin@example.com
```

### 3. Configure DAGs
### 4. Configure DAGs
Copy DAGs to Airflow folder or configure `dags_folder` in `airflow.cfg`:
```bash
cp dags/*.py ~/airflow/dags/
```

### 4. Start Airflow
### 5. Start Airflow
```bash
# Terminal 1: Web server
airflow webserver --port 8080
Expand All @@ -66,7 +94,7 @@ airflow webserver --port 8080
airflow scheduler
```

### 5. Enable DAGs
### 6. Enable DAGs
Go to http://localhost:8080 and toggle on:
- `nhl_daily_download`
- `hk_racing_daily_download`
Expand Down Expand Up @@ -139,6 +167,7 @@ Both sports will be normalized into DuckDB for efficient querying:
- `games`, `teams`, `players`
- `game_events`, `shots`, `shifts`
- `player_game_stats`, `goalie_game_stats`
- `player_trueskill_ratings`, `player_trueskill_history` ⭐ NEW!

### Racing Tables
- `race_meetings`, `races`
Expand All @@ -147,6 +176,7 @@ Both sports will be normalized into DuckDB for efficient querying:

See detailed schemas in:
- [NORMALIZATION_PLAN.md](NORMALIZATION_PLAN.md) - NHL schema
- [README_TRUESKILL.md](README_TRUESKILL.md) - TrueSkill rating system ⭐
- [HK_RACING_SCHEMA.md](HK_RACING_SCHEMA.md) - Racing schema

## Configuration
Expand Down
288 changes: 288 additions & 0 deletions README_TRUESKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
# NHL TrueSkill Rating System

## Overview

This repository now includes a TrueSkill-based rating system for NHL players. TrueSkill is a Bayesian skill rating system developed by Microsoft Research that represents each player's skill as a Gaussian probability distribution.

## What is TrueSkill?

TrueSkill is superior to traditional rating systems (like Elo) for team sports because:

1. **Handles Team Games**: Works with any number of players per team
2. **Uncertainty Tracking**: Maintains confidence intervals for each rating
3. **Partial Play**: Can weight contributions (e.g., by time on ice)
4. **Multiple Teams**: Supports more than 2 teams/players per match
5. **Draws**: Properly handles tied outcomes

### How It Works

Each player has a rating represented by:
- **μ (mu)**: Mean skill level (starts at 25.0)
- **σ (sigma)**: Uncertainty in skill estimate (starts at 8.33)

The **skill estimate** is calculated conservatively as: `μ - 3σ`

After each game:
- Winners gain rating points
- Losers lose rating points
- Uncertainty (σ) decreases over time
- Changes are weighted by time on ice (TOI)

## NHL-Specific Adaptations

Our implementation is customized for hockey:

```python
# TrueSkill parameters for NHL
mu = 25.0 # Initial mean skill
sigma = 25.0/3 # Initial uncertainty (8.33)
beta = 25.0/6 # Skill variance per level (4.17)
tau = 25.0/300 # Dynamics factor (0.083)
draw_probability = 0.10 # ~10% of NHL games go to OT/SO
```

### Time on Ice Weighting

Players are weighted by their time on ice in the game:
- More ice time = greater impact on team outcome
- Goalies weighted by their TOI (usually full game or backup)
- Skaters weighted by shifts and playing time

## Installation

The TrueSkill library is included in `requirements.txt`:

```bash
pip install -r requirements.txt
```

## Usage

### 1. Calculate Ratings for a Season

```bash
# Calculate ratings for 2023-24 season
python calculate_trueskill_ratings.py --season 2023

# Calculate ratings for all available seasons
python calculate_trueskill_ratings.py --all-seasons
```

This will:
1. Process all completed games in chronological order
2. Update player ratings after each game
3. Store ratings in the database
4. Export final ratings to `data/nhl_trueskill_ratings.json`

### 2. Query Player Ratings

```bash
# Top 50 players overall
python query_trueskill_ratings.py --top 50

# Top 20 goalies
python query_trueskill_ratings.py --top 20 --position G

# Top 10 centers
python query_trueskill_ratings.py --top 10 --position C

# Search for specific player
python query_trueskill_ratings.py --player "McDavid"

# Get rating for specific player with history
python query_trueskill_ratings.py --player-id 8478402 --history

# Show top 10 for each position
python query_trueskill_ratings.py --by-position
```

### 3. Use in Python Code

```python
from nhl_trueskill import NHLTrueSkillRatings

# Initialize rating system
with NHLTrueSkillRatings() as ratings:

# Process a season
stats = ratings.process_season(season=2023, game_type=2)

# Get top players
top_players = ratings.get_top_players(limit=50, min_games=20)

for player in top_players:
print(f"{player['first_name']} {player['last_name']}: {player['skill_estimate']:.2f}")

# Calculate team rating for a game
player_ids = [8478402, 8479318, 8477934] # Example: McDavid, Draisaitl, Nugent-Hopkins
weights = [1200, 1100, 1000] # TOI in seconds
team_rating = ratings.calculate_team_rating(player_ids, weights)

# Export ratings
ratings.export_ratings("data/nhl_trueskill_ratings.json")
```

## Database Schema

Two new tables are added to track TrueSkill ratings:

### `player_trueskill_ratings`
Current rating for each player:
- `player_id`: Player identifier
- `mu`: Mean skill level
- `sigma`: Uncertainty in skill
- `skill_estimate`: Conservative estimate (μ - 3σ)
- `games_played`: Number of games processed
- `last_updated`: Timestamp of last update

### `player_trueskill_history`
Historical ratings after each game:
- `player_id`: Player identifier
- `game_id`: Game identifier
- `game_date`: Date of game
- `mu_before`: Rating before game
- `sigma_before`: Uncertainty before game
- `mu_after`: Rating after game
- `sigma_after`: Uncertainty after game
- `toi_seconds`: Time on ice in game
- `team_won`: Whether player's team won

## Understanding the Ratings

### Rating Ranges

Typical NHL player skill estimates:
- **Elite (30+)**: Superstars (McDavid, Matthews, etc.)
- **High (25-30)**: All-stars and top-line players
- **Average (20-25)**: Regular NHL players
- **Below Average (15-20)**: Bottom-six/bottom-pair players
- **New/Uncertain (<15)**: Rookies or players with few games

### Uncertainty (σ)

- **High σ (>5)**: New players, few games, unreliable rating
- **Medium σ (3-5)**: Some games played, rating stabilizing
- **Low σ (<3)**: Veteran with many games, confident rating

The skill estimate `μ - 3σ` is conservative - it's 99.7% likely the player's true skill is at least this high.

### Rating Changes

After each game:
- **Wins**: Players gain rating points (amount depends on opponent strength and uncertainty)
- **Losses**: Players lose rating points
- **Upsets**: Beating stronger teams yields bigger gains
- **Expected Wins**: Beating weaker teams yields smaller gains

Uncertainty decreases with each game as we become more confident in the rating.

## Applications

### 1. Player Evaluation
- Compare players across teams and positions
- Identify undervalued players
- Track player development over time
- Scout rookies and prospects

### 2. Team Strength Calculation
- Aggregate player ratings to get team strength
- Weight by expected lineup/ice time
- Account for injuries and roster changes
- Predict game outcomes

### 3. Lineup Optimization
- Identify strongest line combinations
- Balance TOI distribution
- Optimize special teams units
- Roster construction for cap management

### 4. Predictive Modeling
- Use ratings as features in ML models
- Predict game winners
- Forecast player performance
- Expected goals models

### 5. Trade Analysis
- Evaluate trade value
- Compare players in trade scenarios
- Assess prospect value
- Long-term team building

## Example Output

```
================================================================================
TOP 50 NHL PLAYERS BY TRUESKILL RATING
================================================================================
Rank Name Pos Skill μ σ Games
--------------------------------------------------------------------------------
1 Connor McDavid C 31.45 32.67 0.41 82
2 Nathan MacKinnon C 30.89 32.01 0.37 78
3 Auston Matthews C 30.54 31.78 0.41 81
4 Leon Draisaitl C 30.12 31.45 0.44 80
5 Nikita Kucherov R 29.87 31.21 0.45 79
...
```

## Technical Details

### Algorithm

TrueSkill uses Bayesian inference to update ratings:

1. **Before game**: Each player has rating N(μ, σ²)
2. **Team rating**: Aggregate players weighted by TOI
3. **Match quality**: Calculate performance difference
4. **Outcome**: Observe which team won
5. **Update**: Adjust μ and σ using factor graphs and message passing
6. **After game**: Players have new ratings N(μ', σ'²)

### Implementation

- **Library**: `trueskill` Python package
- **Database**: DuckDB for efficient storage and querying
- **Processing**: Sequential by game date to maintain temporal ordering
- **Weighting**: Time on ice used as weight parameter
- **Ties**: Handled via draw probability parameter

### Performance

- Processing 1,000+ games takes ~30-60 seconds
- Ratings converge after ~20 games per player
- Database queries are sub-second with indexes
- Export to JSON for external use

## References

1. [TrueSkill Paper](https://www.microsoft.com/en-us/research/publication/trueskilltm-a-bayesian-skill-rating-system/)
2. [TrueSkill Python Library](https://trueskill.org/)
3. [Factor Graphs for Rating Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/NIPS2006_0688.pdf)
4. [Bayesian Skill Rating](https://en.wikipedia.org/wiki/TrueSkill)

## Future Enhancements

Potential improvements to the rating system:

- [ ] Position-specific ratings (forwards vs defensemen vs goalies)
- [ ] Home ice advantage factor
- [ ] Playoff vs regular season separate ratings
- [ ] Rating decay for injured/inactive players
- [ ] Line chemistry bonuses
- [ ] Special teams ratings (PP/PK)
- [ ] Situation-based ratings (score effects)
- [ ] Real-time rating updates during season
- [ ] API endpoint for rating queries
- [ ] Interactive visualization dashboard

## Contributing

The TrueSkill implementation is modular and extensible. To add features:

1. Modify `nhl_trueskill.py` for core rating logic
2. Update `calculate_trueskill_ratings.py` for batch processing
3. Extend `query_trueskill_ratings.py` for new queries
4. Update database schema in `nhl_db_schema.sql` if needed

## License

TrueSkill is a patented algorithm by Microsoft Research. This implementation uses the open-source Python library for non-commercial research and analysis purposes.
Loading