Skip to content

feat(leaderboard): ghost performance leaderboard and A/B testing#79

Merged
Enreign merged 2 commits intomainfrom
feat/ghost-leaderboard
Mar 18, 2026
Merged

feat(leaderboard): ghost performance leaderboard and A/B testing#79
Enreign merged 2 commits intomainfrom
feat/ghost-leaderboard

Conversation

@Enreign
Copy link
Copy Markdown
Collaborator

@Enreign Enreign commented Mar 16, 2026

Description

Adds a performance leaderboard for ghost profiles with SQLite-backed tracking and A/B testing support.

Changes

  • src/leaderboard.rsGhostLeaderboard with outcome recording, rankings, A/B routing, auto-promotion recommendations, and head-to-head comparison
  • src/config.rsLeaderboardConfig with ab_test_ghost, ab_test_fraction, min_samples_for_recommendation, promotion_threshold
  • src/main.rssparks leaderboard [show|compare <a> <b>|reset]
  • config.example.toml[leaderboard] section

Features

Leaderboard

Ranks all ghost profiles by composite score: 60% success rate + 20% user rating + 20% token efficiency.

A/B Testing

Route ab_test_fraction (default: 10%) of requests to a challenger ghost, then automatically recommend promotion when the challenger outperforms the control by promotion_threshold (default: 10%) over min_samples (default: 50) tasks.

Comparison

sparks leaderboard compare coder architect

Type of Change

  • New feature

Pre-PR Checklist

  • cargo check -q passes
  • cargo test -q passes (398 tests, 0 failed)

@Enreign Enreign force-pushed the feat/ghost-leaderboard branch 2 times, most recently from 3d7ee0d to d6578ff Compare March 18, 2026 22:11
Enreign and others added 2 commits March 18, 2026 23:14
- GhostLeaderboard: SQLite-backed outcome tracking per ghost profile
- TaskOutcome: records success, latency, token usage, user rating (-1/0/1)
- GhostMetrics: aggregate stats with composite rank_score (60% success,
  20% user rating, 20% token efficiency)
- A/B testing: route configurable fraction of requests to a challenger ghost
- Auto-promotion: recommend promoting challenger when it outperforms control
  by promotion_threshold (default: 10%) over min_samples (default: 50)
- ASCII leaderboard with star ratings and head-to-head comparison
- 'sparks leaderboard [show|compare <a> <b>|reset]' CLI subcommand
- 7 unit tests covering recording, ranking, A/B routing, promotion check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add TaskOutcome::new() convenience constructor to reduce struct literal verbosity in tests
- Fix compare() output: add column headers (ghost names + separator) so each column is identifiable
- Use successful_tasks in format_row() ("{success}/{total}") to eliminate dead-field warning
- Add four missing tests: ab_route with fraction=1.0, rank_score with zero tokens, reset(), and format_leaderboard with data
- Add [leaderboard] section to config.example.toml with all fields documented

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Enreign Enreign force-pushed the feat/ghost-leaderboard branch from d6578ff to 7f76e4f Compare March 18, 2026 22:14
@Enreign Enreign merged commit 0d066b4 into main Mar 18, 2026
4 checks passed
@Enreign Enreign deleted the feat/ghost-leaderboard branch March 18, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant