Skip to content

Conversation

@dtsong
Copy link
Owner

@dtsong dtsong commented Jan 24, 2026

Summary

  • CalibrationWeights dataclass for tunable severity penalties in confidence scoring
  • calculate_confidence now accepts optional weights parameter for calibrated scoring
  • evals/calibration.py with bucketing, ECE (Expected Calibration Error), and weight suggestion
  • load_calibration_data reads human-labeled outcomes from YAML files
  • Sample calibration data with 5 labeled review outcomes

Test plan

  • 16 new tests covering calibration samples, bucketing, analysis, weight suggestions, and data loading
  • Existing confidence tests still pass with new CalibrationWeights defaults
  • Full suite passes (405 tests, 95.77% coverage)

Closes #39

🤖 Generated with Claude Code

- CalibrationWeights dataclass for tunable severity penalties
- calculate_confidence accepts optional weights parameter
- Calibration analysis: bucketing, ECE calculation, weight suggestions
- load_calibration_data reads labeled outcomes from YAML
- Sample calibration data with 5 labeled outcomes
- 16 tests covering calibration analysis and weight tuning

Implements #39

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dtsong dtsong merged commit 171e58b into main Jan 24, 2026
2 checks passed
@dtsong dtsong deleted the feat/39-confidence-calibration branch January 24, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Confidence calibration dataset

2 participants