feat: active benchmark measurement (Story 20.1) by rajish · Pull Request #103 · rajish/cc-hdrm

rajish · 2026-03-27T23:07:49Z

Summary

Adds "Measure" button in analytics window to send controlled test requests per model and measure token efficiency (TPP)
Implements BenchmarkService with Messages API integration, 3 benchmark variants (output-heavy, input-heavy, cache-heavy), and adaptive retry
Creates tpp_measurements database table (migration v6→v7) with TPPStorageService for persistence
Adds pre-measurement validation (OAuth token, headroom check, activity detection)
Includes benchmark configuration in Settings (enable toggle, model selector, variant selector)
BenchmarkSectionView with progress indication, result cards, and cancel support
21 new unit tests across TPPMeasurement, BenchmarkService, and TPPStorageService

Story

20.1: Active Benchmark Measurement ("Measure" Button)

Test plan

All XCTest unit tests pass
Code review findings addressed
Build succeeds with xcodebuild

Summary by CodeRabbit

New Features
- Added "Token Efficiency" benchmark measurement system to evaluate API performance across different model configurations (output-heavy, input-heavy, cache-heavy).
- Introduced benchmark settings to enable/disable the feature and select measurement variants.
- Added benchmark measurement UI with a "Measure" button, progress indicators, result cards, and cost ratio analysis.
- Implemented result persistence and latest measurement history retrieval.
Chores
- Updated database schema to support benchmark measurement storage.

Implement the "Measure" button feature for token efficiency measurement. Sends controlled test requests to the Anthropic Messages API, forces usage polls, and computes tokens-per-percent (TPP) from observed utilization deltas. Key components: - TPPMeasurement model with BenchmarkVariant and MeasurementSource enums - Database migration v6->v7 with tpp_measurements table and indexes - BenchmarkService: Messages API integration, 3 variants (output/input/cache-heavy), adaptive retry when delta is below detection threshold - TPPStorageService: SQLite persistence for benchmark results - BenchmarkSectionView: analytics UI with progress, result cards, weighting discovery - Settings UI: Token Efficiency section with enable toggle and variant checkboxes - Forced poll integration via PollingEngine.performForcedPoll() - Full service wiring through AppDelegate -> AnalyticsWindow -> AnalyticsView

- validatePreconditions: remove dead code (both else branches returned .tokenExpired identical); require .connected status, not .disconnected - BenchmarkSectionView: fix ForEach non-unique IDs when multiple variants run for same model (was id: \.model, now uses enumerated offset) - SettingsView: call syncBenchmarkVariants() in reset action so variant preference changes are actually persisted, not just reflected in UI - Story status: in-progress -> done; sprint-status synced

coderabbitai · 2026-03-27T23:08:07Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 844a83c9-2fd0-42df-919d-e89a7c46701a

📥 Commits

Reviewing files that changed from the base of the PR and between 2a9416c and 34b62bc.

📒 Files selected for processing (25)

_bmad-output/implementation-artifacts/20-1-active-benchmark-measurement.md
_bmad-output/implementation-artifacts/20-2-claude-code-log-parser-service.md
_bmad-output/implementation-artifacts/sprint-status.yaml
cc-hdrm/App/AppDelegate.swift
cc-hdrm/Models/TPPMeasurement.swift
cc-hdrm/Services/BenchmarkService.swift
cc-hdrm/Services/BenchmarkServiceProtocol.swift
cc-hdrm/Services/DatabaseManager.swift
cc-hdrm/Services/PollingEngine.swift
cc-hdrm/Services/PollingEngineProtocol.swift
cc-hdrm/Services/PreferencesManager.swift
cc-hdrm/Services/PreferencesManagerProtocol.swift
cc-hdrm/Services/TPPStorageService.swift
cc-hdrm/Services/TPPStorageServiceProtocol.swift
cc-hdrm/Views/AnalyticsView.swift
cc-hdrm/Views/AnalyticsWindow.swift
cc-hdrm/Views/BenchmarkSectionView.swift
cc-hdrm/Views/SettingsView.swift
cc-hdrmTests/App/AppDelegateTests.swift
cc-hdrmTests/Mocks/MockPreferencesManager.swift
cc-hdrmTests/Models/TPPMeasurementTests.swift
cc-hdrmTests/Services/BenchmarkServiceTests.swift
cc-hdrmTests/Services/DatabaseManagerTests.swift
cc-hdrmTests/Services/PreferencesManagerTests.swift
cc-hdrmTests/Services/TPPStorageServiceTests.swift

📝 Walkthrough

Walkthrough

Implemented Story 20.1: a complete active-benchmark measurement system featuring token-per-percent (TPP) measurement models, a BenchmarkService orchestrating API calls and utilization polling, a TPPStorageService for persistence, SQLite schema migration (v6→v7) with a tpp_measurements table, forced-poll support in PollingEngine, preferences for enablement and variant selection, UI sections for benchmark configuration and measurement execution, and comprehensive test coverage. Story 20.2 reverted from completed to ready-for-development status.

Changes

Cohort / File(s)	Summary
Benchmark Domain Models `cc-hdrm/Models/TPPMeasurement.swift`	Added `TPPMeasurement` struct and enums (`BenchmarkVariant`, `MeasurementSource`, `MeasurementConfidence`) with computed TPP properties (`computedTppFiveHour`, `computedTppSevenDay`) and factory constructor `fromBenchmark(...)` for creating measurements from benchmark runs.
Benchmark Services (Protocol + Implementation) `cc-hdrm/Services/BenchmarkServiceProtocol.swift`, `cc-hdrm/Services/BenchmarkService.swift`	New protocol defining precondition validation, benchmark orchestration (with progress callbacks), and cancellation. Implementation validates OAuth/utilization state, iterates models/variants, POSTs to Messages API via `DataLoader`, forces polling via `PollingEngine.performForcedPoll()`, computes TPP, includes adaptive retry logic for low utilization delta, and persists results via `TPPStorageService`.
TPP Storage Services (Protocol + Implementation) `cc-hdrm/Services/TPPStorageServiceProtocol.swift`, `cc-hdrm/Services/TPPStorageService.swift`	New protocol for storing and retrieving `TPPMeasurement` instances. Implementation persists measurements to `tpp_measurements` SQLite table with graceful degradation when database unavailable; supports querying latest measurements by model/variant and retrieving last-benchmark timestamp.
Polling Engine Enhancement `cc-hdrm/Services/PollingEngineProtocol.swift`, `cc-hdrm/Services/PollingEngine.swift`	Added `performForcedPoll() async` method to trigger immediate single poll cycle, enabling benchmark measurement to refresh utilization mid-execution.
Preferences `cc-hdrm/Services/PreferencesManagerProtocol.swift`, `cc-hdrm/Services/PreferencesManager.swift`	Added three benchmark-related preference properties (`isBenchmarkEnabled`, `benchmarkModels`, `benchmarkVariants`), persisted in `UserDefaults`, with fallback defaults and reset-to-defaults support.
Database Schema Migration `cc-hdrm/Services/DatabaseManager.swift`	Bumped schema version from 6 to 7; added `tpp_measurements` table creation during fresh schema init and v6→v7 migration, with indexes on `timestamp` and `(model, source)`.
Service Wiring `cc-hdrm/App/AppDelegate.swift`	Conditionally instantiates `BenchmarkService` and `TPPStorageService` during app initialization; passes them through `AnalyticsWindow` configuration to `AnalyticsView`.
UI Integration `cc-hdrm/Views/AnalyticsView.swift`, `cc-hdrm/Views/AnalyticsWindow.swift`, `cc-hdrm/Views/BenchmarkSectionView.swift`, `cc-hdrm/Views/SettingsView.swift`	Extended `AnalyticsView` and `AnalyticsWindow` to accept and pass benchmark services. Added `BenchmarkSectionView` rendering Measure button, progress/results UI, weighting discovery, and soft rate-limit warnings. Updated `SettingsView` with benchmark toggle and variant selection controls synced to preferences.
Test Suite (Models) `cc-hdrmTests/Models/TPPMeasurementTests.swift`	Added comprehensive tests for `BenchmarkVariant`/`MeasurementSource` enums, `TPPMeasurement` computed properties, and `fromBenchmark(...)` factory constructor.
Test Suite (Services) `cc-hdrmTests/Services/BenchmarkServiceTests.swift`, `cc-hdrmTests/Services/TPPStorageServiceTests.swift`, `cc-hdrmTests/Services/DatabaseManagerTests.swift`, `cc-hdrmTests/Services/PreferencesManagerTests.swift`	Added precondition validation tests, benchmark execution tests (with progress/cancellation/decoding scenarios), storage/retrieval roundtrip tests, database schema migration tests (v6→v7 with table/index verification), and preference persistence tests.
Test Infrastructure `cc-hdrmTests/App/AppDelegateTests.swift`, `cc-hdrmTests/Mocks/MockPreferencesManager.swift`	Extended mock polling engine with `performForcedPoll()` tracking; extended `MockPreferencesManager` with benchmark preference properties and defaults.
Documentation & Status `_bmad-output/implementation-artifacts/20-1-active-benchmark-measurement.md`, `_bmad-output/implementation-artifacts/20-2-claude-code-log-parser-service.md`, `_bmad-output/implementation-artifacts/sprint-status.yaml`	Updated Story 20.1 status to done with code-review completion date (2026-03-27). Reverted Story 20.2 from done back to ready-for-dev, removing review findings and dev-agent records.

Sequence Diagram

sequenceDiagram
    participant UI as BenchmarkSectionView
    participant BS as BenchmarkService
    participant PE as PollingEngine
    participant API as Messages API
    participant AppState
    participant Storage as TPPStorageService
    participant DB as DatabaseManager

    UI->>BS: validatePreconditions()
    BS->>AppState: check OAuth, utilization, recent activity
    AppState-->>BS: validation result
    BS-->>UI: BenchmarkValidation (e.g., ready)
    
    UI->>BS: runBenchmark(models, variants, onProgress)
    activate BS
    loop for each model & variant
        BS->>UI: onProgress(.sendingRequest)
        BS->>API: POST /messages with variant config
        API-->>BS: MessagesAPIResponse (tokens)
        BS->>UI: onProgress(.polling)
        BS->>PE: performForcedPoll()
        PE->>AppState: refresh utilization
        AppState-->>PE: updated delta
        PE-->>BS: poll complete
        BS->>UI: onProgress(.computingResult)
        BS->>BS: compute TPP from delta & tokens
        BS->>Storage: storeBenchmarkResult(TPPMeasurement)
        Storage->>DB: INSERT into tpp_measurements
        DB-->>Storage: success
        Storage-->>BS: result stored
        BS->>UI: onProgress(.completed) with result
    end
    deactivate BS
    BS-->>UI: [BenchmarkVariantResult]

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: tier recommendation display card with billing cycle settings (Story 16.4) #54: Modifies AppDelegate, AnalyticsWindow, AnalyticsView, and PreferencesManagerProtocol to add optional service dependencies—overlaps with the UI wiring and preference interface extensions in this PR.
fix: poll interval hot reload (Story 2.5) #98: Modifies PollingEngine and PollingEngineProtocol to add new polling-control methods; this PR adds performForcedPoll() to both, making them directly related at the protocol/implementation level.
feat: extra usage state propagation and menu bar indicator (Story 17.1) #59: Modifies PollingEngine and AppDelegate implementations for polling control and extra-usage propagation; related through shared changes to polling orchestration and app initialization.

Poem

🐰 Hops with joy through token streams,
Measuring benchmarks, fulfilling dreams!
TPP flows, five hours to seven,
Our database schema reaches version heaven.
Progress callbacks and storage so bold,
The Measure button's story fully told! 🎯

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/story-20-1-active-benchmark-measurement

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rajish added 3 commits March 28, 2026 00:07

docs: add story files for Stories 20.1 and 20.2

e213b11

rajish merged commit 49ed179 into master Mar 27, 2026
0 of 3 checks passed

rajish deleted the feature/story-20-1-active-benchmark-measurement branch March 27, 2026 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: active benchmark measurement (Story 20.1)#103

feat: active benchmark measurement (Story 20.1)#103
rajish merged 3 commits intomasterfrom
feature/story-20-1-active-benchmark-measurement

rajish commented Mar 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rajish commented Mar 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Story

Test plan

Summary by CodeRabbit

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rajish commented Mar 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 27, 2026 •

edited

Loading