Skip to content

feat: active benchmark measurement (Story 20.1)#103

Merged
rajish merged 3 commits intomasterfrom
feature/story-20-1-active-benchmark-measurement
Mar 27, 2026
Merged

feat: active benchmark measurement (Story 20.1)#103
rajish merged 3 commits intomasterfrom
feature/story-20-1-active-benchmark-measurement

Conversation

@rajish
Copy link
Copy Markdown
Owner

@rajish rajish commented Mar 27, 2026

Summary

  • Adds "Measure" button in analytics window to send controlled test requests per model and measure token efficiency (TPP)
  • Implements BenchmarkService with Messages API integration, 3 benchmark variants (output-heavy, input-heavy, cache-heavy), and adaptive retry
  • Creates tpp_measurements database table (migration v6→v7) with TPPStorageService for persistence
  • Adds pre-measurement validation (OAuth token, headroom check, activity detection)
  • Includes benchmark configuration in Settings (enable toggle, model selector, variant selector)
  • BenchmarkSectionView with progress indication, result cards, and cancel support
  • 21 new unit tests across TPPMeasurement, BenchmarkService, and TPPStorageService

Story

20.1: Active Benchmark Measurement ("Measure" Button)

Test plan

  • All XCTest unit tests pass
  • Code review findings addressed
  • Build succeeds with xcodebuild

Summary by CodeRabbit

  • New Features

    • Added "Token Efficiency" benchmark measurement system to evaluate API performance across different model configurations (output-heavy, input-heavy, cache-heavy).
    • Introduced benchmark settings to enable/disable the feature and select measurement variants.
    • Added benchmark measurement UI with a "Measure" button, progress indicators, result cards, and cost ratio analysis.
    • Implemented result persistence and latest measurement history retrieval.
  • Chores

    • Updated database schema to support benchmark measurement storage.

rajish added 3 commits March 28, 2026 00:07
Implement the "Measure" button feature for token efficiency measurement.
Sends controlled test requests to the Anthropic Messages API, forces
usage polls, and computes tokens-per-percent (TPP) from observed
utilization deltas.

Key components:
- TPPMeasurement model with BenchmarkVariant and MeasurementSource enums
- Database migration v6->v7 with tpp_measurements table and indexes
- BenchmarkService: Messages API integration, 3 variants (output/input/cache-heavy),
  adaptive retry when delta is below detection threshold
- TPPStorageService: SQLite persistence for benchmark results
- BenchmarkSectionView: analytics UI with progress, result cards, weighting discovery
- Settings UI: Token Efficiency section with enable toggle and variant checkboxes
- Forced poll integration via PollingEngine.performForcedPoll()
- Full service wiring through AppDelegate -> AnalyticsWindow -> AnalyticsView
- validatePreconditions: remove dead code (both else branches returned
  .tokenExpired identical); require .connected status, not .disconnected
- BenchmarkSectionView: fix ForEach non-unique IDs when multiple variants
  run for same model (was id: \.model, now uses enumerated offset)
- SettingsView: call syncBenchmarkVariants() in reset action so variant
  preference changes are actually persisted, not just reflected in UI
- Story status: in-progress -> done; sprint-status synced
@rajish rajish merged commit 49ed179 into master Mar 27, 2026
0 of 3 checks passed
@rajish rajish deleted the feature/story-20-1-active-benchmark-measurement branch March 27, 2026 23:07
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 27, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 844a83c9-2fd0-42df-919d-e89a7c46701a

📥 Commits

Reviewing files that changed from the base of the PR and between 2a9416c and 34b62bc.

📒 Files selected for processing (25)
  • _bmad-output/implementation-artifacts/20-1-active-benchmark-measurement.md
  • _bmad-output/implementation-artifacts/20-2-claude-code-log-parser-service.md
  • _bmad-output/implementation-artifacts/sprint-status.yaml
  • cc-hdrm/App/AppDelegate.swift
  • cc-hdrm/Models/TPPMeasurement.swift
  • cc-hdrm/Services/BenchmarkService.swift
  • cc-hdrm/Services/BenchmarkServiceProtocol.swift
  • cc-hdrm/Services/DatabaseManager.swift
  • cc-hdrm/Services/PollingEngine.swift
  • cc-hdrm/Services/PollingEngineProtocol.swift
  • cc-hdrm/Services/PreferencesManager.swift
  • cc-hdrm/Services/PreferencesManagerProtocol.swift
  • cc-hdrm/Services/TPPStorageService.swift
  • cc-hdrm/Services/TPPStorageServiceProtocol.swift
  • cc-hdrm/Views/AnalyticsView.swift
  • cc-hdrm/Views/AnalyticsWindow.swift
  • cc-hdrm/Views/BenchmarkSectionView.swift
  • cc-hdrm/Views/SettingsView.swift
  • cc-hdrmTests/App/AppDelegateTests.swift
  • cc-hdrmTests/Mocks/MockPreferencesManager.swift
  • cc-hdrmTests/Models/TPPMeasurementTests.swift
  • cc-hdrmTests/Services/BenchmarkServiceTests.swift
  • cc-hdrmTests/Services/DatabaseManagerTests.swift
  • cc-hdrmTests/Services/PreferencesManagerTests.swift
  • cc-hdrmTests/Services/TPPStorageServiceTests.swift

📝 Walkthrough

Walkthrough

Implemented Story 20.1: a complete active-benchmark measurement system featuring token-per-percent (TPP) measurement models, a BenchmarkService orchestrating API calls and utilization polling, a TPPStorageService for persistence, SQLite schema migration (v6→v7) with a tpp_measurements table, forced-poll support in PollingEngine, preferences for enablement and variant selection, UI sections for benchmark configuration and measurement execution, and comprehensive test coverage. Story 20.2 reverted from completed to ready-for-development status.

Changes

Cohort / File(s) Summary
Benchmark Domain Models
cc-hdrm/Models/TPPMeasurement.swift
Added TPPMeasurement struct and enums (BenchmarkVariant, MeasurementSource, MeasurementConfidence) with computed TPP properties (computedTppFiveHour, computedTppSevenDay) and factory constructor fromBenchmark(...) for creating measurements from benchmark runs.
Benchmark Services (Protocol + Implementation)
cc-hdrm/Services/BenchmarkServiceProtocol.swift, cc-hdrm/Services/BenchmarkService.swift
New protocol defining precondition validation, benchmark orchestration (with progress callbacks), and cancellation. Implementation validates OAuth/utilization state, iterates models/variants, POSTs to Messages API via DataLoader, forces polling via PollingEngine.performForcedPoll(), computes TPP, includes adaptive retry logic for low utilization delta, and persists results via TPPStorageService.
TPP Storage Services (Protocol + Implementation)
cc-hdrm/Services/TPPStorageServiceProtocol.swift, cc-hdrm/Services/TPPStorageService.swift
New protocol for storing and retrieving TPPMeasurement instances. Implementation persists measurements to tpp_measurements SQLite table with graceful degradation when database unavailable; supports querying latest measurements by model/variant and retrieving last-benchmark timestamp.
Polling Engine Enhancement
cc-hdrm/Services/PollingEngineProtocol.swift, cc-hdrm/Services/PollingEngine.swift
Added performForcedPoll() async method to trigger immediate single poll cycle, enabling benchmark measurement to refresh utilization mid-execution.
Preferences
cc-hdrm/Services/PreferencesManagerProtocol.swift, cc-hdrm/Services/PreferencesManager.swift
Added three benchmark-related preference properties (isBenchmarkEnabled, benchmarkModels, benchmarkVariants), persisted in UserDefaults, with fallback defaults and reset-to-defaults support.
Database Schema Migration
cc-hdrm/Services/DatabaseManager.swift
Bumped schema version from 6 to 7; added tpp_measurements table creation during fresh schema init and v6→v7 migration, with indexes on timestamp and (model, source).
Service Wiring
cc-hdrm/App/AppDelegate.swift
Conditionally instantiates BenchmarkService and TPPStorageService during app initialization; passes them through AnalyticsWindow configuration to AnalyticsView.
UI Integration
cc-hdrm/Views/AnalyticsView.swift, cc-hdrm/Views/AnalyticsWindow.swift, cc-hdrm/Views/BenchmarkSectionView.swift, cc-hdrm/Views/SettingsView.swift
Extended AnalyticsView and AnalyticsWindow to accept and pass benchmark services. Added BenchmarkSectionView rendering Measure button, progress/results UI, weighting discovery, and soft rate-limit warnings. Updated SettingsView with benchmark toggle and variant selection controls synced to preferences.
Test Suite (Models)
cc-hdrmTests/Models/TPPMeasurementTests.swift
Added comprehensive tests for BenchmarkVariant/MeasurementSource enums, TPPMeasurement computed properties, and fromBenchmark(...) factory constructor.
Test Suite (Services)
cc-hdrmTests/Services/BenchmarkServiceTests.swift, cc-hdrmTests/Services/TPPStorageServiceTests.swift, cc-hdrmTests/Services/DatabaseManagerTests.swift, cc-hdrmTests/Services/PreferencesManagerTests.swift
Added precondition validation tests, benchmark execution tests (with progress/cancellation/decoding scenarios), storage/retrieval roundtrip tests, database schema migration tests (v6→v7 with table/index verification), and preference persistence tests.
Test Infrastructure
cc-hdrmTests/App/AppDelegateTests.swift, cc-hdrmTests/Mocks/MockPreferencesManager.swift
Extended mock polling engine with performForcedPoll() tracking; extended MockPreferencesManager with benchmark preference properties and defaults.
Documentation & Status
_bmad-output/implementation-artifacts/20-1-active-benchmark-measurement.md, _bmad-output/implementation-artifacts/20-2-claude-code-log-parser-service.md, _bmad-output/implementation-artifacts/sprint-status.yaml
Updated Story 20.1 status to done with code-review completion date (2026-03-27). Reverted Story 20.2 from done back to ready-for-dev, removing review findings and dev-agent records.

Sequence Diagram

sequenceDiagram
    participant UI as BenchmarkSectionView
    participant BS as BenchmarkService
    participant PE as PollingEngine
    participant API as Messages API
    participant AppState
    participant Storage as TPPStorageService
    participant DB as DatabaseManager

    UI->>BS: validatePreconditions()
    BS->>AppState: check OAuth, utilization, recent activity
    AppState-->>BS: validation result
    BS-->>UI: BenchmarkValidation (e.g., ready)
    
    UI->>BS: runBenchmark(models, variants, onProgress)
    activate BS
    loop for each model & variant
        BS->>UI: onProgress(.sendingRequest)
        BS->>API: POST /messages with variant config
        API-->>BS: MessagesAPIResponse (tokens)
        BS->>UI: onProgress(.polling)
        BS->>PE: performForcedPoll()
        PE->>AppState: refresh utilization
        AppState-->>PE: updated delta
        PE-->>BS: poll complete
        BS->>UI: onProgress(.computingResult)
        BS->>BS: compute TPP from delta & tokens
        BS->>Storage: storeBenchmarkResult(TPPMeasurement)
        Storage->>DB: INSERT into tpp_measurements
        DB-->>Storage: success
        Storage-->>BS: result stored
        BS->>UI: onProgress(.completed) with result
    end
    deactivate BS
    BS-->>UI: [BenchmarkVariantResult]
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 Hops with joy through token streams,
Measuring benchmarks, fulfilling dreams!
TPP flows, five hours to seven,
Our database schema reaches version heaven.
Progress callbacks and storage so bold,
The Measure button's story fully told! 🎯

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/story-20-1-active-benchmark-measurement

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant