feat(report): RunTrendAnalyzer — cross-run accuracy and latency regression detection

## Summary

**quickbench** produces a cryptographically signed `quickbench-report.json` per evaluation run with rich metrics: `scores.accuracy`, `latency.mean`, `latency.p95`, `cost`, and `fairness.demographicParity`. This per-run fidelity is excellent.

**What's missing:** zero cross-run trend analysis. If accuracy drops from 0.90 → 0.84 → 0.77 → 0.70 across four runs, no signal fires. The signed `quickbench-report.json` files accumulate per run, but there is no `RunTrendAnalyzer` that reads them in order and computes behavioral slope.

## Proposed Addition

```typescript
// src/trend.ts

export interface RunPoint {
  timestamp: string;
  accuracy: number;
  latencyMean: number;
  latencyP95: number;
  cost: number;
  fairnessDemographicParity: number;
}

export interface MetricTrend {
  slope: number;           // OLS per-run change
  direction: 'improving' | 'degrading' | 'stable';
  anyRegression: boolean;
}

export interface RunTrendReport {
  window: number;
  runs: RunPoint[];
  accuracyTrend: MetricTrend;
  latencyMeanTrend: MetricTrend;
  latencyP95Trend: MetricTrend;
  costTrend: MetricTrend;
  anyRegression: boolean;
}

export class RunTrendAnalyzer {
  constructor(private reportPaths: string[], private window: number = 10) {}
  analyze(): RunTrendReport { /* OLS via simple regression, pure Node.js stdlib */ }
}
```

**OLS slope** via pure Node.js (zero new dependencies):
```typescript
function olsSlope(ys: number[]): number {
  const n = ys.length;
  const xs = Array.from({ length: n }, (_, i) => i);
  const xMean = (n - 1) / 2;
  const yMean = ys.reduce((a, b) => a + b) / n;
  const num = xs.reduce((s, x, i) => s + (x - xMean) * (ys[i] - yMean), 0);
  const den = xs.reduce((s, x) => s + (x - xMean) ** 2, 0);
  return den === 0 ? 0 : num / den;
}
```

**CLI integration** (optional `src/index.ts` subcommand):
```bash
# Reads existing quickbench-report.json files from dated run dirs
quickbench trend runs/2026-04-*/quickbench-report.json --window 10 --exit-on-regression
```

**CI gate** (exits 1 on regression, composable with existing `signReport`):
```bash
quickbench trend runs/2026-04-*/quickbench-report.json --exit-on-regression ||   (echo "Accuracy regression detected" && exit 1)
```

## Implementation Notes

- Reads existing signed `quickbench-report.json` format — **zero schema changes**
- Sorted by `timestamp` field already present in `SignedReport`
- Complementary to cryptographic signing: signing guarantees run integrity; trend analysis detects behavioral degradation across verified runs
- `anyRegression` flag covers: accuracy ↓, latency ↑, cost ↑, fairnessDemographicParity ↑ (higher = less fair)
- Natural home: `src/trend.ts` alongside `src/report.ts`

## Why This Matters

This pattern — per-run metric capture with no cross-run slope analysis — appears consistently across agent evaluation infrastructure. I've documented 98 independent instances of this architectural gap across Python, TypeScript, Go, Rust, and Shell repos in: **PDR in Production** (DOI: [10.5281/zenodo.19382408](https://doi.org/10.5281/zenodo.19382408), §7.6.8 "The Evaluator's Blind Spot"). Per-run evaluation tells you what happened. Cross-run trend analysis tells you whether the agent is getting better or worse over time. Both layers are needed.

Happy to submit a PR if this direction fits the project.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(report): RunTrendAnalyzer — cross-run accuracy and latency regression detection #1

Summary

Proposed Addition

Implementation Notes

Why This Matters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(report): RunTrendAnalyzer — cross-run accuracy and latency regression detection #1

Description

Summary

Proposed Addition

Implementation Notes

Why This Matters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions