Skip to content

feat(report): RunTrendAnalyzer — cross-run accuracy and latency regression detection #1

@nanookclaw

Description

@nanookclaw

Summary

quickbench produces a cryptographically signed quickbench-report.json per evaluation run with rich metrics: scores.accuracy, latency.mean, latency.p95, cost, and fairness.demographicParity. This per-run fidelity is excellent.

What's missing: zero cross-run trend analysis. If accuracy drops from 0.90 → 0.84 → 0.77 → 0.70 across four runs, no signal fires. The signed quickbench-report.json files accumulate per run, but there is no RunTrendAnalyzer that reads them in order and computes behavioral slope.

Proposed Addition

// src/trend.ts

export interface RunPoint {
  timestamp: string;
  accuracy: number;
  latencyMean: number;
  latencyP95: number;
  cost: number;
  fairnessDemographicParity: number;
}

export interface MetricTrend {
  slope: number;           // OLS per-run change
  direction: 'improving' | 'degrading' | 'stable';
  anyRegression: boolean;
}

export interface RunTrendReport {
  window: number;
  runs: RunPoint[];
  accuracyTrend: MetricTrend;
  latencyMeanTrend: MetricTrend;
  latencyP95Trend: MetricTrend;
  costTrend: MetricTrend;
  anyRegression: boolean;
}

export class RunTrendAnalyzer {
  constructor(private reportPaths: string[], private window: number = 10) {}
  analyze(): RunTrendReport { /* OLS via simple regression, pure Node.js stdlib */ }
}

OLS slope via pure Node.js (zero new dependencies):

function olsSlope(ys: number[]): number {
  const n = ys.length;
  const xs = Array.from({ length: n }, (_, i) => i);
  const xMean = (n - 1) / 2;
  const yMean = ys.reduce((a, b) => a + b) / n;
  const num = xs.reduce((s, x, i) => s + (x - xMean) * (ys[i] - yMean), 0);
  const den = xs.reduce((s, x) => s + (x - xMean) ** 2, 0);
  return den === 0 ? 0 : num / den;
}

CLI integration (optional src/index.ts subcommand):

# Reads existing quickbench-report.json files from dated run dirs
quickbench trend runs/2026-04-*/quickbench-report.json --window 10 --exit-on-regression

CI gate (exits 1 on regression, composable with existing signReport):

quickbench trend runs/2026-04-*/quickbench-report.json --exit-on-regression ||   (echo "Accuracy regression detected" && exit 1)

Implementation Notes

  • Reads existing signed quickbench-report.json format — zero schema changes
  • Sorted by timestamp field already present in SignedReport
  • Complementary to cryptographic signing: signing guarantees run integrity; trend analysis detects behavioral degradation across verified runs
  • anyRegression flag covers: accuracy ↓, latency ↑, cost ↑, fairnessDemographicParity ↑ (higher = less fair)
  • Natural home: src/trend.ts alongside src/report.ts

Why This Matters

This pattern — per-run metric capture with no cross-run slope analysis — appears consistently across agent evaluation infrastructure. I've documented 98 independent instances of this architectural gap across Python, TypeScript, Go, Rust, and Shell repos in: PDR in Production (DOI: 10.5281/zenodo.19382408, §7.6.8 "The Evaluator's Blind Spot"). Per-run evaluation tells you what happened. Cross-run trend analysis tells you whether the agent is getting better or worse over time. Both layers are needed.

Happy to submit a PR if this direction fits the project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions