Summary
quickbench produces a cryptographically signed quickbench-report.json per evaluation run with rich metrics: scores.accuracy, latency.mean, latency.p95, cost, and fairness.demographicParity. This per-run fidelity is excellent.
What's missing: zero cross-run trend analysis. If accuracy drops from 0.90 → 0.84 → 0.77 → 0.70 across four runs, no signal fires. The signed quickbench-report.json files accumulate per run, but there is no RunTrendAnalyzer that reads them in order and computes behavioral slope.
Proposed Addition
// src/trend.ts
export interface RunPoint {
timestamp: string;
accuracy: number;
latencyMean: number;
latencyP95: number;
cost: number;
fairnessDemographicParity: number;
}
export interface MetricTrend {
slope: number; // OLS per-run change
direction: 'improving' | 'degrading' | 'stable';
anyRegression: boolean;
}
export interface RunTrendReport {
window: number;
runs: RunPoint[];
accuracyTrend: MetricTrend;
latencyMeanTrend: MetricTrend;
latencyP95Trend: MetricTrend;
costTrend: MetricTrend;
anyRegression: boolean;
}
export class RunTrendAnalyzer {
constructor(private reportPaths: string[], private window: number = 10) {}
analyze(): RunTrendReport { /* OLS via simple regression, pure Node.js stdlib */ }
}
OLS slope via pure Node.js (zero new dependencies):
function olsSlope(ys: number[]): number {
const n = ys.length;
const xs = Array.from({ length: n }, (_, i) => i);
const xMean = (n - 1) / 2;
const yMean = ys.reduce((a, b) => a + b) / n;
const num = xs.reduce((s, x, i) => s + (x - xMean) * (ys[i] - yMean), 0);
const den = xs.reduce((s, x) => s + (x - xMean) ** 2, 0);
return den === 0 ? 0 : num / den;
}
CLI integration (optional src/index.ts subcommand):
# Reads existing quickbench-report.json files from dated run dirs
quickbench trend runs/2026-04-*/quickbench-report.json --window 10 --exit-on-regression
CI gate (exits 1 on regression, composable with existing signReport):
quickbench trend runs/2026-04-*/quickbench-report.json --exit-on-regression || (echo "Accuracy regression detected" && exit 1)
Implementation Notes
- Reads existing signed
quickbench-report.json format — zero schema changes
- Sorted by
timestamp field already present in SignedReport
- Complementary to cryptographic signing: signing guarantees run integrity; trend analysis detects behavioral degradation across verified runs
anyRegression flag covers: accuracy ↓, latency ↑, cost ↑, fairnessDemographicParity ↑ (higher = less fair)
- Natural home:
src/trend.ts alongside src/report.ts
Why This Matters
This pattern — per-run metric capture with no cross-run slope analysis — appears consistently across agent evaluation infrastructure. I've documented 98 independent instances of this architectural gap across Python, TypeScript, Go, Rust, and Shell repos in: PDR in Production (DOI: 10.5281/zenodo.19382408, §7.6.8 "The Evaluator's Blind Spot"). Per-run evaluation tells you what happened. Cross-run trend analysis tells you whether the agent is getting better or worse over time. Both layers are needed.
Happy to submit a PR if this direction fits the project.
Summary
quickbench produces a cryptographically signed
quickbench-report.jsonper evaluation run with rich metrics:scores.accuracy,latency.mean,latency.p95,cost, andfairness.demographicParity. This per-run fidelity is excellent.What's missing: zero cross-run trend analysis. If accuracy drops from 0.90 → 0.84 → 0.77 → 0.70 across four runs, no signal fires. The signed
quickbench-report.jsonfiles accumulate per run, but there is noRunTrendAnalyzerthat reads them in order and computes behavioral slope.Proposed Addition
OLS slope via pure Node.js (zero new dependencies):
CLI integration (optional
src/index.tssubcommand):CI gate (exits 1 on regression, composable with existing
signReport):Implementation Notes
quickbench-report.jsonformat — zero schema changestimestampfield already present inSignedReportanyRegressionflag covers: accuracy ↓, latency ↑, cost ↑, fairnessDemographicParity ↑ (higher = less fair)src/trend.tsalongsidesrc/report.tsWhy This Matters
This pattern — per-run metric capture with no cross-run slope analysis — appears consistently across agent evaluation infrastructure. I've documented 98 independent instances of this architectural gap across Python, TypeScript, Go, Rust, and Shell repos in: PDR in Production (DOI: 10.5281/zenodo.19382408, §7.6.8 "The Evaluator's Blind Spot"). Per-run evaluation tells you what happened. Cross-run trend analysis tells you whether the agent is getting better or worse over time. Both layers are needed.
Happy to submit a PR if this direction fits the project.