Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
f703493
feat(evals): LLM-driven evaluation harness with todo-api iteration 1
zbigniewsobiecki Apr 7, 2026
f048df6
feat(evals): iteration 2 — symbols stage (LLM-driven definition_metad…
zbigniewsobiecki Apr 7, 2026
e251a8b
refactor(evals): extract iteration helper, split tables.ts, fix cost …
zbigniewsobiecki Apr 7, 2026
e4045d9
fix(evals): production-grade iteration 2 — deterministic across runs
zbigniewsobiecki Apr 7, 2026
91b583e
feat(evals): iteration 3 — relationships stage (LLM-driven relationsh…
zbigniewsobiecki Apr 8, 2026
b4cb8aa
feat(evals): iteration 4 — modules stage (LLM-driven module tree + me…
zbigniewsobiecki Apr 8, 2026
e0668ee
feat(evals): iteration 4.5 — modules-verify regression detector
zbigniewsobiecki Apr 8, 2026
ccd49d6
feat(evals): add themeReference + cohesionRubric strategies for LLM-j…
zbigniewsobiecki Apr 8, 2026
00608b6
refactor(evals): replace iter 2 domain vocabulary with themeReference
zbigniewsobiecki Apr 8, 2026
e7db336
feat(evals): migrate iter 4/4.5 to cohesionRubric (defeats LLM tree-s…
zbigniewsobiecki Apr 8, 2026
fdc6315
fix(evals): force-load .env via dotenv override in vitest setup
zbigniewsobiecki Apr 8, 2026
d18dc6a
fix(evals): stabilize cohesion rubric — boundary-inclusive majority +…
zbigniewsobiecki Apr 8, 2026
50d440a
feat(evals): iteration 3.5 — relationships-verify regression detector
zbigniewsobiecki Apr 8, 2026
3a3e336
feat(evals): iteration 5 contracts + interactionRubric framework
zbigniewsobiecki Apr 8, 2026
e42ca5d
feat(evals): iteration 6 interactions + flow/feature rubric framework
zbigniewsobiecki Apr 8, 2026
40a2895
feat(evals): iterations 6.5 + 6.6 — interactions-validate / interacti…
zbigniewsobiecki Apr 8, 2026
89ad9eb
feat(evals): iteration 7 — flows stage (theme-search flowRubric)
zbigniewsobiecki Apr 8, 2026
b1a526c
feat(evals): iteration 8 — features stage (theme-search featureCohesi…
zbigniewsobiecki Apr 8, 2026
fb67f4e
fix(evals): stabilize Phase 2 rubrics — self-loops minor, broader sta…
zbigniewsobiecki Apr 8, 2026
4d7ac1b
fix(db): write inheritance interaction symbols as JSON array (was bar…
zbigniewsobiecki Apr 10, 2026
8b7ad46
feat(evals): unblock iter 7.5 + iter 8 after squint flows-verify fix
zbigniewsobiecki Apr 10, 2026
90da3d1
feat(evals): add bookstore-api Ruby on Rails eval fixture (7 iterations)
zbigniewsobiecki Apr 10, 2026
aed3001
fix(parser): detect constant-receiver references in Ruby for Zeitwerk…
zbigniewsobiecki Apr 11, 2026
b8e0f70
feat(evals): unblock bookstore-api iters 6-6.6, calibrate GT after pa…
zbigniewsobiecki Apr 11, 2026
997b74b
fix(parser): populate call-site usages for Ruby constant-receiver ref…
zbigniewsobiecki Apr 11, 2026
0338d81
fix: address PR review — dotenv to devDeps, dedup refs, lastIndexOf a…
zbigniewsobiecki Apr 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,11 @@ npm-debug.log*

# CASCADE tooling metadata
.cascade-progress-comment-id

# Eval harness — per-run artifacts and judge cache (keep .gitkeep)
evals/results/*
!evals/results/.gitkeep
evals/.judge-cache.json
evals/fixtures/*/node_modules/
evals/fixtures/*/.squint.db
evals/fixtures/*/dist/
1 change: 1 addition & 0 deletions bin/dev.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/usr/bin/env node

import 'dotenv/config';
import { execute } from '@oclif/core';

await execute({ development: true, dir: import.meta.url });
1 change: 1 addition & 0 deletions bin/run.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/usr/bin/env node

import 'dotenv/config';
import { execute } from '@oclif/core';

try {
Expand Down
60 changes: 60 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Squint Evaluation Harness

End-to-end evaluation of the squint ingestion pipeline against hand-authored ground truth.

## How it works

1. **Fixture**: a small, real, runnable TypeScript repo at `evals/fixtures/<name>/`
2. **Ground truth**: typed declarative records at `evals/ground-truth/<name>/` describing what squint *should* produce
3. **Harness**: shared code at `evals/harness/` that builds, runs, compares, and reports
4. **Eval test**: `evals/<name>.eval.ts` — a Vitest test that wires it all together
5. **Baseline**: a committed scoreboard at `evals/baselines/<name>.json` tracking progress per stage

## Running

```bash
# Run all evals (costs LLM credits!)
npm run eval

# Run a specific eval
npm run eval -- todo-api.eval.ts

# Run a specific stage's tests within an eval
npm run eval -- todo-api.eval.ts -t "parse stage"

# Watch mode for harness development
npm run eval:watch
```

## Cost guardrails

- All LLM calls are scoped per-stage via `--from-stage`/`--to-stage` — never the full pipeline accidentally
- Per-run cost budget enforced via `EVAL_COST_BUDGET_USD` (default `0.50`)
- Prose-judge results cached at `evals/results/.judge-cache.json` (gitignored)

## Environment variables

| Var | Default | Purpose |
|---|---|---|
| `EVAL_JUDGE_MODEL` | `openrouter:anthropic/claude-haiku-4` | LLM used to score prose similarity |
| `EVAL_COST_BUDGET_USD` | `0.50` | Hard fail if a single run exceeds this |
| `EVAL_RUNS_PER_STAGE` | `1` | Re-run LLM stages N times to detect non-determinism |
| `EVAL_KEEP_ALL` | unset | Keep all historical results instead of rotating |

## Iteration plan

The harness is built up one pipeline stage at a time. Each iteration adds exactly one
LLM stage on top of a known-passing base, so when iteration N fails the bug is in stage N.

See `/home/zbigniew/.claude/plans/validated-sprouting-mochi.md` for the full plan.

| Iter | Stages | Cost/run |
|---|---|---|
| 1 | parse | $0 |
| 2 | + symbols | ~$0.05 |
| 3 | + relationships | ~$0.10 |
| 4 | + modules | ~$0.15 |
| 5 | + contracts | ~$0.20 |
| 6 | + interactions | ~$0.25 |
| 7 | + flows | ~$0.30 |
| 8 | + features | ~$0.35 |
87 changes: 87 additions & 0 deletions evals/baselines/bookstore-api.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
{
"fixture": "bookstore-api",
"lastRun": "2026-04-11T12:04:05.560Z",
"squintCommit": "b8e0f70",
"tableScores": {
"files": {
"passed": true,
"expected": 18,
"produced": 18,
"critical": 0,
"major": 0,
"minor": 0
},
"definitions": {
"passed": true,
"expected": 97,
"produced": 97,
"critical": 0,
"major": 0,
"minor": 0
},
"imports": {
"passed": true,
"expected": 15,
"produced": 15,
"critical": 0,
"major": 0,
"minor": 0
},
"definition_metadata": {
"passed": true,
"expected": 95,
"produced": 305,
"critical": 0,
"major": 0,
"minor": 0
},
"relationship_annotations": {
"passed": true,
"expected": 9,
"produced": 89,
"critical": 0,
"major": 0,
"minor": 0
},
"module_cohesion": {
"passed": true,
"expected": 11,
"produced": 97,
"critical": 0,
"major": 0,
"minor": 0
},
"contracts": {
"passed": true,
"expected": 11,
"produced": 11,
"critical": 0,
"major": 0,
"minor": 0
},
"interaction_rubric": {
"passed": true,
"expected": 5,
"produced": 24,
"critical": 0,
"major": 0,
"minor": 1
},
"flow_rubric": {
"passed": true,
"expected": 2,
"produced": 19,
"critical": 0,
"major": 0,
"minor": 0
},
"feature_cohesion": {
"passed": true,
"expected": 2,
"produced": 5,
"critical": 0,
"major": 0,
"minor": 0
}
}
}
87 changes: 87 additions & 0 deletions evals/baselines/todo-api.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
{
"fixture": "todo-api",
"lastRun": "2026-04-10T17:44:42.211Z",
"squintCommit": "8b7ad46",
"tableScores": {
"files": {
"passed": true,
"expected": 14,
"produced": 14,
"critical": 0,
"major": 0,
"minor": 0
},
"definitions": {
"passed": true,
"expected": 50,
"produced": 50,
"critical": 0,
"major": 0,
"minor": 0
},
"imports": {
"passed": true,
"expected": 25,
"produced": 25,
"critical": 0,
"major": 0,
"minor": 0
},
"definition_metadata": {
"passed": true,
"expected": 122,
"produced": 161,
"critical": 0,
"major": 0,
"minor": 0
},
"relationship_annotations": {
"passed": true,
"expected": 35,
"produced": 69,
"critical": 0,
"major": 0,
"minor": 0
},
"module_cohesion": {
"passed": true,
"expected": 12,
"produced": 50,
"critical": 0,
"major": 0,
"minor": 0
},
"contracts": {
"passed": true,
"expected": 11,
"produced": 11,
"critical": 0,
"major": 0,
"minor": 0
},
"interaction_rubric": {
"passed": true,
"expected": 4,
"produced": 25,
"critical": 0,
"major": 0,
"minor": 0
},
"flow_rubric": {
"passed": true,
"expected": 2,
"produced": 14,
"critical": 0,
"major": 0,
"minor": 0
},
"feature_cohesion": {
"passed": true,
"expected": 2,
"produced": 4,
"critical": 0,
"major": 0,
"minor": 0
}
}
}
Loading
Loading