feat(evals): PR4 — property-based assertions, layer hints, drift→0 on both fixtures by zbigniewsobiecki · Pull Request #85 · zbigniewsobiecki/squint

zbigniewsobiecki · 2026-04-11T22:39:54Z

Summary

Replaces brittle prose-similarity grading with structural property checks that ask factual questions about LLM output instead of asking the judge to recognize the GT author's exact phrasing.

Result: both baseline eval fixtures pass cleanly across all 13 iterations:

Fixture	critical	major	minor	prose checks
bookstore-api	0	0	0	80 / 80
todo-api	0	0	0	141 / 141

The canonical bug — Author.domain drifting to ["data-access","user-management"] because the LLM over-indexes on the symbol name — is fixed at the source: Author now produces ["book-management","data-access"] and stays that way.

What changed

PR2 — baseline schema (`evals/harness/reporter/baseline.ts`)

Added proseChecks: { passed, failed } to TableScore so prose drift is regression-tracked alongside structural diffs
updateBaseline reports prose-drift improvements/regressions per stage

PR3 — Ruby reference extractor + inflector (`src/parser/adapters/ruby/`)

Detect ActiveRecord association references (has_many / belongs_to / has_one / has_and_belongs_to_many) so Author.dependencies includes Book even though Zeitwerk autoload means there's no parse-time import
Detect constant-receiver call references (Klass.method) for Zeitwerk apps with zero explicit imports
New inflector.ts wraps the pluralize package
New fixtures: test/fixtures/ruby-rails/ and test/fixtures/ruby-rails-irregular-plurals/

PR4/1 — property-based metadata assertions (`evals/harness/comparator/tables/metadata-assertions.ts`)

MetadataAssertion discriminated union with 7 kinds: tag-any-of, tag-none-of, tag-floor, string-contains, string-forbid, concept-fit, regex
evaluateAssertions helper + 25 unit tests
Wired into compareDefinitionMetadata and compareRelationshipAnnotations
Helper builders (assertedDomain / assertedPurpose / assertedRelationship / exactPure) keep migrated GT files one-line per entry
Migrated all ~85 bookstore + ~120 todo definition_metadata entries from proseReference / themeReference / acceptableSet to assertions
Migrated all relationship_annotations entries to assertedRelationship

The Author.domain ground-truth is now:

assertedDomain('app/models/author.rb', 'Author', {
  anyOf: ['author', 'catalog', 'book', 'inventory', 'persistence', 'data', 'orm', 'active'],
  noneOf: ['user-management', 'authentication', 'login', 'password', 'session-management'],
})

— passes any defensible Author tag set, hard-fails the exact bug

PR4/2 — file-path-derived layer hint (`src/commands/llm/_shared/file-layer.ts`)

Maps app/models/, src/controllers/, etc. to short architectural-layer labels rendered in the symbols-stage prompt
Author.domain stops drifting to user-management because the Layer: Rails ActiveRecord model layer line anchors the symbol's identity in persistence rather than letting the LLM over-index on the name
Source code rendered LAST in the prompt so structural context lands first
10 new file-layer.test.ts cases
Pipe-table dependency rendering EXPERIMENT REVERTED — caused 11 regressions because the LLM treated the table as a "use these tags" template; bullet list is less prescriptive

PR4/1 v5 — calibration after iter-by-iter verification

TasksService.purpose anyOf uses verb stems (creat / updat / delet) plus broad nouns (manage / operation / business / logic) to escape the substring trap discovered when 'create' didn't match 'creating' (the trailing 'e' diverges from the 'i')
router-primitives expectedRole broadened to accept either narrow "HTTP routing primitives" framing or the broader "framework types" framing the LLM legitimately picks
New tripwire test in metadata-assertions.test.ts documenting the substring trap
Substring trap also documented in evals/ground-truth/_shared/assertion-builders.ts

Test plan

pnpm test — 2551 tests passing (was 2526 + 25 new metadata-assertion tests)
pnpm typecheck — clean
pnpm lint — clean
pnpm eval -- bookstore-api.eval.ts — 13/13 iters 0/0/0, 80/80 prose
pnpm eval -- todo-api.eval.ts — 13/13 iters 0/0/0, 141/141 prose
Author.domain produces ["book-management","data-access"] — verified in produced.db from latest run
Bookstore baseline JSON shows 0/0/0 across all 10 stages incl. feature_cohesion
Todo baseline JSON shows 0/0/0 across all 10 stages incl. feature_cohesion

🤖 Generated with Claude Code

… both fixtures Big-picture: replaces brittle prose-similarity grading with structural property checks that ask factual questions about the LLM output instead of asking the judge to recognize the GT author's exact phrasing. PR2 (baseline schema): - Add proseChecks {passed, failed} to TableScore + baseline JSON so prose drift is regression-tracked alongside structural diffs - 4 new baseline.test.ts cases covering proseChecks deltas PR3 (Ruby reference extractor + inflector): - Detect ActiveRecord association references (has_many/belongs_to/ has_one/has_and_belongs_to_many) so Author.dependencies includes Book even though Zeitwerk autoload means there's no parse-time import - Detect constant-receiver call references (Klass.method) for Zeitwerk apps with zero explicit imports - Inflector wraps the `pluralize` package; tests cover irregular cases - 5 new ruby-rails fixtures + 1 ruby-rails-irregular-plurals fixture PR4/1 (property-based assertions + GT migration): - MetadataAssertion discriminated union: tag-any-of, tag-none-of, tag-floor, string-contains, string-forbid, concept-fit, regex - evaluateAssertions helper + 25 unit tests in metadata-assertions.test.ts - Wired into compareDefinitionMetadata + compareRelationshipAnnotations - assertion-builders.ts: assertedDomain/assertedPurpose/ assertedRelationship/exactPure helpers, with the SUBSTRING TRAP documented (verb stems vs gerunds) - Migrated ALL ~85 bookstore-api + ~120 todo-api definition_metadata entries from proseReference/themeReference/acceptableSet to assertions - Migrated all relationship_annotations entries to assertedRelationship PR4/2 (file-path-derived layer hint): - file-layer.ts maps src/controllers/, app/models/, etc. to short architectural-layer labels rendered in the symbols-stage prompt - Author.domain stops drifting to user-management because the "Rails ActiveRecord model layer" hint anchors the symbol's identity in persistence rather than letting the LLM over-index on the name - 10 new file-layer.test.ts cases - Source code rendered LAST in the prompt so structural context is in front of the model when it answers - Pipe-table dependency rendering EXPERIMENT REVERTED — caused 11 regressions because the LLM treated the table as a "use these tags" template; bullet list is less prescriptive PR4/1 v5 (calibration after iter-by-iter verification): - TasksService.purpose anyOf uses verb stems (creat/updat/delet) plus broad nouns (manage/operation/business/logic) to escape the substring trap discovered when 'create' didn't match 'creating' - router-primitives expectedRole broadened to accept either narrow "HTTP routing primitives" framing or the broader "framework types and utilities" framing the LLM legitimately picks - 1 new metadata-assertions.test.ts tripwire documenting the trap Result: - bookstore-api 13/13 iterations: 0 critical, 0 major, 0 minor (80/80 prose) - todo-api 13/13 iterations: 0 critical, 0 major, 0 minor (141/141 prose) - 2551 unit tests passing, lint + typecheck clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-04-11T22:42:31Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 86.25430% with 40 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/commands/interactions/generate.ts	0.00%	40 Missing ⚠️

📢 Thoughts on this report? Let us know!

zbigniewsobiecki temporarily deployed to CI April 11, 2026 22:39 — with GitHub Actions Inactive

zbigniewsobiecki merged commit 4a22c7a into dev Apr 11, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): PR4 — property-based assertions, layer hints, drift→0 on both fixtures#85

feat(evals): PR4 — property-based assertions, layer hints, drift→0 on both fixtures#85
zbigniewsobiecki merged 1 commit intodevfrom
feat/eval-harness

zbigniewsobiecki commented Apr 11, 2026

Uh oh!

codecov-commenter commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zbigniewsobiecki commented Apr 11, 2026

Summary

What changed

PR2 — baseline schema (evals/harness/reporter/baseline.ts)

PR3 — Ruby reference extractor + inflector (src/parser/adapters/ruby/)

PR4/1 — property-based metadata assertions (evals/harness/comparator/tables/metadata-assertions.ts)

PR4/2 — file-path-derived layer hint (src/commands/llm/_shared/file-layer.ts)

PR4/1 v5 — calibration after iter-by-iter verification

Test plan

Uh oh!

codecov-commenter commented Apr 11, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR2 — baseline schema (`evals/harness/reporter/baseline.ts`)

PR3 — Ruby reference extractor + inflector (`src/parser/adapters/ruby/`)

PR4/1 — property-based metadata assertions (`evals/harness/comparator/tables/metadata-assertions.ts`)

PR4/2 — file-path-derived layer hint (`src/commands/llm/_shared/file-layer.ts`)