Skip to content

feat(evals): PR4 — property-based assertions, layer hints, drift→0 on both fixtures#85

Merged
zbigniewsobiecki merged 1 commit intodevfrom
feat/eval-harness
Apr 11, 2026
Merged

feat(evals): PR4 — property-based assertions, layer hints, drift→0 on both fixtures#85
zbigniewsobiecki merged 1 commit intodevfrom
feat/eval-harness

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Owner

Summary

Replaces brittle prose-similarity grading with structural property checks that ask factual questions about LLM output instead of asking the judge to recognize the GT author's exact phrasing.

Result: both baseline eval fixtures pass cleanly across all 13 iterations:

Fixture critical major minor prose checks
bookstore-api 0 0 0 80 / 80
todo-api 0 0 0 141 / 141

The canonical bug — Author.domain drifting to ["data-access","user-management"] because the LLM over-indexes on the symbol name — is fixed at the source: Author now produces ["book-management","data-access"] and stays that way.

What changed

PR2 — baseline schema (evals/harness/reporter/baseline.ts)

  • Added proseChecks: { passed, failed } to TableScore so prose drift is regression-tracked alongside structural diffs
  • updateBaseline reports prose-drift improvements/regressions per stage

PR3 — Ruby reference extractor + inflector (src/parser/adapters/ruby/)

  • Detect ActiveRecord association references (has_many / belongs_to / has_one / has_and_belongs_to_many) so Author.dependencies includes Book even though Zeitwerk autoload means there's no parse-time import
  • Detect constant-receiver call references (Klass.method) for Zeitwerk apps with zero explicit imports
  • New inflector.ts wraps the pluralize package
  • New fixtures: test/fixtures/ruby-rails/ and test/fixtures/ruby-rails-irregular-plurals/

PR4/1 — property-based metadata assertions (evals/harness/comparator/tables/metadata-assertions.ts)

  • MetadataAssertion discriminated union with 7 kinds: tag-any-of, tag-none-of, tag-floor, string-contains, string-forbid, concept-fit, regex
  • evaluateAssertions helper + 25 unit tests
  • Wired into compareDefinitionMetadata and compareRelationshipAnnotations
  • Helper builders (assertedDomain / assertedPurpose / assertedRelationship / exactPure) keep migrated GT files one-line per entry
  • Migrated all ~85 bookstore + ~120 todo definition_metadata entries from proseReference / themeReference / acceptableSet to assertions
  • Migrated all relationship_annotations entries to assertedRelationship
  • The Author.domain ground-truth is now:
    assertedDomain('app/models/author.rb', 'Author', {
      anyOf: ['author', 'catalog', 'book', 'inventory', 'persistence', 'data', 'orm', 'active'],
      noneOf: ['user-management', 'authentication', 'login', 'password', 'session-management'],
    })
    — passes any defensible Author tag set, hard-fails the exact bug

PR4/2 — file-path-derived layer hint (src/commands/llm/_shared/file-layer.ts)

  • Maps app/models/, src/controllers/, etc. to short architectural-layer labels rendered in the symbols-stage prompt
  • Author.domain stops drifting to user-management because the Layer: Rails ActiveRecord model layer line anchors the symbol's identity in persistence rather than letting the LLM over-index on the name
  • Source code rendered LAST in the prompt so structural context lands first
  • 10 new file-layer.test.ts cases
  • Pipe-table dependency rendering EXPERIMENT REVERTED — caused 11 regressions because the LLM treated the table as a "use these tags" template; bullet list is less prescriptive

PR4/1 v5 — calibration after iter-by-iter verification

  • TasksService.purpose anyOf uses verb stems (creat / updat / delet) plus broad nouns (manage / operation / business / logic) to escape the substring trap discovered when 'create' didn't match 'creating' (the trailing 'e' diverges from the 'i')
  • router-primitives expectedRole broadened to accept either narrow "HTTP routing primitives" framing or the broader "framework types" framing the LLM legitimately picks
  • New tripwire test in metadata-assertions.test.ts documenting the substring trap
  • Substring trap also documented in evals/ground-truth/_shared/assertion-builders.ts

Test plan

  • pnpm test — 2551 tests passing (was 2526 + 25 new metadata-assertion tests)
  • pnpm typecheck — clean
  • pnpm lint — clean
  • pnpm eval -- bookstore-api.eval.ts — 13/13 iters 0/0/0, 80/80 prose
  • pnpm eval -- todo-api.eval.ts — 13/13 iters 0/0/0, 141/141 prose
  • Author.domain produces ["book-management","data-access"] — verified in produced.db from latest run
  • Bookstore baseline JSON shows 0/0/0 across all 10 stages incl. feature_cohesion
  • Todo baseline JSON shows 0/0/0 across all 10 stages incl. feature_cohesion

🤖 Generated with Claude Code

… both fixtures

Big-picture: replaces brittle prose-similarity grading with structural
property checks that ask factual questions about the LLM output instead
of asking the judge to recognize the GT author's exact phrasing.

PR2 (baseline schema):
  - Add proseChecks {passed, failed} to TableScore + baseline JSON so
    prose drift is regression-tracked alongside structural diffs
  - 4 new baseline.test.ts cases covering proseChecks deltas

PR3 (Ruby reference extractor + inflector):
  - Detect ActiveRecord association references (has_many/belongs_to/
    has_one/has_and_belongs_to_many) so Author.dependencies includes
    Book even though Zeitwerk autoload means there's no parse-time import
  - Detect constant-receiver call references (Klass.method) for Zeitwerk
    apps with zero explicit imports
  - Inflector wraps the `pluralize` package; tests cover irregular cases
  - 5 new ruby-rails fixtures + 1 ruby-rails-irregular-plurals fixture

PR4/1 (property-based assertions + GT migration):
  - MetadataAssertion discriminated union: tag-any-of, tag-none-of,
    tag-floor, string-contains, string-forbid, concept-fit, regex
  - evaluateAssertions helper + 25 unit tests in metadata-assertions.test.ts
  - Wired into compareDefinitionMetadata + compareRelationshipAnnotations
  - assertion-builders.ts: assertedDomain/assertedPurpose/
    assertedRelationship/exactPure helpers, with the SUBSTRING TRAP
    documented (verb stems vs gerunds)
  - Migrated ALL ~85 bookstore-api + ~120 todo-api definition_metadata
    entries from proseReference/themeReference/acceptableSet to assertions
  - Migrated all relationship_annotations entries to assertedRelationship

PR4/2 (file-path-derived layer hint):
  - file-layer.ts maps src/controllers/, app/models/, etc. to short
    architectural-layer labels rendered in the symbols-stage prompt
  - Author.domain stops drifting to user-management because the
    "Rails ActiveRecord model layer" hint anchors the symbol's identity
    in persistence rather than letting the LLM over-index on the name
  - 10 new file-layer.test.ts cases
  - Source code rendered LAST in the prompt so structural context is
    in front of the model when it answers
  - Pipe-table dependency rendering EXPERIMENT REVERTED — caused 11
    regressions because the LLM treated the table as a "use these tags"
    template; bullet list is less prescriptive

PR4/1 v5 (calibration after iter-by-iter verification):
  - TasksService.purpose anyOf uses verb stems (creat/updat/delet)
    plus broad nouns (manage/operation/business/logic) to escape the
    substring trap discovered when 'create' didn't match 'creating'
  - router-primitives expectedRole broadened to accept either narrow
    "HTTP routing primitives" framing or the broader "framework types
    and utilities" framing the LLM legitimately picks
  - 1 new metadata-assertions.test.ts tripwire documenting the trap

Result:
  - bookstore-api 13/13 iterations: 0 critical, 0 major, 0 minor (80/80 prose)
  - todo-api    13/13 iterations: 0 critical, 0 major, 0 minor (141/141 prose)
  - 2551 unit tests passing, lint + typecheck clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 86.25430% with 40 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/commands/interactions/generate.ts 0.00% 40 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki merged commit 4a22c7a into dev Apr 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants