feat(evals): PR4 — property-based assertions, layer hints, drift→0 on both fixtures#85
Merged
zbigniewsobiecki merged 1 commit intodevfrom Apr 11, 2026
Merged
Conversation
… both fixtures
Big-picture: replaces brittle prose-similarity grading with structural
property checks that ask factual questions about the LLM output instead
of asking the judge to recognize the GT author's exact phrasing.
PR2 (baseline schema):
- Add proseChecks {passed, failed} to TableScore + baseline JSON so
prose drift is regression-tracked alongside structural diffs
- 4 new baseline.test.ts cases covering proseChecks deltas
PR3 (Ruby reference extractor + inflector):
- Detect ActiveRecord association references (has_many/belongs_to/
has_one/has_and_belongs_to_many) so Author.dependencies includes
Book even though Zeitwerk autoload means there's no parse-time import
- Detect constant-receiver call references (Klass.method) for Zeitwerk
apps with zero explicit imports
- Inflector wraps the `pluralize` package; tests cover irregular cases
- 5 new ruby-rails fixtures + 1 ruby-rails-irregular-plurals fixture
PR4/1 (property-based assertions + GT migration):
- MetadataAssertion discriminated union: tag-any-of, tag-none-of,
tag-floor, string-contains, string-forbid, concept-fit, regex
- evaluateAssertions helper + 25 unit tests in metadata-assertions.test.ts
- Wired into compareDefinitionMetadata + compareRelationshipAnnotations
- assertion-builders.ts: assertedDomain/assertedPurpose/
assertedRelationship/exactPure helpers, with the SUBSTRING TRAP
documented (verb stems vs gerunds)
- Migrated ALL ~85 bookstore-api + ~120 todo-api definition_metadata
entries from proseReference/themeReference/acceptableSet to assertions
- Migrated all relationship_annotations entries to assertedRelationship
PR4/2 (file-path-derived layer hint):
- file-layer.ts maps src/controllers/, app/models/, etc. to short
architectural-layer labels rendered in the symbols-stage prompt
- Author.domain stops drifting to user-management because the
"Rails ActiveRecord model layer" hint anchors the symbol's identity
in persistence rather than letting the LLM over-index on the name
- 10 new file-layer.test.ts cases
- Source code rendered LAST in the prompt so structural context is
in front of the model when it answers
- Pipe-table dependency rendering EXPERIMENT REVERTED — caused 11
regressions because the LLM treated the table as a "use these tags"
template; bullet list is less prescriptive
PR4/1 v5 (calibration after iter-by-iter verification):
- TasksService.purpose anyOf uses verb stems (creat/updat/delet)
plus broad nouns (manage/operation/business/logic) to escape the
substring trap discovered when 'create' didn't match 'creating'
- router-primitives expectedRole broadened to accept either narrow
"HTTP routing primitives" framing or the broader "framework types
and utilities" framing the LLM legitimately picks
- 1 new metadata-assertions.test.ts tripwire documenting the trap
Result:
- bookstore-api 13/13 iterations: 0 critical, 0 major, 0 minor (80/80 prose)
- todo-api 13/13 iterations: 0 critical, 0 major, 0 minor (141/141 prose)
- 2551 unit tests passing, lint + typecheck clean
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces brittle prose-similarity grading with structural property checks that ask factual questions about LLM output instead of asking the judge to recognize the GT author's exact phrasing.
Result: both baseline eval fixtures pass cleanly across all 13 iterations:
The canonical bug —
Author.domaindrifting to["data-access","user-management"]because the LLM over-indexes on the symbol name — is fixed at the source:Authornow produces["book-management","data-access"]and stays that way.What changed
PR2 — baseline schema (
evals/harness/reporter/baseline.ts)proseChecks: { passed, failed }toTableScoreso prose drift is regression-tracked alongside structural diffsupdateBaselinereports prose-drift improvements/regressions per stagePR3 — Ruby reference extractor + inflector (
src/parser/adapters/ruby/)has_many/belongs_to/has_one/has_and_belongs_to_many) soAuthor.dependenciesincludesBookeven though Zeitwerk autoload means there's no parse-timeimportKlass.method) for Zeitwerk apps with zero explicit importsinflector.tswraps thepluralizepackagetest/fixtures/ruby-rails/andtest/fixtures/ruby-rails-irregular-plurals/PR4/1 — property-based metadata assertions (
evals/harness/comparator/tables/metadata-assertions.ts)MetadataAssertiondiscriminated union with 7 kinds:tag-any-of,tag-none-of,tag-floor,string-contains,string-forbid,concept-fit,regexevaluateAssertionshelper + 25 unit testscompareDefinitionMetadataandcompareRelationshipAnnotationsassertedDomain/assertedPurpose/assertedRelationship/exactPure) keep migrated GT files one-line per entryproseReference/themeReference/acceptableSettoassertionsrelationship_annotationsentries toassertedRelationshipAuthor.domainground-truth is now:PR4/2 — file-path-derived layer hint (
src/commands/llm/_shared/file-layer.ts)app/models/,src/controllers/, etc. to short architectural-layer labels rendered in the symbols-stage promptLayer: Rails ActiveRecord model layerline anchors the symbol's identity in persistence rather than letting the LLM over-index on the namePR4/1 v5 — calibration after iter-by-iter verification
TasksService.purposeanyOf uses verb stems (creat/updat/delet) plus broad nouns (manage/operation/business/logic) to escape the substring trap discovered when'create'didn't match'creating'(the trailing 'e' diverges from the 'i')router-primitivesexpectedRolebroadened to accept either narrow "HTTP routing primitives" framing or the broader "framework types" framing the LLM legitimately picksmetadata-assertions.test.tsdocumenting the substring trapevals/ground-truth/_shared/assertion-builders.tsTest plan
pnpm test— 2551 tests passing (was 2526 + 25 new metadata-assertion tests)pnpm typecheck— cleanpnpm lint— cleanpnpm eval -- bookstore-api.eval.ts— 13/13 iters 0/0/0, 80/80 prosepnpm eval -- todo-api.eval.ts— 13/13 iters 0/0/0, 141/141 proseAuthor.domainproduces["book-management","data-access"]— verified inproduced.dbfrom latest run🤖 Generated with Claude Code