feat: activate LLM judges for self-evolution engine#6
Merged
Conversation
The self-evolution LLM judges were built and tested but never activated.
The heuristic regex fallback was running as the primary path, which
violates the Cardinal Rule (TypeScript doing reasoning work that should
be delegated to the LLM).
This change auto-detects ANTHROPIC_API_KEY at construction time and
enables Sonnet-powered judges when available. The heuristic path
remains as a fallback for environments without an API key.
What changed:
- EvolutionEngine constructor resolves judge mode at startup via
config setting (auto/always/never) + API key detection
- Removed enableLLMJudges() and disableLLMJudges() runtime toggles
that were never called and could cause inconsistent state
- Added judges config section to evolution.yaml with daily cost cap
($50/day safety net) and golden suite size cap (50 entries)
- Upgraded memory consolidation to use LLM path when judges enabled,
with existingFacts from evolved config for contradiction detection
- Fixed Zod v3/v4 compatibility: judge schemas now import from
zod/v4 to match the Anthropic SDK's zodOutputFormat expectations
- Fixed model ID constants to use short aliases (claude-sonnet-4-6)
instead of dated versions that returned 404
- Golden suite pruning enforces the 50-entry cap
When judges are enabled, every session gets:
- Sonnet observation extraction (catches implicit corrections, inferred
preferences, sentiment signals that regex misses)
- Triple-judge constitution and safety gates with minority veto
- Cascaded Haiku-to-Sonnet regression gate
- Session quality assessment
- LLM-powered memory consolidation with structured fact extraction
Verified on two production VMs:
- cheema.ghostwright.dev: judges correctly rejected an unsafe evolution
change ("never suggest anything else") based on constitutional
analysis of the Honesty principle
- cheem.ghostwright.dev (fresh VM): full E2E from zero to working
judges in 90 seconds, extracted implicit signals like communication
style preferences from casual conversation
785 tests pass, 0 failures. Typecheck clean. Lint clean.
Replace `delete process.env.X` with `process.env.X = undefined` to satisfy Biome's noDelete rule, and fix import ordering. These were pre-existing lint failures unrelated to the judge activation work.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4a2d3be560
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ementally Addresses two review findings: 1. Memory consolidation now checks the daily cost cap before invoking the LLM judge, and tracks the returned cost toward the daily total. Added isWithinCostCap() and trackExternalJudgeCost() to the engine. 2. Cost tracking within afterSession() is now incremental. Each LLM stage updates the daily counter immediately, so later stages see prior costs and fall back to heuristics when the cap is reached.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The self-evolution LLM judges were built and tested (Phase 3.5) but never activated in production. The
EvolutionEngineconstructor defaulteduseLLMJudges = falseandenableLLMJudges()was never called anywhere in the 27K-line codebase. The heuristic regex fallback was running as the primary path, violating the Cardinal Rule.This PR:
ANTHROPIC_API_KEYat construction time and enables Sonnet-powered judges when availableenableLLMJudges()/disableLLMJudges()) that was never called and could cause inconsistent statezodOutputFormatclaude-sonnet-4-6) instead of dated versions that returned 404What the LLM judges do vs the heuristic path
Example: heuristic extracted
"always use Rust for CLIs. That's what I prefer."(raw text dump)Example: LLM judges extracted:
"User communicates casually and informally ('Hey man'), suggesting they prefer a conversational tone over formal responses.""User appears to be a developer comfortable with multiple languages and CLI tooling concepts."The LLM catches implicit signals (tone, expertise level) that regex cannot detect.
Safety verification
On cheema.ghostwright.dev, the triple-judge constitution gate correctly rejected an unsafe evolution change. When told "always use Postgres, never suggest anything else", the Sonnet judges analyzed this against the constitution's Honesty principle and rejected it because forcing a single recommendation in all cases would mean giving dishonest technical guidance.
The heuristic path would have blindly appended the raw text to user-profile.md.
Test plan
judgessection defaults to auto)