Skip to content

fix: preserve structured judge inputs for E1 comparator#72

Merged
intertwine merged 2 commits intomainfrom
fix/e1-judge-structured-response
Mar 8, 2026
Merged

fix: preserve structured judge inputs for E1 comparator#72
intertwine merged 2 commits intomainfrom
fix/e1-judge-structured-response

Conversation

@intertwine
Copy link
Copy Markdown
Owner

@intertwine intertwine commented Mar 8, 2026

Summary

  • feed the E1 judge prompt a canonical JSON rendering of the model response instead of only the parsed label
  • add a structured-response JudgeRubric wrapper so the judge can actually evaluate format + confidence + label
  • add regression tests for both judge environment entrypoints
  • bump env package versions (sv-netlogs-judge -> 0.2.18, sv-env-network-logs -> 0.2.15)

Why

The stopped Prime run i5fvjomdr3yjn3pg6y7x3nu8 showed repeated empty batches after the judge reward stayed pinned at 0. The root cause is that JudgeRubric was seeing parser.parse_answer(completion) (for example Benign) while the prompt asked it to judge JSON validity and confidence. That made the judge systematically reject otherwise-correct completions and collapse reward signal.

Validation

  • source .venv/bin/activate && pytest environments/sv-env-netlogs-judge/sv_netlogs_judge_test.py environments/sv-env-network-logs/sv_env_network_logs_judge_test.py -q
  • source .venv/bin/activate && ruff check environments/sv-env-netlogs-judge/sv_netlogs_judge_impl.py environments/sv-env-network-logs/sv_env_network_logs_judge.py environments/sv-env-netlogs-judge/sv_netlogs_judge_test.py environments/sv-env-network-logs/sv_env_network_logs_judge_test.py

Follow-up

  • rebuild and push the updated judge env to Prime Hub
  • relaunch the E1 judge comparator run

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc93731f19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +168 to +171
original_parser = self.parser
try:
self.parser = self._prompt_parser
return await super().judge(prompt, completion, answer, state)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid mutating shared parser inside async judge call

This method swaps self.parser and then awaits super().judge(...), which creates a race when two judge() coroutines run concurrently on the same rubric instance (a common batched-rollout pattern): the later call can capture _prompt_parser as original_parser and restore that last, leaving the rubric in the wrapped parser state. After that, any code path reading rubric.parser.parse_answer gets structured JSON instead of the normal label-only output, leading to inconsistent scoring/logging behavior across subsequent samples; the same pattern is duplicated in sv_netlogs_judge_impl.py.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — I removed the shared parser mutation and now compute the structured response directly inside judge(), so concurrent batched scoring no longer mutates rubric state. I also ran ruff format to address the CI format failure.

@intertwine
Copy link
Copy Markdown
Owner Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Mar 8, 2026

Claude finished @intertwine's task in 2m 9s —— View job


Code Review

  • Gather context
  • Read changed files
  • Analyze for bugs, security, correctness
  • Post review feedback

The fix correctly identifies and resolves the root cause: JudgeRubric was receiving only the parsed label string (e.g. "Benign") via parser.parse_answer(completion), but the judge prompt explicitly asks the judge to evaluate JSON validity and confidence. This made the judge systematically reject correct completions, collapsing the reward signal to zero.

The StructuredResponseJudgeRubric + _JudgePromptParser approach is a clean design: it intercepts only the parse_answer call used to render the judge prompt, leaving the environment's normal scoring path untouched. The regression tests are well-targeted, particularly test_judge_prompt_receives_structured_json_response which verifies the right invariant.

That said, there is one real concurrency bug and a few smaller issues worth addressing before merging.


P1 — Race condition: self.parser swap is not safe for concurrent async calls

This is the same issue flagged by the Codex review and deserves attention.

In both sv_env_network_logs_judge.py:167-173 and sv_netlogs_judge_impl.py:167-173:

async def judge(self, prompt, completion, answer, state=None) -> str:
    original_parser = self.parser
    try:
        self.parser = self._prompt_parser        # ← mutates shared state
        return await super().judge(...)          # ← yields control here
    finally:
        self.parser = original_parser

Because await yields control to the event loop, two concurrent judge() coroutines (common in batched RL rollouts) can interleave like this:

Time Coroutine A Coroutine B self.parser
t1 saves original_parser = real_parser real_parser
t2 sets self.parser = _prompt_parser _prompt_parser
t3 await super().judge(...) — yields _prompt_parser
t4 saves original_parser = _prompt_parserwrong _prompt_parser
t5 sets self.parser = _prompt_parser _prompt_parser
t6 await super().judge(...) — yields _prompt_parser
t7 finishes, restores self.parser = real_parser real_parser
t8 finishes, restores self.parser = _prompt_parserstuck _prompt_parser

After this, self.parser is permanently _prompt_parser, meaning all subsequent parse_answer() calls from the environment's scoring path will return structured JSON instead of labels, breaking reward calculation for the rest of the run.

Recommended fix — use asyncio.Lock to serialize the swap:

def __init__(self, parser: NetworkLogParser, **kwargs) -> None:
    super().__init__(parser=parser, **kwargs)
    self._prompt_parser = _JudgePromptParser(parser)
    self._judge_lock = asyncio.Lock()

async def judge(self, prompt, completion, answer, state=None) -> str:
    async with self._judge_lock:
        original_parser = self.parser
        try:
            self.parser = self._prompt_parser
            return await super().judge(prompt, completion, answer, state)
        finally:
            self.parser = original_parser

This serializes concurrent judge calls, which is fine — the judge is an LLM call and network latency dominates anyway. Fix this →


P2 — Private method access: parser._parse_json

format_completion_for_judge calls parser._parse_json(completion) (both files, line ~129). This is a protected method and could be renamed or removed in a future sv_shared version without a deprecation warning, silently breaking the fallback path. If JsonClassificationParser exposes a public parsing API that returns the raw dict, prefer that. Otherwise, add a comment documenting the dependency.


P3 — Missing concurrency test

The new test_judge_prompt_receives_structured_json_response test validates the happy path but not the concurrent case. A test that runs two judge() calls concurrently and verifies rubric.parser is real_parser after both complete would catch the P1 regression. Without it, the race condition could silently reappear.


P4 — Duplication between the two files

format_completion_for_judge, _JudgePromptParser, and StructuredResponseJudgeRubric are copy-pasted verbatim between sv_env_network_logs_judge.py and sv_netlogs_judge_impl.py. These are separate packages so some duplication is expected, but since both import from sv_shared, extracting these three into sv_shared would reduce drift risk. Not a blocker, but worth noting.


Minor

  • _JudgePromptParser.__getattr__ delegates all non-parse_answer attribute access to base_parser. If base_parser ever gets a _judge_lock attribute set (after the P1 fix), accessing _prompt_parser._judge_lock would unexpectedly delegate to base_parser. Consider making _JudgePromptParser a proper class with explicit delegation rather than __getattr__.

  • json.dumps(canonical, sort_keys=True) in format_completion_for_judge will reorder keys (e.g. confidence before label). The test correctly accounts for this. No issue, just worth documenting in the docstring.


Summary

The core fix is correct and necessary — without it the judge environment is broken for RL training. The P1 concurrency bug is the main blocker; it would cause the same reward collapse symptom (judge stuck returning structured JSON as labels, mismatching the scoring path) but only manifests under concurrent load, which makes it easy to miss in single-example tests. The other items are lower priority. I'd suggest landing the lock fix before pushing to the Hub.
| Branch: fix/e1-judge-structured-response

@intertwine
Copy link
Copy Markdown
Owner Author

@claude re-review

@claude
Copy link
Copy Markdown

claude Bot commented Mar 8, 2026

Claude finished @intertwine's task in 1m 27s —— View job


Re-Review

  • Gather context from previous review
  • Read updated files
  • Assess whether P1 race condition is resolved
  • Check remaining issues (P2, P3, P4, minor)
  • Post updated review

P1 — Race condition: RESOLVED

The follow-up commit (54db929) takes a better approach than the lock I suggested. Instead of serializing concurrent calls with a lock, it eliminates shared state mutation entirely: judge() now directly computes response = self._prompt_parser.parse_answer(completion) from an immutable instance attribute set at __init__, then formats and issues the LLM call inline. Since _prompt_parser is never mutated after construction, concurrent coroutines can safely read from it simultaneously. This is cleaner than a lock.


New observation — judge() is a full reimplementation of the parent

The fix bypasses super().judge() and reimplements the LLM call directly (lines 167-239 in both files). This is necessary — the parent's judge() calls self.parser.parse_answer() internally, which the override needs to control — but it creates a maintenance surface: if vf.JudgeRubric.judge() gains retry logic, structured output handling, or token accounting in a future verifiers version, this override will silently diverge. A short comment explaining why super().judge() is intentionally bypassed would help future maintainers.


P2 — Private method parser._parse_json (still present, low risk)

format_completion_for_judge calls parser._parse_json(completion) (line 132 in both files). This is a protected method and could break silently if sv_shared is refactored. A short inline comment acknowledging the dependency would be sufficient, or expose a public helper in JsonClassificationParser if this pattern is reused elsewhere.


P3 — No concurrent judge test (still absent)

A test that runs two judge() calls concurrently and asserts rubric.parser is still the original parser after both complete would lock in the non-regression guarantee. Without it, a future refactor that re-introduces shared mutation won't be caught automatically. Not a blocker but recommended.


P4 — Duplication (unchanged, acknowledged)

format_completion_for_judge, _JudgePromptParser, and StructuredResponseJudgeRubric remain verbatim copies between the two packages. This is acceptable given the package separation, but tracking this in a comment (e.g. # keep in sync with sv_env_network_logs_judge.py) would reduce the chance of one drifting silently.


Summary

The only blocker from the previous review is resolved. The P1 fix is correct, idiomatic, and actually better than my suggested lock approach. The remaining items (P2, P3, P4) are all low-severity. This PR is ready to merge. The follow-up items to consider post-merge:

  • Add a # keep in sync comment or move shared helpers to sv_shared
  • Add a concurrency test for StructuredResponseJudgeRubric.judge()
  • Add a comment in judge() explaining why super().judge() is bypassed
    | Branch: fix/e1-judge-structured-response

@intertwine intertwine merged commit 0e6ec71 into main Mar 8, 2026
7 checks passed
@intertwine intertwine deleted the fix/e1-judge-structured-response branch March 8, 2026 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant