fix: preserve structured judge inputs for E1 comparator by intertwine · Pull Request #72 · intertwine/security-verifiers

intertwine · 2026-03-08T03:56:04Z

Summary

feed the E1 judge prompt a canonical JSON rendering of the model response instead of only the parsed label
add a structured-response JudgeRubric wrapper so the judge can actually evaluate format + confidence + label
add regression tests for both judge environment entrypoints
bump env package versions (sv-netlogs-judge -> 0.2.18, sv-env-network-logs -> 0.2.15)

Why

The stopped Prime run i5fvjomdr3yjn3pg6y7x3nu8 showed repeated empty batches after the judge reward stayed pinned at 0. The root cause is that JudgeRubric was seeing parser.parse_answer(completion) (for example Benign) while the prompt asked it to judge JSON validity and confidence. That made the judge systematically reject otherwise-correct completions and collapse reward signal.

Validation

source .venv/bin/activate && pytest environments/sv-env-netlogs-judge/sv_netlogs_judge_test.py environments/sv-env-network-logs/sv_env_network_logs_judge_test.py -q
source .venv/bin/activate && ruff check environments/sv-env-netlogs-judge/sv_netlogs_judge_impl.py environments/sv-env-network-logs/sv_env_network_logs_judge.py environments/sv-env-netlogs-judge/sv_netlogs_judge_test.py environments/sv-env-network-logs/sv_env_network_logs_judge_test.py

Follow-up

rebuild and push the updated judge env to Prime Hub
relaunch the E1 judge comparator run

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc93731f19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-08T03:58:56Z

+        original_parser = self.parser
+        try:
+            self.parser = self._prompt_parser
+            return await super().judge(prompt, completion, answer, state)


Avoid mutating shared parser inside async judge call

This method swaps self.parser and then awaits super().judge(...), which creates a race when two judge() coroutines run concurrently on the same rubric instance (a common batched-rollout pattern): the later call can capture _prompt_parser as original_parser and restore that last, leaving the rubric in the wrapped parser state. After that, any code path reading rubric.parser.parse_answer gets structured JSON instead of the normal label-only output, leading to inconsistent scoring/logging behavior across subsequent samples; the same pattern is duplicated in sv_netlogs_judge_impl.py.

Useful? React with 👍 / 👎.

Good catch — I removed the shared parser mutation and now compute the structured response directly inside judge(), so concurrent batched scoring no longer mutates rubric state. I also ran ruff format to address the CI format failure.

intertwine · 2026-03-08T04:07:53Z

@claude review

claude · 2026-03-08T04:08:05Z

Claude finished @intertwine's task in 2m 9s —— View job

Code Review

Gather context
Read changed files
Analyze for bugs, security, correctness
Post review feedback

The fix correctly identifies and resolves the root cause: JudgeRubric was receiving only the parsed label string (e.g. "Benign") via parser.parse_answer(completion), but the judge prompt explicitly asks the judge to evaluate JSON validity and confidence. This made the judge systematically reject correct completions, collapsing the reward signal to zero.

The StructuredResponseJudgeRubric + _JudgePromptParser approach is a clean design: it intercepts only the parse_answer call used to render the judge prompt, leaving the environment's normal scoring path untouched. The regression tests are well-targeted, particularly test_judge_prompt_receives_structured_json_response which verifies the right invariant.

That said, there is one real concurrency bug and a few smaller issues worth addressing before merging.

P1 — Race condition: `self.parser` swap is not safe for concurrent async calls

This is the same issue flagged by the Codex review and deserves attention.

In both sv_env_network_logs_judge.py:167-173 and sv_netlogs_judge_impl.py:167-173:

async def judge(self, prompt, completion, answer, state=None) -> str:
    original_parser = self.parser
    try:
        self.parser = self._prompt_parser        # ← mutates shared state
        return await super().judge(...)          # ← yields control here
    finally:
        self.parser = original_parser

Because await yields control to the event loop, two concurrent judge() coroutines (common in batched RL rollouts) can interleave like this:

Time	Coroutine A	Coroutine B	`self.parser`
t1	saves `original_parser = real_parser`		`real_parser`
t2	sets `self.parser = _prompt_parser`		`_prompt_parser`
t3	`await super().judge(...)` — yields		`_prompt_parser`
t4		saves `original_parser = _prompt_parser` ← wrong	`_prompt_parser`
t5		sets `self.parser = _prompt_parser`	`_prompt_parser`
t6		`await super().judge(...)` — yields	`_prompt_parser`
t7	finishes, restores `self.parser = real_parser`		`real_parser`
t8		finishes, restores `self.parser = _prompt_parser` ← stuck	`_prompt_parser`

After this, self.parser is permanently _prompt_parser, meaning all subsequent parse_answer() calls from the environment's scoring path will return structured JSON instead of labels, breaking reward calculation for the rest of the run.

Recommended fix — use asyncio.Lock to serialize the swap:

def __init__(self, parser: NetworkLogParser, **kwargs) -> None:
    super().__init__(parser=parser, **kwargs)
    self._prompt_parser = _JudgePromptParser(parser)
    self._judge_lock = asyncio.Lock()

async def judge(self, prompt, completion, answer, state=None) -> str:
    async with self._judge_lock:
        original_parser = self.parser
        try:
            self.parser = self._prompt_parser
            return await super().judge(prompt, completion, answer, state)
        finally:
            self.parser = original_parser

This serializes concurrent judge calls, which is fine — the judge is an LLM call and network latency dominates anyway. Fix this →

P2 — Private method access: `parser._parse_json`

format_completion_for_judge calls parser._parse_json(completion) (both files, line ~129). This is a protected method and could be renamed or removed in a future sv_shared version without a deprecation warning, silently breaking the fallback path. If JsonClassificationParser exposes a public parsing API that returns the raw dict, prefer that. Otherwise, add a comment documenting the dependency.

P3 — Missing concurrency test

The new test_judge_prompt_receives_structured_json_response test validates the happy path but not the concurrent case. A test that runs two judge() calls concurrently and verifies rubric.parser is real_parser after both complete would catch the P1 regression. Without it, the race condition could silently reappear.

P4 — Duplication between the two files

format_completion_for_judge, _JudgePromptParser, and StructuredResponseJudgeRubric are copy-pasted verbatim between sv_env_network_logs_judge.py and sv_netlogs_judge_impl.py. These are separate packages so some duplication is expected, but since both import from sv_shared, extracting these three into sv_shared would reduce drift risk. Not a blocker, but worth noting.

Minor

_JudgePromptParser.__getattr__ delegates all non-parse_answer attribute access to base_parser. If base_parser ever gets a _judge_lock attribute set (after the P1 fix), accessing _prompt_parser._judge_lock would unexpectedly delegate to base_parser. Consider making _JudgePromptParser a proper class with explicit delegation rather than __getattr__.
json.dumps(canonical, sort_keys=True) in format_completion_for_judge will reorder keys (e.g. confidence before label). The test correctly accounts for this. No issue, just worth documenting in the docstring.

Summary

The core fix is correct and necessary — without it the judge environment is broken for RL training. The P1 concurrency bug is the main blocker; it would cause the same reward collapse symptom (judge stuck returning structured JSON as labels, mismatching the scoring path) but only manifests under concurrent load, which makes it easy to miss in single-example tests. The other items are lower priority. I'd suggest landing the lock fix before pushing to the Hub.
| Branch: fix/e1-judge-structured-response

intertwine · 2026-03-08T04:50:41Z

@claude re-review

claude · 2026-03-08T04:50:52Z

Claude finished @intertwine's task in 1m 27s —— View job

Re-Review

Gather context from previous review
Read updated files
Assess whether P1 race condition is resolved
Check remaining issues (P2, P3, P4, minor)
Post updated review

P1 — Race condition: RESOLVED

The follow-up commit (54db929) takes a better approach than the lock I suggested. Instead of serializing concurrent calls with a lock, it eliminates shared state mutation entirely: judge() now directly computes response = self._prompt_parser.parse_answer(completion) from an immutable instance attribute set at __init__, then formats and issues the LLM call inline. Since _prompt_parser is never mutated after construction, concurrent coroutines can safely read from it simultaneously. This is cleaner than a lock.

New observation — judge() is a full reimplementation of the parent

The fix bypasses super().judge() and reimplements the LLM call directly (lines 167-239 in both files). This is necessary — the parent's judge() calls self.parser.parse_answer() internally, which the override needs to control — but it creates a maintenance surface: if vf.JudgeRubric.judge() gains retry logic, structured output handling, or token accounting in a future verifiers version, this override will silently diverge. A short comment explaining why super().judge() is intentionally bypassed would help future maintainers.

P2 — Private method parser._parse_json (still present, low risk)

format_completion_for_judge calls parser._parse_json(completion) (line 132 in both files). This is a protected method and could break silently if sv_shared is refactored. A short inline comment acknowledging the dependency would be sufficient, or expose a public helper in JsonClassificationParser if this pattern is reused elsewhere.

P3 — No concurrent judge test (still absent)

A test that runs two judge() calls concurrently and asserts rubric.parser is still the original parser after both complete would lock in the non-regression guarantee. Without it, a future refactor that re-introduces shared mutation won't be caught automatically. Not a blocker but recommended.

P4 — Duplication (unchanged, acknowledged)

format_completion_for_judge, _JudgePromptParser, and StructuredResponseJudgeRubric remain verbatim copies between the two packages. This is acceptable given the package separation, but tracking this in a comment (e.g. # keep in sync with sv_env_network_logs_judge.py) would reduce the chance of one drifting silently.

Summary

The only blocker from the previous review is resolved. The P1 fix is correct, idiomatic, and actually better than my suggested lock approach. The remaining items (P2, P3, P4) are all low-severity. This PR is ready to merge. The follow-up items to consider post-merge:

Add a # keep in sync comment or move shared helpers to sv_shared
Add a concurrency test for StructuredResponseJudgeRubric.judge()
Add a comment in judge() explaining why super().judge() is bypassed
| Branch: fix/e1-judge-structured-response

fix: preserve structured judge inputs for E1 comparator

cc93731

chatgpt-codex-connector Bot reviewed Mar 8, 2026

View reviewed changes

fix: avoid parser races in judge scoring

54db929

intertwine merged commit 0e6ec71 into main Mar 8, 2026
7 checks passed

intertwine deleted the fix/e1-judge-structured-response branch March 8, 2026 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve structured judge inputs for E1 comparator#72

fix: preserve structured judge inputs for E1 comparator#72
intertwine merged 2 commits intomainfrom
fix/e1-judge-structured-response

intertwine commented Mar 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 8, 2026

Uh oh!

intertwine Mar 8, 2026

Uh oh!

intertwine commented Mar 8, 2026

Uh oh!

claude Bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

intertwine commented Mar 8, 2026

Uh oh!

claude Bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

intertwine commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Follow-up

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

intertwine Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

intertwine commented Mar 8, 2026

Uh oh!

claude Bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

P1 — Race condition: self.parser swap is not safe for concurrent async calls

P2 — Private method access: parser._parse_json

P3 — Missing concurrency test

P4 — Duplication between the two files

Minor

Summary

Uh oh!

intertwine commented Mar 8, 2026

Uh oh!

claude Bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-Review

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

intertwine commented Mar 8, 2026 •

edited

Loading

claude Bot commented Mar 8, 2026 •

edited

Loading

P1 — Race condition: `self.parser` swap is not safe for concurrent async calls

P2 — Private method access: `parser._parse_json`

claude Bot commented Mar 8, 2026 •

edited

Loading