fix: preserve structured judge inputs for E1 comparator#72
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc93731f19
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| original_parser = self.parser | ||
| try: | ||
| self.parser = self._prompt_parser | ||
| return await super().judge(prompt, completion, answer, state) |
There was a problem hiding this comment.
Avoid mutating shared parser inside async judge call
This method swaps self.parser and then awaits super().judge(...), which creates a race when two judge() coroutines run concurrently on the same rubric instance (a common batched-rollout pattern): the later call can capture _prompt_parser as original_parser and restore that last, leaving the rubric in the wrapped parser state. After that, any code path reading rubric.parser.parse_answer gets structured JSON instead of the normal label-only output, leading to inconsistent scoring/logging behavior across subsequent samples; the same pattern is duplicated in sv_netlogs_judge_impl.py.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — I removed the shared parser mutation and now compute the structured response directly inside judge(), so concurrent batched scoring no longer mutates rubric state. I also ran ruff format to address the CI format failure.
|
@claude review |
|
Claude finished @intertwine's task in 2m 9s —— View job Code Review
The fix correctly identifies and resolves the root cause: The That said, there is one real concurrency bug and a few smaller issues worth addressing before merging. P1 — Race condition:
|
| Time | Coroutine A | Coroutine B | self.parser |
|---|---|---|---|
| t1 | saves original_parser = real_parser |
real_parser |
|
| t2 | sets self.parser = _prompt_parser |
_prompt_parser |
|
| t3 | await super().judge(...) — yields |
_prompt_parser |
|
| t4 | saves original_parser = _prompt_parser ← wrong |
_prompt_parser |
|
| t5 | sets self.parser = _prompt_parser |
_prompt_parser |
|
| t6 | await super().judge(...) — yields |
_prompt_parser |
|
| t7 | finishes, restores self.parser = real_parser |
real_parser |
|
| t8 | finishes, restores self.parser = _prompt_parser ← stuck |
_prompt_parser |
After this, self.parser is permanently _prompt_parser, meaning all subsequent parse_answer() calls from the environment's scoring path will return structured JSON instead of labels, breaking reward calculation for the rest of the run.
Recommended fix — use asyncio.Lock to serialize the swap:
def __init__(self, parser: NetworkLogParser, **kwargs) -> None:
super().__init__(parser=parser, **kwargs)
self._prompt_parser = _JudgePromptParser(parser)
self._judge_lock = asyncio.Lock()
async def judge(self, prompt, completion, answer, state=None) -> str:
async with self._judge_lock:
original_parser = self.parser
try:
self.parser = self._prompt_parser
return await super().judge(prompt, completion, answer, state)
finally:
self.parser = original_parserThis serializes concurrent judge calls, which is fine — the judge is an LLM call and network latency dominates anyway. Fix this →
P2 — Private method access: parser._parse_json
format_completion_for_judge calls parser._parse_json(completion) (both files, line ~129). This is a protected method and could be renamed or removed in a future sv_shared version without a deprecation warning, silently breaking the fallback path. If JsonClassificationParser exposes a public parsing API that returns the raw dict, prefer that. Otherwise, add a comment documenting the dependency.
P3 — Missing concurrency test
The new test_judge_prompt_receives_structured_json_response test validates the happy path but not the concurrent case. A test that runs two judge() calls concurrently and verifies rubric.parser is real_parser after both complete would catch the P1 regression. Without it, the race condition could silently reappear.
P4 — Duplication between the two files
format_completion_for_judge, _JudgePromptParser, and StructuredResponseJudgeRubric are copy-pasted verbatim between sv_env_network_logs_judge.py and sv_netlogs_judge_impl.py. These are separate packages so some duplication is expected, but since both import from sv_shared, extracting these three into sv_shared would reduce drift risk. Not a blocker, but worth noting.
Minor
-
_JudgePromptParser.__getattr__delegates all non-parse_answerattribute access tobase_parser. Ifbase_parserever gets a_judge_lockattribute set (after the P1 fix), accessing_prompt_parser._judge_lockwould unexpectedly delegate tobase_parser. Consider making_JudgePromptParsera proper class with explicit delegation rather than__getattr__. -
json.dumps(canonical, sort_keys=True)informat_completion_for_judgewill reorder keys (e.g.confidencebeforelabel). The test correctly accounts for this. No issue, just worth documenting in the docstring.
Summary
The core fix is correct and necessary — without it the judge environment is broken for RL training. The P1 concurrency bug is the main blocker; it would cause the same reward collapse symptom (judge stuck returning structured JSON as labels, mismatching the scoring path) but only manifests under concurrent load, which makes it easy to miss in single-example tests. The other items are lower priority. I'd suggest landing the lock fix before pushing to the Hub.
| Branch: fix/e1-judge-structured-response
|
@claude re-review |
|
Claude finished @intertwine's task in 1m 27s —— View job Re-Review
P1 — Race condition: RESOLVED The follow-up commit ( New observation — The fix bypasses P2 — Private method
P3 — No concurrent judge test (still absent) A test that runs two P4 — Duplication (unchanged, acknowledged)
SummaryThe only blocker from the previous review is resolved. The P1 fix is correct, idiomatic, and actually better than my suggested lock approach. The remaining items (P2, P3, P4) are all low-severity. This PR is ready to merge. The follow-up items to consider post-merge:
|
Summary
JudgeRubricwrapper so the judge can actually evaluate format + confidence + labelsv-netlogs-judge->0.2.18,sv-env-network-logs->0.2.15)Why
The stopped Prime run
i5fvjomdr3yjn3pg6y7x3nu8showed repeated empty batches after the judge reward stayed pinned at 0. The root cause is thatJudgeRubricwas seeingparser.parse_answer(completion)(for exampleBenign) while the prompt asked it to judge JSON validity and confidence. That made the judge systematically reject otherwise-correct completions and collapse reward signal.Validation
source .venv/bin/activate && pytest environments/sv-env-netlogs-judge/sv_netlogs_judge_test.py environments/sv-env-network-logs/sv_env_network_logs_judge_test.py -qsource .venv/bin/activate && ruff check environments/sv-env-netlogs-judge/sv_netlogs_judge_impl.py environments/sv-env-network-logs/sv_env_network_logs_judge.py environments/sv-env-netlogs-judge/sv_netlogs_judge_test.py environments/sv-env-network-logs/sv_env_network_logs_judge_test.pyFollow-up