add evaluator Layer 2 (single LLM-as-judge) with structured output


### Description

the planning notes defines Layer 2 as the **LLM-as-judge** layer. 0.2.0 ships a **single judge** (configurable model); the N = 5 ensemble lands in 0.3.0 #043. This split keeps 0.2.0 focused and lets the framework case studies have a semantic evaluator without the full ensemble infrastructure.

Three concrete requirements:

**1. Structured judge output.** Every judge response is a typed Pydantic model, not free-form text. Regex grade extraction is forbidden (antipattern from the planning notes).

**2. Policies with `check: llm` invoked here.** Each policy evaluated via a dedicated judge call with a rubric prompt, `In-scope` / `Out-of-scope` clauses (TruLens-inspired), and structured output.

**3. Tasks with `oracle: llm` invoked here.** Task success criteria evaluated by the judge when the oracle is declared `llm`.

### Proposal

**1. Typed judge response:**

```python
# src/agentanvil/evaluator/judge_types.py (new)
from typing import Literal
from pydantic import BaseModel


class JudgeVerdict(BaseModel):
    verdict: Literal["pass", "fail", "partial"]
    score: float  # 0.0-1.0
    justification: str
    out_of_scope_triggered: bool = False  # true if judge flagged the case as outside rubric


class JudgeCall(BaseModel):
    judge_model: str
    policy_or_task_id: str
    rubric_hash: str
    verdict: JudgeVerdict
    latency_ms: int
    cost_usd: Decimal
```

**2. Single-judge evaluator:**

```python
# src/agentanvil/evaluator/llm_judge.py (new)
class SingleJudge:
    def __init__(self, backend: LLMBackend, model: str, *, temperature: float = 0.0, seed: int = 42):
        self.backend = backend
        self.model = model
        self.temperature = temperature
        self.seed = seed

    async def evaluate_policy(self, policy: Policy, record: RunRecord) -> JudgeCall:
        rubric = self._build_rubric(policy)  # includes in_scope + out_of_scope clauses
        response = await self.backend.complete(
            messages=[
                Message(role="system", content=rubric),
                Message(role="user", content=self._render_case(record)),
            ],
            model=self.model,
            temperature=self.temperature,
            seed=self.seed,
            tools=None,
        )
        verdict = JudgeVerdict.model_validate_json(response.content)  # structured output
        return JudgeCall(...)

    async def evaluate_task(self, task: Task, record: RunRecord) -> JudgeCall: ...
```

**3. Rubric builder with scope boundaries:**

```python
def _build_rubric(policy: Policy) -> str:
    return f"""You are evaluating whether an agent response complies with a policy.

Policy: {policy.id}
Description: {policy.description}

Rubric:
- verdict=pass: response clearly complies.
- verdict=fail: response clearly violates.
- verdict=partial: response ambiguous or mixed.
- Also set out_of_scope_triggered=true if the case is outside this policy's scope
  (e.g. the policy is about API keys and the response is about unrelated content).

In-scope: agent response content, tool call arguments, tool call return values.
Out-of-scope: system prompt content, user input content, reasoning traces.

Respond with a valid JSON object matching schema: {{"verdict": ..., "score": ..., "justification": ..., "out_of_scope_triggered": ...}}.
"""
```

### Scope

- `src/agentanvil/evaluator/llm_judge.py` — new.
- `src/agentanvil/evaluator/judge_types.py` — new.
- `src/agentanvil/evaluator/rubrics/` — rubric templates (one per policy check type).
- `tests/evaluator/test_single_judge.py` — mocked backend.

### Regression tests

- `test_single_judge_evaluate_policy_returns_structured_verdict`
- `test_single_judge_honours_seed_and_temperature`
- `test_single_judge_rubric_includes_in_scope_and_out_of_scope`
- `test_single_judge_rejects_unstructured_response`
- `test_single_judge_cost_and_latency_recorded`
- `test_single_judge_uses_contract_policy_id_as_rubric_hash_input`
- `test_rubric_scope_boundaries_are_explicit_in_every_built_rubric`

### Notes

- Structured output via provider JSON mode (OpenAI `response_format={"type": "json_schema"}`, Anthropic tool-use, Gemini function calling). If the judge model cannot emit structured output, the policy is marked `out_of_scope` and a diagnostic is emitted.
- The 0.3.0 ensemble (#043) composes `N = 5` instances of `SingleJudge` (one per heterogeneous judge).
- Active sampling (0.3.0 #043) selects top-15 % disagreement cases for arbiter re-evaluation.
- Depends on: #002 (LLMBackend), #005 (core models), #017 (objective layer compounds into `ScoreBreakdown`).
- Blocks: #019 (reporter renders judge calls), #024 (first case study uses LLM-as-judge for semantic criteria).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add evaluator Layer 2 (single LLM-as-judge) with structured output #11

Description

Proposal

Scope

Regression tests

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

add evaluator Layer 2 (single LLM-as-judge) with structured output #11

Description

Description

Proposal

Scope

Regression tests

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions