add evaluator Layer 1 (objective)


### Description

the planning notes defines Layer 1 of the hybrid evaluator as the **objective** layer: everything computable from a trace without invoking an LLM. These checks are deterministic, fast, and cheap — they form the baseline of every evaluation, regardless of whether Layer 2 (single judge) or 0.3.0 Layer 2/3 (ensemble + active sampling) is engaged.

Seven concrete metrics, all required in 0.2.0:

1. `latency_ms` — from trace total duration.
2. `cost_usd` — from trace per-step costs.
3. `token_count` — input + output + reasoning.
4. `tool_usage_correctness` — JSON-Schema validation of tool call args against declared `ToolDef`.
5. `format_compliance` — response matches expected structured format (JSON schema for structured-output tasks; regex / grammar for others).
6. `policy_violations_rule` — regex / function verification of policies where `check: rule`.
7. `constraint_compliance` — trace totals within `Contract.constraints` bounds.

A2A conformance (Layer 1 item) and functional success (SWE-bench etc.) land in 0.3.0 #033-#037 — they require A2A spec pinning and benchmark ingestion.

### Proposal

**1. Evaluator entry point:**

```python
# src/agentanvil/evaluator/objective.py (new)
from agentanvil.core.contracts import AgentContract
from agentanvil.core.models import ScoreBreakdown
from agentanvil.core.run_record import RunRecord


def evaluate_objective(record: RunRecord, contract: AgentContract) -> dict:
    return {
        "latency_ms": _latency_ms(record),
        "cost_usd": _cost_usd(record),
        "token_count": _token_count(record),
        "tool_usage_correctness": _tool_correctness(record, contract),
        "format_compliance": _format_compliance(record, contract),
        "policy_violations_rule": _policy_rule_violations(record, contract),
        "constraint_compliance": _constraint_compliance(record, contract),
    }
```

**2. Per-check functions:**

Each is testable in isolation, deterministic, and pure (no IO).

```python
def _tool_correctness(record, contract) -> dict:
    """For every tool_call step in the trace, validate args against declared ToolDef schema.

    Returns: {"total": N, "passed": M, "failures": [{"step_id": str, "reason": str}, ...]}
    """
    import jsonschema
    ...

def _policy_rule_violations(record, contract) -> list[dict]:
    """For every Policy with check='rule', apply its pattern (regex, callable, etc.)
    against the response content and each tool_call payload.

    Returns: list of {"policy_id": str, "step_id": str, "severity": str, "matched": str}.
    """
    ...

def _constraint_compliance(record, contract) -> dict:
    """Check total latency, total cost, total tool calls, forbidden patterns.

    Returns: {"all_pass": bool, "violations": [{"constraint": str, "actual": Any, "limit": Any}, ...]}.
    """
    ...
```

**3. Integration with `ScoreBreakdown`:**

Output populates `ScoreBreakdown.objective`. The full `ScoreBreakdown` is constructed at the top level (CLI `run`) once all enabled layers have run.

### Scope

- `src/agentanvil/evaluator/__init__.py` — exports `evaluate_objective`.
- `src/agentanvil/evaluator/objective.py` — new.
- `src/agentanvil/evaluator/_checks/` — per-check modules.
- `tests/evaluator/test_objective.py` — per-check tests with synthetic traces.
- `tests/evaluator/fixtures/traces/` — trace fixtures for known-pass / known-fail scenarios.

### Regression tests

- `test_latency_ms_sums_step_durations`
- `test_cost_usd_sums_llm_call_costs`
- `test_token_count_includes_reasoning_tokens`
- `test_tool_correctness_validates_args_against_schema`
- `test_tool_correctness_reports_schema_failures`
- `test_format_compliance_passes_on_valid_json`
- `test_format_compliance_fails_on_invalid_json_when_required`
- `test_policy_rule_violations_detected_in_response`
- `test_policy_rule_violations_detected_in_tool_args`
- `test_constraint_compliance_detects_latency_breach`
- `test_constraint_compliance_detects_cost_breach`
- `test_constraint_compliance_detects_forbidden_pattern`

### Notes

- A2A conformance (L1) lands in 0.3.0 #034.
- Functional success oracles (SWE-bench exit code, τ-bench DB state, etc.) land in 0.3.0 #037.
- This layer is **always-on** — every run computes Layer 1. Layer 2 (single judge, #018) is opt-in per contract.
- Depends on: #005 (core models), #011 (static analyzer — not strictly required, but reads the same contract types).
- Blocks: #018 (single-judge composition), #019 (reporter), #024 (first case study).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add evaluator Layer 1 (objective) #10

Description

Proposal

Scope

Regression tests

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

add evaluator Layer 1 (objective) #10

Description

Description

Proposal

Scope

Regression tests

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions