Skip to content

add evaluator Layer 1 (objective) #10

@cchinchilla-dev

Description

@cchinchilla-dev

Description

the planning notes defines Layer 1 of the hybrid evaluator as the objective layer: everything computable from a trace without invoking an LLM. These checks are deterministic, fast, and cheap — they form the baseline of every evaluation, regardless of whether Layer 2 (single judge) or 0.3.0 Layer 2/3 (ensemble + active sampling) is engaged.

Seven concrete metrics, all required in 0.2.0:

  1. latency_ms — from trace total duration.
  2. cost_usd — from trace per-step costs.
  3. token_count — input + output + reasoning.
  4. tool_usage_correctness — JSON-Schema validation of tool call args against declared ToolDef.
  5. format_compliance — response matches expected structured format (JSON schema for structured-output tasks; regex / grammar for others).
  6. policy_violations_rule — regex / function verification of policies where check: rule.
  7. constraint_compliance — trace totals within Contract.constraints bounds.

A2A conformance (Layer 1 item) and functional success (SWE-bench etc.) land in 0.3.0 #033-#037 — they require A2A spec pinning and benchmark ingestion.

Proposal

1. Evaluator entry point:

# src/agentanvil/evaluator/objective.py (new)
from agentanvil.core.contracts import AgentContract
from agentanvil.core.models import ScoreBreakdown
from agentanvil.core.run_record import RunRecord


def evaluate_objective(record: RunRecord, contract: AgentContract) -> dict:
    return {
        "latency_ms": _latency_ms(record),
        "cost_usd": _cost_usd(record),
        "token_count": _token_count(record),
        "tool_usage_correctness": _tool_correctness(record, contract),
        "format_compliance": _format_compliance(record, contract),
        "policy_violations_rule": _policy_rule_violations(record, contract),
        "constraint_compliance": _constraint_compliance(record, contract),
    }

2. Per-check functions:

Each is testable in isolation, deterministic, and pure (no IO).

def _tool_correctness(record, contract) -> dict:
    """For every tool_call step in the trace, validate args against declared ToolDef schema.

    Returns: {"total": N, "passed": M, "failures": [{"step_id": str, "reason": str}, ...]}
    """
    import jsonschema
    ...

def _policy_rule_violations(record, contract) -> list[dict]:
    """For every Policy with check='rule', apply its pattern (regex, callable, etc.)
    against the response content and each tool_call payload.

    Returns: list of {"policy_id": str, "step_id": str, "severity": str, "matched": str}.
    """
    ...

def _constraint_compliance(record, contract) -> dict:
    """Check total latency, total cost, total tool calls, forbidden patterns.

    Returns: {"all_pass": bool, "violations": [{"constraint": str, "actual": Any, "limit": Any}, ...]}.
    """
    ...

3. Integration with ScoreBreakdown:

Output populates ScoreBreakdown.objective. The full ScoreBreakdown is constructed at the top level (CLI run) once all enabled layers have run.

Scope

  • src/agentanvil/evaluator/__init__.py — exports evaluate_objective.
  • src/agentanvil/evaluator/objective.py — new.
  • src/agentanvil/evaluator/_checks/ — per-check modules.
  • tests/evaluator/test_objective.py — per-check tests with synthetic traces.
  • tests/evaluator/fixtures/traces/ — trace fixtures for known-pass / known-fail scenarios.

Regression tests

  • test_latency_ms_sums_step_durations
  • test_cost_usd_sums_llm_call_costs
  • test_token_count_includes_reasoning_tokens
  • test_tool_correctness_validates_args_against_schema
  • test_tool_correctness_reports_schema_failures
  • test_format_compliance_passes_on_valid_json
  • test_format_compliance_fails_on_invalid_json_when_required
  • test_policy_rule_violations_detected_in_response
  • test_policy_rule_violations_detected_in_tool_args
  • test_constraint_compliance_detects_latency_breach
  • test_constraint_compliance_detects_cost_breach
  • test_constraint_compliance_detects_forbidden_pattern

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreenhancementNew feature or requestevaluatorEvaluator layers (objective, LLM-judge, human)

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions