Description
the planning notes defines Layer 1 of the hybrid evaluator as the objective layer: everything computable from a trace without invoking an LLM. These checks are deterministic, fast, and cheap — they form the baseline of every evaluation, regardless of whether Layer 2 (single judge) or 0.3.0 Layer 2/3 (ensemble + active sampling) is engaged.
Seven concrete metrics, all required in 0.2.0:
latency_ms — from trace total duration.
cost_usd — from trace per-step costs.
token_count — input + output + reasoning.
tool_usage_correctness — JSON-Schema validation of tool call args against declared ToolDef.
format_compliance — response matches expected structured format (JSON schema for structured-output tasks; regex / grammar for others).
policy_violations_rule — regex / function verification of policies where check: rule.
constraint_compliance — trace totals within Contract.constraints bounds.
A2A conformance (Layer 1 item) and functional success (SWE-bench etc.) land in 0.3.0 #033-#037 — they require A2A spec pinning and benchmark ingestion.
Proposal
1. Evaluator entry point:
# src/agentanvil/evaluator/objective.py (new)
from agentanvil.core.contracts import AgentContract
from agentanvil.core.models import ScoreBreakdown
from agentanvil.core.run_record import RunRecord
def evaluate_objective(record: RunRecord, contract: AgentContract) -> dict:
return {
"latency_ms": _latency_ms(record),
"cost_usd": _cost_usd(record),
"token_count": _token_count(record),
"tool_usage_correctness": _tool_correctness(record, contract),
"format_compliance": _format_compliance(record, contract),
"policy_violations_rule": _policy_rule_violations(record, contract),
"constraint_compliance": _constraint_compliance(record, contract),
}
2. Per-check functions:
Each is testable in isolation, deterministic, and pure (no IO).
def _tool_correctness(record, contract) -> dict:
"""For every tool_call step in the trace, validate args against declared ToolDef schema.
Returns: {"total": N, "passed": M, "failures": [{"step_id": str, "reason": str}, ...]}
"""
import jsonschema
...
def _policy_rule_violations(record, contract) -> list[dict]:
"""For every Policy with check='rule', apply its pattern (regex, callable, etc.)
against the response content and each tool_call payload.
Returns: list of {"policy_id": str, "step_id": str, "severity": str, "matched": str}.
"""
...
def _constraint_compliance(record, contract) -> dict:
"""Check total latency, total cost, total tool calls, forbidden patterns.
Returns: {"all_pass": bool, "violations": [{"constraint": str, "actual": Any, "limit": Any}, ...]}.
"""
...
3. Integration with ScoreBreakdown:
Output populates ScoreBreakdown.objective. The full ScoreBreakdown is constructed at the top level (CLI run) once all enabled layers have run.
Scope
src/agentanvil/evaluator/__init__.py — exports evaluate_objective.
src/agentanvil/evaluator/objective.py — new.
src/agentanvil/evaluator/_checks/ — per-check modules.
tests/evaluator/test_objective.py — per-check tests with synthetic traces.
tests/evaluator/fixtures/traces/ — trace fixtures for known-pass / known-fail scenarios.
Regression tests
test_latency_ms_sums_step_durations
test_cost_usd_sums_llm_call_costs
test_token_count_includes_reasoning_tokens
test_tool_correctness_validates_args_against_schema
test_tool_correctness_reports_schema_failures
test_format_compliance_passes_on_valid_json
test_format_compliance_fails_on_invalid_json_when_required
test_policy_rule_violations_detected_in_response
test_policy_rule_violations_detected_in_tool_args
test_constraint_compliance_detects_latency_breach
test_constraint_compliance_detects_cost_breach
test_constraint_compliance_detects_forbidden_pattern
Notes
Description
the planning notes defines Layer 1 of the hybrid evaluator as the objective layer: everything computable from a trace without invoking an LLM. These checks are deterministic, fast, and cheap — they form the baseline of every evaluation, regardless of whether Layer 2 (single judge) or 0.3.0 Layer 2/3 (ensemble + active sampling) is engaged.
Seven concrete metrics, all required in 0.2.0:
latency_ms— from trace total duration.cost_usd— from trace per-step costs.token_count— input + output + reasoning.tool_usage_correctness— JSON-Schema validation of tool call args against declaredToolDef.format_compliance— response matches expected structured format (JSON schema for structured-output tasks; regex / grammar for others).policy_violations_rule— regex / function verification of policies wherecheck: rule.constraint_compliance— trace totals withinContract.constraintsbounds.A2A conformance (Layer 1 item) and functional success (SWE-bench etc.) land in 0.3.0 #033-#037 — they require A2A spec pinning and benchmark ingestion.
Proposal
1. Evaluator entry point:
2. Per-check functions:
Each is testable in isolation, deterministic, and pure (no IO).
3. Integration with
ScoreBreakdown:Output populates
ScoreBreakdown.objective. The fullScoreBreakdownis constructed at the top level (CLIrun) once all enabled layers have run.Scope
src/agentanvil/evaluator/__init__.py— exportsevaluate_objective.src/agentanvil/evaluator/objective.py— new.src/agentanvil/evaluator/_checks/— per-check modules.tests/evaluator/test_objective.py— per-check tests with synthetic traces.tests/evaluator/fixtures/traces/— trace fixtures for known-pass / known-fail scenarios.Regression tests
test_latency_ms_sums_step_durationstest_cost_usd_sums_llm_call_coststest_token_count_includes_reasoning_tokenstest_tool_correctness_validates_args_against_schematest_tool_correctness_reports_schema_failurestest_format_compliance_passes_on_valid_jsontest_format_compliance_fails_on_invalid_json_when_requiredtest_policy_rule_violations_detected_in_responsetest_policy_rule_violations_detected_in_tool_argstest_constraint_compliance_detects_latency_breachtest_constraint_compliance_detects_cost_breachtest_constraint_compliance_detects_forbidden_patternNotes