[codex] Add Snowl-native agent benchmark adapters by ravenSanstete · Pull Request #6 · Qitor/snowl

ravenSanstete · 2026-04-28T01:00:27Z

What changed

Added Snowl-native evaluator primitives for answer matching, function-call matching, trace policy, command checks, workspace diffs, canary leakage, state transitions, checkpoint aggregation, grouped metrics, and normalized trace extraction.
Added benchmark adapters/scorers for agent_bench_os, bfcl, ipi_coding_agent, and agentdojo.
Registered the previously added safety/MCQ adapters and expanded benchmark docs so the README reflects the current built-in coverage.
Added runtime support for per-sample dynamic tool schemas and a generic compose_terminal provider selected by runtime_container.provider_name.
Native-ized toolemu scoring and examples so built-in code no longer bridges an external runtime package.

python -m pytest -q -> 347 passed, 1 skipped
python -m snowl.cli bench list
Real remote API smoke with three models through Snowl:
- deepseek-v3-ep
- glm-5.1-w4a8
- MiniMax-M2.7-w8a8
End-to-end snowl bench run jsonl smoke: 3 trials, all success, remote_smoke_ok=1.000

The PR intentionally leaves unrelated local changes unstaged, including docs.zip, webui/tsconfig.tsbuildinfo, and CLAUDE.md.

ravenSanstete added 3 commits April 27, 2026 23:37

Add safety benchmark adapters

25b8768

Add Snowl-native agent benchmark adapters

e38571d

Fix CI hermetic test setup

99bea68

ravenSanstete marked this pull request as ready for review April 28, 2026 02:18

ravenSanstete merged commit 9a4d2ee into main Apr 28, 2026
2 checks passed