Skip to content

[codex] Add Snowl-native agent benchmark adapters#6

Merged
ravenSanstete merged 3 commits intomainfrom
codex/benchmark-agent-onboarding
Apr 28, 2026
Merged

[codex] Add Snowl-native agent benchmark adapters#6
ravenSanstete merged 3 commits intomainfrom
codex/benchmark-agent-onboarding

Conversation

@ravenSanstete
Copy link
Copy Markdown
Contributor

What changed

  • Added Snowl-native evaluator primitives for answer matching, function-call matching, trace policy, command checks, workspace diffs, canary leakage, state transitions, checkpoint aggregation, grouped metrics, and normalized trace extraction.
  • Added benchmark adapters/scorers for agent_bench_os, bfcl, ipi_coding_agent, and agentdojo.
  • Registered the previously added safety/MCQ adapters and expanded benchmark docs so the README reflects the current built-in coverage.
  • Added runtime support for per-sample dynamic tool schemas and a generic compose_terminal provider selected by runtime_container.provider_name.
  • Native-ized toolemu scoring and examples so built-in code no longer bridges an external runtime package.

Validation

  • python -m pytest -q -> 347 passed, 1 skipped
  • python -m snowl.cli bench list
  • Real remote API smoke with three models through Snowl:
    • deepseek-v3-ep
    • glm-5.1-w4a8
    • MiniMax-M2.7-w8a8
  • End-to-end snowl bench run jsonl smoke: 3 trials, all success, remote_smoke_ok=1.000

Notes

  • The PR intentionally leaves unrelated local changes unstaged, including docs.zip, webui/tsconfig.tsbuildinfo, and CLAUDE.md.

@ravenSanstete ravenSanstete marked this pull request as ready for review April 28, 2026 02:18
@ravenSanstete ravenSanstete merged commit 9a4d2ee into main Apr 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant