aoa-evals

Public library of portable evaluation bundles for agents and agent-shaped workflows.

aoa-evals is the proof layer in the AoA public surface. Where aoa-techniques stores reusable practice and aoa-skills stores bounded execution bundles, aoa-evals stores evaluation bundles that make claims about quality, boundaries, regressions, artifact output, and repeated-window movement reproducible, reviewable, and portable.

An eval here is not a random benchmark and not a pile of project-local tests. It is a bounded proof surface with clear claim limits, explicit scoring or verdict logic, defined fixtures or cases, known blind spots, and portable execution guidance.

Current release: v0.3.1. See CHANGELOG for release notes.

Start here

Use the shortest route by need:

first concrete source-owned proof surface: bundles/aoa-bounded-change-quality/EVAL.md
docs map: docs/README.md
current direction: ROADMAP.md
repeated-window stress recovery proof surface: bundles/aoa-stress-recovery-window/EVAL.md and docs/STRESS_RECOVERY_WINDOW_EVALS.md
self-agency continuity proof surfaces: bundles/aoa-continuity-anchor-integrity/EVAL.md, bundles/aoa-reflective-revision-boundedness/EVAL.md, and bundles/aoa-self-reanchor-correctness/EVAL.md
proof posture and limits: docs/EVAL_PHILOSOPHY.md
architecture: docs/ARCHITECTURE.md
current eval surface and chooser: EVAL_INDEX.md and EVAL_SELECTION.md
authoring template: templates/EVAL.template.md

Route by need

score semantics, verdict boundaries, and review posture: docs/SCORE_SEMANTICS_GUIDE.md, docs/VERDICT_INTERPRETATION_GUIDE.md, docs/EVAL_RUBRIC.md, and docs/EVAL_REVIEW_GUIDE.md
derived reader surfaces: generated/eval_catalog.min.json, generated/eval_capsules.json, and generated/eval_sections.full.json
regression, peer comparison, and repeated-window reading: docs/COMPARISON_SPINE_GUIDE.md, docs/REPEATED_WINDOW_DISCIPLINE_GUIDE.md, and generated/comparison_spine.json
bounded stress recovery longitudinal reading: docs/STRESS_RECOVERY_WINDOW_EVALS.md and bundles/aoa-stress-recovery-window/EVAL.md
bounded self-agency continuity reading: bundles/aoa-continuity-anchor-integrity/EVAL.md, bundles/aoa-reflective-revision-boundedness/EVAL.md, and bundles/aoa-self-reanchor-correctness/EVAL.md
via negativa pruning checklist: docs/VIA_NEGATIVA_CHECKLIST.md
artifact-side versus process-side reading: docs/ARTIFACT_PROCESS_SEPARATION_GUIDE.md
runtime-artifact to verdict bridge: docs/TRACE_EVAL_BRIDGE.md, docs/EVAL_RESULT_RECEIPT_GUIDE.md, and docs/RUNTIME_BENCH_PROMOTION_GUIDE.md
local shared-envelope mirror and eval publication seam: docs/EVAL_RESULT_RECEIPT_GUIDE.md
portability, fixtures, and baseline-boundary guides: docs/FIXTURE_SURFACE_GUIDE.md, docs/BLIND_SPOT_DISCLOSURE_GUIDE.md, docs/PORTABLE_EVAL_BOUNDARY_GUIDE.md, and docs/BASELINE_COMPARISON_GUIDE.md
shared proof infra and portable report surfaces: docs/SHARED_PROOF_INFRA_GUIDE.md, fixtures/, runners/, scorers/, reports/, and schemas/
adjunct progression, self-agent, and recurrence seams: docs/PROGRESSION_EVIDENCE_MODEL.md, docs/SELF_AGENT_CHECKPOINT_EVAL_POSTURE.md, and docs/RECURRENCE_PROOF_PROGRAM.md
selection and release surfaces: EVAL_SELECTION.md and docs/RELEASING.md

What belongs here

Good candidates:

portable eval bundles for agent behavior
bounded workflow evaluations
boundary and authority adherence evals
artifact-quality evals
regression and comparison evals
longitudinal movement evals
scorers, rubrics, and verdict schemas
shared fixture and runner contracts
schema-backed report examples

Bad candidates:

random tests with no portable evaluation contract
raw project-local QA that was not generalized
secret-bearing fixtures or logs
massive uncurated run dumps
techniques that belong in aoa-techniques
skills that belong in aoa-skills
undocumented scoring logic
metrics with no interpretation boundary
claims stronger than the evidence actually supports

Core distinction

aoa-techniques owns practice meaning
aoa-skills owns bounded workflow meaning
aoa-evals owns bounded proof meaning

In short:

practice canon -> workflow canon -> proof canon

The current public runtime path remains:

pick -> inspect -> expand -> object use

What this repository does not try to do

aoa-evals does not provide a final or total measure of intelligence. It does not reduce agent quality to one number. It does not treat any single eval as the whole truth.

This repository prefers explicit limits over false objectivity.

Local validation

Install local dependencies:

python -m pip install -r requirements-dev.txt

Run the minimum repository check:

python scripts/validate_repo.py

Run the current full non-mutating proof-surface battery when you need parity across authored, generated, and adjunct surfaces:

python scripts/build_catalog.py --check
python scripts/generate_runtime_candidate_template_index.py --check
python scripts/generate_runtime_candidate_intake.py --check
python scripts/generate_phase_alpha_eval_matrix.py --check
python -m pytest -q tests

Refresh the derived reader catalogs only when you intentionally need to rewrite them:

python scripts/build_catalog.py

If local sibling repositories are not checked out beside this repo, some dependency-target existence checks stay in the looser local path. Use the documented environment variables when you need the stricter federated path.

Go elsewhere when...

you need reusable practice meaning: aoa-techniques
you need the bounded workflows under evaluation: aoa-skills
you need routing and dispatch: aoa-routing
you need role posture or handoff context: aoa-agents
you need scenario-level composition: aoa-playbooks

License

Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aoa-evals

Start here

Route by need

What belongs here

Core distinction

What this repository does not try to do

Local validation

Go elsewhere when...

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
.agents/skills		.agents/skills
.github		.github
Spark		Spark
bundles		bundles
docs		docs
examples		examples
fixtures		fixtures
generated		generated
manifests/recurrence		manifests/recurrence
quests		quests
reports		reports
runners		runners
schemas		schemas
scorers		scorers
scripts		scripts
templates		templates
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AUDIT.md		AUDIT.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
EVAL_INDEX.md		EVAL_INDEX.md
EVAL_SELECTION.md		EVAL_SELECTION.md
LICENSE		LICENSE
QUESTBOOK.md		QUESTBOOK.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
requirements-dev.txt		requirements-dev.txt

Folders and files

Latest commit

History

Repository files navigation

aoa-evals

Start here

Route by need

What belongs here

Core distinction

What this repository does not try to do

Local validation

Go elsewhere when...

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages