Description
By 0.2.0 close, AgentAnvil must have at least one end-to-end case study committed under examples/case-studies/ — an actual OSS agent from the Tier A pool, wired through AgentAnvil with a contract, scenarios, recording, and report. This is both a dogfooding signal and a template for the ≥ 15 case studies that 0.4.0 executes.
Tier A criteria:
- OSS with public traction (GitHub stars ≥ 100).
- ≥ 12 months of history.
- Active issues.
- Commit activity recent.
- Single-agent for 0.2.0 (multi-agent lands in 0.3.0).
Candidates (pick one):
- LangChain SQL agent (common, small, well-documented).
- LangChain ReAct + web search.
- Open Interpreter (single-agent).
- Raw-Python agent for diversity (if no OSS single-agent fits the 12-month criterion cleanly).
Proposal
1. Case study directory structure:
examples/case-studies/
└── tier-a-01-<agent-name>/
├── README.md # rationale, setup, notable findings
├── contract.yaml
├── Dockerfile
├── requirements.txt
├── scenarios.yaml # hand-crafted + generated mix
├── recordings/
│ └── <agent>.json
├── expected/
│ └── <agent>.report.json
└── run.sh # reproducibility one-liner
2. README.md template:
# Tier A Case Study 01: <agent-name>
## Target
- Repo: <URL>
- Commit pinned: `abc123...`
- Stars at time of study: <N>
- Framework: LangChain / raw / etc.
- Domain: SWE / QA / etc.
## Contract
<describes the 2-3 policies, 1-2 tasks, key constraints>
## Findings
<2-3 bullet points on what the run revealed. Objective-only is fine in 0.2.0.>
## Reproducibility
bash run.sh
3. run.sh:
#!/usr/bin/env bash
set -euo pipefail
pip install -r requirements.txt
agentanvil replay \
--recording recordings/<agent>.json \
--contract contract.yaml \
--out-dir ./output
diff -q output/*.json expected/*.json
4. Smoke test in CI:
# .github/workflows/ci.yml
case-studies-smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: pip install -e .
- run: |
for cs in examples/case-studies/*/; do
bash "$cs/run.sh"
done
Any new case study added later must also pass its run.sh (gate for merge).
Scope
examples/case-studies/tier-a-01-<name>/ — full directory.
.github/workflows/ci.yml — case-studies-smoke job.
docs/case-studies/tier-a-01.md — brief summary in docs site (or link to README.md).
Regression tests
case-studies-smoke CI job green on every PR.
test_tier_a_01_contract_validates
test_tier_a_01_replay_matches_expected
test_tier_a_01_readme_includes_required_fields (rationale, commit hash, findings)
Notes
- Tier A candidate selection is a research decision; list 3 finalists, pick one. The others stay on the 0.4.0 list.
- COI: if the candidate is by any contributor the author knows personally, note it in the README
Caveats section.
- Depends on: all of 0.2.0.
- Blocks: nothing in 0.2.0 — but sets the template for #071 (≥ 15 case studies in 0.4.0).
Description
By 0.2.0 close, AgentAnvil must have at least one end-to-end case study committed under
examples/case-studies/— an actual OSS agent from the Tier A pool, wired through AgentAnvil with a contract, scenarios, recording, and report. This is both a dogfooding signal and a template for the ≥ 15 case studies that 0.4.0 executes.Tier A criteria:
Candidates (pick one):
Proposal
1. Case study directory structure:
2.
README.mdtemplate:3.
run.sh:4. Smoke test in CI:
Any new case study added later must also pass its
run.sh(gate for merge).Scope
examples/case-studies/tier-a-01-<name>/— full directory..github/workflows/ci.yml—case-studies-smokejob.docs/case-studies/tier-a-01.md— brief summary in docs site (or link to README.md).Regression tests
case-studies-smokeCI job green on every PR.test_tier_a_01_contract_validatestest_tier_a_01_replay_matches_expectedtest_tier_a_01_readme_includes_required_fields(rationale, commit hash, findings)Notes
Caveatssection.