This directory contains working examples demonstrating AgentV's evaluation capabilities.
Examples are self-contained packages with their own dependencies. Before running any example, install dependencies from the repository root:
# From repository root
bun run examples:installThis installs dependencies for all examples. Alternatively, install individually:
cd examples/features/execution-metrics
bun installExamples are organized into two categories:
examples/
├── features/ # Feature demonstrations (evaluators, metrics, SDK)
└── showcase/ # Real-world use cases and end-to-end demos
Focused demonstrations of specific AgentV capabilities. Each example includes its own README with details.
- basic - Core schema features
- rubric - Rubric-based evaluation
- tool-trajectory-simple - Tool trajectory validation
- tool-trajectory-advanced - Advanced tool trajectory with expected_output
- composite - Composite evaluator patterns
- weighted-evaluators - Weighted evaluators
- execution-metrics - Metrics tracking (tokens, cost, latency)
- code-grader-with-llm-calls - Code graders with target proxy for LLM calls
- batch-cli - Batch CLI evaluation
- document-extraction - Document data extraction
- local-cli - Local CLI targets
- compare - Baseline comparison
- deterministic-evaluators - Deterministic assertions (contains, regex, JSON validation)
- workspace-setup-script - Multi-step workspace setup with
before_alllifecycle hook
- code-grader-sdk - TypeScript SDK for code graders using
defineCodeGrader() - sdk-custom-assertion - Custom assertion types using
defineAssertion() - sdk-programmatic-api - Programmatic evaluation using
evaluate() - sdk-config-file - Typed configuration with
defineConfig() - prompt-template-sdk - Custom LLM grader prompts using
definePromptTemplate()
Real-world evaluation scenarios. Each example includes its own README with setup instructions.
- export-screening - Export control risk classification
- tool-evaluation-plugins - Tool selection and efficiency patterns
- cw-incident-triage - Incident triage classification
- psychotherapy - Therapeutic dialogue evaluation
Each example follows this structure:
example-name/
├── evals/
│ ├── dataset.eval.yaml # Primary eval file
│ ├── *.ts or *.py # Code evaluators (optional)
│ └── *.md # LLM grader prompts (optional)
├── scripts/ # Helper scripts (optional)
├── .agentv/
│ └── targets.yaml # Target configuration (optional)
├── package.json # Dependencies (if using @agentv/eval)
└── README.md # Example documentation
For TypeScript code graders, add a package.json:
{
"name": "my-example",
"private": true,
"type": "module",
"dependencies": {
"@agentv/eval": "file:../../../packages/eval"
}
}Then write type-safe code graders:
#!/usr/bin/env bun
import { defineCodeGrader } from '@agentv/eval';
export default defineCodeGrader(({ answer, criteria }) => ({
score: answer.includes('expected') ? 1.0 : 0.0,
hits: ['Found expected content'],
misses: [],
}));