AgentV Examples

This directory contains working examples demonstrating AgentV's evaluation capabilities.

Setup

Examples are self-contained packages with their own dependencies. Before running any example, install dependencies from the repository root:

# From repository root
bun run examples:install

This installs dependencies for all examples. Alternatively, install individually:

cd examples/features/execution-metrics
bun install

Directory Structure

Examples are organized into two categories:

examples/
├── features/       # Feature demonstrations (evaluators, metrics, SDK)
└── showcase/       # Real-world use cases and end-to-end demos

Features

Focused demonstrations of specific AgentV capabilities. Each example includes its own README with details.

basic - Core schema features
rubric - Rubric-based evaluation
tool-trajectory-simple - Tool trajectory validation
tool-trajectory-advanced - Advanced tool trajectory with expected_output
composite - Composite evaluator patterns
weighted-evaluators - Weighted evaluators
execution-metrics - Metrics tracking (tokens, cost, latency)
code-grader-with-llm-calls - Code graders with target proxy for LLM calls
batch-cli - Batch CLI evaluation
document-extraction - Document data extraction
local-cli - Local CLI targets
compare - Baseline comparison
deterministic-evaluators - Deterministic assertions (contains, regex, JSON validation)
workspace-setup-script - Multi-step workspace setup with before_all lifecycle hook

SDK

code-grader-sdk - TypeScript SDK for code graders using defineCodeGrader()
sdk-custom-assertion - Custom assertion types using defineAssertion()
sdk-programmatic-api - Programmatic evaluation using evaluate()
sdk-config-file - Typed configuration with defineConfig()
prompt-template-sdk - Custom LLM grader prompts using definePromptTemplate()

Showcase

Real-world evaluation scenarios. Each example includes its own README with setup instructions.

export-screening - Export control risk classification
tool-evaluation-plugins - Tool selection and efficiency patterns
cw-incident-triage - Incident triage classification
psychotherapy - Therapeutic dialogue evaluation

Writing Your Own Examples

Each example follows this structure:

example-name/
├── evals/
│   ├── dataset.eval.yaml     # Primary eval file
│   ├── *.ts or *.py          # Code evaluators (optional)
│   └── *.md                  # LLM grader prompts (optional)
├── scripts/                  # Helper scripts (optional)
├── .agentv/
│   └── targets.yaml          # Target configuration (optional)
├── package.json              # Dependencies (if using @agentv/eval)
└── README.md                 # Example documentation

Using `@agentv/eval` SDK

For TypeScript code graders, add a package.json:

{
  "name": "my-example",
  "private": true,
  "type": "module",
  "dependencies": {
    "@agentv/eval": "file:../../../packages/eval"
  }
}

Then write type-safe code graders:

#!/usr/bin/env bun
import { defineCodeGrader } from '@agentv/eval';

export default defineCodeGrader(({ answer, criteria }) => ({
  score: answer.includes('expected') ? 1.0 : 0.0,
  hits: ['Found expected content'],
  misses: [],
}));

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgentV Examples

Setup

Directory Structure

Features

SDK

Showcase

Writing Your Own Examples

Using `@agentv/eval` SDK

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AgentV Examples

Setup

Directory Structure

Features

SDK

Showcase

Writing Your Own Examples

Using @agentv/eval SDK

Using `@agentv/eval` SDK