This guide explains how to create evaluation tasks for Anvil.
--
The wizard creates the following structure:
my-dataset/
├── Dockerfile # Base environment
├── requirements.txt # pytest dependencies
├── my-repo/ # Your repository
├── task-1/
│ ├── Dockerfile # FROM {user}/anvil-images:{dataset}.base
│ ├── instance_info.txt # Instance ID, FAIL_TO_PASS, PASS_TO_PASS
│ ├── run_script.sh # Bash script with embedded tests
│ ├── task_tests.py # Your pytest tests
│ ├── parser.py # Parses pytest output to JSON
│ └── tasks.csv # Full task specification
├── task-2/
└── ...
# Install anvil
uv sync
# Set up Docker Hub credentials (for publishing)
export REGISTRY_USERNAME=your-dockerhub-username
export REGISTRY_PASSWORD=your-dockerhub-password
# Or add to .env file
echo "REGISTRY_USERNAME=your-dockerhub-username" >> .env
echo "REGISTRY_PASSWORD=your-dockerhub-password" >> .envImportant: Your repository must be a git repo (contain a
.gitdirectory). Anvil usesgit reset --hardto align to thebase_commitbefore applying patches, so full git history is required. The wizard will reject repos without.git.
anvil init-dataset \
--dataset my-dataset \
--repo-path /path/to/your/repo \
--base-image golang:1.22| Option | Description |
|---|---|
--dataset, -d |
Dataset name (alphanumeric + hyphens) |
--repo-path |
Path to your git repository |
--base-image |
Docker base image (golang:1.22, python:3.12, node:20, etc.) |
problem.md - Describe what needs to be implemented:
## Task: Add User Profile Endpoint
Implement a GET /api/profile endpoint that returns the authenticated user's profile.
Requirements:
1. Add GetProfile method to UserService interface
2. Implement the method in the service
3. Add controller handler
4. Register the route with authentication middleware
The endpoint should return 401 if not authenticated, 404 if user not found.tests.py - Pytest tests that verify the implementation:
from pathlib import Path
def test_get_profile_in_interface():
"""Test that GetProfile is defined in the interface."""
content = Path("/app/my-repo/internal/service/user.go").read_text()
assert "GetProfile" in content, "GetProfile not in interface"
def test_get_profile_implemented():
"""Test that GetProfile is implemented."""
content = Path("/app/my-repo/internal/service/user.go").read_text()
assert "func (s *userService) GetProfile" in content
def test_profile_route_exists():
"""Test that /profile route is registered."""
content = Path("/app/my-repo/routes/routes.go").read_text()
assert "/profile" in content and "GetProfile" in contentanvil add-task -d my-dataset \
--problem-file problem.md \
--tests-file tests.py \
--capture-diffWhat happens:
=== Capture Diff Mode ===
Repository: /path/to/my-dataset/my-repo
Base commit: abc123def456
Make your changes to the repository now.
Edit files in: /path/to/my-dataset/my-repo
Type 'done' when done making changes...
- You edit the repo - Make the changes that solve the task
- Type "done" - When you're done editing
- Diff is captured - The wizard runs
git diffand shows a preview - Confirm - "Use this diff?"
- Repo resets - "Reset repo for next task?" - repo returns to clean state
# Task 2
anvil add-task -d my-dataset \
--problem-file task2/problem.md \
--tests-file task2/tests.py \
--capture-diff
# Task 3
anvil add-task -d my-dataset \
--problem-file task3/problem.md \
--tests-file task3/tests.py \
--capture-diffEach time the repo starts clean (from the reset) so you can make fresh changes.
anvil validate-dataset -d my-datasetanvil convert-dataset -d my-dataset -u your-username --dockerhub-repo anvil-imagesThis generates instances.yaml, gold_patches.json, and the directory structure needed for evaluation.
anvil publish-images -d my-dataset -u your-username --repo anvil-imagesAfter publishing images, verify your tasks using the oracle agent:
# Oracle agent: applies gold patches, all tests should PASS
anvil run-evals -d my-dataset --agent oracle -u your-username --dockerhub-repo anvil-imagesThe oracle agent applies your gold patches and runs the tests. All tests should pass if your solution is correct.
Prerequisites:
- Modal account configured (
modal setup) - Docker Hub credentials in
.envor environment - Images published to Docker Hub (step 7)
If tests fail unexpectedly:
- oracle fails: Your gold patch doesn't satisfy the tests, or patch doesn't apply cleanly
anvil run-evals -d my-dataset \
--agent mini-swe-agent \
--model anthropic/claude-sonnet-4-20250514 \
-u your-username \
--dockerhub-repo anvil-imagesFiles are mounted at /app/{repo-name}/:
# If your repo is "my-api", files are at:
Path("/app/my-api/internal/service/user.go")
Path("/app/my-api/routes/routes.go")Tests are classified into two categories:
| Category | Before Patch | After Patch | Purpose |
|---|---|---|---|
| FAIL_TO_PASS | FAIL | PASS | Tests the new functionality being added |
| PASS_TO_PASS | PASS | PASS | Regression tests (existing functionality) |
FAIL_TO_PASS (most common):
- Tests that verify the new feature or bug fix
- These tests FAIL on the original code and PASS after applying the patch
- If not specified, all detected tests are assumed to be FAIL_TO_PASS
# Explicitly specify which tests are FAIL_TO_PASS
anvil add-task -d my-dataset \
--tests-file tests.py \
--fail-to-pass "test_new_feature,test_edge_case"PASS_TO_PASS (optional):
- Regression tests that ensure existing functionality isn't broken
- These tests PASS both before and after the patch
- Use when you want to verify the patch doesn't break existing behavior
# Include regression tests
anvil add-task -d my-dataset \
--tests-file tests.py \
--fail-to-pass "test_new_feature" \
--pass-to-pass "test_existing_works,test_other_feature"When to use PASS_TO_PASS:
- When your patch touches code that has existing functionality
- When you want to ensure backwards compatibility
- When testing a bug fix that shouldn't affect other features
If you already have solution diffs, you can skip --capture-diff:
anvil add-task -d my-dataset \
--problem-file problem.md \
--patch-file solution.diff \
--tests-file tests.py \
--fail-to-pass "test_a,test_b,test_c"To create a diff manually:
cd my-repo
# Make changes
git diff > ../solution.diff
# Reset
git checkout .-
"No .git directory found" - Your repo ZIP must include the
.gitdirectory. Re-zip from within the repo root (e.g.cd my-repo && zip -r ../my-repo.zip .) -
"base_commit not found" - The
base_commitin your task doesn't exist in the repo's git history. Verify withgit rev-parse --verify <commit> -
"Patch failed to apply" - The patch context lines don't match the file contents at
base_commit. Regenerate the patch against the correct commit. -
Tests can't find files - Check paths match
/app/{repo-name}/... -
Test names mismatch - Ensure
fail_to_passmatches function names exactly
- Check Docker is running
- Verify
REGISTRY_USERNAMEandREGISTRY_PASSWORD - Check Dockerfile syntax
- DockerHub username issues - If you encounter image pull errors after submission, try using
afterqueryas the dockerhub username when creating tasks
Instance ID: my-dataset.task-1
Test Files: tasks/task-1/task_tests.py
FAIL_TO_PASS: ['test_get_profile_in_interface', 'test_get_profile_implemented']
PASS_TO_PASS: []
FROM afterquery/anvil-images:my-dataset.base
WORKDIR /apprepo, instance_id, base_commit, patch, test_patch, problem_statement, requirements, interface, repo_language, fail_to_pass, pass_to_pass, issue_specificity, issue_categories, before_repo_set_cmd, selected_test_files_to_run
| Command | Purpose |
|---|---|
anvil init-dataset -d NAME --repo-path PATH |
Create new dataset |
anvil add-task -d NAME --problem-file F --tests-file F -c |
Add task with diff capture |
anvil add-task -d NAME --problem-file F --patch-file F --tests-file F |
Add task with pre-made patch |
anvil validate-dataset -d NAME |
Check structure |
anvil convert-dataset -d NAME -u USER |
Generate Anvil files |
anvil publish-images -d NAME -u USER --repo REPO |
Build & push images |
anvil run-evals -d NAME --agent oracle -u USER --dockerhub-repo REPO |
Verify gold patches pass all tests |
anvil run-evals -d NAME --agent mini-swe-agent --model M -u USER --dockerhub-repo REPO |
Run evaluation with AI agent |