Task Creation Guide

This guide explains how to create evaluation tasks for Anvil.

Generated Structure

The wizard creates the following structure:

my-dataset/
├── Dockerfile                    # Base environment
├── requirements.txt              # pytest dependencies
├── my-repo/                      # Your repository
├── task-1/
│   ├── Dockerfile               # FROM {user}/anvil-images:{dataset}.base
│   ├── instance_info.txt        # Instance ID, FAIL_TO_PASS, PASS_TO_PASS
│   ├── run_script.sh            # Bash script with embedded tests
│   ├── task_tests.py            # Your pytest tests
│   ├── parser.py                # Parses pytest output to JSON
│   └── tasks.csv                # Full task specification
├── task-2/
└── ...

Step-by-Step Guide

Prerequisites

# Install anvil
uv sync

# Set up Docker Hub credentials (for publishing)
export REGISTRY_USERNAME=your-dockerhub-username
export REGISTRY_PASSWORD=your-dockerhub-password

# Or add to .env file
echo "REGISTRY_USERNAME=your-dockerhub-username" >> .env
echo "REGISTRY_PASSWORD=your-dockerhub-password" >> .env

Step 1: Initialize Dataset

Important: Your repository must be a git repo (contain a .git directory). Anvil uses git reset --hard to align to the base_commit before applying patches, so full git history is required. The wizard will reject repos without .git.

anvil init-dataset \
  --dataset my-dataset \
  --repo-path /path/to/your/repo \
  --base-image golang:1.22

Option	Description
`--dataset, -d`	Dataset name (alphanumeric + hyphens)
`--repo-path`	Path to your git repository
`--base-image`	Docker base image (`golang:1.22`, `python:3.12`, `node:20`, etc.)

Step 2: Create Your Task Files

problem.md - Describe what needs to be implemented:

## Task: Add User Profile Endpoint

Implement a GET /api/profile endpoint that returns the authenticated user's profile.

Requirements:
1. Add GetProfile method to UserService interface
2. Implement the method in the service
3. Add controller handler
4. Register the route with authentication middleware

The endpoint should return 401 if not authenticated, 404 if user not found.

tests.py - Pytest tests that verify the implementation:

from pathlib import Path

def test_get_profile_in_interface():
    """Test that GetProfile is defined in the interface."""
    content = Path("/app/my-repo/internal/service/user.go").read_text()
    assert "GetProfile" in content, "GetProfile not in interface"

def test_get_profile_implemented():
    """Test that GetProfile is implemented."""
    content = Path("/app/my-repo/internal/service/user.go").read_text()
    assert "func (s *userService) GetProfile" in content

def test_profile_route_exists():
    """Test that /profile route is registered."""
    content = Path("/app/my-repo/routes/routes.go").read_text()
    assert "/profile" in content and "GetProfile" in content

Step 3: Add Task with --capture-diff

anvil add-task -d my-dataset \
  --problem-file problem.md \
  --tests-file tests.py \
  --capture-diff

What happens:

=== Capture Diff Mode ===
Repository: /path/to/my-dataset/my-repo
Base commit: abc123def456

Make your changes to the repository now.
Edit files in: /path/to/my-dataset/my-repo

Type 'done' when done making changes...

You edit the repo - Make the changes that solve the task
Type "done" - When you're done editing
Diff is captured - The wizard runs git diff and shows a preview
Confirm - "Use this diff?"
Repo resets - "Reset repo for next task?" - repo returns to clean state

Step 4: Repeat for More Tasks

# Task 2
anvil add-task -d my-dataset \
  --problem-file task2/problem.md \
  --tests-file task2/tests.py \
  --capture-diff

# Task 3
anvil add-task -d my-dataset \
  --problem-file task3/problem.md \
  --tests-file task3/tests.py \
  --capture-diff

Each time the repo starts clean (from the reset) so you can make fresh changes.

Step 5: Validate

anvil validate-dataset -d my-dataset

Step 6: Convert to Anvil Format

anvil convert-dataset -d my-dataset -u your-username --dockerhub-repo anvil-images

This generates instances.yaml, gold_patches.json, and the directory structure needed for evaluation.

Step 7: Publish Docker Images

anvil publish-images -d my-dataset -u your-username --repo anvil-images

Step 8: Verify with Oracle Agent

After publishing images, verify your tasks using the oracle agent:

# Oracle agent: applies gold patches, all tests should PASS
anvil run-evals -d my-dataset --agent oracle -u your-username --dockerhub-repo anvil-images

The oracle agent applies your gold patches and runs the tests. All tests should pass if your solution is correct.

Prerequisites:

Modal account configured (modal setup)
Docker Hub credentials in .env or environment
Images published to Docker Hub (step 7)

If tests fail unexpectedly:

oracle fails: Your gold patch doesn't satisfy the tests, or patch doesn't apply cleanly

Step 9: Run with an Agent

anvil run-evals -d my-dataset \
  --agent mini-swe-agent \
  --model anthropic/claude-sonnet-4-20250514 \
  -u your-username \
  --dockerhub-repo anvil-images

Writing Good Tests

Tips for Test Paths

Files are mounted at /app/{repo-name}/:

# If your repo is "my-api", files are at:
Path("/app/my-api/internal/service/user.go")
Path("/app/my-api/routes/routes.go")

Test Classification: FAIL_TO_PASS vs PASS_TO_PASS

Tests are classified into two categories:

Category	Before Patch	After Patch	Purpose
FAIL_TO_PASS	FAIL	PASS	Tests the new functionality being added
PASS_TO_PASS	PASS	PASS	Regression tests (existing functionality)

FAIL_TO_PASS (most common):

Tests that verify the new feature or bug fix
These tests FAIL on the original code and PASS after applying the patch
If not specified, all detected tests are assumed to be FAIL_TO_PASS

# Explicitly specify which tests are FAIL_TO_PASS
anvil add-task -d my-dataset \
  --tests-file tests.py \
  --fail-to-pass "test_new_feature,test_edge_case"

PASS_TO_PASS (optional):

Regression tests that ensure existing functionality isn't broken
These tests PASS both before and after the patch
Use when you want to verify the patch doesn't break existing behavior

# Include regression tests
anvil add-task -d my-dataset \
  --tests-file tests.py \
  --fail-to-pass "test_new_feature" \
  --pass-to-pass "test_existing_works,test_other_feature"

When to use PASS_TO_PASS:

When your patch touches code that has existing functionality
When you want to ensure backwards compatibility
When testing a bug fix that shouldn't affect other features

Alternative: Pre-made Patch Files

If you already have solution diffs, you can skip --capture-diff:

anvil add-task -d my-dataset \
  --problem-file problem.md \
  --patch-file solution.diff \
  --tests-file tests.py \
  --fail-to-pass "test_a,test_b,test_c"

To create a diff manually:

cd my-repo
# Make changes
git diff > ../solution.diff
# Reset
git checkout .

Troubleshooting

Oracle Fails

"No .git directory found" - Your repo ZIP must include the .git directory. Re-zip from within the repo root (e.g. cd my-repo && zip -r ../my-repo.zip .)
"base_commit not found" - The base_commit in your task doesn't exist in the repo's git history. Verify with git rev-parse --verify <commit>
"Patch failed to apply" - The patch context lines don't match the file contents at base_commit. Regenerate the patch against the correct commit.
Tests can't find files - Check paths match /app/{repo-name}/...
Test names mismatch - Ensure fail_to_pass matches function names exactly

Images Don't Build

Check Docker is running
Verify REGISTRY_USERNAME and REGISTRY_PASSWORD
Check Dockerfile syntax
DockerHub username issues - If you encounter image pull errors after submission, try using afterquery as the dockerhub username when creating tasks

File Format Reference

instance_info.txt

Instance ID: my-dataset.task-1
Test Files: tasks/task-1/task_tests.py
FAIL_TO_PASS: ['test_get_profile_in_interface', 'test_get_profile_implemented']
PASS_TO_PASS: []

Task Dockerfile

FROM afterquery/anvil-images:my-dataset.base
WORKDIR /app

tasks.csv Columns

repo, instance_id, base_commit, patch, test_patch, problem_statement, requirements, interface, repo_language, fail_to_pass, pass_to_pass, issue_specificity, issue_categories, before_repo_set_cmd, selected_test_files_to_run

Command Reference

Command	Purpose
`anvil init-dataset -d NAME --repo-path PATH`	Create new dataset
`anvil add-task -d NAME --problem-file F --tests-file F -c`	Add task with diff capture
`anvil add-task -d NAME --problem-file F --patch-file F --tests-file F`	Add task with pre-made patch
`anvil validate-dataset -d NAME`	Check structure
`anvil convert-dataset -d NAME -u USER`	Generate Anvil files
`anvil publish-images -d NAME -u USER --repo REPO`	Build & push images
`anvil run-evals -d NAME --agent oracle -u USER --dockerhub-repo REPO`	Verify gold patches pass all tests
`anvil run-evals -d NAME --agent mini-swe-agent --model M -u USER --dockerhub-repo REPO`	Run evaluation with AI agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task Creation Guide

Generated Structure

Step-by-Step Guide

Prerequisites

Step 1: Initialize Dataset

Step 2: Create Your Task Files

Step 3: Add Task with --capture-diff

Step 4: Repeat for More Tasks

Step 5: Validate

Step 6: Convert to Anvil Format

Step 7: Publish Docker Images

Step 8: Verify with Oracle Agent

Step 9: Run with an Agent

Writing Good Tests

Tips for Test Paths

Test Classification: FAIL_TO_PASS vs PASS_TO_PASS

Alternative: Pre-made Patch Files

Troubleshooting

Oracle Fails

Images Don't Build

File Format Reference

instance_info.txt

Task Dockerfile

tasks.csv Columns

Command Reference

FilesExpand file tree

TASK_CREATION_GUIDE.md

Latest commit

History

TASK_CREATION_GUIDE.md

File metadata and controls

Task Creation Guide

Generated Structure

Step-by-Step Guide

Prerequisites

Step 1: Initialize Dataset

Step 2: Create Your Task Files

Step 3: Add Task with --capture-diff

Step 4: Repeat for More Tasks

Step 5: Validate

Step 6: Convert to Anvil Format

Step 7: Publish Docker Images

Step 8: Verify with Oracle Agent

Step 9: Run with an Agent

Writing Good Tests

Tips for Test Paths

Test Classification: FAIL_TO_PASS vs PASS_TO_PASS

Alternative: Pre-made Patch Files

Troubleshooting

Oracle Fails

Images Don't Build

File Format Reference

instance_info.txt

Task Dockerfile

tasks.csv Columns

Command Reference