Anvil Swift

A benchmark for evaluating LLM coding agents on real-world Swift/iOS tasks. Agents receive a problem statement, generate a patch, and are evaluated by compiling the project and running XCTest unit tests.

How it works

Agent phase: Each task runs in a Modal sandbox using a pre-built Docker image. The agent receives the problem statement and generates a patch.
Eval phase: Patches are applied to a local worktree with cached DerivedData. xcodebuild compiles the patched project and runs unit tests. Each worker gets its own simulator clone to avoid boot conflicts during parallel evaluation.
Output: Trajectories, patches, stdout/stderr, and eval results are saved per-task. A summary with pass@k metrics is printed at the end.

Setup

Install dependencies and Xcode prerequisites

make setup

Configure environment

make setup copies .env.example to .env automatically. Open .env and fill in:

OPENROUTER_API_KEY (or whichever provider you're using)
REGISTRY_USERNAME - your Docker Hub username
REGISTRY_PASSWORD - a Docker Hub access token

Authenticate services

Make sure Docker is running locally, then:

modal setup          # Modal account for sandboxed agent execution
docker login         # Docker Hub for image pulls

Create a private Docker Hub repository

Go to hub.docker.com and create a new private repository (e.g., anvil-images).

⚠️ Public repos will not work. Anvil refuses to push task images to public repositories to prevent data leakage.

Local Usage

Set up a new repo (clone + scaffold task structure)

anvil setup-repo <github_url>

This clones the repo into repos/ and creates tasks/<repo_name>/ with template repo.md and xcode_config.yaml. Fill those in before proceeding.

Create tasks from GitHub PRs

Add PR URLs (one per line) to src/anvil/commands/github_prs/<repo_name>.txt, then:

anvil create-tasks <repo_name>

This fetches each PR, downloads the diff, and uses an LLM to generate task.md, tests.swift, and uitests.swift for each task. Also accepts a single PR URL or a path to a text file. Use --model to change the LLM (default: openrouter/anthropic/claude-sonnet-4.6).

Convert the dataset (also warms the Xcode build cache)

anvil convert-dataset --dataset tasks/<repo_name>

Verify gold patches compile and pass unit tests

anvil run-evals --dataset datasets/<repo_name> --agent oracle

The oracle agent applies gold patches from gold_patches.json directly — all tests should pass if your harness is correct. Each dataset needs a xcode_config.yaml in tasks/<repo_name>/ specifying the Xcode project, scheme, and build destination.

Publish Docker images (required for LLM agent runs via Modal)

anvil publish-images --dataset datasets/<repo_name>

The username and repo are read from REGISTRY_USERNAME and REGISTRY_REPO in .env (or pass -u <username> / --repo <name> to override).

Run evaluations

anvil run-evals \
  --dataset datasets/<repo_name> \
  --model openrouter/anthropic/claude-sonnet-4.5 \
  --agent mini-swe-agent \
  --n-attempts 4

By default each eval runs unit tests (tests.swift) and then UI tests (uitests.swift) when the task provides them. Pass --no-ui-tests to evaluate with unit tests only (skips copying and running UI tests). The run directory name gains a _unit-only suffix when --no-ui-tests is set (e.g. mini-swe-agent_claude-sonnet-4.6_unit-only), so full vs unit-only runs do not share the same folder.

Use --n-attempts to control how many runs per task (useful for pass@k metrics). Results are saved to <dataset>/runs/<agent>_<model>/ (or …_unit-only).

💡 Progress is saved automatically to minimize costs. If you re-run the same command, completed tasks are skipped. Use --no-continue to start fresh.

Options

Flag	Default	Description
`--model`	—	Model ID (required for agents, optional for oracle)
`--dataset`	—	Dataset ID or path
`--agent`	mini-swe-agent	Agent to use (`mini-swe-agent` or `oracle`)
`--n-attempts`	1	Attempts per task (for pass@k)
`--no-ui-tests`	false	Unit tests only; skip `uitests.swift` UI tests
`--compile-only`	false	Only check compilation, skip unit tests
`--no-continue`	false	Start fresh, ignore previous results
`--max-parallel`	20	Concurrent agent runs
`--max-wait`	auto	Minutes to wait for Modal rate limits

Docker Hub auth for Modal uses REGISTRY_USERNAME and REGISTRY_PASSWORD in .env (see Setup). Image names come from instances.yaml (set when you run convert-dataset / publish-images).

Writing Tasks

Each task lives under tasks/<repo_name>/task-N/ and requires three files:

task.md — problem statement shown to the agent. Include:

What to build and why
Acceptance criteria
A "Required API Surface" section listing exact type/method names the tests depend on (so the agent knows what names to expose)

solution.diff — the gold patch. Used by the oracle agent to verify the harness is correct before running LLM agents.

tests.swift — XCTest unit tests. The harness auto-routes based on imports:

import <SPMPackage> only → copied into the SPM package test target (test_files_dest in xcode_config.yaml)
@testable import <AppModule> → injected into the app test target
Use uitests.swift instead for XCUIApplication UI tests (auto-routed to the UI test target)

metadata.yaml (dataset-level, not per-task) — maps each task to its base commit SHA in the source repo:

base_commits:
  task-1: <sha>
  task-2: <sha>

xcode_config.yaml (dataset-level) — configures the Xcode build. At minimum:

project: <RepoName>/<RepoName>.xcodeproj
scheme: <SchemeName>
test_package_path: <RepoName>/Packages/<PackageName>
test_files_dest: Tests/<TargetName>
test_scheme: <PackageScheme>
test_destination: "platform=iOS Simulator,name=iPhone 16,OS=latest"

For app-level tests (@testable import), add app_test_target and app_test_files_dest. Only set app_test_module if the Swift module name differs from scheme.

After writing tasks, run the oracle to verify all gold patches pass before running LLM agents:

anvil run-evals --dataset datasets/<repo_name> --agent oracle --no-continue

GitHub Actions

GitHub Actions workflows are included in the repo under .github/workflows/. The full eval pipeline can be run directly on GitHub.

Configure the following repository secrets under Settings → Secrets and variables → Actions:
- OPENROUTER_API_KEY
- MODAL_TOKEN_ID
- MODAL_TOKEN_SECRET
- REGISTRY_USERNAME
- REGISTRY_PASSWORD
Go to Actions → Anvil Eval, click Run workflow, and pick a dataset, model, agent, and number of attempts from the dropdowns. See an example run.

Results are committed back to the repo under gha_runs/ and are also available as workflow artifacts.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.githooks		.githooks
.github/workflows		.github/workflows
src/anvil		src/anvil
tasks/ACHNBrowserUI		tasks/ACHNBrowserUI
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
anvil		anvil
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anvil Swift

How it works

Setup

Local Usage

Options

Writing Tasks

GitHub Actions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Anvil Swift

How it works

Setup

Local Usage

Options

Writing Tasks

GitHub Actions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages