Proof of Concept - The core architecture has been validated and includes Terminal Bench integration. It's also designed to be extendable - plug in your own agents and evals. Extended testing (continuous optimisation over many iterations) wasn't completed due to time and cost constraints. See Status for details.
An AI agent that improves other AI Agents by:
- Modifying system messages
- Tweaking tool implementations
- Running evaluations to measure performance
- Analysing failed evaluations (using subagents)
- Iterating until optimisation goals are met
- Input: Provide a project breakdown (describing the target system) and an eval callback
- Optimise: The optimiser agent reads target system files, suggests and applies improvements
- Evaluate: Each iteration is tested against the eval suite via your callback
- Iterate: Process repeats until performance targets are reached or the agent finishes
The system has three main components:
┌─────────────────────────────────────────────────────────────────┐
│ Client Code │
│ ┌──────────────────────┐ ┌──────────────────────────────┐ │
│ │ Project Breakdown │ │ Eval Callback │ │
│ │ (YAML config) │ │ (runs target agent evals) │ │
│ └──────────┬───────────┘ └──────────────┬───────────────┘ │
└─────────────┼───────────────────────────────┼───────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ OptimiserAgent │
│ │
│ • Reads/writes target system files │
│ • Analyses eval results │
│ • Makes iterative improvements │
│ • Spawns trajectory analysis subagents │
│ │
└─────────────────────────────────────────────────────────────────┘
│
│ modifies
▼
┌─────────────────────────────────────────────────────────────────┐
│ Target System (Client Code) │
│ │
│ The AI agent being optimised (system message, tools, etc.) │
│ │
└─────────────────────────────────────────────────────────────────┘
The client provides two key inputs:
-
Project Breakdown - A YAML configuration describing the target system:
key_files: Important files and their purposes (system messages, tool implementations)available_actions: What the target agent can doediting_guidelines: Constraints for the optimiser (*optional)known_limitations: Any known limitations of this agent (*optional)
-
Eval Callback - A function that runs evaluations against the target system and returns results. The optimiser calls this to measure the impact of its changes.
The core agent that performs optimization. It operates in iterations:
- Analyze - Review eval results, identify failing evals
- Investigate - Read target system files, optionally spawn trajectory analysis subagents
- Modify - Make targeted changes to improve the target system
- Test - Run specific evals to validate changes
- End Iteration - Run full eval suite, collapse context, proceed to next iteration
At the end of each iteration, message history is reset to prevent unbounded context growth. Only essential state is preserved:
- Optimisation history (what changed, what improved)
- Current project breakdown (updated by the optimisation agent)
- Known limitations (evals deemed not worth pursuing)
For individual trajectory analysis, the optimiser can spawn trajectory analysis subagents that have access to the full eval trajectory (the target agent's conversation history during the eval). This helps diagnose why the target agent behaved incorrectly without overloading the context window of the main agent.
The OptimiserAgent communicates exclusively via JSON actions:
| Action | Purpose |
|---|---|
read |
Read target system files |
write / edit / multi_edit |
Modify target system files |
bash |
Execute shell commands |
run_eval_suite |
Run specific evals to test changes |
end_iteration |
Run full suite, collapse context |
reset_to_iteration |
Reset files and project breakdown to a previous iteration's state |
dispatch_traj_analysis_agent |
Analyze a failed eval's trajectory |
send_subagent_message |
Continue conversation with subagent for follow up questions after analysis |
update_project_breakdown |
Update file/action descriptions |
finish |
Complete optimization |
The system is designed to improve general capability, not maximize test scores:
- Prefer changes to core reasoning over adding specific rules
- Adds/enhances tools when the agent lacks capability
- Mark evals as "known limitations" after 2-3 failed attempts
- Accept non-determinism (most attempts passing is often sufficient)
Proof of Concept - Development paused.
This project was built to explore whether an AI agent could systematically improve other AI agents through iterative optimisation. The core architecture works, and simple scenarios have been validated:
- Real LLM tests demonstrate that Claude-4.5-Sonnet can identify and fix obvious bugs in target agents
- The iteration loop, context collapse, trajectory analysis subagents, and file modification pipeline all function as designed
The scenarios in this repo test relatively simple cases: fixing an obvious bug in tool code, correcting a broken system message, and adding a missing tool. These validate the core mechanics work.
The original vision was more ambitious:
-
Continuous Terminal Bench optimisation - Start with a simple agent (file read/write/edit, bash, maybe subagent dispatch) and run the optimiser against Terminal Bench evaluations over many iterations to see if accuracy improves over time.
-
Improving an existing benchmark agent - Take Stanford's Terminus 2 agent and see if the optimiser could improve its Terminal Bench scores, effectively creating a "Terminus 2.1".
-
A CLI tool installable via uv - So users can use it easily on their own projects with a simple uv command.
Both of these require significant compute time and API costs - running the optimiser agent plus hundreds of Terminal Bench evaluations per iteration adds up quickly.
Time and resource constraints. These experiments would require sustained investment in API costs and debugging time that isn't available right now.
If you're interested in continuing this work, the infrastructure is here:
- The Terminal Bench integration via Harbor is implemented
- Trajectory analysis subagents can diagnose why target agents fail specific evals
- The iteration loop handles context collapse and state preservation
The main work would be running extended optimisation sessions against real benchmarks and refining the optimiser's strategy based on what works.
Some directions this project could go:
-
Parallel experiment rollout - Instead of making one change at a time, the optimiser could spawn multiple subagents to try different approaches concurrently (using git branches or similar), then pick the best result based on eval scores. Similar to how GRPO works for RL.
-
User-in-the-loop CLI - Let users approve changes before they're applied, offer suggestions to the optimiser, and use git integration so changes can be monitored and manually reverted.
-
Configurable stopping conditions - Let users specify when to finish (e.g., "stop when >90% of evals pass" or "stop after 5 iterations without progress").
The following environment variables are used to configure LLM access:
| Variable | Description | Required |
|---|---|---|
LLM_MODEL |
Model identifier for the optimiser agent (e.g., openrouter/anthropic/claude-sonnet-4.5) |
Yes |
LLM_API_KEY |
API key for the optimiser agent's LLM provider | Yes |
TARGET_LLM_MODEL |
Model identifier for the target agent being optimised (used in scenario tests) | For tests |
TARGET_LLM_API_KEY |
API key for the target agent's LLM provider | For tests |
Scenario tests validate that the optimiser agent can successfully improve target agents in different situations. Each scenario provides a deliberately flawed target agent and verifies the optimiser can identify and fix the issue.
Test Types:
- Oracle: Fully simulated scenarios with scripted LLM responses. Tests end-to-end flow without API calls.
- Real LLM: Uses actual LLM API calls. Tests that a model can perform the optimisation.
# Oracle test (no API key needed)
uv run pytest -m oracle tests/scenarios/test_fix_obvious_bug_in_tool.py -v
# Real LLM test
export LLM_MODEL="anthropic/claude-sonnet-4.5"
export LLM_API_KEY="..."
uv run pytest -m real_llm tests/scenarios/test_fix_obvious_bug_in_tool.py -v --log-cli-level=INFO| Scenario | What it tests | Eval method |
|---|---|---|
test_fix_obvious_bug_in_tool |
Fixing a bug in tool implementation | In-memory (exec) |
test_coding_agent_bad_system_message |
Fixing flawed reasoning in a system message | Terminal Bench via Harbor |
test_coding_agent_missing_bash_tool |
Adding a missing tool to an agent | Terminal Bench via Harbor |
1. Fix Obvious Bug in Tool (test_fix_obvious_bug_in_tool.py)
Tests whether the optimiser can identify and fix an obvious bug in tool code. The target is a calculator agent where the add operation incorrectly multiplies the result by 9 (return a + b * 9 instead of return a + b). The optimiser should read the code, spot the bug, and fix it.
- Target agent: Simple calculator with add/subtract/multiply/divide
- Eval: In-memory execution - the fixed
calculate()function is exec'd and tested directly - Success criteria:
5 + 7 = 12(not 108) - Supports: Oracle and Real LLM tests
2. Coding Agent with Bad System Message (test_coding_agent_bad_system_message.py)
Tests whether the optimiser can identify and fix completely broken logic in a system message. The target coding agent's system message instructs it to "call the finish action as your first action" before doing any work - causing immediate task termination.
- Target agent: A coding agent with file read/write/edit tools
- Eval: Terminal Bench 2.0 via Harbor (see below)
- Success criteria: System message no longer contains "call finish first" instructions
- Supports: Real LLM tests only
3. Coding Agent Missing Bash Tool (test_coding_agent_missing_bash_tool.py)
Tests whether the optimiser can add a missing tool when the agent lacks capability. The target coding agent has file operations but no bash/shell execution tool, which is required for many Terminal Bench tasks.
- Target agent: A coding agent with file read/write/edit tools but no bash tool
- Eval: Terminal Bench 2.0 via Harbor (see below)
- Success criteria: Agent code contains a bash tool class and parser mapping
- Supports: Real LLM tests only
Two scenarios (test_coding_agent_bad_system_message and test_coding_agent_missing_bash_tool) use Terminal Bench 2.0 for evaluation. Terminal Bench provides coding tasks where an agent must interact with a Linux environment to complete file manipulation, git operations, and other terminal tasks.
How it works:
- The scenario sets up a temp directory with the target agent's code
- For each eval iteration, the
TBenchEvalRunnercopies the agent code to./eval_runs/iter_N/ - It validates the agent can be imported as a Python module
- It runs Harbor CLI (
harbor run --dataset terminal-bench@2.0 --agent-import-path ...) - Harbor runs the agent against Terminal Bench tasks in Docker containers
- Results are parsed from the Harbor jobs directory (CTRF format + trajectories)
- Trajectories are available for the optimiser's trajectory analysis subagents
Harbor configuration:
TBenchConfig(
coding_llm_model="openrouter/qwen/qwen3-30b-a3b", # or via TARGET_LLM_MODEL env var
coding_llm_api_key="...", # via TARGET_LLM_API_KEY env var
n_concurrent=8,
dataset="terminal-bench@2.0",
)Running Terminal Bench scenarios:
export LLM_MODEL="anthropic/claude-sonnet-4.5"
export LLM_API_KEY="..."
export TARGET_LLM_MODEL="openrouter/qwen/qwen3-30b-a3b"
export TARGET_LLM_API_KEY="..."
uv run pytest -m real_llm tests/scenarios/test_coding_agent_missing_bash_tool.py -v --log-cli-level=INFONote: These tests require Harbor to be installed and configured, and will make real API calls to both the optimiser LLM and the target agent LLM.
