Thank you for your interest in contributing a new task to ToolArena! This guide walks you through the steps to add a new task (tool) to this benchmark repository by opening a pull request.
ToolArena is a benchmark for LLM-based agentic "tool creation."
Specifically, the purpose of this benchmark is to assess how well LLM agents can create LLM-compatible "tools" from GitHub repositories. You can think of a tool as a Python function that performs a given task: it has input arguments, and returns some sort of output.
Note
The terms "task" and "tool" are used interchangeably in this guide.
So, what makes a good task? A good task for ToolArena is a well-scoped, functional wrapper around an open-source GitHub repository that is associated with a research paper. The task should be meaningful for LLM-based tool generation -- simple enough to be learnable from documentation and repo code, yet non-trivial enough to require some reasoning and code synthesis. Importantly, it must also be testable, i.e. you can write code to assess whether the task was successful.
When you create a new task, you place these files in a directory under tasks/. For example, if your task is named my_nifty_task, your files would live under tasks/my_nifty_task/.
Each task in ToolArena consists of:
- A task definition (
task.yaml) containing basic metadata, inputs/outputs, test invocations, etc. - A reference solution (i.e. implementation), consisting of:
- An installation script (
install.sh) that installs all necessary dependencies to run the tool. - An code implementation (
implementation.py) containing the Python implementation for the tool itself.
- An installation script (
- Any data files needed to invoke the tool (
data/anddata/tests). - Several unit tests to assess the correctness of a candidate implementation (
tests.py).
This guide will walk you through creating all the necessary files to define a task. By the end of this guide, you will have produced the following directory structure:
tasks/ └── my_nifty_task/ ├── __init__.py ├── task.yaml ├── install.sh ├── implementation.py ├── tests.py └── data/ ├── .gitignore ├── download.sh # (optional script to download external data) ├── ... # (optional) (your data files) └── tests/ └── ... # (optional) (additional data files for tests)
- Fork the repository
- Install ToolArena
- Create a new task
- Fill in your task definition (
task.yaml) - Add data required to run the task
- Define the example invocation
- Generate
install.shandimplementation.py - Write the
install.shscript - Write
implementation.py - Check that the example invocation works
- Write tests
- Run your tests
- Submit a Pull Request!
- Click the "Fork" button in the GitHub UI to create your personal copy of ToolArena.
- Clone your fork locally:
git clone https://github.com/<YOUR_USERNAME>/ToolArena.git cd ToolArena
- Install
uv. - Install this project in a virtual environment:
This will create a virtual environment at
uv sync --all-groups
.venv/. - Activate the virtual environment:
source .venv/bin/activate - Check that the installation succeeded by verifying that the
toolarenacommand exists:toolarena --help
- Install Docker and pull the latest ToolArena image:
docker pull ghcr.io/katherlab/toolarena:cpu docker pull ghcr.io/katherlab/toolarena:cuda # (only required if your task requires GPU) - Create a
.envfile at the root of the repository (it can be empty for now):touch .env
Use the following command to create a new task:
toolarena init my_nifty_taskThis will create a new task at tasks/my_nifty_task with a task.yaml file to get you started, alongside a data/ directory which you will later use to store data files necessary for your task.
Your directory structure will now look as follows:
tasks/ └── my_nifty_task/ ├── __init__.py # (empty file) ├── task.yaml └── data/ ├── .gitignore └── download.sh
You will find an automatically generated task.yaml file in your task folder (tasks/my_nifty_task/task.yaml).
Edit the task.yaml file to define your task.
Take special care to come up with a task description, and make sure all arguments and returns are specified.
Note
At any point, you can run the following command to see how the Python function signature of your task will look like:
toolarena signature my_nifty_taskTip
You can use an existing task in tasks/ as a reference (for instance, tasks/conch_extract_features/task.yaml).
At this point, you should at least define the following parts (the task.yaml file includes extensive documentation in the form of comments to help you):
name: matches the folder name (e.g.,mynifty_task).description: One or two sentences describing what your tool does.repo: your should wrap functionality from a GitHub repository, so please specify:repo: name: MyRepo url: "https://github.com/username/MyRepo" commit: abc123 branch: main env: # (optional) - name: MY_TOKEN value: "${env:MY_TOKEN}"
arguments: inputs your tool expects.returns: outputs your tool produces.
Important
If your task requires a GPU, ensure you set requires: cuda in the task.yaml file.
In the env section of repo, you may optionally define environment variables that should be set for the tool.
For example, to set the environment variable MY_ENV_VAR to the value "abc", you would write:
repo:
env:
- name: MY_ENV_VAR
value: "abc"The main use case for environment variables is to supply secret tokens such as API keys. For this, you may use a special syntax ("${env:MY_TOKEN}") to insert environment variables from your local environment, defined in the top-level .env file.
Important
The .env file should be placed at the top level of the ToolArena repository, not within your task folder tasks/my_nifty_task/.env!
For example, if you set the following in your .env file:
.env:HF_TOKEN=my_huggingface_token
Then you can tell ToolArena that the local HF_TOKEN environment variable defined in .env should be available to the tool as an environment variable, also named HF_TOKEN:
tasks/my_nifty_task/task.yaml:repo: env: - name: HF_TOKEN value: "${env:HF_TOKEN}"
If your task requires external data as input, add these files to the data/ folder within your task. You may place any files / datasets in there that your tool requires as input for the example invocation or for the tests later on.
If your task requires large files (e.g. larger than a few MB), please add these files to the data/.gitignore file and add commands in the data/download.sh script so they can be downloaded by the user of the benchmark.
Your directory structure will now look as follows:
tasks/ └── my_nifty_task/ ├── __init__.py ├── task.yaml └── data/ ├── .gitignore ├── download.sh └── ... # your data files
In the example section of task.yaml, define an example invocation by providing a concrete value for each input argument.
Later on, you will verify that your tool implementation works by calling your tool using this example invocation.
ToolArena provides an automated mechanism to generate starter code in the files needed for the implementation:
toolarena generate my_nifty_toolNow, your folder structure will look like this:
tasks/ └── my_nifty_task/ ├── __init__.py ├── task.yaml ├── implementation.py ├── install.sh ├── tests.py └── data/ ├── .gitignore ├── download.sh └── ... # your data files
Importantly, ToolArena will automatically place starter code in the implementation.py file.
This code contains the Python function signature of the task you defined in task.yaml.
Caution
If you need to change the description, arguments or returns in your task.yaml, be sure to re-generate the Python function signature and update it in implementation.py:
toolarena signature my_nifty_taskNow, write the install.sh script to install all necessary dependencies for your tool.
The toolarena generate command which you executed earlier will already include the git clone command in your install script to get you started.
To try out the install script, simply run:
toolarena build my_nifty_taskThis will create a Docker image using your install.sh script to install all dependencies.
If this succeeds without errors, you can start an interactive shell in a container based on this image using:
docker run -it --rm --env-file .env toolarena-tool:my_nifty_task /bin/bash
# Or, if your task requires CUDA:
docker run -it --rm --env-file .env --gpus all toolarena-tool:my_nifty_task /bin/bash
Tip
You can just start with an empty install.sh file including just the git clone command, and then run:
toolarena build my_nifty_task
docker run -it --rm --env-file .env toolarena-tool:my_nifty_task /bin/bashIn the interactive shell, you can just run the commands you need in order to install all the dependencies.
Simply take note of the commands you ran and put them in the install.sh file!
Example:
#! /bin/bash set -e git clone https://github.com/username/MyRepo /workspace/MyRepo cd /workspace/MyRepo && git checkout abc123 apt-get install somepackage pip install -e /workspace/MyRepo pip install someotherpackage
This file must define one function whose name matches name in task.yaml and arguments are as defined in task.yaml.
Example:
def my_nifty_task(a: float) -> dict: """ Round the input to the nearest integer. Args: a: the number to be rounded Returns: dict with the following structure: { 'rounded': int # The rounded number } """ import math # Note: we're importing modules inside the function, not globally! rounded = int(math.round(a)) return {"rounded": rounded}
Note
Do not define imports globally; instead put your import statements in the body of the function, as shown in the example above.
To check that the example invocation runs as expected, run the following command:
toolarena run my_nifty_example exampleCheck that the result and standard output are as expected for your task.
If there is an error, you should modify the install.sh and/or implementation.py files to work correctly, and then re-run the toolarena run my_nifty_example example command.
Tip
If your tool takes very long to run, you can inspect the output by running the following in a separate shell:
docker logs -f my_nifty_toolThis will show you the standard output of your tool while it is running.
Tip
You can also attach VS Code to your task's Docker container by running:
toolarena debug my_nifty_task exampleThis will start a docker container for the example invocation (specified in the example section of task.yaml). This command will provide instructions on how to attach VS code to the container.
Define 2-3 test invocations in the invocation section of task.yaml.
These invocations are defined similarly to how you defined the example invocation.
Then, write unit tests for these invocations in the tests.py file.
You can use the unit tests at tasks/conch_extract_features/tests.py (or any of the other tasks) as an example how to define these tests.
Tip
You can see the outputs that your tool implementation produces for each of your test invocations by running:
toolarena run my_nifty_tool my_invocation # (replace `my_invocation` with the name of the invocation which you provided in `task.yaml`)Run the tests for your tool using the following command, and check that they all pass for your implementation:
pytest tasks/my_nifty_taskNice work, you have successfully created a new task for the ToolArena benchmark. Now, open a new Pull Request on GitHub to submit this task!
Check out this PR as an example.
- Pin your dependencies to ensure reproducibility
- Use specific commits for GitHub repos instead of floating branches
- Minimize storage: Avoid checking in large data files
- Use env vars like
HF_TOKENfor secure model downloading - Refer to existing tasks as examples if unsure
- Try to write at least 2-3 test cases to validate outputs
Thank you for contributing to ToolArena! Feel free to reach out via GitHub Issues if you have any questions.