Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
5e07364
Thinking... to Processing... for agnosticism
dwash96 Dec 21, 2025
c06989a
tweak readme
ErichBSchulz Dec 22, 2025
ea14ba7
tweaks
ErichBSchulz Dec 23, 2025
22fe4ab
begin cleanup
ErichBSchulz Dec 23, 2025
06b5b04
feat: Add --dry mode to skip docker check and tests
ErichBSchulz Dec 23, 2025
1738021
chore: Run linter and format code
ErichBSchulz Dec 23, 2025
5eaf450
feat: Add dry run option to benchmark
ErichBSchulz Dec 23, 2025
1b0d525
chore: Run linter on benchmark files
ErichBSchulz Dec 23, 2025
c685caf
feat: Replace print with logging and add verbose/quiet flags
ErichBSchulz Dec 23, 2025
c9d0f06
chore: Run linter
ErichBSchulz Dec 23, 2025
a8e8a1b
fix: Change default benchmark exercises directory
ErichBSchulz Dec 23, 2025
0121aeb
refactor: Add logging and comments to resolve_dirname
ErichBSchulz Dec 23, 2025
c70e766
feat: Rename dirname to results_dir for clarity
ErichBSchulz Dec 23, 2025
8430e02
fix: Replace asserts with explicit error logging and exit
ErichBSchulz Dec 23, 2025
601e8c3
fix: Resolve linter errors in benchmark script
ErichBSchulz Dec 23, 2025
bd5afe2
hacking
ErichBSchulz Dec 23, 2025
85e1556
feat: Add support for new cat exercise structure
ErichBSchulz Dec 23, 2025
14cb852
fix: Run linter and fix formatting issues
ErichBSchulz Dec 23, 2025
7df0b0f
chore: Add logging for found exercises and metadata
ErichBSchulz Dec 23, 2025
f24d56d
fix: Import importlib_resources at the top level
ErichBSchulz Dec 23, 2025
b021795
fix: Move models import to top level in benchmark script
ErichBSchulz Dec 23, 2025
a3dc824
refactor: Dry out run_test code for single and multi-threaded execution
ErichBSchulz Dec 23, 2025
3cc0011
fix: Correct indentation in benchmark script
ErichBSchulz Dec 23, 2025
f50685c
fix: Set commit hash to '???????' when dry run
ErichBSchulz Dec 23, 2025
1969da4
docs: Update --hash-re help text for fractional set division
ErichBSchulz Dec 23, 2025
02164b6
fix: Update placeholder commit hash in benchmark script
ErichBSchulz Dec 23, 2025
9fe5f24
docs: Update hash-re help text with nth char and hex range examples
ErichBSchulz Dec 23, 2025
3f25430
docs: Add enhancements section to benchmark README
ErichBSchulz Dec 23, 2025
a1c011f
chore: Update benchmark README with linting fixes
ErichBSchulz Dec 23, 2025
c9b13bb
fix: Add debug logging for metadata parsing
ErichBSchulz Dec 23, 2025
ea4df0f
refactor: Support new dir structure and dry run
ErichBSchulz Dec 23, 2025
c29a88e
chore: Fix linter warnings in benchmark.py
ErichBSchulz Dec 23, 2025
2190623
docs: Add debug plan for benchmark script
ErichBSchulz Dec 23, 2025
c7f89c9
chore: Update benchmark debug plan for async/await
ErichBSchulz Dec 23, 2025
8390a20
feat: Update benchmark script for async Coder
ErichBSchulz Dec 23, 2025
e7e56f0
fix: Handle None results_dir in benchmark script
ErichBSchulz Dec 23, 2025
86a2b20
fix: Add helpful error message with paths to resolve_dirname
ErichBSchulz Dec 23, 2025
c79c7bf
fix: Add debug logging for directory resolution and result loading
ErichBSchulz Dec 23, 2025
28dfc22
fix: Exit when prior unnamed runs exist
ErichBSchulz Dec 23, 2025
fa13a81
refactor: Simplify benchmark result directory structure
ErichBSchulz Dec 23, 2025
b568600
fix: Improve readability of language filtering in benchmark
ErichBSchulz Dec 23, 2025
54442fe
change to gem
ErichBSchulz Dec 23, 2025
8150fef
refactor: Copy new exercises when continuing benchmark runs
ErichBSchulz Dec 23, 2025
29e3c5d
fix: Format benchmark.py to conform to linter
ErichBSchulz Dec 23, 2025
9c2359e
refactor: Update benchmark README with Cecli Cats details
ErichBSchulz Dec 23, 2025
663cba4
fix: Update benchmark README with minor formatting changes
ErichBSchulz Dec 23, 2025
65bdb1d
yada
ErichBSchulz Dec 23, 2025
68c00af
Merge remote-tracking branch 'dwash/main' into cat
ErichBSchulz Dec 23, 2025
fde8469
yada
ErichBSchulz Dec 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions benchmark/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,8 @@ RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
core-js@3.37.1 \
eslint@8.49.0

COPY . /aider
RUN pip3 install --no-cache-dir --upgrade pip uv
RUN uv pip install --system --no-cache-dir -e /aider[dev]
RUN git config --global --add safe.directory /aider
WORKDIR /aider
COPY . /cecli
RUN uv pip install --system --no-cache-dir -e /cecli[dev]
RUN git config --global --add safe.directory /cecli
WORKDIR /cecli
223 changes: 128 additions & 95 deletions benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1,147 +1,180 @@

# Aider benchmark harness

Aider uses benchmarks to quantitatively measure how well it works
with various LLMs.
Before `cecli` was born, the old `aider` used benchmarks to quantitatively
measure how well it works with various LLMs.

This directory holds the harness and tools needed to run the benchmarking suite.

If you're familiar with the `aider` benchmarking, see the "What's new..."
section below.

## Background

The benchmark is based on the [Exercism](https://github.com/exercism/python) coding exercises.
This
benchmark evaluates how effectively aider and LLMs can translate a
natural language coding request into executable code saved into
files that pass unit tests.
It provides an end-to-end evaluation of not just
the LLM's coding ability, but also its capacity to *edit existing code*
and *format those code edits* so that aider can save the
edits to the local source files.

See [this writeup for a longer discussion about the benchmark](https://aider.chat/2024/12/21/polyglot.html).

The benchmark is intended to be run *inside a docker container*.
This is because the benchmarking harness will be
taking code written by an LLM
and executing it without any human review or supervision!
The LLM could generate dangerous python that harms your system, like this: `import os; os.system("sudo rm -rf /")`.
The benchmark was based on the [Exercism](https://github.com/exercism/python)
coding exercises. This benchmark evaluates how effectively aider and LLMs can
translate a natural language coding request into executable code saved into
files that pass unit tests. It provides an end-to-end evaluation of not just the
LLM's coding ability, but also its capacity to _edit existing code_ and _format
those code edits_ so that aider can save the edits to the local source files.

See
[this writeup for a longer discussion about the benchmark](https://aider.chat/2024/12/21/polyglot.html).

The benchmark is intended to be run _inside a docker container_. This is because
the benchmarking harness will be taking code written by an LLM and executing it
without any human review or supervision! The LLM could generate dangerous python
that harms your system, like this: `import os; os.system("sudo rm -rf /")`.
Running inside a docker container helps limit the damage that could be done.

## Usage

There are 3 main tasks involved in benchmarking aider:
There are 3 main tasks involved in benchmarking:

1. Install and setup for benchmarking.
1. Install and setup.

2. Run the benchmark to measure performance across all the exercises.
2. Run the benchmark.

3. Generate a summary report of how many of the exercises succeeded or failed.
3. Analysis.

### Setup for benchmarking
### Setup

First, prepare all the groundwork for running the benchmarks.
These steps only need to be done once.

```
# Clone the aider repo
git clone https://github.com/Aider-AI/aider.git
ORG=Aider-AI
REPO=aider
# Clone the main repo
git clone https://github.com/$ORG/$REPO.git

# Create the scratch dir to hold benchmarking results inside the main aider dir:
cd aider
# Create the scratch dir to hold benchmarking results inside the main repo:
cd $REPO
mkdir tmp.benchmarks

# Clone the repo with the exercises
git clone https://github.com/Aider-AI/polyglot-benchmark tmp.benchmarks/polyglot-benchmark
git clone https://github.com/$ORG/polyglot-benchmark tmp.benchmarks/polyglot-benchmark

# Build the docker container
./benchmark/docker_build.sh
```

### Running the benchmark
### Running the benchmarks

Launch the docker container and run the benchmark inside it:

```
# Launch the docker container
# You probably want to tweak this script to import your service keys.
# It's curretnly configured to import GEMINI_API_KEY only.
# PR's welcome to more effectively grab the keys without causing anxiety.
./benchmark/docker.sh

# Inside the container, install aider as a development build.
# This way you're running the code that you cloned above, including any local changes.
# TODO: this step should be included in the Dockerfile
pip install -e .[dev]

# Run the benchmark:
./benchmark/benchmark.py a-helpful-name-for-this-run --model gpt-3.5-turbo --edit-format whole --threads 10 --exercises-dir polyglot-benchmark
```

The above will create a folder `tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run` with benchmarking results.
Run like this, the script will run all the exercises in a random order.

You can run `./benchmark/benchmark.py --help` for a list of all the arguments, but here are the most useful to keep in mind:

- `--model` is the name of the model, same as you would pass directly to `aider`.
- `--edit-format` is the name of the edit format, same as you would pass directly to `aider`. When working with an experimental LLM, I recommend starting with `whole`
- `--threads` specifies how many exercises to benchmark in parallel. Start with a single thread if you are working out the kinks on your benchmarking setup or working with a new model, etc. Once you are getting reliable results, you can speed up the process by running with more threads. 10 works well against the OpenAI APIs.
- `--num-tests` specifies how many of the tests to run before stopping. This is another way to start gently as you debug your benchmarking setup.
- `--keywords` filters the tests to run to only the ones whose name match the supplied argument (similar to `pytest -k xxxx`).
- `--read-model-settings=<filename.yml>` specify model settings, see here: https://aider.chat/docs/config/adv-model-settings.html#model-settings
- `--map-tokens` sets a token budget for the repo map sent with each request. Set `0` to disable the repo map. This lets you enable repo map usage for any model (e.g., `--map-tokens 1024`).
The above will create a folder
`tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run` with
benchmarking results. Run like this, the script will run all the exercises in a
random order.

You can run `./benchmark/benchmark.py --help` for a list of all the arguments,
but here are the most useful to keep in mind:

- `--model` is the name of the model, same as you would pass directly to
`aider`.
- `--edit-format` is the name of the edit format, same as you would pass
directly to `aider`. When working with an experimental LLM, I recommend
starting with `whole`
- `--sets` runs specific groups of tests using the `sets` in the `cat.yaml`.
(Hopefully, the sets will grow with time but currently it just bookmarks
the classic "polyglot" test battery.)
- `--hash-re` allows for deterministic slicing of the exercise set based on the
exercise hash. This is useful for quickly grabbing a consistent subset or k-fold
cross-validation. For example:
- `^0`: 1/16 of the set.
- `^[01]`: 1/8 of the set.
- `^[0-3]`: 1/4 of the set.
- `^.{2}[4-7]`: 1/4 of the set, using the 3 character of the hash.
- `--threads` specifies how many exercises to benchmark in parallel. Start with
a single thread if you are working out the kinks on your benchmarking setup or
working with a new model, etc. Once you are getting reliable results, you can
speed up the process by running with more threads. 10 works well against the
OpenAI APIs.
- `--num-tests` specifies how many of the tests to run before stopping. This is
another way to start gently as you debug your benchmarking setup.
- `--keywords` filters the tests to run to only the ones whose name match the
supplied argument (similar to `pytest -k xxxx`).
- `--read-model-settings=<filename.yml>` specify model settings, see here:
https://aider.chat/docs/config/adv-model-settings.html#model-settings
- `--map-tokens` sets a token budget for the repo map sent with each request.
Set `0` to disable the repo map. This lets you enable repo map usage for any
model (e.g., `--map-tokens 1024`).

### Benchmark report

You can generate stats about any benchmark, including ones which are still running.
You don't need to run this inside the docker container, as it is just
You can generate stats about any benchmark, including ones which are still
running. You don't need to run this inside the docker container, as it is just
collecting stats not executing unsafe python.

```
# Generate stats for a specific benchmarking directory
./benchmark/benchmark.py --stats tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run
```

The benchmark report is a yaml record with statistics about the run:

```yaml
- dirname: 2024-07-04-14-32-08--claude-3.5-sonnet-diff-continue
test_cases: 225
model: claude-3.5-sonnet
edit_format: diff
commit_hash: 35f21b5
pass_rate_1: 57.1
pass_rate_2: 77.4
percent_cases_well_formed: 99.2
error_outputs: 23
num_malformed_responses: 4
num_with_malformed_responses: 1
user_asks: 2
lazy_comments: 0
syntax_errors: 1
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 1
command: aider --sonnet
date: 2024-07-04
versions: 0.42.1-dev
seconds_per_case: 17.6
total_cost: 3.6346
```

The key statistics are the `pass_rate_#` entries, which report the
percent of the tasks which had all tests passing.
There will be multiple of these pass rate stats,
depending on the value of the `--tries` parameter.

The yaml also includes all the settings which were in effect for the benchmark run.
It also reports the git hash of the repo at the time that the benchmark was
run, with `(dirty)` if there were uncommitted changes.
It's good practice to commit the repo before starting a benchmark run.
This way the `model`, `edit_format` and `commit_hash`
should be enough to reliably reproduce any benchmark run.

You can see examples of the benchmark report yaml in the
[aider leaderboard data files](https://github.com/Aider-AI/aider/blob/main/aider/website/_data/).


## Limitations, notes

- Contributions of benchmark results are welcome! Submit results by opening a PR with edits to the
[aider leaderboard data files](https://github.com/Aider-AI/aider/blob/main/aider/website/_data/).
- These scripts are not intended for use by typical aider end users.
- Some of these tools are written as `bash` scripts, so it will be hard to use them on Windows.
The benchmark report is a yaml record with statistics about the run.

The key statistics are the `pass_rate_#` entries, which report the percent of
the tasks which had all tests passing. There will be multiple of these pass rate
stats, depending on the value of the `--tries` parameter.

The yaml also includes all the settings which were in effect for the benchmark
run. It also reports the git hash of the repo at the time that the benchmark was
run, with `(dirty)` if there were uncommitted changes. It's good practice to
commit the repo before starting a benchmark run. This way the `model`,
`edit_format` and `commit_hash` should be enough to reliably reproduce any
benchmark run.

## Contributing

Contributions of benchmark results and tests are welcome! Submit results by opening a PR.

Note the roadmap priorities:

1. Complete 'set up records' to support smart caching.
2. Atomic data collection. Most of the data is saved but need protocols for sharing.
3. **Dimensional Parameter Walking** allowing for n-dimensional parameter tuning,
facilitating "gradient descent" approach to opimisation accross multiple parameters.
The test runner should accept n lists of options, e.g., ["thinking: 100", "thinking: 200", "thinking: 400"], ["optionA: B", "optionD: C"].
4. Smart Caching so the runner can optionally skip any tests for which "similar" result data
is already available based on fuzzy metadata matching. This aids iterative Testing as
when adding a new option to a list of permutations, only the new permutations need to
be run. Also when new Cats join the collection it is easy to incrementally collect the data.
5. Data aggregation and analysis. These will be seperate specialised tools.

## Limitations

- These scripts are not intended for use by typical `cecli` end users.
- Some of the old (?deprecated) tools are written as `bash` scripts, so it will be hard to use
them on Windows.
- Currently the JS and cpp tests appear broken.

## What's new with Cecli Cats?

The benchmark has evolved into a collection of **Cecli Atomic Tests (Cats)**.

- **YAML Metadata**: Every Cat has its own `cat.yaml` file containing metadata,
including a unique UUID that may or may not be useful later.
- **Evolving Collection**: The directory structure of the Cats is laid out to
facilitate the growth and evolution of the collection. As the benchmark
matures, Cats will come and go.
- **Simplified Runner**: The test runner is being simplified to focus on its
core job: executing tests and recording results. Downstream aggregation and
analysis of results will be shifted to other tools and projects.
- **Subset Filtering**: see `--sets`
- **K-fold Evaluation Slicing**: The `--hash-re` option allows for deterministic
slicing of the exercise (now `cats`) based on the exercise hash.
Loading
Loading