From 5e073642fc46894c552439011937a2aa4266aea6 Mon Sep 17 00:00:00 2001 From: Dustin Washington Date: Sun, 21 Dec 2025 15:53:05 -0500 Subject: [PATCH 01/48] Thinking... to Processing... for agnosticism --- aider/tui/app.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/aider/tui/app.py b/aider/tui/app.py index b4201275144..c16e34b94aa 100644 --- a/aider/tui/app.py +++ b/aider/tui/app.py @@ -450,7 +450,7 @@ def on_input_area_submit(self, message: InputArea.Submit): # Update footer to show processing footer = self.query_one(AiderFooter) - footer.start_spinner("Thinking...") + footer.start_spinner("Processing...") self.update_key_hints(generating=True) From c06989ad6a75ccdfaf07021c49d4653453db37d1 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Mon, 22 Dec 2025 10:08:02 +1000 Subject: [PATCH 02/48] tweak readme --- benchmark/README.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/benchmark/README.md b/benchmark/README.md index 988406de687..4207b8a24ae 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -1,13 +1,14 @@ # Aider benchmark harness -Aider uses benchmarks to quantitatively measure how well it works +Before `cecli` was born, the old `aider` used benchmarks to quantitatively measure how well it works with various LLMs. + This directory holds the harness and tools needed to run the benchmarking suite. ## Background -The benchmark is based on the [Exercism](https://github.com/exercism/python) coding exercises. +The benchmark was based on the [Exercism](https://github.com/exercism/python) coding exercises. This benchmark evaluates how effectively aider and LLMs can translate a natural language coding request into executable code saved into @@ -42,15 +43,17 @@ First, prepare all the groundwork for running the benchmarks. These steps only need to be done once. ``` -# Clone the aider repo -git clone https://github.com/Aider-AI/aider.git +ORG=Aider-AI +REPO=aider +# Clone the main repo +git clone https://github.com/$ORG/$REPO.git -# Create the scratch dir to hold benchmarking results inside the main aider dir: -cd aider +# Create the scratch dir to hold benchmarking results inside the main repo: +cd $REPO mkdir tmp.benchmarks # Clone the repo with the exercises -git clone https://github.com/Aider-AI/polyglot-benchmark tmp.benchmarks/polyglot-benchmark +git clone https://github.com/$ORG/polyglot-benchmark tmp.benchmarks/polyglot-benchmark # Build the docker container ./benchmark/docker_build.sh @@ -66,6 +69,7 @@ Launch the docker container and run the benchmark inside it: # Inside the container, install aider as a development build. # This way you're running the code that you cloned above, including any local changes. +# TODO: this step should be included in the Dockerfile pip install -e .[dev] # Run the benchmark: @@ -136,12 +140,12 @@ This way the `model`, `edit_format` and `commit_hash` should be enough to reliably reproduce any benchmark run. You can see examples of the benchmark report yaml in the -[aider leaderboard data files](https://github.com/Aider-AI/aider/blob/main/aider/website/_data/). +[aider leaderboard data files](https://github.com/$ORG/aider/blob/main/aider/website/_data/). ## Limitations, notes - Contributions of benchmark results are welcome! Submit results by opening a PR with edits to the -[aider leaderboard data files](https://github.com/Aider-AI/aider/blob/main/aider/website/_data/). +[aider leaderboard data files](https://github.com/$ORG/aider/blob/main/aider/website/_data/). - These scripts are not intended for use by typical aider end users. - Some of these tools are written as `bash` scripts, so it will be hard to use them on Windows. From ea14ba730c8b7acb9f9a6f8653114148fe19f163 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 10:46:13 +1000 Subject: [PATCH 03/48] tweaks --- benchmark/Dockerfile | 8 ++++---- benchmark/README.md | 14 ++++++-------- benchmark/docker.sh | 33 +++++++++++++++++---------------- benchmark/docker_build.sh | 6 +++--- 4 files changed, 30 insertions(+), 31 deletions(-) diff --git a/benchmark/Dockerfile b/benchmark/Dockerfile index a5926dab744..a210915e29e 100644 --- a/benchmark/Dockerfile +++ b/benchmark/Dockerfile @@ -57,8 +57,8 @@ RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \ core-js@3.37.1 \ eslint@8.49.0 -COPY . /aider RUN pip3 install --no-cache-dir --upgrade pip uv -RUN uv pip install --system --no-cache-dir -e /aider[dev] -RUN git config --global --add safe.directory /aider -WORKDIR /aider +COPY . /cecli +RUN uv pip install --system --no-cache-dir -e /cecli[dev] +RUN git config --global --add safe.directory /cecli +WORKDIR /cecli diff --git a/benchmark/README.md b/benchmark/README.md index 4207b8a24ae..4425d0e1deb 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -1,4 +1,3 @@ - # Aider benchmark harness Before `cecli` was born, the old `aider` used benchmarks to quantitatively measure how well it works @@ -29,17 +28,16 @@ Running inside a docker container helps limit the damage that could be done. ## Usage -There are 3 main tasks involved in benchmarking aider: +There are 3 main tasks involved in benchmarking: -1. Install and setup for benchmarking. +1. Install and setup. -2. Run the benchmark to measure performance across all the exercises. +2. Run the benchmark. -3. Generate a summary report of how many of the exercises succeeded or failed. +3. Analysis. -### Setup for benchmarking +### Setup -First, prepare all the groundwork for running the benchmarks. These steps only need to be done once. ``` @@ -59,7 +57,7 @@ git clone https://github.com/$ORG/polyglot-benchmark tmp.benchmarks/polyglot-ben ./benchmark/docker_build.sh ``` -### Running the benchmark +### Running the benchmarks Launch the docker container and run the benchmark inside it: diff --git a/benchmark/docker.sh b/benchmark/docker.sh index 6f97b865e19..b4265a69401 100755 --- a/benchmark/docker.sh +++ b/benchmark/docker.sh @@ -1,19 +1,20 @@ #!/bin/bash +# FIXME - should be able to choose the keys to pass internal +# docker run \ - -it --rm \ - --memory=12g \ - --memory-swap=12g \ - --add-host=host.docker.internal:host-gateway \ - -v `pwd`:/aider \ - -v `pwd`/tmp.benchmarks/.:/benchmarks \ - -e OPENAI_API_KEY=$OPENAI_API_KEY \ - -e HISTFILE=/aider/.bash_history \ - -e PROMPT_COMMAND='history -a' \ - -e HISTCONTROL=ignoredups \ - -e HISTSIZE=10000 \ - -e HISTFILESIZE=20000 \ - -e AIDER_DOCKER=1 \ - -e AIDER_BENCHMARK_DIR=/benchmarks \ - aider-benchmark \ - bash + -it --rm \ + --memory=12g \ + --memory-swap=12g \ + --add-host=host.docker.internal:host-gateway \ + -v $(pwd):/cecli \ + -v $(pwd)/tmp.benchmarks/.:/benchmarks \ + -e GEMINI_API_KEY=$GEMINI_API_KEY \ + -e PROMPT_COMMAND='history -a' \ + -e HISTCONTROL=ignoredups \ + -e HISTSIZE=10000 \ + -e HISTFILESIZE=20000 \ + -e AIDER_DOCKER=1 \ + -e AIDER_BENCHMARK_DIR=/benchmarks \ + cecli-cat \ + bash diff --git a/benchmark/docker_build.sh b/benchmark/docker_build.sh index a6619bb5ce1..a132463ef17 100755 --- a/benchmark/docker_build.sh +++ b/benchmark/docker_build.sh @@ -3,6 +3,6 @@ set -e docker build \ - --file benchmark/Dockerfile \ - -t aider-benchmark \ - . + --file benchmark/Dockerfile \ + -t cecli-cat \ + . From 22fe4abcf5b711934c23ef3eb7cea487f07d7baf Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 10:47:58 +1000 Subject: [PATCH 04/48] begin cleanup --- benchmark/benchmark.py | 163 +--- benchmark/benchmark_classic.py | 1265 ++++++++++++++++++++++++++++++++ 2 files changed, 1271 insertions(+), 157 deletions(-) create mode 100755 benchmark/benchmark_classic.py diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 02117242742..2a50e1d7146 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -19,7 +19,6 @@ Performance-oriented refactors: - Avoid heavy imports unless needed for a given code path. - Fast path for `--stats` to skip GitPython and benchmarking deps. -- Build DataFrame / import plotting only when `--graphs` is true. - Use json.load for result file parsing to reduce memory churn. - Cache git version lookups across a single invocation. """ @@ -43,101 +42,6 @@ load_dotenv(override=True) - -def find_latest_benchmark_dir(): - benchmark_dirs = [d for d in BENCHMARK_DNAME.iterdir() if d.is_dir()] - if not benchmark_dirs: - print("Error: No benchmark directories found under tmp.benchmarks.") - sys.exit(1) - - # Get current time and 24 hours ago - now = datetime.datetime.now() - day_ago = now - datetime.timedelta(days=1) - - # Filter directories by name pattern YYYY-MM-DD-HH-MM-SS-- - recent_dirs = [] - for d in benchmark_dirs: - try: - # Extract datetime from directory name - date_str = d.name[:19] # Takes YYYY-MM-DD-HH-MM-SS - dir_date = datetime.datetime.strptime(date_str, "%Y-%m-%d-%H-%M-%S") - if dir_date >= day_ago: - recent_dirs.append(d) - except ValueError: - # Skip directories that don't match the expected format - continue - - if not recent_dirs: - print("Error: No benchmark directories found from the last 24 hours.") - sys.exit(1) - - # Find directory with most recently modified .md file - latest_dir = None - latest_time = 0 - - for d in recent_dirs: - # Look for .md files in subdirectories - for md_file in d.glob("*/exercises/practice/*/.*.md"): - if md_file.is_file(): - mtime = md_file.stat().st_mtime - if mtime > latest_time: - latest_time = mtime - latest_dir = d - - if not latest_dir: - print("Error: No .md files found in recent benchmark directories.") - sys.exit(1) - - print(f"Using the most recently updated benchmark directory: {latest_dir.name}") - return latest_dir - - -def show_stats(dirnames, graphs, verbose, stats_languages=None): - raw_rows = [] - for dirname in dirnames: - row = summarize_results(dirname, verbose, stats_languages) - raw_rows.append(row) - - # return - - seen = dict() - rows = [] - for row in raw_rows: - if not row: - continue - - if row.completed_tests != row.total_tests: - print( - f"Warning: {row.dir_name} is incomplete: {row.completed_tests} of {row.total_tests}" - ) - - try: - kind = (row.model, row.edit_format) - except AttributeError: - return - - if kind in seen: - dump(row.dir_name) - dump(seen[kind]) - return - - seen[kind] = row.dir_name - rows.append(vars(row)) - - repeat_hi = repeat_lo = repeat_avg = None # noqa: F841 - - # Only build a DataFrame and import plotting libs when graphs are requested - if graphs: - import pandas as pd # Lazy import - from plots import plot_refactoring # Lazy import - - df = pd.DataFrame.from_records(rows) - # plot_timing(df) - # plot_outcomes(df, repeats, repeat_hi, repeat_lo, repeat_avg) - # plot_outcomes_claude(df) - plot_refactoring(df) - - def resolve_dirname(dirname, use_single_prior, make_new): if len(dirname.parts) > 1: return dirname @@ -166,7 +70,6 @@ def resolve_dirname(dirname, use_single_prior, make_new): @app.command() def main( dirnames: Optional[List[str]] = typer.Argument(None, help="Directory names"), - graphs: bool = typer.Option(False, "--graphs", help="Generate graphs"), model: str = typer.Option("gpt-3.5-turbo", "--model", "-m", help="Model name"), sleep: float = typer.Option( 0, "--sleep", help="Sleep seconds between tests when single threaded" @@ -193,15 +96,6 @@ def main( no_unit_tests: bool = typer.Option(False, "--no-unit-tests", help="Do not run unit tests"), no_aider: bool = typer.Option(False, "--no-aider", help="Do not run aider"), verbose: bool = typer.Option(False, "--verbose", "-v", help="Verbose output"), - stats_only: bool = typer.Option( - False, "--stats", "-s", help="Do not run tests, just collect stats on completed tests" - ), - stats_languages: str = typer.Option( - None, - "--stats-languages", - help="Only include stats for specific languages (comma separated)", - ), - diffs_only: bool = typer.Option(False, "--diffs", help="Just diff the provided stats dirs"), tries: int = typer.Option(2, "--tries", "-r", help="Number of tries for running tests"), threads: int = typer.Option(1, "--threads", "-t", help="Number of threads to run in parallel"), num_tests: int = typer.Option(-1, "--num-tests", "-n", help="Number of tests to run"), @@ -226,36 +120,26 @@ def main( EXERCISES_DIR_DEFAULT, "--exercises-dir", help="Directory with exercise files" ), ): - if stats_only and not dirnames: - latest_dir = find_latest_benchmark_dir() - dirnames = [str(latest_dir)] - if dirnames is None: dirnames = [] - if len(dirnames) > 1 and not (stats_only or diffs_only): - print("Only provide 1 dirname unless running with --stats or --diffs") + if len(dirnames) > 1: + print("Only provide 1 dirname") return 1 updated_dirnames = [] for dirname in dirnames: dirname = Path(dirname) - dirname = resolve_dirname(dirname, stats_only or cont, make_new) + dirname = resolve_dirname(dirname, cont, make_new) if not dirname: return 1 updated_dirnames.append(dirname) - if stats_only: - return show_stats(updated_dirnames, graphs, verbose, stats_languages) - - if diffs_only: - return show_diffs(updated_dirnames) - assert len(updated_dirnames) == 1, updated_dirnames dirname = updated_dirnames[0] # Lazy imports for the actual benchmark run - import git # Heavy; avoid for --stats/--diffs + import git # Heavy import importlib_resources # Used for model metadata registration import lox # Only needed for threaded runs @@ -268,7 +152,8 @@ def main( commit_hash += "-dirty" if "AIDER_DOCKER" not in os.environ: - print("Warning: benchmarking runs unvetted code from GPT, run in a docker container") + print("Warning: Benchmarking runs unvetted code. Run in a docker container.") + print("Set AIDER_DOCKER in the environment to by-pass this check at your own risk.") return assert BENCHMARK_DNAME.exists() and BENCHMARK_DNAME.is_dir(), BENCHMARK_DNAME @@ -432,42 +317,6 @@ def get_exercise_dirs(base_dir, languages=None): return 0 -def show_diffs(dirnames): - dirnames = sorted(dirnames) - - all_results = dict((dirname, load_results(dirname)) for dirname in dirnames) - testcases = set() - for results in all_results.values(): - testcases.update(result["testcase"] for result in results) - - testcases = sorted(testcases) - - unchanged = set() - - for testcase in testcases: - all_outcomes = [] - for dirname in dirnames: - results = all_results[dirname] - result = [r for r in results if r["testcase"] == testcase][0] - - outcomes = tuple(result["tests_outcomes"]) - all_outcomes.append(True in outcomes) - - if len(set(all_outcomes)) == 1: - unchanged.add(testcase) - continue - - print() - print(testcase) - for outcome, dirname in zip(all_outcomes, dirnames): - print(outcome, f"{dirname}/{testcase}/.aider.chat.history.md") - - changed = set(testcases) - unchanged - print() - print("changed:", len(changed), ",".join(sorted(changed))) - print() - print("unchanged:", len(unchanged), ",".join(sorted(unchanged))) - def load_results(dirname, stats_languages=None): dirname = Path(dirname) diff --git a/benchmark/benchmark_classic.py b/benchmark/benchmark_classic.py new file mode 100755 index 00000000000..02117242742 --- /dev/null +++ b/benchmark/benchmark_classic.py @@ -0,0 +1,1265 @@ +#!/usr/bin/env python3 +import datetime +import json +import os +import random +import re +import shutil +import subprocess +import sys +import time +import traceback +from collections import defaultdict +from json.decoder import JSONDecodeError +from pathlib import Path +from types import SimpleNamespace +from typing import List, Optional + +""" +Performance-oriented refactors: +- Avoid heavy imports unless needed for a given code path. +- Fast path for `--stats` to skip GitPython and benchmarking deps. +- Build DataFrame / import plotting only when `--graphs` is true. +- Use json.load for result file parsing to reduce memory churn. +- Cache git version lookups across a single invocation. +""" + +# Heavy modules are lazily imported within the code paths that need them. +import typer +from dotenv import load_dotenv +from rich.console import Console + +from aider.dump import dump # noqa: F401 + +# Cache for commit-hash -> version lookup +_VERSION_CACHE = {} + +BENCHMARK_DNAME = Path(os.environ.get("AIDER_BENCHMARK_DIR", "tmp.benchmarks")) + +EXERCISES_DIR_DEFAULT = "polyglot-benchmark" + +app = typer.Typer(add_completion=False, pretty_exceptions_enable=False) + + +load_dotenv(override=True) + + +def find_latest_benchmark_dir(): + benchmark_dirs = [d for d in BENCHMARK_DNAME.iterdir() if d.is_dir()] + if not benchmark_dirs: + print("Error: No benchmark directories found under tmp.benchmarks.") + sys.exit(1) + + # Get current time and 24 hours ago + now = datetime.datetime.now() + day_ago = now - datetime.timedelta(days=1) + + # Filter directories by name pattern YYYY-MM-DD-HH-MM-SS-- + recent_dirs = [] + for d in benchmark_dirs: + try: + # Extract datetime from directory name + date_str = d.name[:19] # Takes YYYY-MM-DD-HH-MM-SS + dir_date = datetime.datetime.strptime(date_str, "%Y-%m-%d-%H-%M-%S") + if dir_date >= day_ago: + recent_dirs.append(d) + except ValueError: + # Skip directories that don't match the expected format + continue + + if not recent_dirs: + print("Error: No benchmark directories found from the last 24 hours.") + sys.exit(1) + + # Find directory with most recently modified .md file + latest_dir = None + latest_time = 0 + + for d in recent_dirs: + # Look for .md files in subdirectories + for md_file in d.glob("*/exercises/practice/*/.*.md"): + if md_file.is_file(): + mtime = md_file.stat().st_mtime + if mtime > latest_time: + latest_time = mtime + latest_dir = d + + if not latest_dir: + print("Error: No .md files found in recent benchmark directories.") + sys.exit(1) + + print(f"Using the most recently updated benchmark directory: {latest_dir.name}") + return latest_dir + + +def show_stats(dirnames, graphs, verbose, stats_languages=None): + raw_rows = [] + for dirname in dirnames: + row = summarize_results(dirname, verbose, stats_languages) + raw_rows.append(row) + + # return + + seen = dict() + rows = [] + for row in raw_rows: + if not row: + continue + + if row.completed_tests != row.total_tests: + print( + f"Warning: {row.dir_name} is incomplete: {row.completed_tests} of {row.total_tests}" + ) + + try: + kind = (row.model, row.edit_format) + except AttributeError: + return + + if kind in seen: + dump(row.dir_name) + dump(seen[kind]) + return + + seen[kind] = row.dir_name + rows.append(vars(row)) + + repeat_hi = repeat_lo = repeat_avg = None # noqa: F841 + + # Only build a DataFrame and import plotting libs when graphs are requested + if graphs: + import pandas as pd # Lazy import + from plots import plot_refactoring # Lazy import + + df = pd.DataFrame.from_records(rows) + # plot_timing(df) + # plot_outcomes(df, repeats, repeat_hi, repeat_lo, repeat_avg) + # plot_outcomes_claude(df) + plot_refactoring(df) + + +def resolve_dirname(dirname, use_single_prior, make_new): + if len(dirname.parts) > 1: + return dirname + + priors = list(BENCHMARK_DNAME.glob(f"*--{dirname}")) + if len(priors) == 1 and use_single_prior: + dirname = priors[0].name + print(f"Using pre-existing {dirname}") + elif len(priors): + if not make_new: + print(f"Prior runs of {dirname} exist, use --new or name one explicitly") + print() + for prior in priors: + print(prior) + return + + if not re.match(r"\d\d\d\d-\d\d-\d\d-", str(dirname)): + now = datetime.datetime.now() + now = now.strftime("%Y-%m-%d-%H-%M-%S--") + dirname = now + dirname.name + + dirname = BENCHMARK_DNAME / dirname + return dirname + + +@app.command() +def main( + dirnames: Optional[List[str]] = typer.Argument(None, help="Directory names"), + graphs: bool = typer.Option(False, "--graphs", help="Generate graphs"), + model: str = typer.Option("gpt-3.5-turbo", "--model", "-m", help="Model name"), + sleep: float = typer.Option( + 0, "--sleep", help="Sleep seconds between tests when single threaded" + ), + languages: str = typer.Option( + None, "--languages", "-l", help="Only run tests for specific languages (comma separated)" + ), + edit_format: str = typer.Option(None, "--edit-format", "-e", help="Edit format"), + editor_model: str = typer.Option(None, "--editor-model", help="Editor model name"), + editor_edit_format: str = typer.Option(None, "--editor-edit-format", help="Editor edit format"), + replay: str = typer.Option( + None, + "--replay", + help="Replay previous .aider.chat.history.md responses from previous benchmark run", + ), + keywords: str = typer.Option( + None, "--keywords", "-k", help="Only run tests that contain keywords (comma sep)" + ), + clean: bool = typer.Option( + False, "--clean", "-c", help="Discard the existing testdir and make a clean copy" + ), + cont: bool = typer.Option(False, "--cont", help="Continue the (single) matching testdir"), + make_new: bool = typer.Option(False, "--new", help="Make a new dated testdir"), + no_unit_tests: bool = typer.Option(False, "--no-unit-tests", help="Do not run unit tests"), + no_aider: bool = typer.Option(False, "--no-aider", help="Do not run aider"), + verbose: bool = typer.Option(False, "--verbose", "-v", help="Verbose output"), + stats_only: bool = typer.Option( + False, "--stats", "-s", help="Do not run tests, just collect stats on completed tests" + ), + stats_languages: str = typer.Option( + None, + "--stats-languages", + help="Only include stats for specific languages (comma separated)", + ), + diffs_only: bool = typer.Option(False, "--diffs", help="Just diff the provided stats dirs"), + tries: int = typer.Option(2, "--tries", "-r", help="Number of tries for running tests"), + threads: int = typer.Option(1, "--threads", "-t", help="Number of threads to run in parallel"), + num_tests: int = typer.Option(-1, "--num-tests", "-n", help="Number of tests to run"), + num_ctx: Optional[int] = typer.Option( + None, "--num-ctx", help="Override model context window size" + ), + read_model_settings: str = typer.Option( + None, "--read-model-settings", help="Load aider model settings from YAML file" + ), + reasoning_effort: Optional[str] = typer.Option( + None, "--reasoning-effort", help="Set reasoning effort for models that support it" + ), + thinking_tokens: Optional[int] = typer.Option( + None, "--thinking-tokens", help="Set thinking tokens for models that support it" + ), + map_tokens: Optional[int] = typer.Option( + None, + "--map-tokens", + help="Suggested number of tokens for repo map (0 to disable)", + ), + exercises_dir: str = typer.Option( + EXERCISES_DIR_DEFAULT, "--exercises-dir", help="Directory with exercise files" + ), +): + if stats_only and not dirnames: + latest_dir = find_latest_benchmark_dir() + dirnames = [str(latest_dir)] + + if dirnames is None: + dirnames = [] + + if len(dirnames) > 1 and not (stats_only or diffs_only): + print("Only provide 1 dirname unless running with --stats or --diffs") + return 1 + + updated_dirnames = [] + for dirname in dirnames: + dirname = Path(dirname) + dirname = resolve_dirname(dirname, stats_only or cont, make_new) + if not dirname: + return 1 + updated_dirnames.append(dirname) + + if stats_only: + return show_stats(updated_dirnames, graphs, verbose, stats_languages) + + if diffs_only: + return show_diffs(updated_dirnames) + + assert len(updated_dirnames) == 1, updated_dirnames + dirname = updated_dirnames[0] + + # Lazy imports for the actual benchmark run + import git # Heavy; avoid for --stats/--diffs + import importlib_resources # Used for model metadata registration + import lox # Only needed for threaded runs + + from aider import models, sendchat + from aider.coders import base_coder + + repo = git.Repo(search_parent_directories=True) + commit_hash = repo.head.object.hexsha[:7] + if repo.is_dirty(): + commit_hash += "-dirty" + + if "AIDER_DOCKER" not in os.environ: + print("Warning: benchmarking runs unvetted code from GPT, run in a docker container") + return + + assert BENCHMARK_DNAME.exists() and BENCHMARK_DNAME.is_dir(), BENCHMARK_DNAME + + def get_exercise_dirs(base_dir, languages=None): + """Get all exercise directories for specified languages (or all if none specified)""" + base_dir = Path(base_dir) + + # Get available language dirs + lang_dirs = [d for d in base_dir.iterdir() if d.is_dir()] + + # Filter to requested languages if specified + if languages: + requested = set(lang.strip().lower() for lang in languages.split(",")) + lang_dirs = [d for d in lang_dirs if d.name.lower() in requested] + dump(lang_dirs) + if not lang_dirs: + print(f"No matching language directories found for: {languages}") + return [] + + # Get all exercise dirs under exercises/practice for each language + exercise_dirs = [] + for lang_dir in lang_dirs: + practice_dir = lang_dir / "exercises" / "practice" + if practice_dir.exists(): + exercise_dirs.extend(d for d in practice_dir.iterdir() if d.is_dir()) + + return exercise_dirs + + original_dname = BENCHMARK_DNAME / exercises_dir + assert original_dname.exists() and original_dname.is_dir(), original_dname + + exercise_dirs = get_exercise_dirs(original_dname, languages) + + if not exercise_dirs: + print("No exercise directories found") + return 1 + + if clean and dirname.exists(): + print("Cleaning up and replacing", dirname) + dir_files = set(fn.name for fn in dirname.glob("*")) + original_files = set(fn.name for fn in original_dname.glob("*")) + if dir_files != original_files: + print("ERROR: will not delete dir that does not look like original tests", dirname) + return + + dest = dirname.parent / "OLD" / dirname.name + if dest.exists(): + old_now = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") + dest = dirname.parent / "OLD" / (old_now + dirname.name) + + dirname.rename(dest) + + if not dirname.exists(): + print(f"Copying {original_dname} -> {dirname} ...") + # Only copy the practice subdirs with exercises + os.makedirs(dirname, exist_ok=True) + for lang_dir in original_dname.iterdir(): + if not lang_dir.is_dir(): + continue + practice_dir = lang_dir / "exercises" / "practice" + if practice_dir.exists(): + dest_lang_dir = dirname / lang_dir.name / "exercises" / "practice" + os.makedirs(dest_lang_dir.parent, exist_ok=True) + shutil.copytree(practice_dir, dest_lang_dir) + print("...done") + + test_dnames = sorted(str(d.relative_to(original_dname)) for d in exercise_dirs) + + resource_metadata = importlib_resources.files("aider.resources").joinpath("model-metadata.json") + model_metadata_files_loaded = models.register_litellm_models([resource_metadata]) + dump(model_metadata_files_loaded) + + if read_model_settings: + try: + files_loaded = models.register_models([read_model_settings]) + if verbose: + if files_loaded: + print(f"Loaded model settings from: {files_loaded[0]}") + else: + print(f"No model settings loaded from: {read_model_settings}") + except Exception as e: + print(f"Error loading model settings: {e}") + return 1 + + if keywords: + keywords = keywords.split(",") + test_dnames = [dn for dn in test_dnames for keyword in keywords if keyword in dn] + + random.shuffle(test_dnames) + if num_tests > 0: + test_dnames = test_dnames[:num_tests] + + # Don't give up when benchmarking + LONG_TIMEOUT = 24 * 60 * 60 + sendchat.RETRY_TIMEOUT = LONG_TIMEOUT + base_coder.RETRY_TIMEOUT = LONG_TIMEOUT + models.RETRY_TIMEOUT = LONG_TIMEOUT + + # Enable in-memory RepoMap cache when running multiple threads to avoid SQLite contention + repomap_in_memory = threads > 1 + + if threads == 1: + all_results = [] + for test_path in test_dnames: + results = run_test( + original_dname, + dirname / test_path, + model, + edit_format, + tries, + no_unit_tests, + no_aider, + verbose, + commit_hash, + replay, + editor_model, + editor_edit_format, + num_ctx, + sleep, + reasoning_effort, + thinking_tokens, + map_tokens, + repomap_in_memory, + ) + + all_results.append(results) + summarize_results(dirname, verbose) + if sleep: + time.sleep(sleep) + else: + run_test_threaded = lox.thread(threads)(run_test) + for test_path in test_dnames: + run_test_threaded.scatter( + original_dname, + dirname / test_path, + model, + edit_format, + tries, + no_unit_tests, + no_aider, + verbose, + commit_hash, + replay, + editor_model, + editor_edit_format, + num_ctx, + sleep, + reasoning_effort, + thinking_tokens, + map_tokens, + repomap_in_memory, + ) + all_results = run_test_threaded.gather(tqdm=True) + + print() + print() + print() + summarize_results(dirname, verbose) + + return 0 + + +def show_diffs(dirnames): + dirnames = sorted(dirnames) + + all_results = dict((dirname, load_results(dirname)) for dirname in dirnames) + testcases = set() + for results in all_results.values(): + testcases.update(result["testcase"] for result in results) + + testcases = sorted(testcases) + + unchanged = set() + + for testcase in testcases: + all_outcomes = [] + for dirname in dirnames: + results = all_results[dirname] + result = [r for r in results if r["testcase"] == testcase][0] + + outcomes = tuple(result["tests_outcomes"]) + all_outcomes.append(True in outcomes) + + if len(set(all_outcomes)) == 1: + unchanged.add(testcase) + continue + + print() + print(testcase) + for outcome, dirname in zip(all_outcomes, dirnames): + print(outcome, f"{dirname}/{testcase}/.aider.chat.history.md") + + changed = set(testcases) - unchanged + print() + print("changed:", len(changed), ",".join(sorted(changed))) + print() + print("unchanged:", len(unchanged), ",".join(sorted(unchanged))) + + +def load_results(dirname, stats_languages=None): + dirname = Path(dirname) + lang_to_results = {} + + if stats_languages: + languages = [lang.strip().lower() for lang in stats_languages.split(",")] + glob_patterns = [f"{lang}/exercises/practice/*/.aider.results.json" for lang in languages] + else: + glob_patterns = ["*/exercises/practice/*/.aider.results.json"] + + for pattern in glob_patterns: + for fname in dirname.glob(pattern): + try: + results = json.loads(fname.read_text()) + # json / test / prac / exer / lang + lang = fname.parent.parent.parent.parent.name + lang_to_results.setdefault(lang, []).append(results) + except json.JSONDecodeError: + print("json.JSONDecodeError", fname) + continue + return lang_to_results + + +def summarize_results(dirname, verbose, stats_languages=None): + lang_to_results = load_results(dirname, stats_languages) + + res = SimpleNamespace() + res.total_tests = len(list(Path(dirname).glob("*/exercises/practice/*"))) + + try: + tries = max( + len(results.get("tests_outcomes", [])) + for results_list in lang_to_results.values() + for results in results_list + if results + ) + except ValueError: + tries = 0 + + res.dir_name = str(dirname) + + passed_tests = [0] * tries + + res.completed_tests = 0 + res.duration = 0 + res.cost = 0 + res.error_outputs = 0 + res.user_asks = 0 + res.test_timeouts = 0 + res.exhausted_context_windows = 0 + res.num_malformed_responses = 0 + res.num_with_malformed_responses = 0 + res.syntax_errors = 0 + res.indentation_errors = 0 + res.lazy_comments = 0 + res.prompt_tokens = 0 + res.completion_tokens = 0 + + res.reasoning_effort = None + res.thinking_tokens = None + res.map_tokens = None + variants = defaultdict(set) + + def add(attr_name, increment, global_stats, lang_stats): + global_prev = getattr(global_stats, attr_name) + setattr(global_stats, attr_name, global_prev + increment) + + lang_prev = getattr(lang_stats, attr_name) + setattr(lang_stats, attr_name, lang_prev + increment) + + lang_to_stats = {} + lang_to_passed_tests = {} + for lang, results_list in lang_to_results.items(): + lang_stats = SimpleNamespace() + lang_stats.completed_tests = 0 + lang_stats.duration = 0 + lang_stats.avg_duration_per_test = 0 + lang_stats.cost = 0 + for i in range(tries): + setattr(lang_stats, f"pass_rate_{i + 1}", 0) + for i in range(tries): + setattr(lang_stats, f"pass_num_{i + 1}", 0) + lang_stats.error_outputs = 0 + lang_stats.user_asks = 0 + lang_stats.test_timeouts = 0 + lang_stats.exhausted_context_windows = 0 + lang_stats.num_malformed_responses = 0 + lang_stats.num_with_malformed_responses = 0 + lang_stats.syntax_errors = 0 + lang_stats.indentation_errors = 0 + lang_stats.lazy_comments = 0 + lang_stats.prompt_tokens = 0 + lang_stats.completion_tokens = 0 + lang_to_stats[lang] = lang_stats + lang_to_passed_tests[lang] = [0] * tries + + for results in results_list: + if not results: + continue + + add("completed_tests", 1, res, lang_stats) + tests_outcomes = results.get("tests_outcomes", []) + passed = tests_outcomes and tests_outcomes[-1] + if passed: + for i in range(len(tests_outcomes) - 1, tries): + passed_tests[i] += 1 + lang_to_passed_tests[lang][i] += 1 + + add("cost", results.get("cost", 0), res, lang_stats) + add("duration", results.get("duration", 0), res, lang_stats) + add("test_timeouts", results.get("test_timeouts", 0), res, lang_stats) + + add("error_outputs", results.get("num_error_outputs", 0), res, lang_stats) + add("user_asks", results.get("num_user_asks", 0), res, lang_stats) + add( + "exhausted_context_windows", + results.get("num_exhausted_context_windows", 0), + res, + lang_stats, + ) + add( + "num_malformed_responses", + results.get("num_malformed_responses", 0), + res, + lang_stats, + ) + if results.get("num_malformed_responses"): + add("num_with_malformed_responses", 1, res, lang_stats) + add("lazy_comments", results.get("lazy_comments", 0), res, lang_stats) + + add("syntax_errors", results.get("syntax_errors", 0), res, lang_stats) + add("indentation_errors", results.get("indentation_errors", 0), res, lang_stats) + + add("prompt_tokens", results.get("prompt_tokens", 0), res, lang_stats) + add("completion_tokens", results.get("completion_tokens", 0), res, lang_stats) + + res.reasoning_effort = results.get("reasoning_effort") + res.thinking_tokens = results.get("thinking_tokens") + res.map_tokens = results.get("map_tokens") + + for key in "model edit_format commit_hash editor_model editor_edit_format".split(): + val = results.get(key) + if val: + variants[key].add(val) + + if not res.completed_tests: + return + + # if res.completed_tests < 133: + # return + + console = Console(highlight=False) + console.rule(title=str(dirname)) + + commit_hashes = variants["commit_hash"] + versions = get_versions(commit_hashes) + date = dirname.name[:10] + + def show(stat, red="red"): + val = getattr(res, stat) + style = red if val else None + console.print(f" {stat}: {val}", style=style) + + percents = dict() + for i in range(tries): + pass_rate = 100 * passed_tests[i] / res.completed_tests + percents[i] = pass_rate + # console.print(f"{pass_rate:.1f}% correct after try {i + 1}") + setattr(res, f"pass_rate_{i + 1}", f"{pass_rate:.1f}") + setattr(res, f"pass_num_{i + 1}", passed_tests[i]) + + print(f"- dirname: {dirname.name}") + style = None if res.completed_tests == res.total_tests else "red" + console.print(f" test_cases: {res.completed_tests}", style=style) + for key, val in variants.items(): + if len(val) > 1: + style = "red" + else: + style = None + val = ", ".join(map(str, val)) + setattr(res, key, val) + console.print(f" {key}: {val}", style=style) + + if res.reasoning_effort is not None: + print(f" reasoning_effort: {res.reasoning_effort}") + if res.thinking_tokens is not None: + print(f" thinking_tokens: {res.thinking_tokens}") + if res.map_tokens is not None: + print(f" map_tokens: {res.map_tokens}") + + for i in range(tries): + print(f" pass_rate_{i + 1}: {percents[i]:.1f}") + for i in range(tries): + print(f" pass_num_{i + 1}: {passed_tests[i]}") + + pct_well_formed = 1.0 - res.num_with_malformed_responses / res.completed_tests + print(f" percent_cases_well_formed: {pct_well_formed * 100:.1f}") + + show("error_outputs") + show("num_malformed_responses") + show("num_with_malformed_responses") + show("user_asks") + show("lazy_comments") + show("syntax_errors") + show("indentation_errors") + show("exhausted_context_windows") + show("prompt_tokens", red=None) + show("completion_tokens", red=None) + show("test_timeouts") + print(f" total_tests: {res.total_tests}") + + if variants["model"]: + a_model = set(variants["model"]).pop() + command = f"aider-ce --model {a_model}" + print(f" command: {command}") + + print(f" date: {date}") + print(" versions:", ",".join(versions)) + + res.avg_duration = res.duration / res.completed_tests + print(f" seconds_per_case: {res.avg_duration:.1f}") + + print(f" total_cost: {res.cost:.4f}") + + res.avg_cost = res.cost / res.completed_tests + + projected_cost = res.avg_cost * res.total_tests + + print() + print( + f"costs: ${res.avg_cost:.4f}/test-case, ${res.cost:.2f} total," + f" ${projected_cost:.2f} projected" + ) + + if verbose and len(lang_to_stats) > 0: + + def format_lang_stats(lang, lang_stats): + # First, postprocess attributes for easier printing + if lang_stats.completed_tests > 0: + lang_stats.avg_duration_per_test = lang_stats.duration / float( + lang_stats.completed_tests + ) + for i in range(tries): + num_passed = lang_to_passed_tests[lang][i] + setattr(lang_stats, f"pass_num_{i + 1}", num_passed) + pass_rate = 100 * num_passed / float(lang_stats.completed_tests) + setattr(lang_stats, f"pass_rate_{i + 1}", pass_rate) + + # Then format attributes into ready-to-print strings + for attr in lang_stats.__dict__: + val = getattr(lang_stats, attr) + if val == 0: + val = "-" + elif isinstance(val, float): + val = f"{val:,.2f}" + else: + val = f"{val:,}" + + setattr(lang_stats, attr, val) + + def compute_lang_to_col_widths(lang_to_stats): + lang_to_col_widths = {} + for lang, lang_stats in lang_to_stats.items(): + lang_stat_attrs = [getattr(lang_stats, attr) for attr in lang_stats.__dict__] + lang_col_width = max(len(lang), len(max(lang_stat_attrs, key=len))) + lang_to_col_widths[lang] = lang_col_width + + return lang_to_col_widths + + print() + print("======== Stats by language ========") + print() + + [format_lang_stats(lang, lang_stats) for lang, lang_stats in lang_to_stats.items()] + lang_to_col_widths = compute_lang_to_col_widths(lang_to_stats) + + any_stats = list(lang_to_stats.values())[0] + attrs = list(any_stats.__dict__) + attr_col_width = len(max(["language"] + attrs, key=len)) + langs = list(lang_to_stats.keys()) + + print("| " + ("-" * attr_col_width), end="") + for lang in langs: + col_width = lang_to_col_widths[lang] + print(" | " + ("-" * col_width), end="") + print(" |") + + print(f"| {' '.center(attr_col_width)}", end="") + for lang in langs: + col_width = lang_to_col_widths[lang] + print(f" | {lang.center(col_width)}", end="") + print(" |") + + print("| " + ("-" * attr_col_width), end="") + for lang in langs: + col_width = lang_to_col_widths[lang] + print(" | " + ("-" * col_width), end="") + print(" |") + + for attr in attrs: + print(f"| {attr:<{attr_col_width}}", end="") + for lang in langs: + lang_stats = lang_to_stats[lang] + col_width = lang_to_col_widths[lang] + print(f" | {getattr(lang_stats, attr):>{col_width}}", end="") + print(" |") + + print("| " + ("-" * attr_col_width), end="") + for lang in langs: + col_width = lang_to_col_widths[lang] + print(" | " + ("-" * col_width), end="") + print(" |") + print() + + console.rule() + + # print(json.dumps(vars(res), indent=4, sort_keys=True)) + return res + + +def get_versions(commit_hashes): + versions = set() + for hsh in commit_hashes: + if not hsh: + continue + short = hsh.split("-")[0] + if short in _VERSION_CACHE: + ver = _VERSION_CACHE.get(short) + if ver: + versions.add(ver) + continue + + try: + version_src = subprocess.check_output( + ["git", "show", f"{short}:aider/__init__.py"], universal_newlines=True + ) + match = re.search(r'__version__ = "(.*)"', version_src) + ver = match.group(1) if match else None + _VERSION_CACHE[short] = ver + if ver: + versions.add(ver) + except subprocess.CalledProcessError: + _VERSION_CACHE[short] = None + pass + return versions + + +def get_replayed_content(replay_dname, test_dname): + replay_dname = Path(replay_dname) + test_dname = Path(test_dname) + dump(replay_dname, test_dname) + + test_name = test_dname.name + replay_fname = replay_dname / test_name / ".aider.chat.history.md" + dump(replay_fname) + + res = replay_fname.read_text() + return res + + res = res.splitlines(keepends=True) + res = [line for line in res if not line.startswith("> ") and not line.startswith("#### ")] + return "".join(res) + + +def run_test(original_dname, testdir, *args, **kwargs): + try: + return run_test_real(original_dname, testdir, *args, **kwargs) + except Exception: + print("=" * 40) + print("Test failed") + traceback.print_exc() + + testdir = Path(testdir) + results_fname = testdir / ".aider.results.json" + results_fname.write_text(json.dumps(dict(exception=traceback.format_exc()))) + + +def run_test_real( + original_dname, + testdir, + model_name, + edit_format, + tries, + no_unit_tests, + no_aider, + verbose, + commit_hash, + replay, + editor_model, + editor_edit_format, + num_ctx=None, + sleep=0, + reasoning_effort: Optional[str] = None, + thinking_tokens: Optional[int] = None, + map_tokens: Optional[int] = None, + read_model_settings=None, + repomap_in_memory: bool = False, +): + # Lazy imports: only needed in the actual benchmark execution path + import git + import prompts + + from aider import models + from aider.coders import Coder + from aider.io import InputOutput + + if not os.path.isdir(testdir): + print("Not a dir:", testdir) + return + + testdir = Path(testdir) + + history_fname = testdir / ".aider.chat.history.md" + + results_fname = testdir / ".aider.results.json" + if results_fname.exists(): + try: + res = json.loads(results_fname.read_text()) + # if res.get("test_timeouts", 0) > 0: + # print(f"{results_fname} test timeouts, redoing...") + # else: + return res + except JSONDecodeError: + print(f"{results_fname} failed to parse, redoing...") + + # Read solution and test files from config + fnames = [] + config_file = testdir / ".meta/config.json" + if not config_file.exists(): + raise ValueError(f"No config file found: {config_file}") + + with open(config_file) as f: + config = json.loads(f.read()) + + # Get file sets from config + test_files = config.get("files", {}).get("test", []) + example_files = config.get("files", {}).get("example", []) + solution_files = set(config.get("files", {}).get("solution", [])) + + # Forcibly ignore certain files not covered by test_files and example_files + ignore_files = set( + [ + "CMakeLists.txt", + "Cargo.toml", + ] + ) + + # Add all files under .meta and .docs directories + ignore_files.update(str(p.relative_to(testdir)) for p in testdir.glob(".meta/**/*")) + ignore_files.update(str(p.relative_to(testdir)) for p in testdir.glob(".docs/**/*")) + + # Also ignore test & example files + ignore_files.update(test_files) + ignore_files.update(example_files) + + # Remove any ignore files from the solution set that LLM will edit + solution_files.difference_update(ignore_files) + + # Copy all solution files + for file_path in solution_files: + src = testdir / Path(file_path) + if src.exists(): + fnames.append(src) + # restore the original file, in case we interrupted a prev run + # Find the original file in the language-specific practice dir + lang_part = str(testdir).split("/exercises/practice/")[0] + original_fname = ( + original_dname + / Path(lang_part).name + / "exercises" + / "practice" + / testdir.name + / file_path + ) + if original_fname.exists(): + os.makedirs(src.parent, exist_ok=True) + shutil.copy(original_fname, src) + else: + print(f"Warning: Solution file not found: {src}") + + file_list = " ".join(fname.name for fname in fnames) + + instructions = "" + + introduction = testdir / ".docs/introduction.md" + if introduction.exists(): + instructions += introduction.read_text() + instructions += (testdir / ".docs/instructions.md").read_text() + instructions_append = testdir / ".docs/instructions.append.md" + if instructions_append.exists(): + instructions += instructions_append.read_text() + + instructions += prompts.instructions_addendum.format(file_list=file_list) + + io = InputOutput( + pretty=False, + yes=True, + chat_history_file=history_fname, + ) + + # weak_model_name = model_name + weak_model_name = None + + main_model = models.Model( + model_name, + weak_model=weak_model_name, + editor_model=editor_model, + editor_edit_format=editor_edit_format, + verbose=verbose, + ) + + if reasoning_effort is not None: + main_model.set_reasoning_effort(reasoning_effort) + + if thinking_tokens is not None: + main_model.set_thinking_tokens(thinking_tokens) + + dump(main_model.max_chat_history_tokens) + + if num_ctx: + if not main_model.extra_params: + main_model.extra_params = {} + main_model.extra_params["num_ctx"] = num_ctx + edit_format = edit_format or main_model.edit_format + + dump(main_model) + dump(edit_format) + show_fnames = ",".join(map(str, fnames)) + print("fnames:", show_fnames) + # Ensure this test directory is a standalone git repo so RepoMap can be used + try: + git_dir = testdir / ".git" + if not git_dir.exists(): + r = git.Repo.init(testdir) + # Set a local identity to avoid commit failures in clean containers + with r.config_writer() as cw: + cw.set_value("user", "name", "aider-benchmark") + cw.set_value("user", "email", "aider-benchmark@example.com") + # Add existing files (solution set and any current files) + r.index.add([str(p.relative_to(testdir)) for p in testdir.rglob("*") if p.is_file()]) + r.index.commit("Initial commit for aider benchmark") + except Exception as e: + if verbose: + print(f"Warning: failed to initialize git repo in {testdir}: {e}") + + coder_kwargs = dict( + main_model=main_model, + edit_format=edit_format, + io=io, + fnames=fnames, + use_git=True, + auto_commits=False, + dirty_commits=False, + stream=False, + verbose=verbose, + # auto_lint=False, # disabled for code-in-json experiments + cache_prompts=True, + suggest_shell_commands=False, + ignore_mentions=ignore_files, + # Reduce repo map contention and size for benchmarks + map_cache_dir=str(testdir), + repomap_in_memory=repomap_in_memory, + map_mul_no_files=4, + ) + if map_tokens is not None: + coder_kwargs["map_tokens"] = map_tokens + + coder = Coder.create(**coder_kwargs) + dump(coder.ignore_mentions) + + coder.show_announcements() + coder.get_file_mentions = lambda x: set() # No loading of any other files + + timeouts = 0 + + syntax_errors = 0 + indentation_errors = 0 + lazy_comments = 0 + + dur = 0 + test_outcomes = [] + for i in range(tries): + start = time.time() + + if no_aider: + pass + elif replay: + response = get_replayed_content(replay, testdir) + coder.partial_response_content = response + + show = response.splitlines(keepends=True) + show = [">> " + line for line in show] + io.append_chat_history("".join(show)) + + coder.apply_updates() + else: + response = coder.run(with_message=instructions, preproc=False) + + dur += time.time() - start + + if not no_aider: + pat = r"^[+]? *[#].* [.][.][.] " + # Count the number of lines that match pat in response + dump(response) + lazy_comments += len(re.findall(pat, response, re.MULTILINE)) + dump(lazy_comments) + + if coder.last_keyboard_interrupt: + raise KeyboardInterrupt + + if no_unit_tests: + break + + try: + errors = run_unit_tests(original_dname, testdir, history_fname, test_files) + except subprocess.TimeoutExpired: + # try: + # errors = run_unit_tests(original_dname, testdir, history_fname, test_files) + # except subprocess.TimeoutExpired: + errors = "Tests timed out!" + timeouts += 1 + + if errors: + test_outcomes.append(False) + else: + test_outcomes.append(True) + break + + if replay: + io.append_chat_history(errors) + + errors = errors.splitlines() + + syntax_errors += sum(1 for line in errors if line.startswith("SyntaxError")) + indentation_errors += sum(1 for line in errors if line.startswith("IndentationError")) + + print(errors[-1]) + errors = "\n".join(errors) + instructions = errors + instructions += prompts.test_failures.format(file_list=file_list) + + # Clean up build directories after all attempts + # Rust target/debug + target_dir = testdir / "target" / "debug" + if target_dir.exists(): + try: + shutil.rmtree(target_dir) + if verbose: + print(f"Cleaned up Rust target/debug directory: {target_dir}") + except (OSError, shutil.Error, PermissionError) as e: + if verbose: + print(f"Failed to clean up Rust target/debug directory: {e}") + + # Java build directories + java_build_dir = testdir / "build" + if java_build_dir.exists(): + try: + shutil.rmtree(java_build_dir) + if verbose: + print(f"Cleaned up Java build directory: {java_build_dir}") + except (OSError, shutil.Error, PermissionError) as e: + if verbose: + print(f"Failed to clean up Java build directory: {e}") + + # Node.js node_modules directories + node_modules_dir = testdir / "node_modules" + if node_modules_dir.exists(): + try: + shutil.rmtree(node_modules_dir) + if verbose: + print(f"Cleaned up Node.js node_modules directory: {node_modules_dir}") + except (OSError, shutil.Error, PermissionError) as e: + if verbose: + print(f"Failed to clean up Node.js node_modules directory: {e}") + + results = dict( + testdir=str(testdir), + testcase=testdir.name, + model=main_model.name, + edit_format=edit_format, + tests_outcomes=test_outcomes, + cost=coder.total_cost, + duration=dur, + test_timeouts=timeouts, + commit_hash=commit_hash, + num_error_outputs=io.num_error_outputs, + num_user_asks=io.num_user_asks, + num_exhausted_context_windows=coder.num_exhausted_context_windows, + num_malformed_responses=coder.num_malformed_responses, + syntax_errors=syntax_errors, + indentation_errors=indentation_errors, + lazy_comments=lazy_comments, # Add the count of pattern matches to the results + reasoning_effort=reasoning_effort, + prompt_tokens=coder.total_tokens_sent, + completion_tokens=coder.total_tokens_received, + thinking_tokens=thinking_tokens, + map_tokens=map_tokens, + chat_hashes=list( + zip( + coder.chat_completion_call_hashes, + coder.chat_completion_response_hashes, + ) + ), + ) + + if edit_format == "architect": + results["editor_model"] = main_model.editor_model.name if main_model.editor_model else None + results["editor_edit_format"] = main_model.editor_edit_format + dump(results) + + results_fname.write_text(json.dumps(results, indent=4)) + + return results + + +def run_unit_tests(original_dname, testdir, history_fname, test_files): + timeout = 60 * 3 + + # Map of file extensions to test commands + TEST_COMMANDS = { + ".py": ["pytest"], + ".rs": ["cargo", "test", "--", "--include-ignored"], + ".go": ["go", "test", "./..."], + ".js": ["/aider/benchmark/npm-test.sh"], + ".cpp": ["/aider/benchmark/cpp-test.sh"], + ".java": ["./gradlew", "test"], + } + + # Get unique file extensions from test files + extensions = {Path(f).suffix for f in test_files} + + # Find matching test command + command = None + for ext in extensions: + if ext in TEST_COMMANDS: + command = TEST_COMMANDS[ext] + break + + if not command: + raise ValueError(f"No test command found for files with extensions: {extensions}") + + # Copy test files from original directory + for file_path in test_files: + src = original_dname / Path(*testdir.parts[-4:]) / file_path + dst = testdir / file_path + if src.exists(): + print("copying", src, dst) + os.makedirs(dst.parent, exist_ok=True) + shutil.copy(src, dst) + + # Remove @Disabled annotations from Java test files + for file_path in test_files: + if file_path.endswith(".java"): + test_file = testdir / file_path + if test_file.exists(): + content = test_file.read_text() + content = re.sub(r"@Disabled\([^)]*\)\s*\n", "", content) + test_file.write_text(content) + + print(" ".join(command)) + + result = subprocess.run( + command, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + timeout=timeout, + cwd=testdir, + encoding="utf-8", + errors="replace", + ) + + success = result.returncode == 0 + res = result.stdout + res = cleanup_test_output(res, testdir) + dump(res) + + with history_fname.open("a") as fh: + fh.write(f"```\n{res}\n```") + + if not success: + print(f"Tests failed: {testdir}") + return res + + +def cleanup_test_output(output, testdir): + # remove timing info, to avoid randomizing the response to GPT + res = re.sub(r"\bin \d+\.\d+s\b", "", output) + res = res.replace(str(testdir), str(testdir.name)) + return res + + +if __name__ == "__main__": + app() From 06b5b04f0442dbda98ce340c96efcb0b5c90c36c Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 10:51:53 +1000 Subject: [PATCH 05/48] feat: Add --dry mode to skip docker check and tests Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 2a50e1d7146..ebfe4d4e2aa 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -119,7 +119,12 @@ def main( exercises_dir: str = typer.Option( EXERCISES_DIR_DEFAULT, "--exercises-dir", help="Directory with exercise files" ), + dry: bool = typer.Option(False, "--dry", help="Run in dry mode (no aider, no tests)"), ): + if dry: + no_aider = True + no_unit_tests = True + if dirnames is None: dirnames = [] @@ -151,7 +156,7 @@ def main( if repo.is_dirty(): commit_hash += "-dirty" - if "AIDER_DOCKER" not in os.environ: + if not dry and "AIDER_DOCKER" not in os.environ: print("Warning: Benchmarking runs unvetted code. Run in a docker container.") print("Set AIDER_DOCKER in the environment to by-pass this check at your own risk.") return From 17380212f367e1450ecb46fe614755f04b7e06fa Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 10:51:56 +1000 Subject: [PATCH 06/48] chore: Run linter and format code Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 121 ++++++++++++++++++++++++++++++++--------- 1 file changed, 94 insertions(+), 27 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index ebfe4d4e2aa..c375154a357 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -42,6 +42,7 @@ load_dotenv(override=True) + def resolve_dirname(dirname, use_single_prior, make_new): if len(dirname.parts) > 1: return dirname @@ -75,30 +76,51 @@ def main( 0, "--sleep", help="Sleep seconds between tests when single threaded" ), languages: str = typer.Option( - None, "--languages", "-l", help="Only run tests for specific languages (comma separated)" + None, + "--languages", + "-l", + help="Only run tests for specific languages (comma separated)", ), edit_format: str = typer.Option(None, "--edit-format", "-e", help="Edit format"), editor_model: str = typer.Option(None, "--editor-model", help="Editor model name"), - editor_edit_format: str = typer.Option(None, "--editor-edit-format", help="Editor edit format"), + editor_edit_format: str = typer.Option( + None, "--editor-edit-format", help="Editor edit format" + ), replay: str = typer.Option( None, "--replay", help="Replay previous .aider.chat.history.md responses from previous benchmark run", ), keywords: str = typer.Option( - None, "--keywords", "-k", help="Only run tests that contain keywords (comma sep)" + None, + "--keywords", + "-k", + help="Only run tests that contain keywords (comma sep)", ), clean: bool = typer.Option( - False, "--clean", "-c", help="Discard the existing testdir and make a clean copy" + False, + "--clean", + "-c", + help="Discard the existing testdir and make a clean copy", + ), + cont: bool = typer.Option( + False, "--cont", help="Continue the (single) matching testdir" ), - cont: bool = typer.Option(False, "--cont", help="Continue the (single) matching testdir"), make_new: bool = typer.Option(False, "--new", help="Make a new dated testdir"), - no_unit_tests: bool = typer.Option(False, "--no-unit-tests", help="Do not run unit tests"), + no_unit_tests: bool = typer.Option( + False, "--no-unit-tests", help="Do not run unit tests" + ), no_aider: bool = typer.Option(False, "--no-aider", help="Do not run aider"), verbose: bool = typer.Option(False, "--verbose", "-v", help="Verbose output"), - tries: int = typer.Option(2, "--tries", "-r", help="Number of tries for running tests"), - threads: int = typer.Option(1, "--threads", "-t", help="Number of threads to run in parallel"), - num_tests: int = typer.Option(-1, "--num-tests", "-n", help="Number of tests to run"), + tries: int = typer.Option( + 2, "--tries", "-r", help="Number of tries for running tests" + ), + threads: int = typer.Option( + 1, "--threads", "-t", help="Number of threads to run in parallel" + ), + num_tests: int = typer.Option( + -1, "--num-tests", "-n", help="Number of tests to run" + ), num_ctx: Optional[int] = typer.Option( None, "--num-ctx", help="Override model context window size" ), @@ -106,7 +128,9 @@ def main( None, "--read-model-settings", help="Load aider model settings from YAML file" ), reasoning_effort: Optional[str] = typer.Option( - None, "--reasoning-effort", help="Set reasoning effort for models that support it" + None, + "--reasoning-effort", + help="Set reasoning effort for models that support it", ), thinking_tokens: Optional[int] = typer.Option( None, "--thinking-tokens", help="Set thinking tokens for models that support it" @@ -119,7 +143,9 @@ def main( exercises_dir: str = typer.Option( EXERCISES_DIR_DEFAULT, "--exercises-dir", help="Directory with exercise files" ), - dry: bool = typer.Option(False, "--dry", help="Run in dry mode (no aider, no tests)"), + dry: bool = typer.Option( + False, "--dry", help="Run in dry mode (no aider, no tests)" + ), ): if dry: no_aider = True @@ -158,7 +184,9 @@ def main( if not dry and "AIDER_DOCKER" not in os.environ: print("Warning: Benchmarking runs unvetted code. Run in a docker container.") - print("Set AIDER_DOCKER in the environment to by-pass this check at your own risk.") + print( + "Set AIDER_DOCKER in the environment to by-pass this check at your own risk." + ) return assert BENCHMARK_DNAME.exists() and BENCHMARK_DNAME.is_dir(), BENCHMARK_DNAME @@ -202,7 +230,10 @@ def get_exercise_dirs(base_dir, languages=None): dir_files = set(fn.name for fn in dirname.glob("*")) original_files = set(fn.name for fn in original_dname.glob("*")) if dir_files != original_files: - print("ERROR: will not delete dir that does not look like original tests", dirname) + print( + "ERROR: will not delete dir that does not look like original tests", + dirname, + ) return dest = dirname.parent / "OLD" / dirname.name @@ -228,7 +259,9 @@ def get_exercise_dirs(base_dir, languages=None): test_dnames = sorted(str(d.relative_to(original_dname)) for d in exercise_dirs) - resource_metadata = importlib_resources.files("aider.resources").joinpath("model-metadata.json") + resource_metadata = importlib_resources.files("aider.resources").joinpath( + "model-metadata.json" + ) model_metadata_files_loaded = models.register_litellm_models([resource_metadata]) dump(model_metadata_files_loaded) @@ -246,7 +279,9 @@ def get_exercise_dirs(base_dir, languages=None): if keywords: keywords = keywords.split(",") - test_dnames = [dn for dn in test_dnames for keyword in keywords if keyword in dn] + test_dnames = [ + dn for dn in test_dnames for keyword in keywords if keyword in dn + ] random.shuffle(test_dnames) if num_tests > 0: @@ -322,14 +357,15 @@ def get_exercise_dirs(base_dir, languages=None): return 0 - def load_results(dirname, stats_languages=None): dirname = Path(dirname) lang_to_results = {} if stats_languages: languages = [lang.strip().lower() for lang in stats_languages.split(",")] - glob_patterns = [f"{lang}/exercises/practice/*/.aider.results.json" for lang in languages] + glob_patterns = [ + f"{lang}/exercises/practice/*/.aider.results.json" for lang in languages + ] else: glob_patterns = ["*/exercises/practice/*/.aider.results.json"] @@ -454,16 +490,30 @@ def add(attr_name, increment, global_stats, lang_stats): add("lazy_comments", results.get("lazy_comments", 0), res, lang_stats) add("syntax_errors", results.get("syntax_errors", 0), res, lang_stats) - add("indentation_errors", results.get("indentation_errors", 0), res, lang_stats) + add( + "indentation_errors", + results.get("indentation_errors", 0), + res, + lang_stats, + ) add("prompt_tokens", results.get("prompt_tokens", 0), res, lang_stats) - add("completion_tokens", results.get("completion_tokens", 0), res, lang_stats) + add( + "completion_tokens", + results.get("completion_tokens", 0), + res, + lang_stats, + ) res.reasoning_effort = results.get("reasoning_effort") res.thinking_tokens = results.get("thinking_tokens") res.map_tokens = results.get("map_tokens") - for key in "model edit_format commit_hash editor_model editor_edit_format".split(): + for ( + key + ) in ( + "model edit_format commit_hash editor_model editor_edit_format".split() + ): val = results.get(key) if val: variants[key].add(val) @@ -586,7 +636,9 @@ def format_lang_stats(lang, lang_stats): def compute_lang_to_col_widths(lang_to_stats): lang_to_col_widths = {} for lang, lang_stats in lang_to_stats.items(): - lang_stat_attrs = [getattr(lang_stats, attr) for attr in lang_stats.__dict__] + lang_stat_attrs = [ + getattr(lang_stats, attr) for attr in lang_stats.__dict__ + ] lang_col_width = max(len(lang), len(max(lang_stat_attrs, key=len))) lang_to_col_widths[lang] = lang_col_width @@ -596,7 +648,10 @@ def compute_lang_to_col_widths(lang_to_stats): print("======== Stats by language ========") print() - [format_lang_stats(lang, lang_stats) for lang, lang_stats in lang_to_stats.items()] + [ + format_lang_stats(lang, lang_stats) + for lang, lang_stats in lang_to_stats.items() + ] lang_to_col_widths = compute_lang_to_col_widths(lang_to_stats) any_stats = list(lang_to_stats.values())[0] @@ -683,7 +738,11 @@ def get_replayed_content(replay_dname, test_dname): return res res = res.splitlines(keepends=True) - res = [line for line in res if not line.startswith("> ") and not line.startswith("#### ")] + res = [ + line + for line in res + if not line.startswith("> ") and not line.startswith("#### ") + ] return "".join(res) @@ -862,7 +921,9 @@ def run_test_real( cw.set_value("user", "name", "aider-benchmark") cw.set_value("user", "email", "aider-benchmark@example.com") # Add existing files (solution set and any current files) - r.index.add([str(p.relative_to(testdir)) for p in testdir.rglob("*") if p.is_file()]) + r.index.add( + [str(p.relative_to(testdir)) for p in testdir.rglob("*") if p.is_file()] + ) r.index.commit("Initial commit for aider benchmark") except Exception as e: if verbose: @@ -957,7 +1018,9 @@ def run_test_real( errors = errors.splitlines() syntax_errors += sum(1 for line in errors if line.startswith("SyntaxError")) - indentation_errors += sum(1 for line in errors if line.startswith("IndentationError")) + indentation_errors += sum( + 1 for line in errors if line.startswith("IndentationError") + ) print(errors[-1]) errors = "\n".join(errors) @@ -1029,7 +1092,9 @@ def run_test_real( ) if edit_format == "architect": - results["editor_model"] = main_model.editor_model.name if main_model.editor_model else None + results["editor_model"] = ( + main_model.editor_model.name if main_model.editor_model else None + ) results["editor_edit_format"] = main_model.editor_edit_format dump(results) @@ -1062,7 +1127,9 @@ def run_unit_tests(original_dname, testdir, history_fname, test_files): break if not command: - raise ValueError(f"No test command found for files with extensions: {extensions}") + raise ValueError( + f"No test command found for files with extensions: {extensions}" + ) # Copy test files from original directory for file_path in test_files: From 5eaf450adf0dfe38c9fb0d1ba0509a465f87a90e Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 10:56:10 +1000 Subject: [PATCH 07/48] feat: Add dry run option to benchmark Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 132 +++++++++++++++++++++-------------------- 1 file changed, 69 insertions(+), 63 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index c375154a357..e7e0fdb3efa 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -225,7 +225,7 @@ def get_exercise_dirs(base_dir, languages=None): print("No exercise directories found") return 1 - if clean and dirname.exists(): + if clean and dirname.exists() and not dry: print("Cleaning up and replacing", dirname) dir_files = set(fn.name for fn in dirname.glob("*")) original_files = set(fn.name for fn in original_dname.glob("*")) @@ -243,7 +243,7 @@ def get_exercise_dirs(base_dir, languages=None): dirname.rename(dest) - if not dirname.exists(): + if not dirname.exists() and not dry: print(f"Copying {original_dname} -> {dirname} ...") # Only copy the practice subdirs with exercises os.makedirs(dirname, exist_ok=True) @@ -318,6 +318,7 @@ def get_exercise_dirs(base_dir, languages=None): thinking_tokens, map_tokens, repomap_in_memory, + dry, ) all_results.append(results) @@ -346,6 +347,7 @@ def get_exercise_dirs(base_dir, languages=None): thinking_tokens, map_tokens, repomap_in_memory, + dry, ) all_results = run_test_threaded.gather(tqdm=True) @@ -779,6 +781,7 @@ def run_test_real( map_tokens: Optional[int] = None, read_model_settings=None, repomap_in_memory: bool = False, + dry: bool = False, ): # Lazy imports: only needed in the actual benchmark execution path import git @@ -847,18 +850,19 @@ def run_test_real( fnames.append(src) # restore the original file, in case we interrupted a prev run # Find the original file in the language-specific practice dir - lang_part = str(testdir).split("/exercises/practice/")[0] - original_fname = ( - original_dname - / Path(lang_part).name - / "exercises" - / "practice" - / testdir.name - / file_path - ) - if original_fname.exists(): - os.makedirs(src.parent, exist_ok=True) - shutil.copy(original_fname, src) + if not dry: + lang_part = str(testdir).split("/exercises/practice/")[0] + original_fname = ( + original_dname + / Path(lang_part).name + / "exercises" + / "practice" + / testdir.name + / file_path + ) + if original_fname.exists(): + os.makedirs(src.parent, exist_ok=True) + shutil.copy(original_fname, src) else: print(f"Warning: Solution file not found: {src}") @@ -912,22 +916,23 @@ def run_test_real( show_fnames = ",".join(map(str, fnames)) print("fnames:", show_fnames) # Ensure this test directory is a standalone git repo so RepoMap can be used - try: - git_dir = testdir / ".git" - if not git_dir.exists(): - r = git.Repo.init(testdir) - # Set a local identity to avoid commit failures in clean containers - with r.config_writer() as cw: - cw.set_value("user", "name", "aider-benchmark") - cw.set_value("user", "email", "aider-benchmark@example.com") - # Add existing files (solution set and any current files) - r.index.add( - [str(p.relative_to(testdir)) for p in testdir.rglob("*") if p.is_file()] - ) - r.index.commit("Initial commit for aider benchmark") - except Exception as e: - if verbose: - print(f"Warning: failed to initialize git repo in {testdir}: {e}") + if not dry: + try: + git_dir = testdir / ".git" + if not git_dir.exists(): + r = git.Repo.init(testdir) + # Set a local identity to avoid commit failures in clean containers + with r.config_writer() as cw: + cw.set_value("user", "name", "aider-benchmark") + cw.set_value("user", "email", "aider-benchmark@example.com") + # Add existing files (solution set and any current files) + r.index.add( + [str(p.relative_to(testdir)) for p in testdir.rglob("*") if p.is_file()] + ) + r.index.commit("Initial commit for aider benchmark") + except Exception as e: + if verbose: + print(f"Warning: failed to initialize git repo in {testdir}: {e}") coder_kwargs = dict( main_model=main_model, @@ -1027,39 +1032,40 @@ def run_test_real( instructions = errors instructions += prompts.test_failures.format(file_list=file_list) - # Clean up build directories after all attempts - # Rust target/debug - target_dir = testdir / "target" / "debug" - if target_dir.exists(): - try: - shutil.rmtree(target_dir) - if verbose: - print(f"Cleaned up Rust target/debug directory: {target_dir}") - except (OSError, shutil.Error, PermissionError) as e: - if verbose: - print(f"Failed to clean up Rust target/debug directory: {e}") - - # Java build directories - java_build_dir = testdir / "build" - if java_build_dir.exists(): - try: - shutil.rmtree(java_build_dir) - if verbose: - print(f"Cleaned up Java build directory: {java_build_dir}") - except (OSError, shutil.Error, PermissionError) as e: - if verbose: - print(f"Failed to clean up Java build directory: {e}") - - # Node.js node_modules directories - node_modules_dir = testdir / "node_modules" - if node_modules_dir.exists(): - try: - shutil.rmtree(node_modules_dir) - if verbose: - print(f"Cleaned up Node.js node_modules directory: {node_modules_dir}") - except (OSError, shutil.Error, PermissionError) as e: - if verbose: - print(f"Failed to clean up Node.js node_modules directory: {e}") + if not dry: + # Clean up build directories after all attempts + # Rust target/debug + target_dir = testdir / "target" / "debug" + if target_dir.exists(): + try: + shutil.rmtree(target_dir) + if verbose: + print(f"Cleaned up Rust target/debug directory: {target_dir}") + except (OSError, shutil.Error, PermissionError) as e: + if verbose: + print(f"Failed to clean up Rust target/debug directory: {e}") + + # Java build directories + java_build_dir = testdir / "build" + if java_build_dir.exists(): + try: + shutil.rmtree(java_build_dir) + if verbose: + print(f"Cleaned up Java build directory: {java_build_dir}") + except (OSError, shutil.Error, PermissionError) as e: + if verbose: + print(f"Failed to clean up Java build directory: {e}") + + # Node.js node_modules directories + node_modules_dir = testdir / "node_modules" + if node_modules_dir.exists(): + try: + shutil.rmtree(node_modules_dir) + if verbose: + print(f"Cleaned up Node.js node_modules directory: {node_modules_dir}") + except (OSError, shutil.Error, PermissionError) as e: + if verbose: + print(f"Failed to clean up Node.js node_modules directory: {e}") results = dict( testdir=str(testdir), From 1b0d525570ec0257e27d6be05cb3c8d34f2296e1 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 10:56:13 +1000 Subject: [PATCH 08/48] chore: Run linter on benchmark files Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index e7e0fdb3efa..43505334ca3 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -927,7 +927,11 @@ def run_test_real( cw.set_value("user", "email", "aider-benchmark@example.com") # Add existing files (solution set and any current files) r.index.add( - [str(p.relative_to(testdir)) for p in testdir.rglob("*") if p.is_file()] + [ + str(p.relative_to(testdir)) + for p in testdir.rglob("*") + if p.is_file() + ] ) r.index.commit("Initial commit for aider benchmark") except Exception as e: @@ -1062,7 +1066,9 @@ def run_test_real( try: shutil.rmtree(node_modules_dir) if verbose: - print(f"Cleaned up Node.js node_modules directory: {node_modules_dir}") + print( + f"Cleaned up Node.js node_modules directory: {node_modules_dir}" + ) except (OSError, shutil.Error, PermissionError) as e: if verbose: print(f"Failed to clean up Node.js node_modules directory: {e}") From c685caff8b4938a7503e828731c5e77699529c7e Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 11:15:36 +1000 Subject: [PATCH 09/48] feat: Replace print with logging and add verbose/quiet flags Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 109 ++++++++++++++++++++++------------------- 1 file changed, 59 insertions(+), 50 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 43505334ca3..e246bedf730 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -14,6 +14,7 @@ from pathlib import Path from types import SimpleNamespace from typing import List, Optional +import logging """ Performance-oriented refactors: @@ -30,6 +31,8 @@ from aider.dump import dump # noqa: F401 +logger = logging.getLogger("aider.benchmark") + # Cache for commit-hash -> version lookup _VERSION_CACHE = {} @@ -50,13 +53,14 @@ def resolve_dirname(dirname, use_single_prior, make_new): priors = list(BENCHMARK_DNAME.glob(f"*--{dirname}")) if len(priors) == 1 and use_single_prior: dirname = priors[0].name - print(f"Using pre-existing {dirname}") + logger.info(f"Using pre-existing {dirname}") elif len(priors): if not make_new: - print(f"Prior runs of {dirname} exist, use --new or name one explicitly") - print() + logger.warning( + f"Prior runs of {dirname} exist, use --new or name one explicitly" + ) for prior in priors: - print(prior) + logger.warning(prior) return if not re.match(r"\d\d\d\d-\d\d-\d\d-", str(dirname)): @@ -111,7 +115,10 @@ def main( False, "--no-unit-tests", help="Do not run unit tests" ), no_aider: bool = typer.Option(False, "--no-aider", help="Do not run aider"), - verbose: bool = typer.Option(False, "--verbose", "-v", help="Verbose output"), + verbose: int = typer.Option( + 0, "--verbose", "-v", count=True, help="Verbose output" + ), + quiet: bool = typer.Option(False, "--quiet", "-q", help="Quiet output"), tries: int = typer.Option( 2, "--tries", "-r", help="Number of tries for running tests" ), @@ -147,6 +154,15 @@ def main( False, "--dry", help="Run in dry mode (no aider, no tests)" ), ): + if quiet: + log_level = logging.WARNING + elif verbose > 0: + log_level = logging.DEBUG + else: + log_level = logging.INFO + + logging.basicConfig(level=log_level, format="%(message)s") + if dry: no_aider = True no_unit_tests = True @@ -155,7 +171,7 @@ def main( dirnames = [] if len(dirnames) > 1: - print("Only provide 1 dirname") + logger.error("Only provide 1 dirname") return 1 updated_dirnames = [] @@ -183,8 +199,10 @@ def main( commit_hash += "-dirty" if not dry and "AIDER_DOCKER" not in os.environ: - print("Warning: Benchmarking runs unvetted code. Run in a docker container.") - print( + logger.warning( + "Warning: Benchmarking runs unvetted code. Run in a docker container." + ) + logger.warning( "Set AIDER_DOCKER in the environment to by-pass this check at your own risk." ) return @@ -204,7 +222,7 @@ def get_exercise_dirs(base_dir, languages=None): lang_dirs = [d for d in lang_dirs if d.name.lower() in requested] dump(lang_dirs) if not lang_dirs: - print(f"No matching language directories found for: {languages}") + logger.warning(f"No matching language directories found for: {languages}") return [] # Get all exercise dirs under exercises/practice for each language @@ -222,17 +240,16 @@ def get_exercise_dirs(base_dir, languages=None): exercise_dirs = get_exercise_dirs(original_dname, languages) if not exercise_dirs: - print("No exercise directories found") + logger.error("No exercise directories found") return 1 if clean and dirname.exists() and not dry: - print("Cleaning up and replacing", dirname) + logger.info(f"Cleaning up and replacing {dirname}") dir_files = set(fn.name for fn in dirname.glob("*")) original_files = set(fn.name for fn in original_dname.glob("*")) if dir_files != original_files: - print( - "ERROR: will not delete dir that does not look like original tests", - dirname, + logger.error( + f"ERROR: will not delete dir that does not look like original tests {dirname}" ) return @@ -244,7 +261,7 @@ def get_exercise_dirs(base_dir, languages=None): dirname.rename(dest) if not dirname.exists() and not dry: - print(f"Copying {original_dname} -> {dirname} ...") + logger.info(f"Copying {original_dname} -> {dirname} ...") # Only copy the practice subdirs with exercises os.makedirs(dirname, exist_ok=True) for lang_dir in original_dname.iterdir(): @@ -255,7 +272,7 @@ def get_exercise_dirs(base_dir, languages=None): dest_lang_dir = dirname / lang_dir.name / "exercises" / "practice" os.makedirs(dest_lang_dir.parent, exist_ok=True) shutil.copytree(practice_dir, dest_lang_dir) - print("...done") + logger.info("...done") test_dnames = sorted(str(d.relative_to(original_dname)) for d in exercise_dirs) @@ -268,13 +285,12 @@ def get_exercise_dirs(base_dir, languages=None): if read_model_settings: try: files_loaded = models.register_models([read_model_settings]) - if verbose: - if files_loaded: - print(f"Loaded model settings from: {files_loaded[0]}") - else: - print(f"No model settings loaded from: {read_model_settings}") + if files_loaded: + logger.debug(f"Loaded model settings from: {files_loaded[0]}") + else: + logger.debug(f"No model settings loaded from: {read_model_settings}") except Exception as e: - print(f"Error loading model settings: {e}") + logger.error(f"Error loading model settings: {e}") return 1 if keywords: @@ -379,7 +395,7 @@ def load_results(dirname, stats_languages=None): lang = fname.parent.parent.parent.parent.name lang_to_results.setdefault(lang, []).append(results) except json.JSONDecodeError: - print("json.JSONDecodeError", fname) + logger.warning(f"json.JSONDecodeError {fname}") continue return lang_to_results @@ -752,9 +768,9 @@ def run_test(original_dname, testdir, *args, **kwargs): try: return run_test_real(original_dname, testdir, *args, **kwargs) except Exception: - print("=" * 40) - print("Test failed") - traceback.print_exc() + logger.error("=" * 40) + logger.error("Test failed") + logger.error(traceback.format_exc()) testdir = Path(testdir) results_fname = testdir / ".aider.results.json" @@ -792,7 +808,7 @@ def run_test_real( from aider.io import InputOutput if not os.path.isdir(testdir): - print("Not a dir:", testdir) + logger.error(f"Not a dir: {testdir}") return testdir = Path(testdir) @@ -808,7 +824,7 @@ def run_test_real( # else: return res except JSONDecodeError: - print(f"{results_fname} failed to parse, redoing...") + logger.warning(f"{results_fname} failed to parse, redoing...") # Read solution and test files from config fnames = [] @@ -864,7 +880,7 @@ def run_test_real( os.makedirs(src.parent, exist_ok=True) shutil.copy(original_fname, src) else: - print(f"Warning: Solution file not found: {src}") + logger.warning(f"Warning: Solution file not found: {src}") file_list = " ".join(fname.name for fname in fnames) @@ -914,7 +930,7 @@ def run_test_real( dump(main_model) dump(edit_format) show_fnames = ",".join(map(str, fnames)) - print("fnames:", show_fnames) + logger.info(f"fnames: {show_fnames}") # Ensure this test directory is a standalone git repo so RepoMap can be used if not dry: try: @@ -935,8 +951,7 @@ def run_test_real( ) r.index.commit("Initial commit for aider benchmark") except Exception as e: - if verbose: - print(f"Warning: failed to initialize git repo in {testdir}: {e}") + logger.debug(f"Warning: failed to initialize git repo in {testdir}: {e}") coder_kwargs = dict( main_model=main_model, @@ -1031,7 +1046,7 @@ def run_test_real( 1 for line in errors if line.startswith("IndentationError") ) - print(errors[-1]) + logger.info(errors[-1]) errors = "\n".join(errors) instructions = errors instructions += prompts.test_failures.format(file_list=file_list) @@ -1043,35 +1058,29 @@ def run_test_real( if target_dir.exists(): try: shutil.rmtree(target_dir) - if verbose: - print(f"Cleaned up Rust target/debug directory: {target_dir}") + logger.debug(f"Cleaned up Rust target/debug directory: {target_dir}") except (OSError, shutil.Error, PermissionError) as e: - if verbose: - print(f"Failed to clean up Rust target/debug directory: {e}") + logger.debug(f"Failed to clean up Rust target/debug directory: {e}") # Java build directories java_build_dir = testdir / "build" if java_build_dir.exists(): try: shutil.rmtree(java_build_dir) - if verbose: - print(f"Cleaned up Java build directory: {java_build_dir}") + logger.debug(f"Cleaned up Java build directory: {java_build_dir}") except (OSError, shutil.Error, PermissionError) as e: - if verbose: - print(f"Failed to clean up Java build directory: {e}") + logger.debug(f"Failed to clean up Java build directory: {e}") # Node.js node_modules directories node_modules_dir = testdir / "node_modules" if node_modules_dir.exists(): try: shutil.rmtree(node_modules_dir) - if verbose: - print( - f"Cleaned up Node.js node_modules directory: {node_modules_dir}" - ) + logger.debug( + f"Cleaned up Node.js node_modules directory: {node_modules_dir}" + ) except (OSError, shutil.Error, PermissionError) as e: - if verbose: - print(f"Failed to clean up Node.js node_modules directory: {e}") + logger.debug(f"Failed to clean up Node.js node_modules directory: {e}") results = dict( testdir=str(testdir), @@ -1148,7 +1157,7 @@ def run_unit_tests(original_dname, testdir, history_fname, test_files): src = original_dname / Path(*testdir.parts[-4:]) / file_path dst = testdir / file_path if src.exists(): - print("copying", src, dst) + logger.info(f"copying {src} {dst}") os.makedirs(dst.parent, exist_ok=True) shutil.copy(src, dst) @@ -1161,7 +1170,7 @@ def run_unit_tests(original_dname, testdir, history_fname, test_files): content = re.sub(r"@Disabled\([^)]*\)\s*\n", "", content) test_file.write_text(content) - print(" ".join(command)) + logger.info(" ".join(command)) result = subprocess.run( command, @@ -1183,7 +1192,7 @@ def run_unit_tests(original_dname, testdir, history_fname, test_files): fh.write(f"```\n{res}\n```") if not success: - print(f"Tests failed: {testdir}") + logger.info(f"Tests failed: {testdir}") return res From c9d0f0694828a4dd05a113362b3b3526b5d55cbe Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 11:15:39 +1000 Subject: [PATCH 10/48] chore: Run linter Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index e246bedf730..0cb8d977445 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -222,7 +222,9 @@ def get_exercise_dirs(base_dir, languages=None): lang_dirs = [d for d in lang_dirs if d.name.lower() in requested] dump(lang_dirs) if not lang_dirs: - logger.warning(f"No matching language directories found for: {languages}") + logger.warning( + f"No matching language directories found for: {languages}" + ) return [] # Get all exercise dirs under exercises/practice for each language From a8e8a1bc2b916cd4c316297d9c14de0568004cda Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 11:29:23 +1000 Subject: [PATCH 11/48] fix: Change default benchmark exercises directory --- benchmark/benchmark.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 0cb8d977445..3514fb7543a 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -37,8 +37,7 @@ _VERSION_CACHE = {} BENCHMARK_DNAME = Path(os.environ.get("AIDER_BENCHMARK_DIR", "tmp.benchmarks")) - -EXERCISES_DIR_DEFAULT = "polyglot-benchmark" +EXERCISES_DIR_DEFAULT = "cecli-cat" app = typer.Typer(add_completion=False, pretty_exceptions_enable=False) @@ -68,6 +67,7 @@ def resolve_dirname(dirname, use_single_prior, make_new): now = now.strftime("%Y-%m-%d-%H-%M-%S--") dirname = now + dirname.name + logger.debug(f"resolved {dirname}") dirname = BENCHMARK_DNAME / dirname return dirname From 0121aeba553b90ee2fca535577d1f87ae9cd3622 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 11:29:24 +1000 Subject: [PATCH 12/48] refactor: Add logging and comments to resolve_dirname Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 3514fb7543a..b0d817be4fc 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -46,6 +46,13 @@ def resolve_dirname(dirname, use_single_prior, make_new): + """ + Determines the actual directory path used for storing benchmark results. + + 1. Resuming a previous run: If the --cont flag is used and exactly one matching previous run exists, it selects that existing directory. + 2. Safety check: If previous runs exist but the user didn't specify --new or --cont, it warns the user and aborts to prevent accidental overwrites or confusion. + 3. Creating a new run: If no prior run exists (or --new is used), it prepends the current timestamp to the directory name to ensure a unique workspace. + """ if len(dirname.parts) > 1: return dirname @@ -174,6 +181,8 @@ def main( logger.error("Only provide 1 dirname") return 1 + logger.info(f"dirnames: {dirnames}") + updated_dirnames = [] for dirname in dirnames: dirname = Path(dirname) @@ -182,6 +191,7 @@ def main( return 1 updated_dirnames.append(dirname) + logger.info(f"updated_dirnames: {updated_dirnames}") assert len(updated_dirnames) == 1, updated_dirnames dirname = updated_dirnames[0] From c70e766b952d89c332aaa3bcc5162ed78b4d75c0 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 11:58:42 +1000 Subject: [PATCH 13/48] feat: Rename dirname to results_dir for clarity --- benchmark/benchmark.py | 103 ++++++++++++++++++----------------------- 1 file changed, 45 insertions(+), 58 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index b0d817be4fc..44c8a4f53c2 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -38,6 +38,7 @@ BENCHMARK_DNAME = Path(os.environ.get("AIDER_BENCHMARK_DIR", "tmp.benchmarks")) EXERCISES_DIR_DEFAULT = "cecli-cat" +RESULTS_DIR_DEFAULT = "cat-results" app = typer.Typer(add_completion=False, pretty_exceptions_enable=False) @@ -45,7 +46,7 @@ load_dotenv(override=True) -def resolve_dirname(dirname, use_single_prior, make_new): +def resolve_dirname(results_dir, use_single_prior, make_new): """ Determines the actual directory path used for storing benchmark results. @@ -53,35 +54,39 @@ def resolve_dirname(dirname, use_single_prior, make_new): 2. Safety check: If previous runs exist but the user didn't specify --new or --cont, it warns the user and aborts to prevent accidental overwrites or confusion. 3. Creating a new run: If no prior run exists (or --new is used), it prepends the current timestamp to the directory name to ensure a unique workspace. """ - if len(dirname.parts) > 1: - return dirname + logger.debug(f"initial results_dir: {results_dir}") + results_dir = Path(results_dir) + logger.debug(f"dirname1: {results_dir}") + if len(results_dir.parts) > 1: + return results_dir - priors = list(BENCHMARK_DNAME.glob(f"*--{dirname}")) + priors = list(BENCHMARK_DNAME.glob(f"*--{results_dir}")) if len(priors) == 1 and use_single_prior: - dirname = priors[0].name - logger.info(f"Using pre-existing {dirname}") + results_dir = priors[0].name + logger.info(f"Using pre-existing {results_dir}") elif len(priors): if not make_new: logger.warning( - f"Prior runs of {dirname} exist, use --new or name one explicitly" + f"Prior runs of {results_dir} exist, use --new or name one explicitly" ) for prior in priors: logger.warning(prior) return - if not re.match(r"\d\d\d\d-\d\d-\d\d-", str(dirname)): + if not re.match(r"\d\d\d\d-\d\d-\d\d-", str(results_dir)): now = datetime.datetime.now() now = now.strftime("%Y-%m-%d-%H-%M-%S--") - dirname = now + dirname.name + results_dir = now + results_dir.name - logger.debug(f"resolved {dirname}") - dirname = BENCHMARK_DNAME / dirname - return dirname + logger.debug(f"resolved {results_dir}") + results_dir = BENCHMARK_DNAME / results_dir + logger.info(f"updated results_dir: {results_dir}") + return results_dir @app.command() def main( - dirnames: Optional[List[str]] = typer.Argument(None, help="Directory names"), + results_dir: Optional[str] = typer.Argument(RESULTS_DIR_DEFAULT, help="Results directory"), model: str = typer.Option("gpt-3.5-turbo", "--model", "-m", help="Model name"), sleep: float = typer.Option( 0, "--sleep", help="Sleep seconds between tests when single threaded" @@ -161,6 +166,7 @@ def main( False, "--dry", help="Run in dry mode (no aider, no tests)" ), ): + # setup logging and verbosity if quiet: log_level = logging.WARNING elif verbose > 0: @@ -174,26 +180,7 @@ def main( no_aider = True no_unit_tests = True - if dirnames is None: - dirnames = [] - - if len(dirnames) > 1: - logger.error("Only provide 1 dirname") - return 1 - - logger.info(f"dirnames: {dirnames}") - - updated_dirnames = [] - for dirname in dirnames: - dirname = Path(dirname) - dirname = resolve_dirname(dirname, cont, make_new) - if not dirname: - return 1 - updated_dirnames.append(dirname) - - logger.info(f"updated_dirnames: {updated_dirnames}") - assert len(updated_dirnames) == 1, updated_dirnames - dirname = updated_dirnames[0] + results_dir = resolve_dirname(results_dir, cont, make_new) # Lazy imports for the actual benchmark run import git # Heavy @@ -255,33 +242,33 @@ def get_exercise_dirs(base_dir, languages=None): logger.error("No exercise directories found") return 1 - if clean and dirname.exists() and not dry: - logger.info(f"Cleaning up and replacing {dirname}") - dir_files = set(fn.name for fn in dirname.glob("*")) + if clean and results_dir.exists() and not dry: + logger.info(f"Cleaning up and replacing {results_dir}") + dir_files = set(fn.name for fn in results_dir.glob("*")) original_files = set(fn.name for fn in original_dname.glob("*")) if dir_files != original_files: logger.error( - f"ERROR: will not delete dir that does not look like original tests {dirname}" + f"ERROR: will not delete dir that does not look like original tests {results_dir}" ) return - dest = dirname.parent / "OLD" / dirname.name + dest = results_dir.parent / "OLD" / results_dir.name if dest.exists(): old_now = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") - dest = dirname.parent / "OLD" / (old_now + dirname.name) + dest = results_dir.parent / "OLD" / (old_now + results_dir.name) - dirname.rename(dest) + results_dir.rename(dest) - if not dirname.exists() and not dry: - logger.info(f"Copying {original_dname} -> {dirname} ...") + if not results_dir.exists() and not dry: + logger.info(f"Copying {original_dname} -> {results_dir} ...") # Only copy the practice subdirs with exercises - os.makedirs(dirname, exist_ok=True) + os.makedirs(results_dir, exist_ok=True) for lang_dir in original_dname.iterdir(): if not lang_dir.is_dir(): continue practice_dir = lang_dir / "exercises" / "practice" if practice_dir.exists(): - dest_lang_dir = dirname / lang_dir.name / "exercises" / "practice" + dest_lang_dir = results_dir / lang_dir.name / "exercises" / "practice" os.makedirs(dest_lang_dir.parent, exist_ok=True) shutil.copytree(practice_dir, dest_lang_dir) logger.info("...done") @@ -329,7 +316,7 @@ def get_exercise_dirs(base_dir, languages=None): for test_path in test_dnames: results = run_test( original_dname, - dirname / test_path, + results_dir / test_path, model, edit_format, tries, @@ -350,7 +337,7 @@ def get_exercise_dirs(base_dir, languages=None): ) all_results.append(results) - summarize_results(dirname, verbose) + summarize_results(results_dir, verbose) if sleep: time.sleep(sleep) else: @@ -358,7 +345,7 @@ def get_exercise_dirs(base_dir, languages=None): for test_path in test_dnames: run_test_threaded.scatter( original_dname, - dirname / test_path, + results_dir / test_path, model, edit_format, tries, @@ -382,13 +369,13 @@ def get_exercise_dirs(base_dir, languages=None): print() print() print() - summarize_results(dirname, verbose) + summarize_results(results_dir, verbose) return 0 -def load_results(dirname, stats_languages=None): - dirname = Path(dirname) +def load_results(results_dir, stats_languages=None): + results_dir = Path(results_dir) lang_to_results = {} if stats_languages: @@ -400,7 +387,7 @@ def load_results(dirname, stats_languages=None): glob_patterns = ["*/exercises/practice/*/.aider.results.json"] for pattern in glob_patterns: - for fname in dirname.glob(pattern): + for fname in results_dir.glob(pattern): try: results = json.loads(fname.read_text()) # json / test / prac / exer / lang @@ -412,11 +399,11 @@ def load_results(dirname, stats_languages=None): return lang_to_results -def summarize_results(dirname, verbose, stats_languages=None): - lang_to_results = load_results(dirname, stats_languages) +def summarize_results(results_dir, verbose, stats_languages=None): + lang_to_results = load_results(results_dir, stats_languages) res = SimpleNamespace() - res.total_tests = len(list(Path(dirname).glob("*/exercises/practice/*"))) + res.total_tests = len(list(Path(results_dir).glob("*/exercises/practice/*"))) try: tries = max( @@ -428,7 +415,7 @@ def summarize_results(dirname, verbose, stats_languages=None): except ValueError: tries = 0 - res.dir_name = str(dirname) + res.dir_name = str(results_dir) passed_tests = [0] * tries @@ -555,11 +542,11 @@ def add(attr_name, increment, global_stats, lang_stats): # return console = Console(highlight=False) - console.rule(title=str(dirname)) + console.rule(title=str(results_dir)) commit_hashes = variants["commit_hash"] versions = get_versions(commit_hashes) - date = dirname.name[:10] + date = results_dir.name[:10] def show(stat, red="red"): val = getattr(res, stat) @@ -574,7 +561,7 @@ def show(stat, red="red"): setattr(res, f"pass_rate_{i + 1}", f"{pass_rate:.1f}") setattr(res, f"pass_num_{i + 1}", passed_tests[i]) - print(f"- dirname: {dirname.name}") + print(f"- results_dir: {results_dir.name}") style = None if res.completed_tests == res.total_tests else "red" console.print(f" test_cases: {res.completed_tests}", style=style) for key, val in variants.items(): From 8430e0255986b31dfce28440e0fbca0c41f4d19f Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 11:58:43 +1000 Subject: [PATCH 14/48] fix: Replace asserts with explicit error logging and exit Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 44c8a4f53c2..b06aa6f5cfd 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -204,7 +204,9 @@ def main( ) return - assert BENCHMARK_DNAME.exists() and BENCHMARK_DNAME.is_dir(), BENCHMARK_DNAME + if not (BENCHMARK_DNAME.exists() and BENCHMARK_DNAME.is_dir()): + logger.error(f"Benchmark directory not found: {BENCHMARK_DNAME}") + sys.exit(1) def get_exercise_dirs(base_dir, languages=None): """Get all exercise directories for specified languages (or all if none specified)""" @@ -234,7 +236,9 @@ def get_exercise_dirs(base_dir, languages=None): return exercise_dirs original_dname = BENCHMARK_DNAME / exercises_dir - assert original_dname.exists() and original_dname.is_dir(), original_dname + if not (original_dname.exists() and original_dname.is_dir()): + logger.error(f"Exercises directory not found: {original_dname}") + sys.exit(1) exercise_dirs = get_exercise_dirs(original_dname, languages) From 601e8c3fe7d2f95cfd86e7587c86f505d383c4c2 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 11:58:45 +1000 Subject: [PATCH 15/48] fix: Resolve linter errors in benchmark script Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index b06aa6f5cfd..bfd915abd83 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -86,7 +86,9 @@ def resolve_dirname(results_dir, use_single_prior, make_new): @app.command() def main( - results_dir: Optional[str] = typer.Argument(RESULTS_DIR_DEFAULT, help="Results directory"), + results_dir: Optional[str] = typer.Argument( + RESULTS_DIR_DEFAULT, help="Results directory" + ), model: str = typer.Option("gpt-3.5-turbo", "--model", "-m", help="Model name"), sleep: float = typer.Option( 0, "--sleep", help="Sleep seconds between tests when single threaded" From bd5afe255531f5d8014c7b241517c543e9a37618 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 13:02:34 +1000 Subject: [PATCH 16/48] hacking --- benchmark/benchmark.py | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index bfd915abd83..bc070dbd45e 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -87,7 +87,7 @@ def resolve_dirname(results_dir, use_single_prior, make_new): @app.command() def main( results_dir: Optional[str] = typer.Argument( - RESULTS_DIR_DEFAULT, help="Results directory" + "unnamed", help="Results directory slug" ), model: str = typer.Option("gpt-3.5-turbo", "--model", "-m", help="Model name"), sleep: float = typer.Option( @@ -181,21 +181,20 @@ def main( if dry: no_aider = True no_unit_tests = True + else: + # Lazy imports for the actual benchmark run + import git # Heavy + import importlib_resources # Used for model metadata registration + import lox # Only needed for threaded runs + from aider import models, sendchat + from aider.coders import base_coder + repo = git.Repo(search_parent_directories=True) + commit_hash = repo.head.object.hexsha[:7] + if repo.is_dirty(): + commit_hash += "-dirty" results_dir = resolve_dirname(results_dir, cont, make_new) - # Lazy imports for the actual benchmark run - import git # Heavy - import importlib_resources # Used for model metadata registration - import lox # Only needed for threaded runs - - from aider import models, sendchat - from aider.coders import base_coder - - repo = git.Repo(search_parent_directories=True) - commit_hash = repo.head.object.hexsha[:7] - if repo.is_dirty(): - commit_hash += "-dirty" if not dry and "AIDER_DOCKER" not in os.environ: logger.warning( @@ -206,13 +205,21 @@ def main( ) return + # Check dirs exist if not (BENCHMARK_DNAME.exists() and BENCHMARK_DNAME.is_dir()): logger.error(f"Benchmark directory not found: {BENCHMARK_DNAME}") sys.exit(1) + original_dname = BENCHMARK_DNAME / exercises_dir + if not (original_dname.exists() and original_dname.is_dir()): + logger.error(f"Exercises directory not found: {original_dname}") + sys.exit(1) - def get_exercise_dirs(base_dir, languages=None): - """Get all exercise directories for specified languages (or all if none specified)""" + def legacy_get_exercise_dirs(base_dir, languages=None): + """Get all exercise directories for specified languages (or all if none specified). + Uses the legacy `excerises/practice` pattern. + """ base_dir = Path(base_dir) + logger.info(f"Looking for exercises in {base_dir}") # Get available language dirs lang_dirs = [d for d in base_dir.iterdir() if d.is_dir()] @@ -237,10 +244,7 @@ def get_exercise_dirs(base_dir, languages=None): return exercise_dirs - original_dname = BENCHMARK_DNAME / exercises_dir - if not (original_dname.exists() and original_dname.is_dir()): - logger.error(f"Exercises directory not found: {original_dname}") - sys.exit(1) + def get_exercise_dirs(base_dir, languages=None): exercise_dirs = get_exercise_dirs(original_dname, languages) From 85e15564c58922d746fc90d062b31377afb1fb42 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 13:20:14 +1000 Subject: [PATCH 17/48] feat: Add support for new cat exercise structure Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 50 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 48 insertions(+), 2 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index bc070dbd45e..b40769aef3d 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -9,6 +9,7 @@ import sys import time import traceback +import yaml from collections import defaultdict from json.decoder import JSONDecodeError from pathlib import Path @@ -164,6 +165,15 @@ def main( exercises_dir: str = typer.Option( EXERCISES_DIR_DEFAULT, "--exercises-dir", help="Directory with exercise files" ), + legacy: bool = typer.Option( + False, "--legacy", help="Use legacy exercise directory structure" + ), + sets: Optional[str] = typer.Option( + None, "--sets", help="Only run tests for specific sets (comma separated)" + ), + hash_re: Optional[str] = typer.Option( + None, "--hash-re", help="Regex to filter exercise hashes" + ), dry: bool = typer.Option( False, "--dry", help="Run in dry mode (no aider, no tests)" ), @@ -244,9 +254,45 @@ def legacy_get_exercise_dirs(base_dir, languages=None): return exercise_dirs - def get_exercise_dirs(base_dir, languages=None): + def get_exercise_dirs(base_dir, languages=None, sets=None, hash_re=None, legacy=False): + if legacy: + return legacy_get_exercise_dirs(base_dir, languages) + + base_dir = Path(base_dir) + logger.info(f"Scanning for cat.yaml in {base_dir}") + + lang_filter = ( + set(l.strip().lower() for l in languages.split(",")) if languages else None + ) + set_filter = set(s.strip().lower() for s in sets.split(",")) if sets else None + + exercise_dirs = [] + for cat_file in base_dir.rglob("cat.yaml"): + try: + with open(cat_file, "r") as f: + metadata = yaml.safe_load(f) + except Exception as e: + logger.warning(f"Failed to parse {cat_file}: {e}") + continue + + if lang_filter and metadata.get("language", "").lower() not in lang_filter: + continue + + if set_filter: + cat_sets = set(s.lower() for s in metadata.get("sets", [])) + if not (set_filter & cat_sets): + continue + + if hash_re and not re.search(hash_re, metadata.get("hash", "")): + continue + + exercise_dirs.append(cat_file.parent) - exercise_dirs = get_exercise_dirs(original_dname, languages) + return exercise_dirs + + exercise_dirs = get_exercise_dirs( + original_dname, languages, sets, hash_re, legacy=legacy + ) if not exercise_dirs: logger.error("No exercise directories found") From 14cb852f6c7df4785fc9ac307ddd6be229cd65af Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 13:20:17 +1000 Subject: [PATCH 18/48] fix: Run linter and fix formatting issues Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index b40769aef3d..50067acc666 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -198,6 +198,7 @@ def main( import lox # Only needed for threaded runs from aider import models, sendchat from aider.coders import base_coder + repo = git.Repo(search_parent_directories=True) commit_hash = repo.head.object.hexsha[:7] if repo.is_dirty(): @@ -205,7 +206,6 @@ def main( results_dir = resolve_dirname(results_dir, cont, make_new) - if not dry and "AIDER_DOCKER" not in os.environ: logger.warning( "Warning: Benchmarking runs unvetted code. Run in a docker container." @@ -254,7 +254,9 @@ def legacy_get_exercise_dirs(base_dir, languages=None): return exercise_dirs - def get_exercise_dirs(base_dir, languages=None, sets=None, hash_re=None, legacy=False): + def get_exercise_dirs( + base_dir, languages=None, sets=None, hash_re=None, legacy=False + ): if legacy: return legacy_get_exercise_dirs(base_dir, languages) From 7df0b0f2db28a6d3e8473ae3ef954c440ba0787f Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 13:43:04 +1000 Subject: [PATCH 19/48] chore: Add logging for found exercises and metadata --- benchmark/benchmark.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 50067acc666..11a5839a026 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -273,6 +273,7 @@ def get_exercise_dirs( try: with open(cat_file, "r") as f: metadata = yaml.safe_load(f) + logger.info(f"found {metadata['name']} ({metadata['language']})") except Exception as e: logger.warning(f"Failed to parse {cat_file}: {e}") continue @@ -290,6 +291,7 @@ def get_exercise_dirs( exercise_dirs.append(cat_file.parent) + logger.info(f"Found {len(exercise_dirs)} cats") return exercise_dirs exercise_dirs = get_exercise_dirs( From f24d56dd275d70ccc316d9c0b2b5c04cbd1585d7 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 13:43:06 +1000 Subject: [PATCH 20/48] fix: Import importlib_resources at the top level Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 11a5839a026..80d3dbdee7b 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -1,5 +1,6 @@ #!/usr/bin/env python3 import datetime +import importlib_resources import json import os import random @@ -194,7 +195,6 @@ def main( else: # Lazy imports for the actual benchmark run import git # Heavy - import importlib_resources # Used for model metadata registration import lox # Only needed for threaded runs from aider import models, sendchat from aider.coders import base_coder From b021795d57d7f675491a6322eec8d0bc8e0e65f3 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 13:45:50 +1000 Subject: [PATCH 21/48] fix: Move models import to top level in benchmark script Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 80d3dbdee7b..e5c5ed6684d 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -189,6 +189,8 @@ def main( logging.basicConfig(level=log_level, format="%(message)s") + from aider import models + if dry: no_aider = True no_unit_tests = True @@ -196,7 +198,7 @@ def main( # Lazy imports for the actual benchmark run import git # Heavy import lox # Only needed for threaded runs - from aider import models, sendchat + from aider import sendchat from aider.coders import base_coder repo = git.Repo(search_parent_directories=True) From a3dc824d795c06aec3484beace26f2fa1ab935ba Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 13:52:25 +1000 Subject: [PATCH 22/48] refactor: Dry out run_test code for single and multi-threaded execution Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 75 +++++++++++++++--------------------------- 1 file changed, 27 insertions(+), 48 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index e5c5ed6684d..ef70702e412 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -373,60 +373,39 @@ def get_exercise_dirs( # Enable in-memory RepoMap cache when running multiple threads to avoid SQLite contention repomap_in_memory = threads > 1 - if threads == 1: + test_args = dict( + model_name=model, + edit_format=edit_format, + tries=tries, + no_unit_tests=no_unit_tests, + no_aider=no_aider, + verbose=verbose, + commit_hash=commit_hash, + replay=replay, + editor_model=editor_model, + editor_edit_format=editor_edit_format, + num_ctx=num_ctx, + sleep=sleep, + reasoning_effort=reasoning_effort, + thinking_tokens=thinking_tokens, + map_tokens=map_tokens, + repomap_in_memory=repomap_in_memory, + dry=dry, + ) + + if threads > 1: + run_test_threaded = lox.thread(threads)(run_test) + for test_path in test_dnames: + run_test_threaded.scatter(original_dname, results_dir / test_path, **test_args) + all_results = run_test_threaded.gather(tqdm=True) + else: all_results = [] for test_path in test_dnames: - results = run_test( - original_dname, - results_dir / test_path, - model, - edit_format, - tries, - no_unit_tests, - no_aider, - verbose, - commit_hash, - replay, - editor_model, - editor_edit_format, - num_ctx, - sleep, - reasoning_effort, - thinking_tokens, - map_tokens, - repomap_in_memory, - dry, - ) - + results = run_test(original_dname, results_dir / test_path, **test_args) all_results.append(results) summarize_results(results_dir, verbose) if sleep: time.sleep(sleep) - else: - run_test_threaded = lox.thread(threads)(run_test) - for test_path in test_dnames: - run_test_threaded.scatter( - original_dname, - results_dir / test_path, - model, - edit_format, - tries, - no_unit_tests, - no_aider, - verbose, - commit_hash, - replay, - editor_model, - editor_edit_format, - num_ctx, - sleep, - reasoning_effort, - thinking_tokens, - map_tokens, - repomap_in_memory, - dry, - ) - all_results = run_test_threaded.gather(tqdm=True) print() print() From 3cc00118bb75d227265cabaf57104eec18afe44c Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 13:52:27 +1000 Subject: [PATCH 23/48] fix: Correct indentation in benchmark script Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index ef70702e412..4391a572ae0 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -396,7 +396,9 @@ def get_exercise_dirs( if threads > 1: run_test_threaded = lox.thread(threads)(run_test) for test_path in test_dnames: - run_test_threaded.scatter(original_dname, results_dir / test_path, **test_args) + run_test_threaded.scatter( + original_dname, results_dir / test_path, **test_args + ) all_results = run_test_threaded.gather(tqdm=True) else: all_results = [] From f50685c37e41c56260f4d62d0590a960b2745254 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:10:29 +1000 Subject: [PATCH 24/48] fix: Set commit hash to '???????' when dry run --- benchmark/benchmark.py | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 4391a572ae0..7be52cfa995 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -194,6 +194,7 @@ def main( if dry: no_aider = True no_unit_tests = True + commit_hash = '???????' else: # Lazy imports for the actual benchmark run import git # Heavy @@ -364,11 +365,12 @@ def get_exercise_dirs( if num_tests > 0: test_dnames = test_dnames[:num_tests] - # Don't give up when benchmarking - LONG_TIMEOUT = 24 * 60 * 60 - sendchat.RETRY_TIMEOUT = LONG_TIMEOUT - base_coder.RETRY_TIMEOUT = LONG_TIMEOUT - models.RETRY_TIMEOUT = LONG_TIMEOUT + if not no_aider: + # Don't give up when benchmarking + LONG_TIMEOUT = 24 * 60 * 60 + sendchat.RETRY_TIMEOUT = LONG_TIMEOUT + base_coder.RETRY_TIMEOUT = LONG_TIMEOUT + models.RETRY_TIMEOUT = LONG_TIMEOUT # Enable in-memory RepoMap cache when running multiple threads to avoid SQLite contention repomap_in_memory = threads > 1 From 1969da46f97e776f06155bb7e13bc55116f94f20 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:10:31 +1000 Subject: [PATCH 25/48] docs: Update --hash-re help text for fractional set division Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 7be52cfa995..f4bd7544dbf 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -173,7 +173,12 @@ def main( None, "--sets", help="Only run tests for specific sets (comma separated)" ), hash_re: Optional[str] = typer.Option( - None, "--hash-re", help="Regex to filter exercise hashes" + None, + "--hash-re", + help=( + "Regex to filter exercise hashes. Useful for dividing the set into fractions using" + " hex chars: '^0' for 1/16, '^[01]' for 1/8, '^[0-3]' for 1/4." + ), ), dry: bool = typer.Option( False, "--dry", help="Run in dry mode (no aider, no tests)" From 02164b6b6f204eddb64afa22e2350e25893babca Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:10:33 +1000 Subject: [PATCH 26/48] fix: Update placeholder commit hash in benchmark script Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index f4bd7544dbf..6550ab4adb5 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -199,7 +199,7 @@ def main( if dry: no_aider = True no_unit_tests = True - commit_hash = '???????' + commit_hash = "???????" else: # Lazy imports for the actual benchmark run import git # Heavy From 9fe5f245f92266c7031be2abe6779f1fa6236e8a Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:12:44 +1000 Subject: [PATCH 27/48] docs: Update hash-re help text with nth char and hex range examples Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 6550ab4adb5..81b0162f957 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -177,7 +177,8 @@ def main( "--hash-re", help=( "Regex to filter exercise hashes. Useful for dividing the set into fractions using" - " hex chars: '^0' for 1/16, '^[01]' for 1/8, '^[0-3]' for 1/4." + " hex chars: '^0' for 1/16, '^[01]' for 1/8, '^[0-3]' for 1/4. Use '^.{n}x' to" + " match the nth character (e.g., '^.{2}[4-7]' for the 3rd char in range 4-7)." ), ), dry: bool = typer.Option( From 3f25430e8ae07a6afd9e9f0aafdde4cc83df42d9 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:19:54 +1000 Subject: [PATCH 28/48] docs: Add enhancements section to benchmark README Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/README.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/benchmark/README.md b/benchmark/README.md index 4425d0e1deb..0fea152b829 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -147,3 +147,15 @@ You can see examples of the benchmark report yaml in the [aider leaderboard data files](https://github.com/$ORG/aider/blob/main/aider/website/_data/). - These scripts are not intended for use by typical aider end users. - Some of these tools are written as `bash` scripts, so it will be hard to use them on Windows. + +## Enhancements + +The `aider-ce` benchmark harness includes several enhancements over the original `aider` benchmark: + +- **YAML Metadata**: Exercises now use `cat.yaml` files for metadata, allowing for richer categorization and filtering. +- **Subset Filtering**: Use the `--sets` option to run specific groups of tests (e.g., `--sets core,strings`). +- **K-fold Evaluation Slicing**: The `--hash-re` option allows for deterministic slicing of the exercise set based on the exercise hash. This is useful for parallelizing runs or performing k-fold cross-validation. + - `^0`: 1/16 of the set. + - `^[01]`: 1/8 of the set. + - `^[0-3]`: 1/4 of the set. + - `^.{2}[4-7]`: Targets the 3rd character of the hash for more granular slicing. From a1c011fa73e0720f654a466ccca488a53f7b197d Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:19:57 +1000 Subject: [PATCH 29/48] chore: Update benchmark README with linting fixes Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/README.md | 130 +++++++++++++++++++++++++------------------- 1 file changed, 74 insertions(+), 56 deletions(-) diff --git a/benchmark/README.md b/benchmark/README.md index 0fea152b829..e15ebb3c91a 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -1,29 +1,26 @@ # Aider benchmark harness -Before `cecli` was born, the old `aider` used benchmarks to quantitatively measure how well it works -with various LLMs. +Before `cecli` was born, the old `aider` used benchmarks to quantitatively +measure how well it works with various LLMs. This directory holds the harness and tools needed to run the benchmarking suite. ## Background -The benchmark was based on the [Exercism](https://github.com/exercism/python) coding exercises. -This -benchmark evaluates how effectively aider and LLMs can translate a -natural language coding request into executable code saved into -files that pass unit tests. -It provides an end-to-end evaluation of not just -the LLM's coding ability, but also its capacity to *edit existing code* -and *format those code edits* so that aider can save the -edits to the local source files. - -See [this writeup for a longer discussion about the benchmark](https://aider.chat/2024/12/21/polyglot.html). - -The benchmark is intended to be run *inside a docker container*. -This is because the benchmarking harness will be -taking code written by an LLM -and executing it without any human review or supervision! -The LLM could generate dangerous python that harms your system, like this: `import os; os.system("sudo rm -rf /")`. +The benchmark was based on the [Exercism](https://github.com/exercism/python) +coding exercises. This benchmark evaluates how effectively aider and LLMs can +translate a natural language coding request into executable code saved into +files that pass unit tests. It provides an end-to-end evaluation of not just the +LLM's coding ability, but also its capacity to _edit existing code_ and _format +those code edits_ so that aider can save the edits to the local source files. + +See +[this writeup for a longer discussion about the benchmark](https://aider.chat/2024/12/21/polyglot.html). + +The benchmark is intended to be run _inside a docker container_. This is because +the benchmarking harness will be taking code written by an LLM and executing it +without any human review or supervision! The LLM could generate dangerous python +that harms your system, like this: `import os; os.system("sudo rm -rf /")`. Running inside a docker container helps limit the damage that could be done. ## Usage @@ -74,23 +71,38 @@ pip install -e .[dev] ./benchmark/benchmark.py a-helpful-name-for-this-run --model gpt-3.5-turbo --edit-format whole --threads 10 --exercises-dir polyglot-benchmark ``` -The above will create a folder `tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run` with benchmarking results. -Run like this, the script will run all the exercises in a random order. - -You can run `./benchmark/benchmark.py --help` for a list of all the arguments, but here are the most useful to keep in mind: - -- `--model` is the name of the model, same as you would pass directly to `aider`. -- `--edit-format` is the name of the edit format, same as you would pass directly to `aider`. When working with an experimental LLM, I recommend starting with `whole` -- `--threads` specifies how many exercises to benchmark in parallel. Start with a single thread if you are working out the kinks on your benchmarking setup or working with a new model, etc. Once you are getting reliable results, you can speed up the process by running with more threads. 10 works well against the OpenAI APIs. -- `--num-tests` specifies how many of the tests to run before stopping. This is another way to start gently as you debug your benchmarking setup. -- `--keywords` filters the tests to run to only the ones whose name match the supplied argument (similar to `pytest -k xxxx`). -- `--read-model-settings=` specify model settings, see here: https://aider.chat/docs/config/adv-model-settings.html#model-settings -- `--map-tokens` sets a token budget for the repo map sent with each request. Set `0` to disable the repo map. This lets you enable repo map usage for any model (e.g., `--map-tokens 1024`). +The above will create a folder +`tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run` with +benchmarking results. Run like this, the script will run all the exercises in a +random order. + +You can run `./benchmark/benchmark.py --help` for a list of all the arguments, +but here are the most useful to keep in mind: + +- `--model` is the name of the model, same as you would pass directly to + `aider`. +- `--edit-format` is the name of the edit format, same as you would pass + directly to `aider`. When working with an experimental LLM, I recommend + starting with `whole` +- `--threads` specifies how many exercises to benchmark in parallel. Start with + a single thread if you are working out the kinks on your benchmarking setup or + working with a new model, etc. Once you are getting reliable results, you can + speed up the process by running with more threads. 10 works well against the + OpenAI APIs. +- `--num-tests` specifies how many of the tests to run before stopping. This is + another way to start gently as you debug your benchmarking setup. +- `--keywords` filters the tests to run to only the ones whose name match the + supplied argument (similar to `pytest -k xxxx`). +- `--read-model-settings=` specify model settings, see here: + https://aider.chat/docs/config/adv-model-settings.html#model-settings +- `--map-tokens` sets a token budget for the repo map sent with each request. + Set `0` to disable the repo map. This lets you enable repo map usage for any + model (e.g., `--map-tokens 1024`). ### Benchmark report -You can generate stats about any benchmark, including ones which are still running. -You don't need to run this inside the docker container, as it is just +You can generate stats about any benchmark, including ones which are still +running. You don't need to run this inside the docker container, as it is just collecting stats not executing unsafe python. ``` @@ -125,37 +137,43 @@ The benchmark report is a yaml record with statistics about the run: total_cost: 3.6346 ``` -The key statistics are the `pass_rate_#` entries, which report the -percent of the tasks which had all tests passing. -There will be multiple of these pass rate stats, -depending on the value of the `--tries` parameter. +The key statistics are the `pass_rate_#` entries, which report the percent of +the tasks which had all tests passing. There will be multiple of these pass rate +stats, depending on the value of the `--tries` parameter. -The yaml also includes all the settings which were in effect for the benchmark run. -It also reports the git hash of the repo at the time that the benchmark was -run, with `(dirty)` if there were uncommitted changes. -It's good practice to commit the repo before starting a benchmark run. -This way the `model`, `edit_format` and `commit_hash` -should be enough to reliably reproduce any benchmark run. +The yaml also includes all the settings which were in effect for the benchmark +run. It also reports the git hash of the repo at the time that the benchmark was +run, with `(dirty)` if there were uncommitted changes. It's good practice to +commit the repo before starting a benchmark run. This way the `model`, +`edit_format` and `commit_hash` should be enough to reliably reproduce any +benchmark run. You can see examples of the benchmark report yaml in the [aider leaderboard data files](https://github.com/$ORG/aider/blob/main/aider/website/_data/). - ## Limitations, notes -- Contributions of benchmark results are welcome! Submit results by opening a PR with edits to the -[aider leaderboard data files](https://github.com/$ORG/aider/blob/main/aider/website/_data/). +- Contributions of benchmark results are welcome! Submit results by opening a PR + with edits to the + [aider leaderboard data files](https://github.com/$ORG/aider/blob/main/aider/website/_data/). - These scripts are not intended for use by typical aider end users. -- Some of these tools are written as `bash` scripts, so it will be hard to use them on Windows. +- Some of these tools are written as `bash` scripts, so it will be hard to use + them on Windows. ## Enhancements -The `aider-ce` benchmark harness includes several enhancements over the original `aider` benchmark: - -- **YAML Metadata**: Exercises now use `cat.yaml` files for metadata, allowing for richer categorization and filtering. -- **Subset Filtering**: Use the `--sets` option to run specific groups of tests (e.g., `--sets core,strings`). -- **K-fold Evaluation Slicing**: The `--hash-re` option allows for deterministic slicing of the exercise set based on the exercise hash. This is useful for parallelizing runs or performing k-fold cross-validation. - - `^0`: 1/16 of the set. - - `^[01]`: 1/8 of the set. - - `^[0-3]`: 1/4 of the set. - - `^.{2}[4-7]`: Targets the 3rd character of the hash for more granular slicing. +The `aider-ce` benchmark harness includes several enhancements over the original +`aider` benchmark: + +- **YAML Metadata**: Exercises now use `cat.yaml` files for metadata, allowing + for richer categorization and filtering. +- **Subset Filtering**: Use the `--sets` option to run specific groups of tests + (e.g., `--sets core,strings`). +- **K-fold Evaluation Slicing**: The `--hash-re` option allows for deterministic + slicing of the exercise set based on the exercise hash. This is useful for + parallelizing runs or performing k-fold cross-validation. + - `^0`: 1/16 of the set. + - `^[01]`: 1/8 of the set. + - `^[0-3]`: 1/4 of the set. + - `^.{2}[4-7]`: Targets the 3rd character of the hash for more granular + slicing. From c9b13bbe2c8a63ddda01ba074bd02743196ea970 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:31:36 +1000 Subject: [PATCH 30/48] fix: Add debug logging for metadata parsing --- benchmark/benchmark.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 81b0162f957..bb02233c8e1 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -282,7 +282,8 @@ def get_exercise_dirs( try: with open(cat_file, "r") as f: metadata = yaml.safe_load(f) - logger.info(f"found {metadata['name']} ({metadata['language']})") + if verbose > 1: + logger.debug(f"found {metadata['name']} ({metadata['language']})") except Exception as e: logger.warning(f"Failed to parse {cat_file}: {e}") continue From ea4df0fe229bd041571b02fb52cb88bcb8f1292c Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:31:37 +1000 Subject: [PATCH 31/48] refactor: Support new dir structure and dry run Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 41 +++++++++++++++++++---------------------- 1 file changed, 19 insertions(+), 22 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index bb02233c8e1..9cc48910746 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -331,16 +331,13 @@ def get_exercise_dirs( if not results_dir.exists() and not dry: logger.info(f"Copying {original_dname} -> {results_dir} ...") - # Only copy the practice subdirs with exercises os.makedirs(results_dir, exist_ok=True) - for lang_dir in original_dname.iterdir(): - if not lang_dir.is_dir(): - continue - practice_dir = lang_dir / "exercises" / "practice" - if practice_dir.exists(): - dest_lang_dir = results_dir / lang_dir.name / "exercises" / "practice" - os.makedirs(dest_lang_dir.parent, exist_ok=True) - shutil.copytree(practice_dir, dest_lang_dir) + for exercise_dir in exercise_dirs: + rel_path = exercise_dir.relative_to(original_dname) + dest_dir = results_dir / rel_path + os.makedirs(dest_dir.parent, exist_ok=True) + if not dest_dir.exists(): + shutil.copytree(exercise_dir, dest_dir) logger.info("...done") test_dnames = sorted(str(d.relative_to(original_dname)) for d in exercise_dirs) @@ -400,6 +397,7 @@ def get_exercise_dirs( map_tokens=map_tokens, repomap_in_memory=repomap_in_memory, dry=dry, + results_dir=results_dir, ) if threads > 1: @@ -849,6 +847,7 @@ def run_test_real( read_model_settings=None, repomap_in_memory: bool = False, dry: bool = False, + results_dir=None, ): # Lazy imports: only needed in the actual benchmark execution path import git @@ -859,6 +858,8 @@ def run_test_real( from aider.io import InputOutput if not os.path.isdir(testdir): + if dry: + return logger.error(f"Not a dir: {testdir}") return @@ -917,19 +918,15 @@ def run_test_real( fnames.append(src) # restore the original file, in case we interrupted a prev run # Find the original file in the language-specific practice dir - if not dry: - lang_part = str(testdir).split("/exercises/practice/")[0] - original_fname = ( - original_dname - / Path(lang_part).name - / "exercises" - / "practice" - / testdir.name - / file_path - ) - if original_fname.exists(): - os.makedirs(src.parent, exist_ok=True) - shutil.copy(original_fname, src) + if not dry and results_dir: + try: + rel_path = testdir.relative_to(results_dir) + original_fname = original_dname / rel_path / file_path + if original_fname.exists(): + os.makedirs(src.parent, exist_ok=True) + shutil.copy(original_fname, src) + except ValueError: + pass else: logger.warning(f"Warning: Solution file not found: {src}") From c29a88e814b40b5a3f5e03589679ea1884064c6f Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:31:40 +1000 Subject: [PATCH 32/48] chore: Fix linter warnings in benchmark.py Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 9cc48910746..2da9065b79d 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -283,7 +283,9 @@ def get_exercise_dirs( with open(cat_file, "r") as f: metadata = yaml.safe_load(f) if verbose > 1: - logger.debug(f"found {metadata['name']} ({metadata['language']})") + logger.debug( + f"found {metadata['name']} ({metadata['language']})" + ) except Exception as e: logger.warning(f"Failed to parse {cat_file}: {e}") continue From 2190623a12da699d455d5cd677b3eab6e4f9ef1a Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:50:43 +1000 Subject: [PATCH 33/48] docs: Add debug plan for benchmark script Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/debug_plan.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100644 benchmark/debug_plan.md diff --git a/benchmark/debug_plan.md b/benchmark/debug_plan.md new file mode 100644 index 00000000000..03841a41ff2 --- /dev/null +++ b/benchmark/debug_plan.md @@ -0,0 +1,32 @@ +# Debug Plan + +The benchmark script is failing because `Coder` has been converted to use `async/await`, but `benchmark.py` is still synchronous. + +## Symptom + +`AttributeError: 'coroutine' object has no attribute 'ignore_mentions'` when accessing properties of the result of `Coder.create()`. + +## Diagnosis + +1. `Coder.create()` is `async def` and returns a coroutine. +2. `benchmark.py` calls it as `coder = Coder.create(...)` without awaiting. +3. `coder.run()` is also `async def` and needs to be awaited. +4. `coder.apply_updates()` is also `async def` and needs to be awaited (used in replay mode). + +## Plan + +We need to bridge the synchronous benchmark runner with the async `Coder`. + +1. Modify `benchmark/benchmark.py`. +2. Import `asyncio`. +3. Wrap the coder creation and execution in an async function. +4. Use `asyncio.run()` to execute that function within `run_test_real`. + +The async function needs to handle: +- `coder = await Coder.create(...)` +- `response = await coder.run(...)` +- `await coder.apply_updates()` + +## Files to Edit + +- `benchmark/benchmark.py` From c7f89c9a6916073f286498b6fd9609744adc698c Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:50:44 +1000 Subject: [PATCH 34/48] chore: Update benchmark debug plan for async/await Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/debug_plan.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/benchmark/debug_plan.md b/benchmark/debug_plan.md index 03841a41ff2..b1ceced44a2 100644 --- a/benchmark/debug_plan.md +++ b/benchmark/debug_plan.md @@ -1,17 +1,20 @@ # Debug Plan -The benchmark script is failing because `Coder` has been converted to use `async/await`, but `benchmark.py` is still synchronous. +The benchmark script is failing because `Coder` has been converted to use +`async/await`, but `benchmark.py` is still synchronous. ## Symptom -`AttributeError: 'coroutine' object has no attribute 'ignore_mentions'` when accessing properties of the result of `Coder.create()`. +`AttributeError: 'coroutine' object has no attribute 'ignore_mentions'` when +accessing properties of the result of `Coder.create()`. ## Diagnosis 1. `Coder.create()` is `async def` and returns a coroutine. 2. `benchmark.py` calls it as `coder = Coder.create(...)` without awaiting. 3. `coder.run()` is also `async def` and needs to be awaited. -4. `coder.apply_updates()` is also `async def` and needs to be awaited (used in replay mode). +4. `coder.apply_updates()` is also `async def` and needs to be awaited (used in + replay mode). ## Plan @@ -23,6 +26,7 @@ We need to bridge the synchronous benchmark runner with the async `Coder`. 4. Use `asyncio.run()` to execute that function within `run_test_real`. The async function needs to handle: + - `coder = await Coder.create(...)` - `response = await coder.run(...)` - `await coder.apply_updates()` From 8390a203db4338d191cab24819ea33850df7d7aa Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:54:43 +1000 Subject: [PATCH 35/48] feat: Update benchmark script for async Coder Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 2da9065b79d..233a0bd7929 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -1,4 +1,5 @@ #!/usr/bin/env python3 +import asyncio import datetime import importlib_resources import json @@ -817,7 +818,7 @@ def get_replayed_content(replay_dname, test_dname): def run_test(original_dname, testdir, *args, **kwargs): try: - return run_test_real(original_dname, testdir, *args, **kwargs) + return asyncio.run(run_test_real(original_dname, testdir, *args, **kwargs)) except Exception: logger.error("=" * 40) logger.error("Test failed") @@ -828,7 +829,7 @@ def run_test(original_dname, testdir, *args, **kwargs): results_fname.write_text(json.dumps(dict(exception=traceback.format_exc()))) -def run_test_real( +async def run_test_real( original_dname, testdir, model_name, @@ -1025,7 +1026,7 @@ def run_test_real( if map_tokens is not None: coder_kwargs["map_tokens"] = map_tokens - coder = Coder.create(**coder_kwargs) + coder = await Coder.create(**coder_kwargs) dump(coder.ignore_mentions) coder.show_announcements() @@ -1052,9 +1053,9 @@ def run_test_real( show = [">> " + line for line in show] io.append_chat_history("".join(show)) - coder.apply_updates() + await coder.apply_updates() else: - response = coder.run(with_message=instructions, preproc=False) + response = await coder.run(with_message=instructions, preproc=False) dur += time.time() - start From e7e56f0a08407e393a94eb7e7588370511f6cffd Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 14:59:28 +1000 Subject: [PATCH 36/48] fix: Handle None results_dir in benchmark script Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 233a0bd7929..f8459d6b8d0 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -216,6 +216,9 @@ def main( results_dir = resolve_dirname(results_dir, cont, make_new) + if not results_dir: + return 1 + if not dry and "AIDER_DOCKER" not in os.environ: logger.warning( "Warning: Benchmarking runs unvetted code. Run in a docker container." From 86a2b207c903d8379984069d8ec6332ae12019c6 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 15:03:44 +1000 Subject: [PATCH 37/48] fix: Add helpful error message with paths to resolve_dirname Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index f8459d6b8d0..530a34c0863 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -214,10 +214,13 @@ def main( if repo.is_dirty(): commit_hash += "-dirty" - results_dir = resolve_dirname(results_dir, cont, make_new) + resolved_results_dir = resolve_dirname(results_dir, cont, make_new) - if not results_dir: + if not resolved_results_dir: + logger.error(f"Could not resolve results directory from slug: {results_dir}") + logger.error(f"Checked in {BENCHMARK_DNAME}") return 1 + results_dir = resolved_results_dir if not dry and "AIDER_DOCKER" not in os.environ: logger.warning( From c79c7bf418b387818843c6be56c3b4f8ae0125ec Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 15:10:11 +1000 Subject: [PATCH 38/48] fix: Add debug logging for directory resolution and result loading Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 530a34c0863..2e18f1d54d7 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -64,6 +64,10 @@ def resolve_dirname(results_dir, use_single_prior, make_new): return results_dir priors = list(BENCHMARK_DNAME.glob(f"*--{results_dir}")) + # BUG20251223 + logger.debug(f"Found priors: {priors}") + logger.debug(f"use_single_prior: {use_single_prior}, make_new: {make_new}") + if len(priors) == 1 and use_single_prior: results_dir = priors[0].name logger.info(f"Using pre-existing {results_dir}") @@ -446,11 +450,23 @@ def load_results(results_dir, stats_languages=None): glob_patterns = ["*/exercises/practice/*/.aider.results.json"] for pattern in glob_patterns: - for fname in results_dir.glob(pattern): + # BUG20251223 + logger.debug(f"Globbing {results_dir} with {pattern}") + files = list(results_dir.glob(pattern)) + logger.debug(f"Found {len(files)} files") + + for fname in files: try: results = json.loads(fname.read_text()) # json / test / prac / exer / lang - lang = fname.parent.parent.parent.parent.name + # BUG20251223 + logger.debug(f"Processing result file: {fname}") + if len(fname.parts) > 4: + lang = fname.parent.parent.parent.parent.name + else: + lang = "unknown" + logger.debug(f"Derived lang: {lang}") + lang_to_results.setdefault(lang, []).append(results) except json.JSONDecodeError: logger.warning(f"json.JSONDecodeError {fname}") From 28dfc228eead132e29521df44924e741e94f3f83 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 15:18:34 +1000 Subject: [PATCH 39/48] fix: Exit when prior unnamed runs exist Co-authored-by: aider-ce (gemini/gemini-3-pro-preview) --- benchmark/benchmark.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 2e18f1d54d7..53040525aea 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -78,7 +78,7 @@ def resolve_dirname(results_dir, use_single_prior, make_new): ) for prior in priors: logger.warning(prior) - return + sys.exit(1) if not re.match(r"\d\d\d\d-\d\d-\d\d-", str(results_dir)): now = datetime.datetime.now() From fa13a8124a75655aeeb749cee64eb694e4827549 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 15:53:09 +1000 Subject: [PATCH 40/48] refactor: Simplify benchmark result directory structure Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 107 +++++++++++++++++++++++++---------------- 1 file changed, 65 insertions(+), 42 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 53040525aea..b76a2125f57 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -346,14 +346,12 @@ def get_exercise_dirs( logger.info(f"Copying {original_dname} -> {results_dir} ...") os.makedirs(results_dir, exist_ok=True) for exercise_dir in exercise_dirs: - rel_path = exercise_dir.relative_to(original_dname) - dest_dir = results_dir / rel_path - os.makedirs(dest_dir.parent, exist_ok=True) + dest_dir = results_dir / exercise_dir.name if not dest_dir.exists(): shutil.copytree(exercise_dir, dest_dir) logger.info("...done") - test_dnames = sorted(str(d.relative_to(original_dname)) for d in exercise_dirs) + test_dnames = sorted(d.name for d in exercise_dirs) resource_metadata = importlib_resources.files("aider.resources").joinpath( "model-metadata.json" @@ -441,36 +439,38 @@ def load_results(results_dir, stats_languages=None): results_dir = Path(results_dir) lang_to_results = {} - if stats_languages: - languages = [lang.strip().lower() for lang in stats_languages.split(",")] - glob_patterns = [ - f"{lang}/exercises/practice/*/.aider.results.json" for lang in languages - ] - else: - glob_patterns = ["*/exercises/practice/*/.aider.results.json"] + # BUG20251223 + logger.debug(f"Globbing {results_dir} for results") + files = list(results_dir.glob("*/.aider.results.json")) + logger.debug(f"Found {len(files)} files") - for pattern in glob_patterns: - # BUG20251223 - logger.debug(f"Globbing {results_dir} with {pattern}") - files = list(results_dir.glob(pattern)) - logger.debug(f"Found {len(files)} files") + for fname in files: + try: + results = json.loads(fname.read_text()) + # BUG20251223 + logger.debug(f"Processing result file: {fname}") + + # Try to get language from cat.yaml if it exists in the same dir + lang = "unknown" + cat_yaml = fname.parent / "cat.yaml" + if cat_yaml.exists(): + try: + with open(cat_yaml, "r") as f: + metadata = yaml.safe_load(f) + lang = metadata.get("language", "unknown") + except Exception: + pass - for fname in files: - try: - results = json.loads(fname.read_text()) - # json / test / prac / exer / lang - # BUG20251223 - logger.debug(f"Processing result file: {fname}") - if len(fname.parts) > 4: - lang = fname.parent.parent.parent.parent.name - else: - lang = "unknown" - logger.debug(f"Derived lang: {lang}") + if stats_languages: + languages = [lang.strip().lower() for lang in stats_languages.split(",")] + if lang.lower() not in languages: + continue - lang_to_results.setdefault(lang, []).append(results) - except json.JSONDecodeError: - logger.warning(f"json.JSONDecodeError {fname}") - continue + logger.debug(f"Derived lang: {lang}") + lang_to_results.setdefault(lang, []).append(results) + except json.JSONDecodeError: + logger.warning(f"json.JSONDecodeError {fname}") + continue return lang_to_results @@ -478,7 +478,7 @@ def summarize_results(results_dir, verbose, stats_languages=None): lang_to_results = load_results(results_dir, stats_languages) res = SimpleNamespace() - res.total_tests = len(list(Path(results_dir).glob("*/exercises/practice/*"))) + res.total_tests = len(list(Path(results_dir).glob("*/.aider.results.json"))) try: tries = max( @@ -936,6 +936,25 @@ async def run_test_real( # Remove any ignore files from the solution set that LLM will edit solution_files.difference_update(ignore_files) + # Try to find original relative path from cat.yaml + original_rel_path = None + cat_yaml = testdir / "cat.yaml" + if cat_yaml.exists(): + try: + with open(cat_yaml, "r") as f: + metadata = yaml.safe_load(f) + # We need to find where this exercise was in original_dname. + # Since we don't store the full relative path in cat.yaml, + # we have to search for it or rely on the fact that we know + # it was copied from original_dname. + # A better way is to look for the directory with the same name (hash) + # in original_dname. + matches = list(original_dname.rglob(testdir.name)) + if matches: + original_rel_path = matches[0].relative_to(original_dname) + except Exception: + pass + # Copy all solution files for file_path in solution_files: src = testdir / Path(file_path) @@ -943,15 +962,11 @@ async def run_test_real( fnames.append(src) # restore the original file, in case we interrupted a prev run # Find the original file in the language-specific practice dir - if not dry and results_dir: - try: - rel_path = testdir.relative_to(results_dir) - original_fname = original_dname / rel_path / file_path - if original_fname.exists(): - os.makedirs(src.parent, exist_ok=True) - shutil.copy(original_fname, src) - except ValueError: - pass + if not dry and original_rel_path: + original_fname = original_dname / original_rel_path / file_path + if original_fname.exists(): + os.makedirs(src.parent, exist_ok=True) + shutil.copy(original_fname, src) else: logger.warning(f"Warning: Solution file not found: {src}") @@ -1200,6 +1215,12 @@ async def run_test_real( def run_unit_tests(original_dname, testdir, history_fname, test_files): timeout = 60 * 3 + # Find original relative path + original_rel_path = None + matches = list(original_dname.rglob(testdir.name)) + if matches: + original_rel_path = matches[0].relative_to(original_dname) + # Map of file extensions to test commands TEST_COMMANDS = { ".py": ["pytest"], @@ -1227,7 +1248,9 @@ def run_unit_tests(original_dname, testdir, history_fname, test_files): # Copy test files from original directory for file_path in test_files: - src = original_dname / Path(*testdir.parts[-4:]) / file_path + if not original_rel_path: + break + src = original_dname / original_rel_path / file_path dst = testdir / file_path if src.exists(): logger.info(f"copying {src} {dst}") From b5686000527e0ba6d941cb4d7bf2e5d81504d48c Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 15:53:11 +1000 Subject: [PATCH 41/48] fix: Improve readability of language filtering in benchmark Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index b76a2125f57..36621458667 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -462,7 +462,9 @@ def load_results(results_dir, stats_languages=None): pass if stats_languages: - languages = [lang.strip().lower() for lang in stats_languages.split(",")] + languages = [ + lang.strip().lower() for lang in stats_languages.split(",") + ] if lang.lower() not in languages: continue From 54442fefd8e9ccda1246c934807f710f74ebb457 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 16:03:23 +1000 Subject: [PATCH 42/48] change to gem --- benchmark/benchmark.py | 2 +- benchmark/debug_plan.md | 36 ------------------------------------ 2 files changed, 1 insertion(+), 37 deletions(-) delete mode 100644 benchmark/debug_plan.md diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 36621458667..cb1a093e978 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -96,7 +96,7 @@ def main( results_dir: Optional[str] = typer.Argument( "unnamed", help="Results directory slug" ), - model: str = typer.Option("gpt-3.5-turbo", "--model", "-m", help="Model name"), + model: str = typer.Option("gemini/gemini-3-flash-preview", "--model", "-m", help="Model name"), sleep: float = typer.Option( 0, "--sleep", help="Sleep seconds between tests when single threaded" ), diff --git a/benchmark/debug_plan.md b/benchmark/debug_plan.md deleted file mode 100644 index b1ceced44a2..00000000000 --- a/benchmark/debug_plan.md +++ /dev/null @@ -1,36 +0,0 @@ -# Debug Plan - -The benchmark script is failing because `Coder` has been converted to use -`async/await`, but `benchmark.py` is still synchronous. - -## Symptom - -`AttributeError: 'coroutine' object has no attribute 'ignore_mentions'` when -accessing properties of the result of `Coder.create()`. - -## Diagnosis - -1. `Coder.create()` is `async def` and returns a coroutine. -2. `benchmark.py` calls it as `coder = Coder.create(...)` without awaiting. -3. `coder.run()` is also `async def` and needs to be awaited. -4. `coder.apply_updates()` is also `async def` and needs to be awaited (used in - replay mode). - -## Plan - -We need to bridge the synchronous benchmark runner with the async `Coder`. - -1. Modify `benchmark/benchmark.py`. -2. Import `asyncio`. -3. Wrap the coder creation and execution in an async function. -4. Use `asyncio.run()` to execute that function within `run_test_real`. - -The async function needs to handle: - -- `coder = await Coder.create(...)` -- `response = await coder.run(...)` -- `await coder.apply_updates()` - -## Files to Edit - -- `benchmark/benchmark.py` From 8150fefdadf47852312097600c5bae70043f9580 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 16:37:44 +1000 Subject: [PATCH 43/48] refactor: Copy new exercises when continuing benchmark runs Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index cb1a093e978..975f5c95c7f 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -342,14 +342,21 @@ def get_exercise_dirs( results_dir.rename(dest) - if not results_dir.exists() and not dry: - logger.info(f"Copying {original_dname} -> {results_dir} ...") - os.makedirs(results_dir, exist_ok=True) + if not dry: + if not results_dir.exists(): + logger.info(f"Copying {original_dname} -> {results_dir} ...") + os.makedirs(results_dir, exist_ok=True) + + copied = False for exercise_dir in exercise_dirs: dest_dir = results_dir / exercise_dir.name if not dest_dir.exists(): + if not copied: + logger.info(f"Adding missing exercises to {results_dir} ...") shutil.copytree(exercise_dir, dest_dir) - logger.info("...done") + copied = True + if copied: + logger.info("...done") test_dnames = sorted(d.name for d in exercise_dirs) From 29e3c5d1bf106d7c2f77d7077f54e28b60064e92 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 16:37:46 +1000 Subject: [PATCH 44/48] fix: Format benchmark.py to conform to linter Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/benchmark.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 975f5c95c7f..660aa50d57c 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -96,7 +96,9 @@ def main( results_dir: Optional[str] = typer.Argument( "unnamed", help="Results directory slug" ), - model: str = typer.Option("gemini/gemini-3-flash-preview", "--model", "-m", help="Model name"), + model: str = typer.Option( + "gemini/gemini-3-flash-preview", "--model", "-m", help="Model name" + ), sleep: float = typer.Option( 0, "--sleep", help="Sleep seconds between tests when single threaded" ), From 9c2359efa69e141397fc2d9664692ba5dbcbc5cd Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 18:29:15 +1000 Subject: [PATCH 45/48] refactor: Update benchmark README with Cecli Cats details Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/README.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/benchmark/README.md b/benchmark/README.md index e15ebb3c91a..21f3ef7f48d 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -160,13 +160,19 @@ You can see examples of the benchmark report yaml in the - Some of these tools are written as `bash` scripts, so it will be hard to use them on Windows. +## What's new with Cecli Cats? + +The benchmark has evolved into a collection of **Cecli Atomic Tests (Cats)**. + +- **YAML Metadata**: Every Cat has its own `cat.yaml` file containing metadata, including a unique UUID that may or may not be useful later. +- **Evolving Collection**: The directory structure of the Cats is laid out to facilitate the growth and evolution of the collection. As the benchmark matures, Cats will come and go. +- **Simplified Runner**: The test runner is being simplified to focus on its core job: executing tests and recording results. Downstream aggregation and analysis of results will be shifted to other tools and projects. + ## Enhancements The `aider-ce` benchmark harness includes several enhancements over the original `aider` benchmark: -- **YAML Metadata**: Exercises now use `cat.yaml` files for metadata, allowing - for richer categorization and filtering. - **Subset Filtering**: Use the `--sets` option to run specific groups of tests (e.g., `--sets core,strings`). - **K-fold Evaluation Slicing**: The `--hash-re` option allows for deterministic From 663cba4eff02fa14cd4367dce31badb647bf6472 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 18:29:18 +1000 Subject: [PATCH 46/48] fix: Update benchmark README with minor formatting changes Co-authored-by: aider-ce (gemini/gemini-3-flash-preview) --- benchmark/README.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/benchmark/README.md b/benchmark/README.md index 21f3ef7f48d..eeb5ca7c05f 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -164,9 +164,14 @@ You can see examples of the benchmark report yaml in the The benchmark has evolved into a collection of **Cecli Atomic Tests (Cats)**. -- **YAML Metadata**: Every Cat has its own `cat.yaml` file containing metadata, including a unique UUID that may or may not be useful later. -- **Evolving Collection**: The directory structure of the Cats is laid out to facilitate the growth and evolution of the collection. As the benchmark matures, Cats will come and go. -- **Simplified Runner**: The test runner is being simplified to focus on its core job: executing tests and recording results. Downstream aggregation and analysis of results will be shifted to other tools and projects. +- **YAML Metadata**: Every Cat has its own `cat.yaml` file containing metadata, + including a unique UUID that may or may not be useful later. +- **Evolving Collection**: The directory structure of the Cats is laid out to + facilitate the growth and evolution of the collection. As the benchmark + matures, Cats will come and go. +- **Simplified Runner**: The test runner is being simplified to focus on its + core job: executing tests and recording results. Downstream aggregation and + analysis of results will be shifted to other tools and projects. ## Enhancements From 65bdb1d1902b9f27fc95f1d8148982391df51a87 Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 18:50:58 +1000 Subject: [PATCH 47/48] yada --- benchmark/README.md | 67 +++++++++++++++++---------------------------- 1 file changed, 25 insertions(+), 42 deletions(-) diff --git a/benchmark/README.md b/benchmark/README.md index eeb5ca7c05f..a07ff418a32 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -5,6 +5,9 @@ measure how well it works with various LLMs. This directory holds the harness and tools needed to run the benchmarking suite. +If you're familiar with the `aider` benchmarking, see the "What's new..." +section below. + ## Background The benchmark was based on the [Exercism](https://github.com/exercism/python) @@ -110,32 +113,7 @@ collecting stats not executing unsafe python. ./benchmark/benchmark.py --stats tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run ``` -The benchmark report is a yaml record with statistics about the run: - -```yaml -- dirname: 2024-07-04-14-32-08--claude-3.5-sonnet-diff-continue - test_cases: 225 - model: claude-3.5-sonnet - edit_format: diff - commit_hash: 35f21b5 - pass_rate_1: 57.1 - pass_rate_2: 77.4 - percent_cases_well_formed: 99.2 - error_outputs: 23 - num_malformed_responses: 4 - num_with_malformed_responses: 1 - user_asks: 2 - lazy_comments: 0 - syntax_errors: 1 - indentation_errors: 0 - exhausted_context_windows: 0 - test_timeouts: 1 - command: aider --sonnet - date: 2024-07-04 - versions: 0.42.1-dev - seconds_per_case: 17.6 - total_cost: 3.6346 -``` +The benchmark report is a yaml record with statistics about the run. The key statistics are the `pass_rate_#` entries, which report the percent of the tasks which had all tests passing. There will be multiple of these pass rate @@ -148,17 +126,29 @@ commit the repo before starting a benchmark run. This way the `model`, `edit_format` and `commit_hash` should be enough to reliably reproduce any benchmark run. -You can see examples of the benchmark report yaml in the -[aider leaderboard data files](https://github.com/$ORG/aider/blob/main/aider/website/_data/). +## Contributing + +Contributions of benchmark results and tests are welcome! Submit results by opening a PR. + +Note the roadmap priorities: -## Limitations, notes +1. Complete 'set up records' to support smart caching. +2. Atomic data collection. Most of the data is saved but need protocols for sharing. +3. **Dimensional Parameter Walking** allowing for n-dimensional parameter tuning, + facilitating "gradient descent" approach to opimisation accross multiple parameters. + The test runner should accept n lists of options, e.g., ["thinking: 100", "thinking: 200", "thinking: 400"], ["optionA: B", "optionD: C"]. +4. Smart Caching so the runner can optionally skip any tests for which "similar" result data + is already available based on fuzzy metadata matching. This aids iterative Testing as + when adding a new option to a list of permutations, only the new permutations need to + be run. Also when new Cats join the collection it is easy to incrementally collect the data. +5. Data aggregation and analysis. These will be seperate specialised tools. -- Contributions of benchmark results are welcome! Submit results by opening a PR - with edits to the - [aider leaderboard data files](https://github.com/$ORG/aider/blob/main/aider/website/_data/). -- These scripts are not intended for use by typical aider end users. -- Some of these tools are written as `bash` scripts, so it will be hard to use +## Limitations + +- These scripts are not intended for use by typical `cecli` end users. +- Some of the old (?deprecated) tools are written as `bash` scripts, so it will be hard to use them on Windows. +- Currently the JS and cpp tests appear broken. ## What's new with Cecli Cats? @@ -172,14 +162,7 @@ The benchmark has evolved into a collection of **Cecli Atomic Tests (Cats)**. - **Simplified Runner**: The test runner is being simplified to focus on its core job: executing tests and recording results. Downstream aggregation and analysis of results will be shifted to other tools and projects. - -## Enhancements - -The `aider-ce` benchmark harness includes several enhancements over the original -`aider` benchmark: - -- **Subset Filtering**: Use the `--sets` option to run specific groups of tests - (e.g., `--sets core,strings`). +- **Subset Filtering**: Use the `--sets` option to run specific groups of tests. (Hopefully, the sets will grow with time.) - **K-fold Evaluation Slicing**: The `--hash-re` option allows for deterministic slicing of the exercise set based on the exercise hash. This is useful for parallelizing runs or performing k-fold cross-validation. From fde846911d947e6c8c7058846844de4a9acd306c Mon Sep 17 00:00:00 2001 From: Erich Schulz Date: Tue, 23 Dec 2025 22:07:52 +1000 Subject: [PATCH 48/48] yada --- benchmark/README.md | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/benchmark/README.md b/benchmark/README.md index a07ff418a32..c35bcd61a95 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -63,6 +63,9 @@ Launch the docker container and run the benchmark inside it: ``` # Launch the docker container +# You probably want to tweak this script to import your service keys. +# It's curretnly configured to import GEMINI_API_KEY only. +# PR's welcome to more effectively grab the keys without causing anxiety. ./benchmark/docker.sh # Inside the container, install aider as a development build. @@ -87,6 +90,16 @@ but here are the most useful to keep in mind: - `--edit-format` is the name of the edit format, same as you would pass directly to `aider`. When working with an experimental LLM, I recommend starting with `whole` +- `--sets` runs specific groups of tests using the `sets` in the `cat.yaml`. + (Hopefully, the sets will grow with time but currently it just bookmarks + the classic "polyglot" test battery.) +- `--hash-re` allows for deterministic slicing of the exercise set based on the + exercise hash. This is useful for quickly grabbing a consistent subset or k-fold + cross-validation. For example: + - `^0`: 1/16 of the set. + - `^[01]`: 1/8 of the set. + - `^[0-3]`: 1/4 of the set. + - `^.{2}[4-7]`: 1/4 of the set, using the 3 character of the hash. - `--threads` specifies how many exercises to benchmark in parallel. Start with a single thread if you are working out the kinks on your benchmarking setup or working with a new model, etc. Once you are getting reliable results, you can @@ -162,12 +175,6 @@ The benchmark has evolved into a collection of **Cecli Atomic Tests (Cats)**. - **Simplified Runner**: The test runner is being simplified to focus on its core job: executing tests and recording results. Downstream aggregation and analysis of results will be shifted to other tools and projects. -- **Subset Filtering**: Use the `--sets` option to run specific groups of tests. (Hopefully, the sets will grow with time.) +- **Subset Filtering**: see `--sets` - **K-fold Evaluation Slicing**: The `--hash-re` option allows for deterministic - slicing of the exercise set based on the exercise hash. This is useful for - parallelizing runs or performing k-fold cross-validation. - - `^0`: 1/16 of the set. - - `^[01]`: 1/8 of the set. - - `^[0-3]`: 1/4 of the set. - - `^.{2}[4-7]`: Targets the 3rd character of the hash for more granular - slicing. + slicing of the exercise (now `cats`) based on the exercise hash.