Skip to content

Feature: print benchmark stats broken down by language#27

Merged
dwash96 merged 1 commit intocecli-dev:v0.87.8from
itsmeknt:feat/benchmark_stats_by_language
Sep 21, 2025
Merged

Feature: print benchmark stats broken down by language#27
dwash96 merged 1 commit intocecli-dev:v0.87.8from
itsmeknt:feat/benchmark_stats_by_language

Conversation

@itsmeknt
Copy link
Copy Markdown

When running the Aider benchmark, sometimes it is useful to analyze the performance of the model according to the programming language. Some users may want to choose a model that do better specifically in Go, even though the overall benchmark score may be low.

I added some self-contained code to benchmark.py so that when you call benchmark.py --stats along with --verbose, it will print the benchmark stats broken down by each language at the bottom of the report. Without --verbose, the behavior is kept unchanged.

Here is an example:

./benchmark/benchmark.py --stats --verbose reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium/

──────────────────────────────────────────── reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium ─────────────────────────────────────────────- dirname: 2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium
  test_cases: 225
  model: openai/openai/gpt-oss-20b
  edit_format: whole
  commit_hash: 32faf82-dirty
  reasoning_effort: medium
  pass_rate_1: 9.8
  pass_rate_2: 36.0
  pass_num_1: 22
  pass_num_2: 81
  percent_cases_well_formed: 100.0
  error_outputs: 27
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 154
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2162608
  completion_tokens: 1224921
  test_timeouts: 4
  total_tests: 225
  command: aider --model openai/openai/gpt-oss-20b
  date: 2025-09-12
  versions: 0.86.2.dev
  seconds_per_case: 801.2
  total_cost: 0.0000

costs: $0.0000/test-case, $0.00 total, $0.00 projected

======== Stats by language ========

| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
|                              |   python  |     go    |    rust   |    cpp    | javascript |    java   |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
| completed_tests              |        34 |        39 |        30 |        26 |         49 |        47 |
| duration                     | 24,957.62 | 21,706.71 | 17,028.67 | 51,506.41 |  29,789.68 | 35,275.56 |
| avg_duration_per_test        |    734.05 |    556.58 |    567.62 |  1,981.02 |     607.95 |    750.54 |
| cost                         |         - |         - |         - |         - |          - |         - |
| pass_rate_0                  |      5.88 |      5.13 |      6.67 |      7.69 |       4.08 |      4.26 |
| pass_rate_1                  |     35.29 |     30.77 |     40.00 |     46.15 |      24.49 |     25.53 |
| pass_num_0                   |         2 |         2 |         2 |         2 |          2 |         2 |
| pass_num_1                   |        12 |        12 |        12 |        12 |         12 |        12 |
| error_outputs                |         7 |         2 |         3 |         - |         14 |         1 |
| user_asks                    |         1 |         1 |         - |       139 |          - |        13 |
| test_timeouts                |         - |         - |         1 |         - |          2 |         1 |
| exhausted_context_windows    |         - |         - |         - |         - |          - |         - |
| num_malformed_responses      |         - |         - |         - |         - |          - |         - |
| num_with_malformed_responses |         - |         - |         - |         - |          - |         - |
| syntax_errors                |         - |         - |         - |         - |          - |         - |
| indentation_errors           |         - |         - |         - |         - |          - |         - |
| lazy_comments                |         - |         - |         - |         - |          - |         - |
| prompt_tokens                |   204,931 |   159,565 |   127,949 | 1,078,034 |    247,566 |   344,563 |
| completion_tokens            |   138,725 |   159,982 |   128,591 |   379,616 |    185,134 |   232,873 |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

@dwash96
Copy link
Copy Markdown
Collaborator

dwash96 commented Sep 20, 2025

Very interesting! This makes a lot of sense. I will merge this into a new version tomorrow, thank you for taking the time

@dwash96 dwash96 changed the base branch from main to v0.87.8 September 21, 2025 14:15
@dwash96 dwash96 merged commit b90d8f6 into cecli-dev:v0.87.8 Sep 21, 2025
7 of 8 checks passed
@dwash96 dwash96 mentioned this pull request Sep 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants