Feature: print benchmark stats broken down by language by itsmeknt · Pull Request #27 · cecli-dev/cecli

itsmeknt · 2025-09-18T00:35:45Z

When running the Aider benchmark, sometimes it is useful to analyze the performance of the model according to the programming language. Some users may want to choose a model that do better specifically in Go, even though the overall benchmark score may be low.

I added some self-contained code to benchmark.py so that when you call benchmark.py --stats along with --verbose, it will print the benchmark stats broken down by each language at the bottom of the report. Without --verbose, the behavior is kept unchanged.

Here is an example:

./benchmark/benchmark.py --stats --verbose reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium/

──────────────────────────────────────────── reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium ─────────────────────────────────────────────- dirname: 2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium
  test_cases: 225
  model: openai/openai/gpt-oss-20b
  edit_format: whole
  commit_hash: 32faf82-dirty
  reasoning_effort: medium
  pass_rate_1: 9.8
  pass_rate_2: 36.0
  pass_num_1: 22
  pass_num_2: 81
  percent_cases_well_formed: 100.0
  error_outputs: 27
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 154
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2162608
  completion_tokens: 1224921
  test_timeouts: 4
  total_tests: 225
  command: aider --model openai/openai/gpt-oss-20b
  date: 2025-09-12
  versions: 0.86.2.dev
  seconds_per_case: 801.2
  total_cost: 0.0000

costs: $0.0000/test-case, $0.00 total, $0.00 projected

======== Stats by language ========

| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
|                              |   python  |     go    |    rust   |    cpp    | javascript |    java   |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
| completed_tests              |        34 |        39 |        30 |        26 |         49 |        47 |
| duration                     | 24,957.62 | 21,706.71 | 17,028.67 | 51,506.41 |  29,789.68 | 35,275.56 |
| avg_duration_per_test        |    734.05 |    556.58 |    567.62 |  1,981.02 |     607.95 |    750.54 |
| cost                         |         - |         - |         - |         - |          - |         - |
| pass_rate_0                  |      5.88 |      5.13 |      6.67 |      7.69 |       4.08 |      4.26 |
| pass_rate_1                  |     35.29 |     30.77 |     40.00 |     46.15 |      24.49 |     25.53 |
| pass_num_0                   |         2 |         2 |         2 |         2 |          2 |         2 |
| pass_num_1                   |        12 |        12 |        12 |        12 |         12 |        12 |
| error_outputs                |         7 |         2 |         3 |         - |         14 |         1 |
| user_asks                    |         1 |         1 |         - |       139 |          - |        13 |
| test_timeouts                |         - |         - |         1 |         - |          2 |         1 |
| exhausted_context_windows    |         - |         - |         - |         - |          - |         - |
| num_malformed_responses      |         - |         - |         - |         - |          - |         - |
| num_with_malformed_responses |         - |         - |         - |         - |          - |         - |
| syntax_errors                |         - |         - |         - |         - |          - |         - |
| indentation_errors           |         - |         - |         - |         - |          - |         - |
| lazy_comments                |         - |         - |         - |         - |          - |         - |
| prompt_tokens                |   204,931 |   159,565 |   127,949 | 1,078,034 |    247,566 |   344,563 |
| completion_tokens            |   138,725 |   159,982 |   128,591 |   379,616 |    185,134 |   232,873 |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

dwash96 · 2025-09-20T00:48:44Z

Very interesting! This makes a lot of sense. I will merge this into a new version tomorrow, thank you for taking the time

print benchmark stats by language

671c232

dwash96 changed the base branch from main to v0.87.8 September 21, 2025 14:15

dwash96 merged commit b90d8f6 into cecli-dev:v0.87.8 Sep 21, 2025
7 of 8 checks passed

dwash96 mentioned this pull request Sep 21, 2025

V0.87.8 #28

Merged

itsmeknt mentioned this pull request Sep 30, 2025

[Small bug fix] benchmark stats by language - pull the correct pass rate for each language #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: print benchmark stats broken down by language#27

Feature: print benchmark stats broken down by language#27
dwash96 merged 1 commit intocecli-dev:v0.87.8from
itsmeknt:feat/benchmark_stats_by_language

itsmeknt commented Sep 18, 2025

Uh oh!

dwash96 commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

itsmeknt commented Sep 18, 2025

Uh oh!

dwash96 commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants