Compiler statistics (`--benchmark`) by Fiwo735 · Pull Request #67 · LangProc/langproc-cw

Fiwo735 · 2026-03-24T13:29:24Z

Addresses #52.

Simple compiler statistics measured over new test cases in tests/benchmark/, hidden behind --benchmark flag. The terminal output is:

Passed 86/86 found test cases
Benchmark results:
matmul_sum: compilation time = 0.1 s, execution time = 31.3 us, binary size = 210 B

Current order of operation:

Run tests as before, but skip tests in tests/benchmark/
If benchmarking, run tests only in tests/benchmark/
Measure statistics

The measured statistics are:

Compilation time: /usr/bin/time is prepended to student_compiler(...), so it's always possible to check the run time by reading .c_compiler.stderr.log. Current limitation - /usr/bin/time is not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to run student_compiler(...) in a loop and use time.perf_counter() as any Python overhead would be completely insignificant for e.g. 10000 repetitions.
Execution time: /usr/bin/time is prepended to running spike, so it's always possible to check the run time by reading .simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body.
Binary size: sum of .text + .data + .rodata sections of ELF file, read with riscv32-unknown-elf-size -A <elf_file.o>

The reason for splitting tests running into 2:

benchmark programs will be difficult, so we don't include them in the assessed tests
normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.
to address current compilation time limitation explained above, we could pass a different compiler, which repeats the compilation X number of times to get an accurate measurement - I can implement that once you're happy with the current approach.

Note: /usr/bin/time would need to be added to the environment, see updated Dockerfile

With hopefully slightly more emphasis on extensions in the coming years, easy statistics tracking could motivate students to think about introducing optimisations into their design.

…ount)

dwRchyngqxs · 2026-03-24T14:10:45Z

I will review after #65 is merged.

…ode_stats

dwRchyngqxs

I don't think this is measuring or reporting the right data.

test.py

Fiwo735 · 2026-03-25T15:44:35Z

@dwRchyngqxs I see the point of going from raw averages to average differences/ratios compared to GCG. The advantage is a more meaningful statistic, but I worry that comparing against GCC might be a bit "scary" to students. I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change. I guess a solution would be showing stats wrt to different GCC optimisation levels?

My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller". We could also measure the statistics across some selected (more complex) test cases, e.g. tests/programs/* to avoid polluting results (both raw averages and GCC relative averages) with very simple test cases, which check only 1 tiny feature at a time. It seems to me that might be the best approach, we could even add/move 1 or 2 more complex test cases for that reason.

As for what we actually measure - could you summarise what are your suggestions? I'd like some orthogonal stats so that students can observe trade-offs like "compiler takes X% longer to compile, but the code is Y% faster". Ideally the measurement would be simple to implement, so that we don't bloat the code base with a feature that's in early development.

dwRchyngqxs · 2026-03-25T16:27:46Z

I worry that comparing against GCC might be a bit "scary" to students

If you're really worried about that we can use absolute perf and store the best student perf each time we measure it so that they try to improve personal best.

I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change.

Marginal change relative to gcc is marginal absolute change:
change = abs(new_perf - old_perf)
relative_perf = perf / gcc_perf
relative_change = abs(new_perf / gcc_perf - old_perf / gcc_perf) = abs((new_perf - old_perf) / gcc_perf) = change / gcc_perf
So if we expect marginal change in any case what even is the point of measuring and showing perf?

I guess a solution would be showing stats wrt to different GCC optimisation levels?

I don't think reference assembly is optimised. I was thinking about perf relative to reference assembly.

My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller".

I see the point of benchmarking, but your code isn't achieving it. You can get 50% smaller code by no longer passing more complicated tests; that's what I mean by heterogeneous, I think that wasn't clear with retrospect.

We could also measure the statistics across some selected (more complex) test cases, e.g. tests/programs/* to avoid polluting results (both raw averages and GCC relative averages) with very simple test cases.

It is a good solution to the point I raised. It is also a good solution IMO because students shouldn't even look at perf before being able to pass 80% of the tests.

we could even add/move 1 or 2 more complex test cases for that reason.

We could even add 1 or 2 unmarked tests. Then have 10 runs for each test and get real estimates of perf data.

As for what we actually measure - could you summarise what are your suggestions?

Binary size: the sum of the size of the section .text, .data, and .rodata from the object file generated using the assembly produced by the compiler; this way only the assembling pass of gcc interferes with the measure.
Compile time: wall clock time of running build/c_compiler, student using parallelism to get better perf is valid even though test.py -m interferes atm.
Run time: executed instruction count ideally, spike should provide it, if not wall clock time of spike pk.

Fiwo735 · 2026-03-26T14:12:17Z

Thanks for your thoughts, they helped me reach a significantly more reasonable V2. Previous changes have been overwritten, so you can check out the overall diff as "Files changed" at the top. Don't mind the actual code style, as the code will be polished once the methods are agreed upon + changes made in #68 are integrated.

Current order of operation:

Run tests as before, but skip tests in tests/benchmark/
If benchmarking, run tests only in tests/benchmark/
Measure statistics

The measured statistics are:

Compilation time: /usr/bin/time is prepended to student_compiler(...), so it's always possible to check the run time by reading .c_compiler.stderr.log. Current limitation - /usr/bin/time is not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to run student_compiler(...) in a loop and use time.perf_counter() as any Python overhead would be completely insignificant for e.g. 10000 repetitions.
Execution time: /usr/bin/time is prepended to running spike, so it's always possible to check the run time by reading .simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body.
Binary size: sum of .text + .data + .rodata sections of ELF file, read with riscv32-unknown-elf-size -A <elf_file.o>

The reason for splitting tests running into 2:

benchmark programs will be difficult, so we don't include them in the assessed tests
normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.
to address current compilation time limitation explained above, we could pass a different compiler, which repeats the compilation X number of times to get an accurate measurement - I can implement that once you're happy with the current approach.

Note: /usr/bin/time would need to be added to the environment, see updated Dockerfile

Compiler statistics (compile time, asm size and static instructions c…

3041771

…ount)

Fiwo735 requested a review from dwRchyngqxs March 24, 2026 13:29

Fiwo735 self-assigned this Mar 24, 2026

Merge branch 'main' of https://github.com/LangProc/langproc-cw into c…

e676a21

…ode_stats

dwRchyngqxs requested changes Mar 25, 2026

View reviewed changes

Compiler stats V2

70c8ca9

Changed --gather_stats to --benchmark

0dbeaf0

Fiwo735 changed the title ~~Compiler statistics~~ Compiler statistics (--benchmark) Mar 27, 2026

Fiwo735 mentioned this pull request Apr 1, 2026

That other PR but updated #68

Merged

Fiwo735 added 3 commits April 1, 2026 20:52

Merge

eae11f5

Removed not used imports

ed0d23a

Compatibility with langproc-marking

7511ebe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiler statistics (`--benchmark`)#67

Compiler statistics (`--benchmark`)#67
Fiwo735 wants to merge 7 commits intomainfrom
code_stats

Fiwo735 commented Mar 24, 2026 •

edited

Loading

Uh oh!

dwRchyngqxs commented Mar 24, 2026

Uh oh!

dwRchyngqxs left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fiwo735 commented Mar 25, 2026 •

edited

Loading

Uh oh!

dwRchyngqxs commented Mar 25, 2026 •

edited

Loading

Uh oh!

Fiwo735 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Fiwo735 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwRchyngqxs commented Mar 24, 2026

Uh oh!

dwRchyngqxs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fiwo735 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwRchyngqxs commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fiwo735 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fiwo735 commented Mar 24, 2026 •

edited

Loading

Fiwo735 commented Mar 25, 2026 •

edited

Loading

dwRchyngqxs commented Mar 25, 2026 •

edited

Loading

Fiwo735 commented Mar 26, 2026 •

edited

Loading