Skip to content

Compiler statistics (--benchmark)#67

Open
Fiwo735 wants to merge 7 commits intomainfrom
code_stats
Open

Compiler statistics (--benchmark)#67
Fiwo735 wants to merge 7 commits intomainfrom
code_stats

Conversation

@Fiwo735
Copy link
Copy Markdown
Collaborator

@Fiwo735 Fiwo735 commented Mar 24, 2026

Addresses #52.

Simple compiler statistics measured over new test cases in tests/benchmark/, hidden behind --benchmark flag. The terminal output is:

Passed 86/86 found test cases
Benchmark results:
matmul_sum: compilation time = 0.1 s, execution time = 31.3 us, binary size = 210 B

Current order of operation:

  1. Run tests as before, but skip tests in tests/benchmark/
  2. If benchmarking, run tests only in tests/benchmark/
  3. Measure statistics

The measured statistics are:

  1. Compilation time: /usr/bin/time is prepended to student_compiler(...), so it's always possible to check the run time by reading .c_compiler.stderr.log. Current limitation - /usr/bin/time is not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to run student_compiler(...) in a loop and use time.perf_counter() as any Python overhead would be completely insignificant for e.g. 10000 repetitions.
  2. Execution time: /usr/bin/time is prepended to running spike, so it's always possible to check the run time by reading .simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body.
  3. Binary size: sum of .text + .data + .rodata sections of ELF file, read with riscv32-unknown-elf-size -A <elf_file.o>

The reason for splitting tests running into 2:

  • benchmark programs will be difficult, so we don't include them in the assessed tests
  • normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.
  • to address current compilation time limitation explained above, we could pass a different compiler, which repeats the compilation X number of times to get an accurate measurement - I can implement that once you're happy with the current approach.

Note: /usr/bin/time would need to be added to the environment, see updated Dockerfile

With hopefully slightly more emphasis on extensions in the coming years, easy statistics tracking could motivate students to think about introducing optimisations into their design.

@Fiwo735 Fiwo735 requested a review from dwRchyngqxs March 24, 2026 13:29
@Fiwo735 Fiwo735 self-assigned this Mar 24, 2026
@dwRchyngqxs
Copy link
Copy Markdown
Collaborator

I will review after #65 is merged.

Copy link
Copy Markdown
Collaborator

@dwRchyngqxs dwRchyngqxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is measuring or reporting the right data.

@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Mar 25, 2026

@dwRchyngqxs I see the point of going from raw averages to average differences/ratios compared to GCG. The advantage is a more meaningful statistic, but I worry that comparing against GCC might be a bit "scary" to students. I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change. I guess a solution would be showing stats wrt to different GCC optimisation levels?

My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller". We could also measure the statistics across some selected (more complex) test cases, e.g. tests/programs/* to avoid polluting results (both raw averages and GCC relative averages) with very simple test cases, which check only 1 tiny feature at a time. It seems to me that might be the best approach, we could even add/move 1 or 2 more complex test cases for that reason.

As for what we actually measure - could you summarise what are your suggestions? I'd like some orthogonal stats so that students can observe trade-offs like "compiler takes X% longer to compile, but the code is Y% faster". Ideally the measurement would be simple to implement, so that we don't bloat the code base with a feature that's in early development.

@dwRchyngqxs
Copy link
Copy Markdown
Collaborator

dwRchyngqxs commented Mar 25, 2026

I worry that comparing against GCC might be a bit "scary" to students

If you're really worried about that we can use absolute perf and store the best student perf each time we measure it so that they try to improve personal best.

I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change.

Marginal change relative to gcc is marginal absolute change:
change = abs(new_perf - old_perf)
relative_perf = perf / gcc_perf
relative_change = abs(new_perf / gcc_perf - old_perf / gcc_perf) = abs((new_perf - old_perf) / gcc_perf) = change / gcc_perf
So if we expect marginal change in any case what even is the point of measuring and showing perf?

I guess a solution would be showing stats wrt to different GCC optimisation levels?

I don't think reference assembly is optimised. I was thinking about perf relative to reference assembly.

My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller".

I see the point of benchmarking, but your code isn't achieving it. You can get 50% smaller code by no longer passing more complicated tests; that's what I mean by heterogeneous, I think that wasn't clear with retrospect.

We could also measure the statistics across some selected (more complex) test cases, e.g. tests/programs/* to avoid polluting results (both raw averages and GCC relative averages) with very simple test cases.

It is a good solution to the point I raised. It is also a good solution IMO because students shouldn't even look at perf before being able to pass 80% of the tests.

we could even add/move 1 or 2 more complex test cases for that reason.

We could even add 1 or 2 unmarked tests. Then have 10 runs for each test and get real estimates of perf data.

As for what we actually measure - could you summarise what are your suggestions?

  1. Binary size: the sum of the size of the section .text, .data, and .rodata from the object file generated using the assembly produced by the compiler; this way only the assembling pass of gcc interferes with the measure.
  2. Compile time: wall clock time of running build/c_compiler, student using parallelism to get better perf is valid even though test.py -m interferes atm.
  3. Run time: executed instruction count ideally, spike should provide it, if not wall clock time of spike pk.

@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Mar 26, 2026

Thanks for your thoughts, they helped me reach a significantly more reasonable V2. Previous changes have been overwritten, so you can check out the overall diff as "Files changed" at the top. Don't mind the actual code style, as the code will be polished once the methods are agreed upon + changes made in #68 are integrated.

Current order of operation:

  1. Run tests as before, but skip tests in tests/benchmark/
  2. If benchmarking, run tests only in tests/benchmark/
  3. Measure statistics

The measured statistics are:

  1. Compilation time: /usr/bin/time is prepended to student_compiler(...), so it's always possible to check the run time by reading .c_compiler.stderr.log. Current limitation - /usr/bin/time is not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to run student_compiler(...) in a loop and use time.perf_counter() as any Python overhead would be completely insignificant for e.g. 10000 repetitions.
  2. Execution time: /usr/bin/time is prepended to running spike, so it's always possible to check the run time by reading .simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body.
  3. Binary size: sum of .text + .data + .rodata sections of ELF file, read with riscv32-unknown-elf-size -A <elf_file.o>

The reason for splitting tests running into 2:

  • benchmark programs will be difficult, so we don't include them in the assessed tests
  • normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.
  • to address current compilation time limitation explained above, we could pass a different compiler, which repeats the compilation X number of times to get an accurate measurement - I can implement that once you're happy with the current approach.

Note: /usr/bin/time would need to be added to the environment, see updated Dockerfile

@Fiwo735 Fiwo735 changed the title Compiler statistics Compiler statistics (--benchmark) Mar 27, 2026
@Fiwo735 Fiwo735 mentioned this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants