Conversation
|
I will review after #65 is merged. |
dwRchyngqxs
left a comment
There was a problem hiding this comment.
I don't think this is measuring or reporting the right data.
|
@dwRchyngqxs I see the point of going from raw averages to average differences/ratios compared to GCG. The advantage is a more meaningful statistic, but I worry that comparing against GCC might be a bit "scary" to students. I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change. I guess a solution would be showing stats wrt to different GCC optimisation levels? My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller". We could also measure the statistics across some selected (more complex) test cases, e.g. As for what we actually measure - could you summarise what are your suggestions? I'd like some orthogonal stats so that students can observe trade-offs like "compiler takes X% longer to compile, but the code is Y% faster". Ideally the measurement would be simple to implement, so that we don't bloat the code base with a feature that's in early development. |
If you're really worried about that we can use absolute perf and store the best student perf each time we measure it so that they try to improve personal best.
Marginal change relative to
I don't think reference assembly is optimised. I was thinking about perf relative to reference assembly.
I see the point of benchmarking, but your code isn't achieving it. You can get 50% smaller code by no longer passing more complicated tests; that's what I mean by heterogeneous, I think that wasn't clear with retrospect.
It is a good solution to the point I raised. It is also a good solution IMO because students shouldn't even look at perf before being able to pass 80% of the tests.
We could even add 1 or 2 unmarked tests. Then have 10 runs for each test and get real estimates of perf data.
|
|
Thanks for your thoughts, they helped me reach a significantly more reasonable V2. Previous changes have been overwritten, so you can check out the overall diff as "Files changed" at the top. Don't mind the actual code style, as the code will be polished once the methods are agreed upon + changes made in #68 are integrated. Current order of operation:
The measured statistics are:
The reason for splitting tests running into 2:
Note: |
Addresses #52.
Simple compiler statistics measured over new test cases in
tests/benchmark/, hidden behind--benchmarkflag. The terminal output is:Current order of operation:
tests/benchmark/tests/benchmark/The measured statistics are:
/usr/bin/timeis prepended tostudent_compiler(...), so it's always possible to check the run time by reading.c_compiler.stderr.log. Current limitation -/usr/bin/timeis not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to runstudent_compiler(...)in a loop and usetime.perf_counter()as any Python overhead would be completely insignificant for e.g. 10000 repetitions./usr/bin/timeis prepended to runningspike, so it's always possible to check the run time by reading.simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body..text+.data+.rodatasections of ELF file, read withriscv32-unknown-elf-size -A <elf_file.o>The reason for splitting tests running into 2:
--benchmarkand--validate_testmutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporarytry... except...block for readingc_compilercompilation time, as with--validate_testsuch file isn't created.Note:
/usr/bin/timewould need to be added to the environment, see updatedDockerfileWith hopefully slightly more emphasis on extensions in the coming years, easy statistics tracking could motivate students to think about introducing optimisations into their design.