Performance benchmarking harness for deterministic ML inference on safety-critical hardware.
Measures latency, throughput, and WCET with cryptographic verification that outputs are bit-identical across platforms.
x86_64 (Intel Xeon) RISC-V (Tenstorrent)
─────────────────── ────────────────────
Output hash: 48f4eceb... ═══ Output hash: 48f4eceb...
Latency (p99): 1,234 µs Latency (p99): 2,567 µs
Throughput: 8,100 inf/s Throughput: 3,890 inf/s
WCET bound: 2,450 µs WCET bound: 5,120 µs
Bit-identical: YES ✓
RISC-V is 2.08x slower (but CORRECT)
Different hardware. Different performance. Same outputs.
How do you benchmark ML inference across platforms when you need to prove the results are identical?
Traditional benchmarks fail:
- They measure performance without verifying correctness
- Floating-point drift means "faster" might mean "wrong"
- No cryptographic proof that Platform A and B compute the same thing
certifiable-bench solves this by:
- Hashing all inference outputs during the benchmark
- Comparing output hashes before comparing performance
- Refusing to report performance ratios if outputs differ
- Generating auditable JSON reports with full provenance
| Feature | Description |
|---|---|
| High-Resolution Timing | POSIX backend, 23ns overhead, nanosecond precision |
| Comprehensive Statistics | Min, max, mean, median, p95, p99, stddev, WCET |
| Histogram & Outliers | Configurable bins, MAD-based outlier detection |
| Cryptographic Verification | FIPS 180-4 SHA-256, constant-time comparison |
| Platform Detection | x86_64, aarch64, riscv64, CPU model, frequency |
| Environmental Monitoring | Temperature, throttle events, stability assessment |
| Cross-Platform Comparison | Bit-identity gate, Q16.16 fixed-point ratios |
| Report Generation | JSON, CSV, human-readable summary |
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
ctest --output-on-failure./bench_inference --iterations 5000 --output result_x86.json./bench_inference --iterations 5000 --output result_riscv.json --compare result_x86.jsonOutput:
═══════════════════════════════════════════════════════════════════
Cross-Platform Performance Comparison
Reference: x86_64 Target: riscv64
═══════════════════════════════════════════════════════════════════
Bit Identity: VERIFIED (outputs identical)
Latency (p99):
Diff: +1,333,323 ns
Ratio: 2.08x slower
Throughput:
Diff: -4,210 inferences/sec
Ratio: 0.48x
═══════════════════════════════════════════════════════════════════
Performance comparison is only valid when outputs are identical:
cb_compare_results(&x86_result, &riscv_result, &comparison);
if (comparison.outputs_identical) {
/* Safe to compare performance */
printf("Ratio: %.2fx\n", comparison.latency_ratio_q16 / 65536.0);
} else {
/* Something is wrong — don't trust performance numbers */
printf("ERROR: Outputs differ, comparison invalid\n");
}This is the certifiable-* philosophy: correctness first, then performance.
Usage: bench_inference [options]
Options:
--iterations N Measurement iterations (default: 1000)
--warmup N Warmup iterations (default: 100)
--batch N Batch size (default: 1)
--output PATH Output JSON path
--csv PATH Output CSV path
--compare PATH Compare with previous JSON result
--help Show this help
Examples:
bench_inference --iterations 5000 --output result.json
bench_inference --compare baseline.json --output new_result.json
#include "cb_runner.h"
#include "cb_report.h"
/* Your inference function */
cb_result_code_t my_inference(void *ctx, const void *in, void *out) {
/* Run neural network forward pass */
return CB_OK;
}
int main(void) {
cb_config_t config;
cb_result_t result;
cb_config_init(&config);
config.warmup_iterations = 100;
config.measure_iterations = 1000;
cb_run_benchmark(&config, my_inference, model,
input, input_size, output, output_size,
&result);
cb_print_summary(&result);
cb_write_json(&result, "benchmark.json");
return 0;
}The benchmark core follows CB-MATH-001 §7.2:
for (i = 0; i < iterations; i++) {
/* === CRITICAL LOOP START === */
t_start = cb_timer_now_ns();
fn(ctx, input, output);
t_end = cb_timer_now_ns();
/* === CRITICAL LOOP END === */
samples[i] = t_end - t_start;
/* Verification OUTSIDE timing */
cb_verify_ctx_update(&ctx, output, size);
}Timing measures only the inference. Hashing happens outside the critical path.
{
"version": "1.0",
"platform": "x86_64",
"cpu_model": "Intel Xeon @ 2.20GHz",
"latency": {
"min_ns": 1234567,
"max_ns": 2345678,
"mean_ns": 1500000,
"p99_ns": 2100000,
"wcet_bound_ns": 3245678
},
"throughput": {
"inferences_per_sec": 666
},
"verification": {
"determinism_verified": true,
"output_hash": "48f4eceb..."
},
"faults": {
"overflow": false,
"thermal_drift": false
}
}certifiable-bench/
├── include/
│ ├── cb_types.h Core types (config, result, faults)
│ ├── cb_timer.h Timer API
│ ├── cb_metrics.h Statistics API
│ ├── cb_platform.h Platform detection API
│ ├── cb_verify.h SHA-256, golden ref, result binding
│ ├── cb_runner.h Benchmark runner API
│ └── cb_report.h JSON/CSV/comparison API
├── src/
│ ├── timer/timer.c POSIX backend (23ns overhead)
│ ├── metrics/metrics.c Stats, histogram, outlier, WCET
│ ├── platform/platform.c Detection, hwcounters, env monitoring
│ ├── verify/verify.c FIPS 180-4 SHA-256
│ ├── runner/runner.c Critical loop, warmup
│ └── report/report.c JSON, CSV, compare, print
├── tests/unit/
│ ├── test_timer.c 10,032 assertions
│ ├── test_metrics.c 1,502 assertions
│ ├── test_platform.c 35 assertions
│ ├── test_verify.c 113 assertions
│ ├── test_runner.c 92 assertions
│ └── test_report.c 66 assertions
├── examples/
│ └── bench_inference.c CLI benchmark tool
└── docs/
├── CB-MATH-001.md Mathematical specification
├── CB-STRUCT-001.md Data structure specification
└── requirements/ 6 SRS documents
$ ctest --output-on-failure
Test project /home/william/certifiable-bench/build
Start 1: test_timer
1/6 Test #1: test_timer ....................... Passed 0.01 sec
Start 2: test_metrics
2/6 Test #2: test_metrics ..................... Passed 0.00 sec
Start 3: test_platform
3/6 Test #3: test_platform .................... Passed 0.01 sec
Start 4: test_verify
4/6 Test #4: test_verify ...................... Passed 0.01 sec
Start 5: test_runner
5/6 Test #5: test_runner ...................... Passed 0.00 sec
Start 6: test_report
6/6 Test #6: test_report ...................... Passed 0.00 sec
100% tests passed, 0 tests failed out of 6
Total assertions: 11,840
| Document | Description |
|---|---|
| CB-MATH-001 | Mathematical specification |
| CB-STRUCT-001 | Data structure specification |
| SRS-001-TIMER | Timer requirements |
| SRS-002-METRICS | Statistics requirements |
| SRS-003-RUNNER | Benchmark runner requirements |
| SRS-004-VERIFY | Verification requirements |
| SRS-005-REPORT | Reporting requirements |
| SRS-006-PLATFORM | Platform detection requirements |
~233 SHALL statements across 6 SRS documents
| Platform | OS | Compiler | Result |
|---|---|---|---|
| x86_64 | Linux (Ubuntu) | GCC 12.2.0 | ✓ 23ns timer overhead |
| x86_64 | Container | GCC 13.3.0 | ✓ 34ns timer overhead |
| aarch64 | — | — | Pending |
| riscv64 | — | — | Pending (Semper Victus) |
certifiable-bench integrates with the certifiable-* ecosystem:
certifiable-harness: "Are the outputs identical?" (correctness)
certifiable-bench: "How fast is it?" (performance)
certifiable-inference ──→ certifiable-bench ──→ JSON report
↑ │
└───────────────────────┘
Performance data
| Project | Purpose |
|---|---|
| certifiable-data | Deterministic data pipeline |
| certifiable-training | Deterministic training |
| certifiable-quant | Model quantization |
| certifiable-deploy | Bundle packaging |
| certifiable-inference | Fixed-point inference |
| certifiable-monitor | Runtime monitoring |
| certifiable-verify | Pipeline verification |
| certifiable-harness | End-to-end bit-identity |
Dual License:
- Open Source: GPL-3.0 — Free for open source projects
- Commercial: Contact william@fstopify.com for commercial licensing
Patent: UK Patent Application GB2521625.0
See CONTRIBUTING.md for guidelines.
All contributions require a signed Contributor License Agreement.
William Murray
The Murray Family Innovation Trust
SpeyTech · GitHub
Measure performance. Verify determinism. Generate evidence.
Copyright © 2026 The Murray Family Innovation Trust. All rights reserved.