U256 limb-native arithmetic: beat Rust on UniswapV2 (87ns -> 20ns)#30
U256 limb-native arithmetic: beat Rust on UniswapV2 (87ns -> 20ns)#30
Conversation
…dedicated u256 comparison
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughAdds a U256 benchmarking suite and comparison tooling across Zig and Rust; introduces a new Rust bench, a Zig u256 benchmark executable, comparison scripts, and performance-focused changes plus visibility/inline adjustments in the Zig u256 implementation. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Orchestrator as compare_u256.sh
participant ZigBench as eth.zig (u256_bench)
participant RustBench as alloy.rs (u256_comparison)
participant Parser as Python_Parser
participant Output as Results_Table
User->>Orchestrator: run compare_u256.sh
Orchestrator->>ZigBench: build & run zig bench-u256
ZigBench->>ZigBench: execute u256 benchmarks
ZigBench-->>Orchestrator: bench output (text/JSON)
Orchestrator->>RustBench: cargo bench --bench u256_comparison
RustBench->>RustBench: execute criterion benches
RustBench-->>Orchestrator: criterion output
Orchestrator->>Parser: feed both outputs
Parser->>Parser: extract ns/op, normalize names/units
Parser-->>Orchestrator: structured timings
Orchestrator->>Output: render comparison table & summary
Output-->>User: display results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bench/alloy-bench/benches/u256_comparison.rs`:
- Around line 78-84: The div benchmarks currently call U256::checked_div (in the
bench function names "div_small" and "div_full"), which adds Option-handling
overhead; replace those calls with the raw division operator (use
black_box(large) / black_box(ONE_ETH) and black_box(large) / black_box(FULL_C)
respectively) since ONE_ETH and FULL_C are non-zero constants, and update both
the div_small and div_full closures to perform direct division with / instead of
checked_div to make the Rust benchmarks comparable to Zig's fastDiv.
In `@bench/compare_u256.sh`:
- Around line 20-34: The script currently captures full benchmark logs into
shell variables ZIG_OUTPUT and RUST_OUTPUT and passes them as argv to the
embedded Python (python3 - "$ZIG_OUTPUT" "$RUST_OUTPUT"), which can hit
argument-length limits; instead write both outputs to temporary files (e.g.,
create secure temp files), pass the temp file paths or read them inside the
Python here-doc, and clean up the temps afterward; update references around
ZIG_OUTPUT, RUST_OUTPUT and the python3 invocation so the Python script reads
from those temp files (or from stdin via a single here-doc) rather than
receiving huge logs via argv, and apply the same change for the second
occurrence at lines 39-40.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
bench/alloy-bench/Cargo.tomlbench/alloy-bench/benches/eth_comparison.rsbench/alloy-bench/benches/u256_comparison.rsbench/compare_u256.shbench/u256_bench.zigbuild.zig
Three key optimizations to close the u256 compound arithmetic gap: 1. Scalar multiply (mulLimbScalar): 4 mul/umulh pairs instead of 10 for single-limb multipliers like fee constants (997, 1000) 2. Specialized 128-by-128 division (div2by2): inline register-only division for 2-limb operands, avoiding Knuth D's array overhead and __udivti3 software division (~50ns penalty) 3. Limb-native benchmark: rewrite uniswapv2_naive to use [4]u64 limb operations directly (apples-to-apples with Rust's [u64; 4] U256) Also: fix Rust benchmark using checked_div instead of / operator, gitignore build artifacts (*.a, *.o, *.s), make limb functions pub inline, use @bitcast for u256<->limbs conversion. Score: eth-zig 3/8 | alloy.rs 3/8 | tied 2/8 (was 2/8 | 4/8 | 2/8)
There was a problem hiding this comment.
🧹 Nitpick comments (5)
bench/u256_bench.zig (2)
47-52: Add an overflow guard while doublingbatch.
batch *= 2has no cap. A simple max check avoids potential wraparound/infinite-loop behavior in calibration edge cases.Small robustness patch
var batch: u64 = 64; while (true) { timer.reset(); for (0..batch) |_| func(); if (timer.read() >= 100_000_000) break; // 100ms + if (batch > std.math.maxInt(u64) / 2) break; batch *= 2; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/u256_bench.zig` around lines 47 - 52, In the calibration loop that calls timer.reset(); for (0..batch) |_| func(); if (timer.read() >= 100_000_000) break; batch *= 2; add an overflow guard so doubling batch cannot wrap: before multiplying check batch against a safe maximum (e.g., compare to (usize.max / 2) or use a constant MAX_BATCH) and either clamp batch to that max or break out of the loop if it would overflow; update the code around the while (true) loop so batch is only doubled when safe, preventing wraparound/infinite-loop behavior.
196-217: Run each benchmark once and reuse the result for both outputs.Right now every case is measured twice (table + JSON). That can introduce drift between reported values and unnecessarily extends run time. Capture one
BenchResultper case and print both formats from it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/u256_bench.zig` around lines 196 - 217, Currently each benchmark (benchAdd, benchMulSmall, etc.) is executed twice via runAndPrint and runAndJson; change the flow so each benchmark is executed once to produce a single BenchResult and then print both formats from that result. Concretely: add or use a function that runs a benchmark and returns a BenchResult (e.g., runBenchmark or make runAndPrint return BenchResult), call it once for each symbol (benchAdd, benchMulSmall, benchMulFull, benchDivSmall, benchDivFull, benchUniswapV2Naive, benchUniswapV2Optimized, benchMulDiv, benchUniswapV4Swap) storing the result, then pass that BenchResult into printing helpers (create or adapt runAndPrint/runAndJson to accept a BenchResult and stdout) so stdout.print is unchanged but each case is measured only once..gitignore (1)
8-11: Avoid globally ignoring assembly sources.Line 11 (
*.s) can accidentally hide new checked-in assembly source files. If the goal is only generated artifacts, keep this scoped to build output paths (or drop*.s).Proposed tweak
# Build artifacts *.a *.o -*.s🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore around lines 8 - 11, The .gitignore entry "*.s" is too broad and can hide legitimate checked-in assembly sources; remove or narrow it: either delete the "*.s" pattern or scope it to generated build output paths (e.g., replace "*.s" with something like "build/*.s" or "out/**/*.s" or add the pattern under a build/ or obj/ directory), ensuring only generated assembly artifacts are ignored while leaving source .s files trackable; update the .gitignore accordingly where the "*.s" pattern appears.src/uint256.zig (2)
99-105: Document limb ordering for public conversion helpers.Now that
u256ToLimbsandlimbsToU256are public, please document the exact limb order (least-significanttomost-significant) to avoid cross-module misuse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/uint256.zig` around lines 99 - 105, The public conversion helpers u256ToLimbs and limbsToU256 lack documentation about limb ordering; update both function doc comments to explicitly state that the returned/accepted [4]u64 array is ordered from least-significant limb at index 0 to most-significant limb at index 3 (i.e., little-endian limb order), and mention that callers must use this ordering when converting between u256 and limbs to avoid cross-module misuse; add the same clarification to any related public API docs or comments referencing these helpers.
387-390: Add a direct test for the new 2-limb fast path.Line 387 introduces a dedicated
dd == 2 and nn <= 2branch, but there isn’t a focused test that guarantees this path remains correct across refactors.Suggested test addition
+test "fastDiv 2-limb numerator 2-limb divisor fast path" { + const n: u256 = (`@as`(u256, 1) << 127) + 123456789; + const d: u256 = (`@as`(u256, 1) << 96) + 7; + try std.testing.expectEqual(n / d, fastDiv(n, d)); +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/uint256.zig` around lines 387 - 390, Add a unit test that directly exercises the new 2-limb fast path by invoking the division routine with dd == 2 and nn <= 2 so the branch with div2by2 is taken; create tests that call the public division/quotient function (or the tested function that reads numerator/denominator limbs) using two-limb divisors and one- and two-limb numerators, covering typical, edge and random cases (e.g., small values, max limb values, and boundary carries) and assert the quotient/remainder match a reference implementation (big-int or existing slow-path division). Ensure the test targets the symbols involved (div2by2 and the surrounding division function) so future refactors will fail if the fast path is incorrect.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.gitignore:
- Around line 8-11: The .gitignore entry "*.s" is too broad and can hide
legitimate checked-in assembly sources; remove or narrow it: either delete the
"*.s" pattern or scope it to generated build output paths (e.g., replace "*.s"
with something like "build/*.s" or "out/**/*.s" or add the pattern under a
build/ or obj/ directory), ensuring only generated assembly artifacts are
ignored while leaving source .s files trackable; update the .gitignore
accordingly where the "*.s" pattern appears.
In `@bench/u256_bench.zig`:
- Around line 47-52: In the calibration loop that calls timer.reset(); for
(0..batch) |_| func(); if (timer.read() >= 100_000_000) break; batch *= 2; add
an overflow guard so doubling batch cannot wrap: before multiplying check batch
against a safe maximum (e.g., compare to (usize.max / 2) or use a constant
MAX_BATCH) and either clamp batch to that max or break out of the loop if it
would overflow; update the code around the while (true) loop so batch is only
doubled when safe, preventing wraparound/infinite-loop behavior.
- Around line 196-217: Currently each benchmark (benchAdd, benchMulSmall, etc.)
is executed twice via runAndPrint and runAndJson; change the flow so each
benchmark is executed once to produce a single BenchResult and then print both
formats from that result. Concretely: add or use a function that runs a
benchmark and returns a BenchResult (e.g., runBenchmark or make runAndPrint
return BenchResult), call it once for each symbol (benchAdd, benchMulSmall,
benchMulFull, benchDivSmall, benchDivFull, benchUniswapV2Naive,
benchUniswapV2Optimized, benchMulDiv, benchUniswapV4Swap) storing the result,
then pass that BenchResult into printing helpers (create or adapt
runAndPrint/runAndJson to accept a BenchResult and stdout) so stdout.print is
unchanged but each case is measured only once.
In `@src/uint256.zig`:
- Around line 99-105: The public conversion helpers u256ToLimbs and limbsToU256
lack documentation about limb ordering; update both function doc comments to
explicitly state that the returned/accepted [4]u64 array is ordered from
least-significant limb at index 0 to most-significant limb at index 3 (i.e.,
little-endian limb order), and mention that callers must use this ordering when
converting between u256 and limbs to avoid cross-module misuse; add the same
clarification to any related public API docs or comments referencing these
helpers.
- Around line 387-390: Add a unit test that directly exercises the new 2-limb
fast path by invoking the division routine with dd == 2 and nn <= 2 so the
branch with div2by2 is taken; create tests that call the public
division/quotient function (or the tested function that reads
numerator/denominator limbs) using two-limb divisors and one- and two-limb
numerators, covering typical, edge and random cases (e.g., small values, max
limb values, and boundary carries) and assert the quotient/remainder match a
reference implementation (big-int or existing slow-path division). Ensure the
test targets the symbols involved (div2by2 and the surrounding division
function) so future refactors will fail if the fast path is incorrect.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.gitignorebench/alloy-bench/benches/u256_comparison.rsbench/u256_bench.zigsrc/uint256.zig
🚧 Files skipped from review as they are similar to previous changes (1)
- bench/alloy-bench/benches/u256_comparison.rs
Replace zbench with a custom benchmark harness (0.5s warmup, calibrated batch sizes, 2s measurement window). zbench's ~42ns macOS timer floor was masking real performance -- u256_add measured 25ns under zbench, actual is 2ns. With accurate measurement: eth.zig wins 20/26 benchmarks vs alloy.rs (was 14/26 under zbench). Key corrections: - u256_add: 25ns -> 2ns (ties alloy) - u256_mul: 26ns -> 2ns (2.50x faster) - u256_div: 28ns -> 3ns (4.00x faster) - abi_decode_dynamic: 40ns -> 17ns (7.94x faster) - hex_encode: 36ns -> 11ns (1.09x faster, was a loss) - uniswapv2: 30ns -> 10ns (1.30x faster, was a loss) Also: remove zbench dependency from build.zig.zon, port keccak_compare.zig to custom harness, use FixedBufferAllocator for allocating benchmarks, update README and RESULTS.md.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
bench/bench.zig (2)
48-99: Extract shared bench harness to avoid cross-file drift.
runBench/runAndPrintlogic is now repeated across benchmark entrypoints. Centralizing these helpers into one module will reduce maintenance risk when tuning warmup/calibration behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/bench.zig` around lines 48 - 99, The bench harness (runBench and runAndPrint) is duplicated across benchmarks; extract these functions into a single shared module (e.g., bench/harness.zig) and update each benchmark entrypoint to import that module and call the shared runBench/runAndPrint. Preserve the functions' signatures (comptime func: fn() void and runAndPrint(name: []const u8, func: fn() void, stdout: anytype) !void), retain Timer usage and constants (WARMUP_NS, BENCH_NS) in the shared module, and ensure benchmarks replace their local implementations with calls to the imported functions.
175-449: Factor out repeatedFixedBufferAllocatorsetup in benchmark functions.The same local allocator scaffolding is copied across many benches. A tiny helper wrapper would reduce copy/paste surface and keep per-bench logic focused on the operation under test.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/bench.zig` around lines 175 - 449, Many benchmarks duplicate the FixedBufferAllocator setup (init of std.heap.FixedBufferAllocator, getting alloc) — create a small helper (e.g., withFixedBufferAllocator or makeFixedAllocator) that encapsulates the buffer allocation and returns the allocator (or takes a callback that receives alloc) and replace the repeated blocks in functions like benchAbiEncodeTransfer, benchAbiEncodeStatic, benchAbiEncodeDynamic, benchRlpEncodeTx, benchTxHashEip1559, benchEip712Hash, etc.; update those functions to call the helper for an allocator instead of repeating the buf/fba/alloc boilerplate so each bench only contains the test-specific logic.bench/RESULTS.md (1)
7-7: Include exact toolchain/platform versions for reproducibility.Adding CPU model/binning, macOS version, Zig version, and Rust toolchain version will make these benchmark deltas easier to validate over time.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/RESULTS.md` at line 7, Update the "Benchmarks run on Apple Silicon..." line in RESULTS.md to include explicit toolchain and platform versions for reproducibility: append CPU model/binning (e.g., M1/M2 with specific SKU), macOS version, Zig version, Rust toolchain (cargo/rustc) and target triple, and any compiler flags used (ReleaseFast/--release), so the existing Benchmarks run on Apple Silicon... sentence contains exact CPU, macOS, Zig and Rust versions and flags.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bench/RESULTS.md`:
- Line 57: Update the table row containing "Custom criterion-style harness" to
avoid overstating precision: replace "Accurate sub-ns measurement; zbench had
~25ns floor on macOS" with wording like "Accurate sub-25ns regime; zbench had
~25ns floor on macOS" (or similar phrasing) to reflect reported integer ns
outputs rather than claiming true sub-nanosecond precision.
---
Nitpick comments:
In `@bench/bench.zig`:
- Around line 48-99: The bench harness (runBench and runAndPrint) is duplicated
across benchmarks; extract these functions into a single shared module (e.g.,
bench/harness.zig) and update each benchmark entrypoint to import that module
and call the shared runBench/runAndPrint. Preserve the functions' signatures
(comptime func: fn() void and runAndPrint(name: []const u8, func: fn() void,
stdout: anytype) !void), retain Timer usage and constants (WARMUP_NS, BENCH_NS)
in the shared module, and ensure benchmarks replace their local implementations
with calls to the imported functions.
- Around line 175-449: Many benchmarks duplicate the FixedBufferAllocator setup
(init of std.heap.FixedBufferAllocator, getting alloc) — create a small helper
(e.g., withFixedBufferAllocator or makeFixedAllocator) that encapsulates the
buffer allocation and returns the allocator (or takes a callback that receives
alloc) and replace the repeated blocks in functions like benchAbiEncodeTransfer,
benchAbiEncodeStatic, benchAbiEncodeDynamic, benchRlpEncodeTx,
benchTxHashEip1559, benchEip712Hash, etc.; update those functions to call the
helper for an allocator instead of repeating the buf/fba/alloc boilerplate so
each bench only contains the test-specific logic.
In `@bench/RESULTS.md`:
- Line 7: Update the "Benchmarks run on Apple Silicon..." line in RESULTS.md to
include explicit toolchain and platform versions for reproducibility: append CPU
model/binning (e.g., M1/M2 with specific SKU), macOS version, Zig version, Rust
toolchain (cargo/rustc) and target triple, and any compiler flags used
(ReleaseFast/--release), so the existing Benchmarks run on Apple Silicon...
sentence contains exact CPU, macOS, Zig and Rust versions and flags.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.mdbench/RESULTS.mdbench/bench.zigbench/compare.shbench/keccak_compare.zigbuild.zigbuild.zig.zon
✅ Files skipped from review due to trivial changes (1)
- README.md
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
bench/RESULTS.md (1)
78-78:⚠️ Potential issue | 🟡 MinorReproduction command is incomplete for published results.
Line 78 shows
cargo bench --bench eth_comparison, but the results table includesu256_mulDiv(row 5), which is only produced by the separateu256_comparisonbench target. To fully reproduce the published numbers, run both bench targets:(cd bench/alloy-bench && cargo bench --bench eth_comparison && cargo bench --bench u256_comparison)Alternatively, reference the existing
bash bench/compare_u256.shthat already runs the u256 benchmarks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/RESULTS.md` at line 78, The reproduction command in RESULTS.md only runs the eth_comparison bench but the published table includes results from the u256_comparison target (e.g., u256_mulDiv), so update the command to run both bench targets or point to the existing script; specifically change the command that runs `cargo bench --bench eth_comparison` to run `cargo bench --bench eth_comparison && cargo bench --bench u256_comparison` or replace it with/mention the existing `bench/compare_u256.sh` script so `u256_comparison` results are produced when reproducing the published numbers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@bench/RESULTS.md`:
- Line 78: The reproduction command in RESULTS.md only runs the eth_comparison
bench but the published table includes results from the u256_comparison target
(e.g., u256_mulDiv), so update the command to run both bench targets or point to
the existing script; specifically change the command that runs `cargo bench
--bench eth_comparison` to run `cargo bench --bench eth_comparison && cargo
bench --bench u256_comparison` or replace it with/mention the existing
`bench/compare_u256.sh` script so `u256_comparison` results are produced when
reproducing the published numbers.
Summary
Closes the u256 compound arithmetic performance gap with Rust/alloy. The
uniswapv2_naivebenchmark went from 87ns (3.48x slower than Rust) to 20ns (1.20x faster).Key Optimizations
Scalar multiply (
mulLimbScalar): Multiplying [4]u64 limbs by a single u64 uses 4 mul/umulh pairs instead of 10 for full 4x4 schoolbook. Used for fee constants (997, 1000) in UniswapV2.Specialized 128-by-128 division (
div2by2): Inline register-only path for the common case where both numerator and divisor fit in 2 limbs. Avoids Knuth D's runtime-loop array overhead and eliminates__udivti3software division calls (~50ns penalty discovered via disassembly).Limb-native benchmark: Rewrote
uniswapv2_naiveto use[4]u64limb operations directly — apples-to-apples comparison with Rust's[u64; 4]U256 type.Other Changes
checked_div->/(removed unfairOptionhandling overhead)pub inline(mulLimbs,addLimbs,mulLimbScalar)@bitCastfor zero-cost u256 <-> [4]u64 conversion*.a,*.o,*.sto.gitignore(build artifacts)Benchmark Results
Previous score: eth-zig 2/8 | alloy.rs 4/8 | tied 2/8
Test plan
zig build testpasseszig build bench-u256runs successfullybash bench/compare_u256.shshows improved resultsSummary by CodeRabbit
Tests
Chores
Refactor
Documentation