Add D64 onehot scheduling infrastructure by quangvdao · Pull Request #8 · a16z/hachi

quangvdao · 2026-03-19T07:45:23Z

Posted by Cursor assistant (model: GPT-5) on behalf of the user (Quang Dao) with approval.

Summary

add D64 onehot family configs plus deterministic per-level scheduling via HachiLevelParams
cut setup caches over to family-aware envelope generation and runtime row-prefix use
replace static Cfg::N_B / Cfg::N_D assumptions in ring-switch, quadratic-equation, and recursive prove/verify paths
decouple onehot K=256 from D in the profile example and e2e tests

Validation

cargo fmt -q
cargo clippy --all --message-format=short -q -- -D warnings
cargo test -q

Profile (`examples/profile.rs`, `HACHI_MODE=onehot`, `HACHI_NUM_VARS=32`)

setup: 2.411s
commit: 190.328ms
prove: 943.577ms
verify: 47.719ms
proof size: 157,670 B
hachi folds: 25,008 B
direct tail: 132,657 B

Comparison To `ONEHOT_PROOF_SIZE_OPTIMUM.md`

the note does not tabulate nv=32, but its D64 optimum bracket is 106,125 B at nv=30 and 115,441 B at nv=38
the measured nv=32 proof here is therefore about 42.2 KB to 51.5 KB above that bracket
most of the gap is in the tail: current direct tail is 132,657 B versus the note's D64 fixed-point tail of 71,849 B

Note

the modeled b0=2 / wb=4 optimum from the note is not fully live yet; the PR lands the D64 scheduling/runtime-row-count infrastructure first

* chore: add toolchain and formatting config Pin Rust 1.88 with minimal profile (cargo, rustc, clippy, rustfmt). Co-authored-by: Cursor <cursoragent@cursor.com> * chore(ci): switch to actions-rust-lang/setup-rust-toolchain Respects rust-toolchain.toml automatically. Also normalize clippy flags to use --all --all-targets consistently. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(primitives): add u128/i128 serialization support Required by the Fp128 field backend. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(algebra): add prime fields, extensions, and modules Introduces the algebra module with: - Fp32/Fp64/Fp128 prime field backends with branchless constant-time add/sub/neg and rejection-sampled random - U256 helper for Fp128 wide multiplication - Fp2/Fp4 tower extensions with Karatsuba-ready structure - VectorModule<F, N> fixed-length vector module over any field - Poly<F, D> fixed-size polynomial container Co-authored-by: Cursor <cursoragent@cursor.com> * feat(algebra): add NTT small-prime arithmetic and CRT helpers Adds the ntt submodule with: - NttPrime: per-prime Montgomery-like fpmul, Barrett-like fpred, branchless csubq/caddq/center - LimbQ/QData: radix-2^14 limb arithmetic for big-q coefficients - logq=32 parameter preset (six NTT-friendly primes, CRT constants) Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add comprehensive algebra test suite 24 tests covering: - Field arithmetic, identities, and distributivity (Fp32/Fp64/Fp128) - Zero inversion returns None - Serialization round-trips (all field types, extensions, VectorModule) - Fp2 conjugate, norm, and distributivity - U256 wide multiply and bit access - LimbQ round-trip and add/sub inverse - QData consistency with preset constants - NTT normalize range and fpmul commutativity - Poly add/sub/neg Co-authored-by: Cursor <cursoragent@cursor.com> * docs: add and update progress tracking document Records Phase 0 status: all field types, extensions, NTT scaffolding, constant-time arithmetic, and 24-test suite. Reflects the fields/ntt/module/poly directory layout. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(ntt): Rust-ify NTT/CRT port from C Overhaul the NTT small-prime arithmetic and CRT modules: - Add MontCoeff newtype (#[repr(transparent)] i16 wrapper) to enforce Montgomery-domain vs canonical-domain separation at the type level - NttPrime methods now take/return MontCoeff instead of bare i16: fpmul→mul, fpred→reduce, csubq→csubp, caddq→caddp - Add domain conversion: from_canonical (i16→Mont), to_canonical (Mont→i16) - Delete free functions (pointwise_mul etc), replaced by methods on NttPrime - LimbQ: replace add_limbs/sub_limbs/less_than with std Add/Sub/Ord impls - LimbQ: replace from_u128/to_u128 with From<u128>/TryFrom for u128 - LimbQ: add Display impl, branchless csub_mod - Rename all LABRADOR* constants to project-native Q32_* names - Add #[cfg(test)] verification that re-derives pinv/v/mont/montsq from p - Add MontCoeff round-trip and LimbQ ordering tests (28 total) Co-authored-by: Cursor <cursoragent@cursor.com> * chore: remove section banners, update progress doc Remove // ---- Section ---- banner comments from prime.rs and crt.rs. Add non-negotiable rules to HACHI_PROGRESS.md: - No section-banner comments - No commit/push without explicit user approval Co-authored-by: Cursor <cursoragent@cursor.com> * feat(ring): add CyclotomicRing, CyclotomicNtt, and NTT butterfly Milestone 1 - CyclotomicRing<F, D> (coefficient form): - Schoolbook negacyclic convolution Mul (X^D = -1) - Add/Sub/Neg/AddAssign/SubAssign/MulAssign, scale, zero/one/x - HachiSerialize/HachiDeserialize Milestone 2 - NTT butterfly + CyclotomicNtt<K, D>: - Merged negacyclic Cooley-Tukey forward NTT (twist folded into twiddles) - Gentleman-Sande inverse NTT with D^{-1} scaling - Runtime primitive-root finder and twiddle table computation (TODO: migrate to compile-time const tables) - CyclotomicNtt with per-prime pointwise Add/Sub/Neg/Mul - Ring<->Ntt transforms with CRT reconstruction Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add ring and NTT tests, wrap in mod tests Add 12 new tests: - CyclotomicRing: negacyclic X^D=-1, mul identity/zero, commutativity, distributivity, associativity, additive inverse, serde, degree-64 - NTT: forward/inverse round-trip (single prime + all primes), NTT mul matches schoolbook cross-check Wrap all integration tests in a single mod tests block and remove section-banner comments. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden ring-NTT conversion and field decoding Constrain ring/NTT conversions to explicit field backends and replace fragile CRT reconstruction with deterministic modular lifting. Enforce canonical deserialization checks in validated field decoding paths to reject malformed encodings. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add CRT round-trip and serialization guard coverage Add end-to-end ring->NTT->ring CRT round-trip tests plus reduced-ops stability checks. Expand serialization coverage for Fp4/Poly and verify checked deserialization rejects non-canonical field encodings. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(bench): add ring_ntt benchmark target and CT tracking docs Add a dedicated ring/NTT benchmark harness and register it in Cargo metadata. Record current constant-time review status and sync the implementation progress board with new milestones and test coverage. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(field): split core, canonical, and sampling capabilities Break the monolithic Field trait into FieldCore, CanonicalField, and FieldSampling, and update algebra primitives to depend on explicit capabilities for cleaner semantics and future backend integration. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(fields): add pow2-offset pseudo-mersenne registry and checks Introduce the curated 2^k-offset prime registry and typed field aliases, then add dedicated Miller-Rabin regression tests to enforce probable primality for all enabled profiles. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(ring): introduce crt-ntt backend/domain layering Rename the ring NTT representation to explicit CRT+NTT semantics and route conversions through backend traits, adding scalar backend and domain aliases for a cleaner representation-vs-execution boundary. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): cover backend parity and pow2-offset invariants Expand algebra tests to validate default-vs-backend CRT+NTT equivalence, sampling bounds, and pow2-offset registry consistency under the new field and ring abstractions. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(algebra): update progress notes and add prime analysis references Refresh progress and constant-time notes to match the new CRT+NTT naming and field scope, and add the NTT prime analysis document plus local NIST standards artifacts used for parameter rationale. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden fp128 reduction and CRT reconstruction arithmetic Make Fp128 reduction and CRT inner accumulation paths more timing-stable with branchless modular operations, and refresh ring/docs/tests status after the hardening cleanup pass. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add transcript and commitment scaffold Introduce Hachi protocol-layer interfaces and placeholder types with Blake2b/Keccak transcript backends plus phase-aligned labels, while making transcript absorption label-directed at call sites. Co-authored-by: Cursor <cursoragent@cursor.com> * test(protocol): add transcript and commitment contract coverage Add deterministic transcript schedule checks (including keccak) and protocol commitment contract tests so transcript ordering and challenge derivation behavior are locked down. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(protocol): align transcript spec and progress status Document the protocol scaffold as in-progress, capture the commitment-focused transcript label vocabulary, and clarify deferred Jolt adapter expectations. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add ring commitment core and seeded matrix derivation Implement the ring-native commitment setup/commit core with config validation, utility modules, and seeded domain-separated public matrix derivation, while wiring prover/verifier stub modules for the next open-check phase. Co-authored-by: Cursor <cursoragent@cursor.com> * test(protocol): consolidate ring commitment and stub contract coverage Unify ring commitment core and config validation checks in one test file and add explicit prover/verifier stub contract tests to lock current placeholder behavior before open-check implementation. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(progress): update phase 2 status after commitment core landing Record that ring-native §4.1 commitment setup/commit and protocol wiring are in place, and clarify that open-check prove/verify remains the next unfinished protocol milestone. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden CT inversion path and CRT final projection Add a constant-time inversion helper for prime fields and replace scalar CRT's final `% q` projection with a division-free fixed-iteration reducer, so secret-bearing arithmetic paths avoid variable-latency behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(algebra): rename inversion helper API without ct suffix Rename the secret-path inversion helper to `Invertible::inv_or_zero` while preserving constant-time semantics via doc contracts, and update CT tracking docs to match the new API names. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): clean inversion test naming and normalize formatting Rename the inversion helper test to match the new API naming and keep the ring commitment test formatting consistent after linting. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add sumcheck core module and tests Introduce core sumcheck building blocks (univariate messages, compression, and transcript-driving prover/verifier driver) and add unit/integration tests. Update progress doc to reflect sumcheck core landing. Co-authored-by: Cursor <cursoragent@cursor.com> * Add reference PDF papers * Add local agent instruction files * Add Hachi and SuperNOVA digest docs * Add general field, ring, and multilinear utilities * Add sparse Fiat-Shamir challenge sampling * Implement Polynomail Evaluation as Quadradic Equation * Rename stub to prover and verifier * Refactor code organization * Replace decopose with balanced decompose * Transform polynomial over Fq to ring * Refactor function names * Impl commitment_scheme API * Add SolinasFp128 backend for sparse 128-bit primes Introduce `SolinasFp128` with two-fold Solinas reduction for `p = 2^128 - c` (sparse `c`), plus `U256::sqr_u128`. Export descriptive prime aliases, add BigUint-backed correctness tests, and include a Criterion bench for mul/inv. Co-authored-by: Cursor <cursoragent@cursor.com> * Tighten docs and minor clippy cleanups Add missing rustdoc Errors/Panics sections and apply small simplifications suggested by clippy. Co-authored-by: Cursor <cursoragent@cursor.com> * Add reduction steps to iteration prover * Optimize Solinas mul/add/sub: fused u64-limb schoolbook + csel canonicalize Rewrite mul_raw as a fused 2×2 schoolbook multiply with two-fold Solinas reduction using explicit u64 limbs and mac helper, bypassing U256. Replace mask-based canonicalize with carry-flag-based pattern that compiles to adds+adcs+csel+csel (4 insns) instead of 10 on AArch64. Add pure-mul, sqr, and throughput microbenchmarks. Made-with: Cursor * Switch SolinasFp128 repr from u128 to [u64; 2] for 8-byte alignment Storage is now [u64; 2] (lo, hi) which halves alignment from 16 to 8 bytes, improving struct packing. Arithmetic hot paths convert to u128 for LLVM-optimal codegen (adds/adcs pairs), so no perf regression. Made-with: Cursor * Fuse overflow correction with canonicalize in fold2_canonicalize When fold-2 overflows, the wrapped value s < C², so s + C < C(C+1) < P — meaning s + C is already canonical. This lets us replace the separate overflow-correction + canonicalize (3 + 4 insns) with a single fused `if (overflow | carry) { s + C } else { s }` select, saving 2 instructions on the critical path. Add compile-time assertion enforcing C(C+1) < P. Made-with: Cursor * Unify Fp128 with Solinas-optimized arithmetic, delete SolinasFp128 Replace the generic Fp128<const MODULUS: u128> (binary-long-division via U256) with the Solinas-optimized implementation. Fp128<const P: u128> now uses [u64; 2] storage, fused schoolbook 2x2 + two-fold Solinas reduction (~23 cycles/mul on AArch64/x86-64), and compile-time validation that P = 2^128 - C with C < 2^64. Delete SolinasFp128, SolinasParams, solinas128.rs, and u256.rs. All call sites updated; prime type aliases (Prime128M13M4P0 etc.) are now simple Fp128<...> aliases in fp128.rs. Blanket PseudoMersenneField impl for all Fp128<P>. Made-with: Cursor * Use git deps for ark-bn254/ark-ff instead of local paths Switch from local path dependencies to the a16z/arkworks-algebra git repo (branch dev/twist-shout) so collaborators can compile without needing a local checkout of arkworks-algebra-jolt. Made-with: Cursor * Add template for sumchecks * Optimize Fp128 mul path and expand Rust field benchmarks. Refine Fp128 multiply/fold carry handling for better generated code and add isolated, passthrough, independent, and long-chain Rust microbenches to separate latency and throughput effects when comparing against BN254. Made-with: Cursor * Add 2^a±1 Fp128 reduction specialization and benches. Detect C = 2^a ± 1 at compile time and route fold multiplications through a specialized shift-based path with generic fallback, plus add benchmark coverage for sparse 128-bit primes using this shape. Made-with: Cursor * Add packed Fp128 field backend scaffolding and focused benchmarks. This introduces AArch64-first packed field abstractions with a scalar fallback and adds dedicated field-only validation/benchmark coverage before any ring or protocol integration. Made-with: Cursor * Refactor packed Fp128 backend to true SoA layout and stabilize benchmarking. This switches packed lane storage to SoA with NEON add/sub kernels and a SoA mul path, and updates packed-field APIs and benches so scalar-vs-packed latency/throughput comparisons are measured consistently. Made-with: Cursor * Optimize packed Fp128 mul throughput with array-backed SoA lanes. This keeps mul in true SoA form while removing repeated vector transmute overhead and inlining the limb-level Solinas lane kernel, improving packed mul throughput and latency against scalar baselines. Made-with: Cursor * Add Fp128 widening multiply API and specialized Solinas reduction Expose mul_wide_u64, mul_wide, mul_wide_u128, solinas_reduce, and to_limbs for deferred-reduction patterns needed by jolt-hachi. Hand-optimized reduce paths for 3/4/5 limbs avoid generic loop overhead. Refactor mul_raw to reuse mul_wide + reduce_4 (zero overhead). Add 9 unit tests and widening/accumulator benchmarks. Made-with: Cursor * Clean up fp128: remove section banners, hoist std::ops imports, rename mul_wide free fn Rename free function mul_wide → mul64_wide to avoid shadowing Fp128::mul_wide. Move reduce_4 next to fold2_canonicalize. Replace fully qualified std::ops::{Add,Sub,Mul,Neg} with use imports. Made-with: Cursor * Constrain Fp32/Fp64 to pseudo-Mersenne primes with Solinas reduction Rework fp32.rs and fp64.rs to require p = 2^k - c (small c), matching fp128's design. Compile-time constants BITS/C/MASK derived from P with static assertions. Replace bit-serial reduction with two-fold Solinas reduction (reduce_product for hot path, loop-based reduce_u64/u128 for arbitrary inputs). Add widening ops (mul_wide, square, solinas_reduce). Fix FieldSampling to use direct modular reduction instead of rejection sampling. Blanket-impl PseudoMersenneField, remove manual impls. Rename const generic MODULUS -> P at all call sites. Add latency + throughput benchmarks. Hoist mid-function imports in tests/algebra.rs. Made-with: Cursor * Specialize Fp64 sub-word primes to u64-only arithmetic For BITS < 64 (e.g. 2^40-195), avoid u128 intermediates in reduce_product, add_raw, and sub_raw. Use mul_c_narrow which splits C*high into u32x32->u64 widening multiplies (umaddl on AArch64), preventing LLVM from promoting to u128. Brings 40-bit mul throughput within 4% of 64-bit (690 vs 716 Melem/s), up from ~20% gap. Made-with: Cursor * Add 2^30 and 2^31 pseudo-Mersenne primes and expand benchmarks Add Pow2Offset30Field (2^30-35) and Pow2Offset31Field (2^31-19) prime definitions and type aliases. Refactor fp32/fp64 latency benchmarks with chain_bench! macro, add throughput benchmarks for all new primes. Made-with: Cursor * Add NEON packed backends for Fp32 (4-wide) and Fp64 (2-wide) PackedFp32Neon: 4 lanes in uint32x4_t with full NEON Solinas reduction for mul (vmull_u32 + 2-fold reduce), umin trick for add/sub (BITS<=31), overflow-aware paths for BITS==32. C_SHIFT_KIND optimization for C=2^a+/-1. PackedFp64Neon: 2 lanes in uint64x2_t with NEON add/sub (conditional P for BITS<=62, carry-aware for BITS>=63), scalar-per-lane mul (no native 64x64->128 on NEON). Fp32 packed achieves 2.4-3.5x mul throughput and 3.5-5.0x add/sub throughput over scalar. Includes HasPacking impls, type aliases, NoPacking fallbacks, 7 correctness tests, and throughput benchmarks. Made-with: Cursor * Optimize packed Fp32/Fp64 Solinas multiply hot paths on NEON For packed Fp32, remove the shift/add C-special-case in the Solinas fold and always use vmull_u32 with a hoisted C broadcast, which improves stability and removes the 24-bit mul regression. For packed Fp64, replace per-lane Fp64 wrapper multiplication with packed-local per-lane 64x64->128 products plus specialized Solinas reduction (including the sub-word u64 fold path), reducing mul overhead for both 40-bit and 64-bit packed variants. Made-with: Cursor * Tune packed Fp64 mul folding and add reducer/codegen probes Switch packed Fp64 sub-word fold multiplication to direct `C*x`, which improves packed mul throughput in repeated A/B runs. Add dedicated reducer and codegen probe benches so we can compare 40-bit and 64-bit fold paths with instruction-level visibility. Made-with: Cursor * Optimize x86 BMI2 multiply paths for fp64/fp128 fields Use BMI2 widening multiplies in scalar field hot paths and specialize x86 sub-word fold multiplication to a single 64-bit multiply, improving 40-bit fp64 throughput while keeping 64-bit and 128-bit paths stable. Made-with: Cursor * Optimize fp128 wide-limb multiply path for Jolt integration Raise Hachi MSRV to 1.88, add specialized Fp128 mul_wide_limbs kernels for M={3,4} and OUT={4,5,6}, and add field_arith benches that track mul_wide_limbs-only and roundtrip costs to catch regressions. Made-with: Cursor * Specialize Fp128 CanonicalField small-int constructors Make from_u64 use a direct canonical limb construction (no reduction path), fix from_i64 to use unsigned_abs to avoid i64::MIN overflow, and add a regression test for the min-value case. Made-with: Cursor * Impl sumchecks for hachi * Add optimized one-hot commitment path for regular sparse witnesses Exploits the structure of one-hot vectors (T chunks of K field elements, each chunk with exactly one 1) to eliminate all inner ring multiplications. Gadget decomposition of {0,1} coefficients is trivial (only level-0 digit is nonzero), and the inner Ajtai t = A*s reduces to summing selected columns of A with O(D) negacyclic rotations instead of O(D^2) ring muls. Handles both K >= D and D >= K as long as one divides the other: - K >= D: each nonzero ring element is a monomial X^j (single rotation) - D >= K: each ring element is a sum of D/K monomials (multiple rotations) Total inner cost: N_A * T * D coefficient additions (zero multiplications), vs N_A * 2^M * delta * D^2 coefficient multiplications in the dense path. Made-with: Cursor * Apply rustfmt formatting to fp128 and field_arith bench Made-with: Cursor * Inject sumchecks to Hachi prover * Add commitment to w to transcript * Add AVX2 and AVX-512 packed field backends for Fp32, Fp64, Fp128 Implement vectorized SIMD arithmetic for x86_64: - AVX2: 8-wide Fp32, 4-wide Fp64, 2-wide Fp128 (scalar delegation) - AVX-512: 16-wide Fp32, 8-wide Fp64, 4-wide Fp128 (scalar delegation) Fp32 uses even/odd lane split with 2-fold Solinas reduction. Fp64 uses vectorized 64×64→128 schoolbook multiply (adapted from plonky3 Goldilocks) with custom Solinas reduction for pseudo-Mersenne primes p = 2^k - c. Also: extract NEON backend into packed_neon.rs, add cfg-gated module selection (AVX-512 > AVX2 > NEON > NoPacking), enable nightly stdarch_x86_avx512 feature, add sumcheck-mix benchmark, and fix minor clippy lints in fp64/fp128. Made-with: Cursor * Vectorize Fp128 packed add/sub on AVX-512 (8-wide) and AVX2 (4-wide) Convert Fp128 packed backends from scalar delegation (AoS) to SoA layout with vectorized add/sub via __m512i / __m256i. Mul remains scalar per-lane. Add FIELD_OPS_PERF.md with Zen 5 benchmark results. Fp128 packed add: +114% (1.08 → 2.31 Gelem/s on Zen 5 AVX-512) Fp128 packed sub: +137% (1.34 → 3.18 Gelem/s) Made-with: Cursor * Add M4 Pro NEON benchmarks, remove mul_add experiment Populate FIELD_OPS_PERF.md with Apple M4 Pro (NEON) results for all primes across scalar, packed, and sumcheck MACC workloads. Remove the experimental mul_add trait method (vectorized add already optimal after inlining; scalar fused approach was 16% slower). Made-with: Cursor * Change sumcheck API * Separate ring switch logic * Rename sumchecks to NormSumcheck and RelationSumcheck * Remove iteration prover * Eliminate O(D^2) schoolbook ring multiplication from protocol hot paths At production parameters (D=256/1024), schoolbook CyclotomicRing multiplication is catastrophically expensive. Every protocol hot path has exploitable operand structure that avoids the full D^2 cost: - Add CyclotomicRing::mul_by_sparse for O(omega*D) sparse challenge multiplication (90-140x speedup in compute_z_hat) - Change RingOpeningPoint to store Vec<F> scalars; use scale() instead of ring mul in compute_w_hat (256-1024x speedup) - Add kron_scalars, kron_row_scale, kron_sparse_scale; refactor generate_m to use scalar-aware Kronecker products - Add zero-skip and scalar-detect in compute_r_via_poly_division - Add sample_sparse_challenges, store Vec<SparseChallenge> in QuadraticEquation throughout prover and verifier paths Made-with: Cursor * lint: section banner removal, naming hoist, cfg(test) for test-only paths - Remove section banner comments (----, =====) repo-wide in src, tests, benches - commitment_scheme: hoist RingCommitment, RingOpeningPoint, transcript labels to top-level use; add #[cfg(test)] use for rederive_alpha_and_m_a body (Blake2bTranscript, eval_ring_matrix_at, expand_m_a, labels) so that function uses short names without polluting lib build - Leave mod tests imports in place (no hoisting of test-module use blocks) Made-with: Cursor * Fix CI issues --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org> Co-authored-by: Cursor <cursoragent@cursor.com>

…umcheck (#3) * Add rayon parallelism behind `parallel` feature flag (enabled by default) - New src/parallel.rs with cfg_iter!/cfg_into_iter!/cfg_chunks! macros that dispatch to rayon parallel iterators when `parallel` is enabled - Parallelize protocol hot paths: ring polynomial division, w_evals construction, M_alpha evaluation, ring vector evaluation, packed ring poly evaluation, coefficients-to-ring reduction, quadratic equation folding, and sumcheck round polynomial computation - All 174 tests pass with and without the parallel feature Made-with: Cursor * Add e2e benchmark and make HachiCommitmentScheme generic over config - Make HachiCommitmentScheme generic over <const D, Cfg> so different configs (and thus num_vars) can be used without code duplication. - Remove hardcoded DefaultCommitmentConfig::D from ring_switch.rs; WCommitmentConfig and commit_w now flow D generically. - Add benches/hachi_e2e.rs with configs sweeping nv=10,14,18,20. Made-with: Cursor * Refactor CRT-NTT backend: generalize over PrimeWidth, add Q128 support Make NTT primitives (NttPrime, NttTwiddles, MontCoeff, CyclotomicCrtNtt) generic over PrimeWidth (i16/i32) instead of hardcoding i16. Replace the monolithic QData struct with separate GarnerData and per-prime NttPrime arrays. Add Q128 parameter set (5 × i32 primes, D ≤ 1024) alongside the existing Q32 set. Simplify ScalarBackend by removing the const-generic limb count from to_ring_with_backend. Made-with: Cursor * Add extension field arithmetic and refactor sumcheck trait bounds Split CanonicalField into FromSmallInt (from_{u,i}{8,16,32,64} for all fields) and CanonicalField (u128 repr, base fields only). Implement FromSmallInt, Eq, Debug for Fp2/Fp4. Add ExtField<F> trait with EXT_DEGREE and from_base_slice. Optimize extension field arithmetic: Karatsuba multiplication for Fp2 and Fp4 (3 base muls instead of 4), specialized squaring (2 base muls for Fp2), non-residue IS_NEG_ONE specialization. Add concrete configs (TwoNr, NegOneNr, UnitNr) and type aliases Ext2<F>, Ext4<F>. Add transpose-based packed extension fields (PackedFp2, PackedFp4) for SIMD acceleration, following Plonky3's approach. Relax sumcheck bounds from E: CanonicalField to E: FromSmallInt (or E: FieldCore where spurious). Add sample_ext_challenge transcript helper. Includes tests for extension field sumcheck execution. Made-with: Cursor * Fix CRT+NTT correctness and optimize negacyclic NTT pipeline Correctness fixes: - Rewrite negacyclic NTT as twist + cyclic DIF/DIT pair (no bit-reversal permutation), correctly diagonalizing X^D+1. - Center coefficient→CRT mapping and Garner reconstruction to handle negacyclic sign wrapping consistently. - Fix i32 Montgomery csubp/caddp overflow via branchless i64 widening. - Fix q128 centering overflow in balanced_decompose_pow2 (avoid casting q≈2^128 into i128). - Remove dense-protocol schoolbook fallback; all mat-vec now routes through CRT+NTT. Performance optimizations: - Precompute per-stage twiddle roots in NttTwiddles (eliminate runtime pow_mod per butterfly stage). - Forward DIF butterfly skips reduce_range before Montgomery mul (safe because mul absorbs unreduced input). - Hoist centered-coefficient computation out of per-prime loop in from_ring. - Add fused pointwise multiply-accumulate for mat-vec inner loop. - Add batched mat_vec_mul_crt_ntt_many that precomputes matrix NTT once and reuses across many input vectors. - Wire commit_ring_blocks to batched A*s path. Benchmarks (D=64, Q32/K=6): - Single-prime forward+inverse NTT: 1.14µs → 0.43µs (2.7x) - CRT round-trip: 10.7µs → 6.3µs (1.7x) - Commit nv10: ~70% faster, nv20: ~47% faster Made-with: Cursor * Cache CRT+NTT matrix representations in setup to avoid repeated conversion The dense mat-vec paths (commit_ring_blocks, commit_onehot B-mul, compute_v) previously converted coefficient-form matrices to CRT+NTT on every call. Now the setup eagerly converts A, B, D into an NttMatrixCache and all dense operations use the pre-converted form. Coefficient-form matrices are retained for the onehot inner-product path and ring-switch/generate_m. Made-with: Cursor * Remove dead code (HachiRoutines, domains/, redundant trait methods) and extract shared field utilities - Delete unused HachiRoutines trait and dead algebra/domains/ module - Remove redundant FieldCore::add/sub/mul and Module::add/neg (covered by ops traits) - Extract is_pow2_u64, log2_pow2_u64, mul64_wide into fields/util.rs to deduplicate Made-with: Cursor * Unify Blake2b and Keccak transcript backends into generic HashTranscript Replace separate blake2b.rs and keccak.rs with a single generic HashTranscript<D: Digest> parameterized by hash function. Blake2bTranscript and KeccakTranscript are now type aliases. Made-with: Cursor * Fix sumcheck degree bug, split types, in-place fold, CommitWitness, rename configs, add soundness test - Fix CompressedUniPoly::degree() off-by-one that could let malformed proofs pass - Split sumcheck/mod.rs: extract types into types.rs, relocate multilinear_eval and fold_evals to algebra/poly.rs - Replace allocating fold_evals with in-place fold_evals_in_place - Add debug_assert guards to multilinear_eval and fold_evals_in_place - Introduce CommitWitness struct to replace error-prone 3-tuple returns - Rename DefaultCommitmentConfig to SmallTestCommitmentConfig, add ProductionFp128CommitmentConfig - Add verify_rejects_wrong_opening negative test for verifier soundness Made-with: Cursor * fix(test): resolve clippy needless_range_loop in algebra tests Use iter().enumerate() for schoolbook convolution loops and array::from_fn for pointwise NTT operations. Made-with: Cursor * Refactor commitment setup to runtime layout and staged artifacts. This removes compile-time commitment shape locks, derives beta from runtime layout, and threads layout-aware setup through commit/prove/verify with setup serialization roundtrip coverage. Made-with: Cursor * Soundness hardening: panic-free verifier, Fiat-Shamir binding, NTT overflow fix - Verifier path never panics; all errors return HachiError - Bind commitment, opening point, and y_ring in Fiat-Shamir transcript - Fix i16 csubp/caddp overflow by widening to i32 - multilinear_eval returns Result with dimension checks - build_w_evals validates w.len() is a multiple of d - UniPoly::degree uses saturating_sub instead of expect - Serialize usize as u64 for 32/64-bit portability - Fix from_i64(i64::MIN) via unsigned_abs - Remove Transcript::reset from public trait (move to inherent) - Add batched_sumcheck verifier empty-input guard Made-with: Cursor * Hoist fully qualified paths to use statements in touched files Replace inline crate::protocol::commitment::HachiCommitmentLayout, hachi_pcs::algebra::backend::{CrtReconstruct, NttPrimeOps}, and hachi_pcs::algebra::CyclotomicRing with top-level use imports. Made-with: Cursor * Dispatch norm sumcheck kernels by range size. Route small-b rounds through the point-eval interpolation kernel and keep the affine-coefficient kernel for larger b, while adding deterministic baseline-vs-dispatched benchmarks and parity tests to validate correctness across both strategies. Made-with: Cursor * Format commitment-related files for readability. Apply non-functional formatting and import ordering cleanups across commitment, ring-switch, and benchmark/test files to keep the codebase style consistent. Made-with: Cursor * Format: cargo fmt pass on commitment-related files Made-with: Cursor * feat: sequential coefficient ordering + streaming commitment Change coefficient-to-ring packing from strided to sequential, enabling true streaming where each trace chunk maps to exactly one inner Ajtai block. Implement StreamingCommitmentScheme for HachiCommitmentScheme. - reduce_coeffs_to_ring_elements: sequential packing (chunks_exact(D)) - prove/verify: opening point split flipped to (inner, outer) - ring_opening_point_from_field: outer split flipped to (M first, R second) - commit_coeffs: sequential block distribution - map_onehot_to_sparse_blocks: sequential block distribution - HachiChunkState + process_chunk / process_chunk_onehot / aggregate_chunks - Streaming commit tests (matches non-streaming, prove/verify roundtrip) Made-with: Cursor * refactor: decompose verify_batched_sumcheck into composable steps Split the monolithic verify_batched_sumcheck into three pieces: - verify_batched_sumcheck_rounds: replay rounds, return intermediate state - compute_batched_expected_output_claim: query verifier instances - check_batched_output_claim: enforce equality This enables callers (e.g. Greyhound) to intercept the intermediate sumcheck state before the final oracle check. The original function is preserved as a convenience wrapper. Made-with: Cursor * feat: accept Option<usize> in commit_onehot for sparse one-hot support Allows None entries in one-hot index arrays to represent inactive cycles. Adds public commit_onehot free function returning both commitment and hint. Made-with: Cursor * feat: submatrix commit for polynomials smaller than setup max commit_coeffs now accepts ring coefficient vectors shorter than the layout's full size, padding each block internally. prove/verify pad the opening point with zeros so the transcript stays consistent. This avoids materializing huge zero-padded field-element arrays. Made-with: Cursor * feat: add HachiSerialize impls for proof types Implement HachiSerialize/HachiDeserialize for HachiProof, HachiCommitmentHint, and SumcheckAux so they can be serialized through the ArkBridge adapter in Jolt. Made-with: Cursor * fix: relax balanced_decompose_pow2 assertion for 128-bit fields Allow levels * log_basis up to 128 + log_basis. For Fp128 with LOG_BASIS=4, the decomposition needs 33 levels (132 bits total) because 32 levels can't represent the full signed range [-q/2, q/2). The extra level's digit is at most ±1 and the i128 arithmetic remains safe since the quotient shrinks monotonically. Made-with: Cursor * feat: add DynamicSmallTestCommitmentConfig Same D=16 security parameters as SmallTestCommitmentConfig but derives layout from max_num_vars instead of using a fixed (4,2) shape. Made-with: Cursor * perf: true submatrix in commit_coeffs — skip zero blocks Short polynomials no longer pad to block_len. commit_coeffs accepts fewer ring elements than num_blocks * block_len, decomposes only the non-zero blocks, and fills remaining entries with zero s/t_hat without allocation or mat-vec multiplication. Also relax debug_assert in mat_vec_mul_precomputed to >= (zip handles the shorter vector correctly). Made-with: Cursor * fix: use inner_width for zero_s in commit_coeffs/commit_onehot prove expects s[i] to have inner_width entries. Use the correct length for zero blocks to match the dense path's decompose_block output size. Made-with: Cursor * fix: configure rayon with 64MB stack for D>=512 ring elements CRT-NTT conversion puts ~28KB on the stack per ring element ([[MontCoeff; D]; K] + [i128; D]). With D=512 and the commit call chain depth, rayon's default thread stack overflows. ensure_large_thread_stack() is called from setup() and is safe to call multiple times (only the first configures the pool). Made-with: Cursor * feat: add commit_mixed for mega-polynomial commitment Exposes MegaPolyBlock enum (Dense/OneHot/Zero) and commit_mixed() which processes heterogeneous blocks in a single commitment. This lets Jolt pack all witness polynomials into one Hachi commitment (one block per polynomial) instead of N independent commitments. Also makes SparseBlockEntry and map_onehot_to_sparse_blocks public so callers can construct one-hot block descriptors. Made-with: Cursor * perf: drop s vectors from CommitWitness and HachiCommitmentHint The basis-decomposed s_i vectors (one per block, each block_len*delta ring elements) were stored in both CommitWitness and HachiCommitmentHint. At production parameters (D=512, block_len=2048, delta=32), each s_i is 512 MB — storing all 64 of them consumed ~32 GB. Instead, recompute s_i on the fly in compute_w_hat and compute_z_hat from ring_coeffs using decompose_block. Peak memory drops from O(blocks * block_len * delta) to O(block_len * delta) per thread. Also adds setup_with_layout for caller-specified HachiCommitmentLayout, and makes decompose_block, SparseBlockEntry, map_onehot_to_sparse_blocks public for downstream (Jolt) mega-polynomial integration. Made-with: Cursor * chore: untrack docs/ and paper/ from version control Keep these files locally for reference but remove from the committed tree. They can be selectively re-added later. Made-with: Cursor * perf: fused sumcheck, split-eq streaming, compact w_evals — 8x memory reduction Refactor the Hachi proving pipeline to eliminate the 13 GB matrix M and 2.6 GB vector z from memory, reducing peak prover allocation from ~30 GB to ~3.7 GB. Key changes: - QuadraticEquation: remove m/z fields; add compute_r_split_eq (split-eq factoring replaces full Kronecker materialization) and compute_m_a_streaming (row-at-a-time M·α evaluation). - ring_switch: decompose z_pre on the fly in build_w_coeffs; add build_w_evals_compact returning Vec<i8> for round-0 storage (all entries fit in [-8, 7] from balanced_decompose_pow2 with LOG_BASIS=4). - HachiSumcheckProver: fused norm+relation prover sharing a single w_table. Round 0 uses WTable::Compact(Vec<i8>), folding to WTable::Full(Vec<F>) at half size after the first challenge. - HachiSumcheckVerifier: fused verifier combining both oracle checks with a batching_coeff sampled from the transcript. - Remove dead batched mat-vec functions from linear.rs. - Import hygiene: shorten crate::algebra::ring::X to crate::algebra::X; hoist mid-function use statements to top-level. Made-with: Cursor * revert: remove ensure_large_thread_stack rayon config Stack sizing for D>=512 ring elements should be handled by the caller, not baked into the library's setup path. Made-with: Cursor

…ine, NTT acceleration (#5) * perf: parallelize commit phase and reduce allocations - Add block-level parallelism to commit_ring_blocks, commit_coeffs, commit_onehot, and commit_mixed via cfg_iter!/cfg_into_iter! - Parallelize vector-to-NTT conversion in mat_vec_mul_precomputed_with_params - Cache CRT+NTT params inside NttMatrixCache, eliminating redundant select_crt_ntt_params calls on every mat-vec multiply - Add balanced_decompose_pow2_into for in-place decomposition, removing per-element Vec allocations in decompose_block/decompose_rows - Add inner_ajtai_onehot_t_only that skips the 16MB s-vector allocation when the caller discards it (commit_onehot, commit_mixed) - Add one-hot and mixed commitment benchmarks to hachi_e2e Made-with: Cursor * chore: remove stale #[allow(non_snake_case)] from setup structs HachiSetupSeed, HachiProverSetup, and HachiVerifierSetup have no uppercase fields — the allows were left over from earlier refactors. Made-with: Cursor * perf: hoist decomposition params to runtime, reduce allocations and cloning Pre-existing change: - Remove rows/cols from matrix domain separator so A matrix is reusable across poly/mega-poly layouts with the same m_vars. New changes: Move delta/tau/log_basis from CommitmentConfig associated constants into HachiCommitmentLayout runtime fields. This decouples decomposition parameters from the config type, allowing them to vary at runtime without monomorphization. All ~50 call sites updated. Eliminate redundant work in the prover hot path: - Flatten w_hat once and reuse in both compute_v and compute_r_split_eq (was flattened separately in each). - Stream z_hat decomposition directly in build_w_coeffs instead of collecting into a temporary Vec. - Skip the unused w.to_vec() clone in ring_switch_verifier output. - Take ownership of ring_opening_point and hint in QuadraticEquation constructors instead of cloning. Reduce stack pressure for large ring elements (8KB at D=512, Fp128): - Add CyclotomicRing::from_slice() to avoid std::array::from_fn intermediaries that create 8KB stack temporaries. - Replace from_fn patterns in process_chunk, reduce_coeffs_to_ring_elements, commit_w, and compute_r_split_eq. Made-with: Cursor * feat: flexible decomposition depth and dual basis mode Move DELTA/TAU/LOG_BASIS out of CommitmentConfig into runtime DecompositionParams (log_basis, log_coeff_bound). Delta and tau are now auto-derived from the coefficient bound, so small-coefficient polynomials (0/1, already range-checked) get proportionally cheaper commitments. Add BasisMode enum (Lagrange / Monomial) as a prove/verify-time parameter. Commitment is basis-agnostic; the mode only changes the tensor-product weights in the opening relation. Made-with: Cursor * fix compute m a streaming to not need padding * refactor: unify polynomial API via HachiPolyOps trait, remove dead code, fix config validation HachiPolyOps trait and implementations: - Add HachiPolyOps<F, D> trait with 4 operation methods (evaluate_ring, fold_blocks, decompose_fold, commit_inner) replacing raw coefficient access - Add DensePoly<F, D> for dense ring coefficient vectors - Add OneHotPoly<F, D> for sparse one-hot polynomials with optimized ops CommitmentScheme refactor: - Parameterize CommitmentScheme<F, D> (was CommitmentScheme<F>) - Generic commit/prove over P: HachiPolyOps<F, D> - Rename OpeningProofHint to CommitHint, remove Option wrapper from prove - Remove batch_commit, combine_commitments, combine_hints - Remove StreamingCommitmentScheme trait, HachiChunkState, process_chunk* Dead code removal: - Delete MegaPolyBlock enum and commit_mixed method - Delete inner_ajtai_onehot (keep _t_only variant) - Delete Polynomial trait, MultilinearLagrange trait - Delete DenseMultilinearEvals and multilinear_evals module - Remove all unnecessary #[allow(...)] attributes Proof simplification: - Remove ring_coeffs from HachiCommitmentHint (only t_hat remains) - Update quadratic_equation to use HachiPolyOps methods Config fix: - Remove overly strict delta*log_basis > 128 check in config.rs; balanced_decompose_pow2 already enforces the correct bound (levels*log_basis <= 128+log_basis) Documentation: - Add docs to all public items in test_utils and packed_ext - Remove #[allow(missing_docs)] from parallel, test_utils, packed_ext modules Made-with: Cursor * fix: remove test for deleted delta*log_basis validation The setup_rejects_invalid_digit_budget test asserted the overly strict delta*log_basis > 128 check that was intentionally removed in the previous commit. Delete the test and its BadDigitBudgetConfig. Made-with: Cursor * style: fix formatting in ring_commitment_core.rs Made-with: Cursor * perf: parallelize proving hot paths, eliminate per-proof w-commitment setup Parallelize the three proving bottlenecks (quad_eq, ring_switch, sumcheck) and remove the per-proof matrix generation in commit_w by reusing the main NTT cache. Proving hot-path parallelism: - Parallelize round-0 norm and relation sumcheck via cfg_fold_reduce! macro - Parallelize DensePoly::decompose_fold with parallel fold-reduce over blocks - Parallelize fold_evals_in_place and build_w_evals_compact with cfg_into_iter! - Add cfg_fold_reduce! macro to unify parallel/sequential fold-reduce patterns - Unify compute_round_{norm,relation}_{compact,full} into single generic fns Sumcheck micro-optimizations: - Unroll 3-point relation evaluation to avoid redundant from_u64 conversions and multiply-by-zero/one at evaluation points 0 and 1 - Hoist gadget_recompose_pow2 out of per-row loop in compute_r_split_eq Eliminate per-proof w-commitment setup: - Add w_ring_element_count() and w_commitment_layout() helpers to compute w-commitment matrix dimensions from the main layout - Widen A/B matrices at setup time to max(main, w) column counts so the main NTT cache always covers the w-commitment (required when delta_commit=1, e.g. boolean polynomials) - Rewrite commit_w to take &NttMatrixCache directly, inlining the commit logic with flat_map instead of intermediate Vec<Vec<...>> - Remove w_setup field from HachiProverSetup - Add ensure_matrix_shape_ge for >= column checks on widened matrices Naming cleanup: - Rename delta -> num_digits_commit, tau -> num_digits_fold, log_coeff_bound -> log_commit_bound throughout - Add log_open_bound to DecompositionParams for recursive w commitments - Hoist fully qualified paths (std::ops, std::mem, std::iter, crate::protocol::ring_switch::w_commitment_layout) to use statements Made-with: Cursor * perf: profile and accelerate opening proof hot paths Replace D/B-row schoolbook quotient extraction with an NTT-based unreduced quotient path and add targeted tracing spans/timers plus a Perfetto profile example so prover bottlenecks are visible and cheaper to iterate on. Temporarily force the point-eval norm kernel to isolate fused-sumcheck behavior during profiling. Made-with: Cursor * perf: NTT-accelerate A-rows, reduce basis 16→8, fix saturation bug Three optimizations to the proving pipeline: 1. NTT-accelerate A-rows in compute_r_split_eq: use unreduced_quotient_rows_ntt_cached for A*z_pre (O(D log D) instead of O(D^2) schoolbook). Also exploit sparse challenge structure in add_sparse_ring_product (O(weight*D) instead of O(D^2)). 2. Reduce decomposition basis from 16 to 8 (log_basis 4→3): halves the norm sumcheck range-check polynomial degree from 31 to 15, yielding ~4x speedup on the dominant prove-time bottleneck. Soundness is strictly improved (smaller MSIS norm bound). 3. Fix u128 saturation bug in compute_num_digits and r_decomp_levels that caused an incorrect extra decomposition level when b^levels overflows u128. Skip the balanced-range check when levels*log_basis > log_bound, since the digit range is mathematically guaranteed sufficient for b >= 4. Also: replace hardcoded LOG_BASIS const with log_basis() function derived from TinyConfig, fuse decompose+sparse-mul in decompose_fold to i32 arithmetic, and add balanced_decompose_pow2_i8 variant. Net result: prove time 4.76s → 1.57s (3.0x speedup) at num_vars=19. Made-with: Cursor * perf: i8 digit pipeline for w_hat — bypass Fp128 for small decomposed digits Store w_hat/w_hat_flat as [i8; D] instead of CyclotomicRing<Fp128, D>, eliminating redundant field arithmetic on values in [-b/2, b/2). - Add balanced_decompose_pow2_i8 and gadget_recompose_pow2_i8 - Add CyclotomicCrtNtt::from_i8_with_params / from_i8_cyclic for direct i8 → CRT+NTT conversion (skips Fp128 centering) - Add mat_vec_mul_ntt_cached_i8 and unreduced_quotient_rows_ntt_cached_i8 - Change QuadraticEquation w_hat/w_hat_flat types + all consumers - Simplify build_w_coeffs to write i8 digits directly as field elements Made-with: Cursor * perf(poly): optimize range_check_eval and fold_evals_in_place range_check_eval: precompute w² and use (w²−k²) instead of (w−k)(w+k), saving one multiply per factor. fold_evals_in_place: fold in-place with truncate() instead of allocating a new Vec, removing the rayon dependency from this function. Made-with: Cursor * refactor(sumcheck): centralize and optimize norm sumcheck computation Extract duplicated norm round polynomial logic from NormSumcheckProver and HachiSumcheckProver into shared compute_norm_round_poly() and compute_norm_round_poly_compact() functions. Optimizations: - Flat contiguous storage for RangeAffinePrecomp (coeff_mix_flat + row_offsets) - Precomputed small-integer LUT (h_i(w_0)) for round-0 compact accumulation - Native i128 range-check evaluation path for b <= 10 - Precomputed squared offsets in PointEvalPrecomp - Make choose_round_kernel public with env var override and b-threshold dispatch Made-with: Cursor * feat(protocol): multi-level recursive folding proof Replace single-shot proof with recursive multi-level folding. Instead of sending the full w vector after one round of quad_eq → ring_switch → sumcheck, the prover now recursively commits to w and opens it via the same protocol until w is small enough to send directly. Key changes: - HachiProof now holds Vec<HachiLevelProof> + final_w instead of flat fields - Remove SumcheckAux; each level carries a w_eval claim instead - Extract prove_one_level / verify_one_level from monolithic prove/verify - Folding stops via should_stop_folding heuristic (MIN_W_LEN_FOR_FOLDING, MIN_SHRINK_RATIO) - QuadraticEquation takes explicit layout parameter for per-level configs - ring_switch exports WCommitmentConfig for recursive w-openings - D matrix widened to max(layout, w_layout) for shared setup - HachiSumcheckVerifier gains w_val_override for intermediate levels Made-with: Cursor * chore(examples): update profile example for multi-level proofs and A/B kernel testing - Extract run_prove() helper for reuse across kernel configs - Add A/B test mode (HACHI_AB_TEST=1) to compare affine_coeff vs point_eval - Update layout from (6,4) to (8,8) - Report multi-level proof stats (levels, final_w length, proof size) - Set 64 MiB rayon stack size Made-with: Cursor * style: remove section banners and hoist mid-function use statement - Remove redundant section banner comments in proof.rs and commitment_scheme.rs - Move choose_round_kernel import from function body to top-level in hachi_sumcheck.rs Made-with: Cursor * perf(algebra): use bitwise ops for balanced digit decomposition Replace rem_euclid(b) with bitwise AND and division with right shift in CyclotomicRing digit decomposition (decompose_balanced, decompose_balanced_digit_planes, decompose_balanced_i8) and DensePoly commit_with_setup. Valid since b is always a power of two. Made-with: Cursor * perf: store t_hat as i8 digit planes, cache w_folded to skip recompose Switch t_hat storage from Vec<Vec<CyclotomicRing<F,D>>> to Vec<Vec<[i8;D]>> throughout the commitment and proving pipeline. Decomposed digits are bounded by log_basis (typically 3), so i8 is sufficient and avoids carrying full field-element ring elements through commit, ring-switch, and serialization. Key changes: - CommitWitness and HachiCommitmentHint now hold [i8; D] digit planes - New i8 variants: decompose_block_i8, decompose_rows_i8, mat_vec_mul_ntt_cached_i8, gadget_recompose_pow2_i8 - HachiPolyOps::commit_blocks returns [i8; D] digit planes - QuadraticEquation caches w_folded (pre-decomposition folded ring elements) so compute_r_split_eq avoids a gadget_recompose roundtrip - Precomputed idx/sign lookup tables for sparse challenge multiplication - Custom i8 serialization for HachiCommitmentHint - Remove bogus debug_assert constraining ring degree D<=128 in build_w_evals_compact (was checking log2(D) but message said log_basis) Made-with: Cursor * perf: optimize hot paths in commit/prove pipeline - Hoist NTT conversions out of per-row quotient loops (crt_ntt, linear) - Precompute c_alpha in compute_m_a_streaming (quadratic_equation) - Compact alpha/m tables with variable-specific folding (sumcheck) - Eliminate t_hat_flat rematerialization and zero_t_hat clones (commit, ring_switch, hachi_poly_ops) - Merge duplicate w-eval passes (ring_switch, commitment_scheme) - Clean up fully qualified paths (linear, relation_sumcheck, hachi_poly_ops) Made-with: Cursor * feat(algebra): add wide unreduced accumulators and fused shift-accumulate Add Fp32x2i32, Fp64x4i32, Fp128x8i32 types that split field elements into 16-bit limbs in i32 slots for carry-free SIMD-friendly addition. Overflow budget ~32k signed adds before reduction. Add shift_accumulate_into / shift_sub_into / mul_by_monomial_sum_into on CyclotomicRing for fused negacyclic shift + accumulate without temporary ring allocations. Make field offset constants C public. Made-with: Cursor * refactor(protocol): per-matrix NttSlotCache, fused one-hot commit, bench stack fix Replace monolithic NttMatrixCache with per-matrix NttSlotCache, removing HachiPreparedSetup and MatrixSlot enum. HachiProverSetup now holds three independent NttSlotCache instances (A, B, D). Simplify dispatch macros in linear.rs to operate on a single slot. Add CommitCache associated type to HachiPolyOps trait. Wire one-hot commit path to use fused mul_by_monomial_sum_into, eliminating temporary allocations. Fix pre-existing benchmark stack overflow by configuring rayon with a 64MB thread stack (matching examples/profile.rs). Made-with: Cursor * feat(commit): column-tiled A matvec for cache-efficient commitment Add mat_vec_mul_ntt_tiled_i8 and mat_vec_mul_ntt_tiled_single_i8 that tile the NTT matrix columns into L2-sized chunks (~400 cols). Each rayon thread owns one tile and iterates over all blocks, so the matrix is loaded from DRAM exactly once. Ring coefficients are decomposed on-the-fly per tile to avoid full digit materialization. All call sites (commit, commit_coeffs, commit_onehot, ring_switch, quadratic_equation, HachiPolyOps::commit_inner) updated to use the tiled API. Reduces total DRAM traffic ~25x for large traces. Made-with: Cursor * refactor: promote TWO_INV and ZERO to const associated items on FieldCore Hoists two_inv from a trait method to a compile-time constant, and adds const ZERO so extension fields (Fp2, Fp4) can build their TWO_INV without runtime calls. Deduplicates CrtNttParamSet computation across A/B/D caches. Made-with: Cursor * refactor: remove two_inv parameters now that TWO_INV is a const Functions and macros no longer thread two_inv through call chains; they reference F::TWO_INV directly. Also removes the runtime computation in batched_sumcheck. Made-with: Cursor * feat(commit): stub HachiSerialize for HachiProverSetup Add Valid + HachiSerialize impls for HachiProverSetup that return an error on serialize (NTT caches are runtime artifacts). Needed by downstream wrappers that require the trait bound. Made-with: Cursor * perf: fuse hot loops, eliminate allocations, cheaper CRT reduction - mul_by_sparse: use shift_accumulate_into/shift_sub_into for ±1 coeffs - inverse NTT: fuse d_inv and psi_inv trailing passes into one loop - CRT conversion: replace __modti3 (i128 % i128) with split i64 arithmetic - Fp128 sqr_raw: 3 widening muls instead of 4 via squaring symmetry - decompose_block_i8: add _into variant, reuse buffer across tiles - sumcheck: fuse norm+relation into single pass over w_table - ring_switch: fuse expand_m_a+build_m_evals_x, rayon::join parallel phases - ring_switch: build_w_evals_dual uses unzip instead of triple allocation - quadratic_equation: hoist scratch allocations out of row loop Made-with: Cursor * feat: wide ring accumulators with NEON SIMD for one-hot commitment Introduce carry-free wide accumulators (Fp32x2i32, Fp64x4i32, Fp128x8i32) that defer modular reduction during one-hot commitment, yielding 69x faster commit for sparse witnesses. Key changes: - AdditiveGroup trait decoupling additive ops from full FieldCore - WideCyclotomicRing<W, D> for carry-free ring accumulation - HasWide / ReduceTo traits for type-level wide ↔ canonical dispatch - NEON SIMD backends for Fp64x4i32 and Fp128x8i32 with scalar fallback - inner_ajtai_onehot_wide replaces inner_ajtai_onehot_t_only - Profile example now covers both dense and one-hot paths Made-with: Cursor * refactor: drop "_tiled" suffix from mat-vec functions Tiling is an internal optimization detail, not an API distinction. The tiled versions are the only production path; non-tiled variants exist only as #[cfg(test)] reference implementations. Made-with: Cursor * refactor: rename Fp128CommitmentConfig, hoist inline qualified path - Drop "Production" prefix from ProductionFp128CommitmentConfig - Hoist crate::algebra::fields::LiftBase to use statement in sparse_challenge.rs Made-with: Cursor * feat: pack final_w as balanced digits, use Vec<i8> throughout prover Represent the prover's witness vector w as Vec<i8> instead of Vec<F> throughout the folding pipeline. Introduces PackedDigits to bit-pack the final-level w into log_basis bits per element, reducing proof size by ~32x. Cleans up import hygiene in profile example and proof module. Made-with: Cursor * perf: use const digit lookup table for i8-to-field conversion Add const fn digit_lut to Fp128 and FromSmallInt trait for precomputing balanced-digit-to-field-element tables. Replaces per-element from_i64 calls with indexed loads in the three hot prover loops (commit_w, build_w_evals_dual, dense_poly_from_w). Made-with: Cursor * perf: add DigitMontLut for i8 mat-vec kernels, clean up imports Add a precomputed Montgomery lookup table (DigitMontLut) for balanced digit values {-8..7}, replacing per-coefficient from_canonical calls in the i8→CRT+NTT conversion hot path. Wire it into mat_vec_mul_ntt_i8, mat_vec_mul_ntt_single_i8, and unreduced_quotient_rows_ntt_cached_i8. Also: merge duplicate NTT butterfly imports, remove duplicated doc comment on from_ring_cyclic, export DigitMontLut through ring/algebra modules, apply cargo fmt. Made-with: Cursor * perf: NEON SIMD kernels, decompose_fold optimization, explicit layout API Add AArch64 NEON SIMD for NTT butterflies, pointwise multiply-accumulate, and add-reduce (neon.rs). Dispatch from butterfly.rs and linear.rs with runtime feature check and scalar fallback. Optimize DensePoly::decompose_fold with two-phase restructure: K=3 interleaved carry chains for ILP on decomposition, then NEON rotate-and-add scatter (decompose_fold_neon.rs). ~2x speedup on compute_z_pre. Optimize OneHotPoly::decompose_fold by replacing O(omega*D) mul_by_sparse with direct sparse scatter O(omega*|nonzero_coeffs|). ~22x speedup. Thread explicit HachiCommitmentLayout through commit/prove/verify instead of computing from setup internally. Add OneHotIndex trait for generic onehot indices. Profile now uses OneHotPoly end-to-end for the onehot path. Clean up imports: hoist qualified crate::algebra::ntt::neon paths, move test-function use statements to module scope. Made-with: Cursor * perf: unreduced accumulation for sumcheck, fused compact round-0 loop Introduce HasUnreducedOps trait with MulU64Accum / ProductAccum types for Fp64, Fp128, and Fp2, enabling widening multiplies that defer reduction until after accumulation. Key changes: - Fuse norm + relation computation into a single pass for compact (Round 0) via compute_round_compact_fused, using split pos/neg MulU64Accum for the relation and i128/LUT arithmetic for the norm. - Sparse integer representation for affine-coeff precomputation (SparseCoeffEntry) with batched x4 kernel (compute_entry_coeffs_x4). - Two-level inner/outer ProductAccum accumulation for affine-coeff kernel, both compact and full-field paths. - Optimize fold_compact_to_full to use mul_u64_unreduced for r * delta. - Parallelize OneHotPoly::evaluate_ring, fold_blocks, decompose_fold. - Add FromSmallInt::from_i128 default method. Made-with: Cursor * perf: two-level ProductAccum for full-field affine-coeff kernel Upgrade the WTable::Full + AffineCoeffComposition path in HachiSumcheckProver to use two-level ProductAccum accumulation (outer loop over e_second, inner mul_to_product_accum, single reduction per j_high block), matching the standalone norm_sumcheck.rs implementation. Also fix multilinear_eval_small missing FromSmallInt bound, switch commitment_scheme w_eval to use w_evals_field (w_evals is moved), and add missing doc on ScaleI32 trait method. Made-with: Cursor * style: rustfmt formatting for poly.rs and hachi_sumcheck.rs Made-with: Cursor * fix(ci): use compound assignment operators to satisfy clippy Made-with: Cursor * chore: remove docs/ and paper/ from tracked files Backed up to quang/temp-docs branch. Files remain on disk. Made-with: Cursor * fix(ci): implement assign traits and fix all clippy assign_op_pattern lints Add MulAssign for Fp128, and AddAssign/SubAssign/MulAssign for all PackedNeon types. Convert all x = x op y patterns to x op= y across benches, tests, and lib. Made-with: Cursor * fix(ci): add assign traits to NoPacking, AVX2/AVX512 packed types, Fp32, Fp64 NoPacking<T> (x86_64 fallback) was missing AddAssign/SubAssign/MulAssign, causing CI failures on the GitHub runner. Add assign traits uniformly across all packed backends and scalar field types. Fix remaining assign_op_pattern lints in benches and tests. Made-with: Cursor * fix(ci): fix no-default-features clippy — unused var, dead code, rayon gate - Allow unused rel_combine (only used in parallel reduce combiner) - Allow dead_code on add_ntt_into (only used in parallel + aarch64) - Gate rayon::ThreadPoolBuilder behind cfg(feature = "parallel") - Fix remaining assign_op_pattern in norm_sumcheck bench Made-with: Cursor

… infrastructure (#7) * fix: separate delta_commit and delta_open for t_hat decomposition t = A * s produces full-field-size coefficients even when s has small (delta_commit-digit) entries. The code was decomposing t_hat using delta_commit instead of delta_open, causing lossy truncation and breaking verification for onehot/logbasis commitment configs. Split commit_inner's num_digits parameter into num_digits_commit (for s) and num_digits_open (for t_hat), and propagate this distinction through layout, commit, quadratic_equation, and ring_switch. Also: - Add Fp128FullCommitmentConfig, Fp128OneHotCommitmentConfig, Fp128LogBasisCommitmentConfig bounded commitment configs - Add optimal_m_r_split for dynamic m/r layout selection - Refactor profile example to be generic over CommitmentConfig and accept HACHI_NUM_VARS / HACHI_MODE env vars Made-with: Cursor * refactor(algebra): add repr(transparent) to CyclotomicRing types Enables safe transmute between `[CyclotomicRing<F, D>]` and `[F]` for the upcoming FlatMatrix D-agnostic storage layer. Made-with: Cursor * refactor(commitment): D-agnostic FlatMatrix storage + halving-D scaffolding Replace `Vec<Vec<CyclotomicRing<F, D>>>` in HachiExpandedSetup with `FlatMatrix<F>`, a D-agnostic flat field-element array that can be viewed at any ring dimension via `.view::<D>()`. This decouples setup storage from the const-generic D, enabling future varying-D prove loops. Key changes: - HachiExpandedSetup<F, D> → HachiExpandedSetup<F> (loses D) - HachiVerifierSetup<F, D> → HachiVerifierSetup<F> - NTT/CRT functions take RingMatrixView instead of &[Vec<CyclotomicRing>] - New FlatMatrix, NttCache, and dispatch_ring_dim! infrastructure - New CommitmentConfig::d_at_level / n_a_at_level trait methods - New Fp128HalvingDCommitmentConfig (D=512→256→128→64) - commit_w made pub for future varying-D usage Made-with: Cursor * refactor(bench): rewrite benchmarks with real configs and parameterized D Replace hand-rolled bench_config! macro with real commitment configs (Fp128FullCommitmentConfig, Fp128OneHotCommitmentConfig, Fp128LogBasisCommitmentConfig). Parameterize D as const generic instead of hardcoding. Use random evaluations, iter_batched for prove bench, and add HACHI_PARALLEL=0 env var for single-threaded runs. Made-with: Cursor * fix: eliminate debug-build stack overflow via dispatch extraction and NTT cache boxing Extract dispatch_ring_dim!/dispatch_with_ntt! macro expansions into dedicated #[inline(never)] functions (dispatch_prove_level, dispatch_verify_level, dispatch_commit) so monomorphized match arms live in separate stack frames instead of bloating the caller. Box NttSlotCache<D> fields inside MultiDNttCaches to avoid ~465KB temporaries on the stack when constructing MultiDNttBundle. Remove with_large_stack test wrappers and .cargo/config.toml — all tests now pass with the default 2MB stack in debug builds. Clean up import hygiene: hoist in-function use statements, replace inline fully-qualified paths with top-level imports. Made-with: Cursor * fix: broken doc links and clippy needless_range_loop - Use crate-qualified paths for MultiDNttBundle and HachiExpandedSetup doc links in dispatch_with_ntt macro - Replace index loop with iterator in flat_matrix test Made-with: Cursor

* Add rayon parallelism behind `parallel` feature flag (enabled by default) - New src/parallel.rs with cfg_iter!/cfg_into_iter!/cfg_chunks! macros that dispatch to rayon parallel iterators when `parallel` is enabled - Parallelize protocol hot paths: ring polynomial division, w_evals construction, M_alpha evaluation, ring vector evaluation, packed ring poly evaluation, coefficients-to-ring reduction, quadratic equation folding, and sumcheck round polynomial computation - All 174 tests pass with and without the parallel feature Made-with: Cursor * Add e2e benchmark and make HachiCommitmentScheme generic over config - Make HachiCommitmentScheme generic over <const D, Cfg> so different configs (and thus num_vars) can be used without code duplication. - Remove hardcoded DefaultCommitmentConfig::D from ring_switch.rs; WCommitmentConfig and commit_w now flow D generically. - Add benches/hachi_e2e.rs with configs sweeping nv=10,14,18,20. Made-with: Cursor * Refactor CRT-NTT backend: generalize over PrimeWidth, add Q128 support Make NTT primitives (NttPrime, NttTwiddles, MontCoeff, CyclotomicCrtNtt) generic over PrimeWidth (i16/i32) instead of hardcoding i16. Replace the monolithic QData struct with separate GarnerData and per-prime NttPrime arrays. Add Q128 parameter set (5 × i32 primes, D ≤ 1024) alongside the existing Q32 set. Simplify ScalarBackend by removing the const-generic limb count from to_ring_with_backend. Made-with: Cursor * Add extension field arithmetic and refactor sumcheck trait bounds Split CanonicalField into FromSmallInt (from_{u,i}{8,16,32,64} for all fields) and CanonicalField (u128 repr, base fields only). Implement FromSmallInt, Eq, Debug for Fp2/Fp4. Add ExtField<F> trait with EXT_DEGREE and from_base_slice. Optimize extension field arithmetic: Karatsuba multiplication for Fp2 and Fp4 (3 base muls instead of 4), specialized squaring (2 base muls for Fp2), non-residue IS_NEG_ONE specialization. Add concrete configs (TwoNr, NegOneNr, UnitNr) and type aliases Ext2<F>, Ext4<F>. Add transpose-based packed extension fields (PackedFp2, PackedFp4) for SIMD acceleration, following Plonky3's approach. Relax sumcheck bounds from E: CanonicalField to E: FromSmallInt (or E: FieldCore where spurious). Add sample_ext_challenge transcript helper. Includes tests for extension field sumcheck execution. Made-with: Cursor * Fix CRT+NTT correctness and optimize negacyclic NTT pipeline Correctness fixes: - Rewrite negacyclic NTT as twist + cyclic DIF/DIT pair (no bit-reversal permutation), correctly diagonalizing X^D+1. - Center coefficient→CRT mapping and Garner reconstruction to handle negacyclic sign wrapping consistently. - Fix i32 Montgomery csubp/caddp overflow via branchless i64 widening. - Fix q128 centering overflow in balanced_decompose_pow2 (avoid casting q≈2^128 into i128). - Remove dense-protocol schoolbook fallback; all mat-vec now routes through CRT+NTT. Performance optimizations: - Precompute per-stage twiddle roots in NttTwiddles (eliminate runtime pow_mod per butterfly stage). - Forward DIF butterfly skips reduce_range before Montgomery mul (safe because mul absorbs unreduced input). - Hoist centered-coefficient computation out of per-prime loop in from_ring. - Add fused pointwise multiply-accumulate for mat-vec inner loop. - Add batched mat_vec_mul_crt_ntt_many that precomputes matrix NTT once and reuses across many input vectors. - Wire commit_ring_blocks to batched A*s path. Benchmarks (D=64, Q32/K=6): - Single-prime forward+inverse NTT: 1.14µs → 0.43µs (2.7x) - CRT round-trip: 10.7µs → 6.3µs (1.7x) - Commit nv10: ~70% faster, nv20: ~47% faster Made-with: Cursor * Cache CRT+NTT matrix representations in setup to avoid repeated conversion The dense mat-vec paths (commit_ring_blocks, commit_onehot B-mul, compute_v) previously converted coefficient-form matrices to CRT+NTT on every call. Now the setup eagerly converts A, B, D into an NttMatrixCache and all dense operations use the pre-converted form. Coefficient-form matrices are retained for the onehot inner-product path and ring-switch/generate_m. Made-with: Cursor * Remove dead code (HachiRoutines, domains/, redundant trait methods) and extract shared field utilities - Delete unused HachiRoutines trait and dead algebra/domains/ module - Remove redundant FieldCore::add/sub/mul and Module::add/neg (covered by ops traits) - Extract is_pow2_u64, log2_pow2_u64, mul64_wide into fields/util.rs to deduplicate Made-with: Cursor * Unify Blake2b and Keccak transcript backends into generic HashTranscript Replace separate blake2b.rs and keccak.rs with a single generic HashTranscript<D: Digest> parameterized by hash function. Blake2bTranscript and KeccakTranscript are now type aliases. Made-with: Cursor * Fix sumcheck degree bug, split types, in-place fold, CommitWitness, rename configs, add soundness test - Fix CompressedUniPoly::degree() off-by-one that could let malformed proofs pass - Split sumcheck/mod.rs: extract types into types.rs, relocate multilinear_eval and fold_evals to algebra/poly.rs - Replace allocating fold_evals with in-place fold_evals_in_place - Add debug_assert guards to multilinear_eval and fold_evals_in_place - Introduce CommitWitness struct to replace error-prone 3-tuple returns - Rename DefaultCommitmentConfig to SmallTestCommitmentConfig, add ProductionFp128CommitmentConfig - Add verify_rejects_wrong_opening negative test for verifier soundness Made-with: Cursor * fix(test): resolve clippy needless_range_loop in algebra tests Use iter().enumerate() for schoolbook convolution loops and array::from_fn for pointwise NTT operations. Made-with: Cursor * Refactor commitment setup to runtime layout and staged artifacts. This removes compile-time commitment shape locks, derives beta from runtime layout, and threads layout-aware setup through commit/prove/verify with setup serialization roundtrip coverage. Made-with: Cursor * Soundness hardening: panic-free verifier, Fiat-Shamir binding, NTT overflow fix - Verifier path never panics; all errors return HachiError - Bind commitment, opening point, and y_ring in Fiat-Shamir transcript - Fix i16 csubp/caddp overflow by widening to i32 - multilinear_eval returns Result with dimension checks - build_w_evals validates w.len() is a multiple of d - UniPoly::degree uses saturating_sub instead of expect - Serialize usize as u64 for 32/64-bit portability - Fix from_i64(i64::MIN) via unsigned_abs - Remove Transcript::reset from public trait (move to inherent) - Add batched_sumcheck verifier empty-input guard Made-with: Cursor * Hoist fully qualified paths to use statements in touched files Replace inline crate::protocol::commitment::HachiCommitmentLayout, hachi_pcs::algebra::backend::{CrtReconstruct, NttPrimeOps}, and hachi_pcs::algebra::CyclotomicRing with top-level use imports. Made-with: Cursor * Dispatch norm sumcheck kernels by range size. Route small-b rounds through the point-eval interpolation kernel and keep the affine-coefficient kernel for larger b, while adding deterministic baseline-vs-dispatched benchmarks and parity tests to validate correctness across both strategies. Made-with: Cursor * Format commitment-related files for readability. Apply non-functional formatting and import ordering cleanups across commitment, ring-switch, and benchmark/test files to keep the codebase style consistent. Made-with: Cursor * Format: cargo fmt pass on commitment-related files Made-with: Cursor * feat: sequential coefficient ordering + streaming commitment Change coefficient-to-ring packing from strided to sequential, enabling true streaming where each trace chunk maps to exactly one inner Ajtai block. Implement StreamingCommitmentScheme for HachiCommitmentScheme. - reduce_coeffs_to_ring_elements: sequential packing (chunks_exact(D)) - prove/verify: opening point split flipped to (inner, outer) - ring_opening_point_from_field: outer split flipped to (M first, R second) - commit_coeffs: sequential block distribution - map_onehot_to_sparse_blocks: sequential block distribution - HachiChunkState + process_chunk / process_chunk_onehot / aggregate_chunks - Streaming commit tests (matches non-streaming, prove/verify roundtrip) Made-with: Cursor * refactor: decompose verify_batched_sumcheck into composable steps Split the monolithic verify_batched_sumcheck into three pieces: - verify_batched_sumcheck_rounds: replay rounds, return intermediate state - compute_batched_expected_output_claim: query verifier instances - check_batched_output_claim: enforce equality This enables callers (e.g. Greyhound) to intercept the intermediate sumcheck state before the final oracle check. The original function is preserved as a convenience wrapper. Made-with: Cursor * feat: Labrador/Greyhound recursive lattice proof protocol Implements the full Labrador recursive amortization and Greyhound evaluation reduction, ported from the C reference with Hachi-native Fiat-Shamir transcript integration. New modules: - protocol::labrador — recursive proof (prover, verifier, fold, commit, challenge rejection sampler, JL projection, config/guardrails, types) - protocol::greyhound — evaluation reduction (4-row witness, 5 constraints, eval prover + verifier-side reduce) - protocol::prg — pluggable PRG backends (SHAKE256, AES-128-CTR) for commitment key and JL matrix derivation Hachi-core changes: - algebra::ring — conjugation automorphism, coeff_norm_sq, ternary/ quaternary samplers for Labrador challenges - protocol::commitment — pre-derived setup matrices, PRG backend abstraction for matrix derivation - protocol::proof — HachiProof restructured as composite of folds + GreyhoundEvalProof + LabradorProof - protocol::ring_switch — externalized w_tilde(r) check for Greyhound - protocol::transcript — ring-element challenge functions (dense + rejection-sampled), 16 new Fiat-Shamir labels - protocol::commitment_scheme — integrated Greyhound/Labrador into prove/verify pipeline - sumcheck tests decoupled from old proof structure Made-with: Cursor * Impl folded Labrador protocol * Refactor Labrador Witness * Refactor Labrador Constraints * Change grenhound to use Labrador scheme * Update gitignore * Fix CI issues * Use constants instead of hardcoded values * feat: integrate Greyhound/Labrador lattice proof protocol into main Port the Greyhound evaluation-reduction and Labrador recursive lattice proof modules from dev-labrador onto main's optimized proving pipeline. Greyhound/Labrador is invoked as a final proof step after multi-level folding when D >= 64, providing post-quantum security for the opening. New modules: protocol/greyhound, protocol/labrador, protocol/prg. Algebra extensions: coefficients_mut, coeff_norm_sq, balanced_decompose_pow2_with_carry, conjugation_automorphism_ntt, sample_ternary/quaternary. Made-with: Cursor * Remove integration to Hachi * Fix CI issue --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

Save HachiExpandedSetup (seed + matrices A, B, D) to an OS-specific cache directory on first generation, and transparently load it on subsequent calls to avoid re-deriving matrices from SHAKE. NTT caches are rebuilt from the deserialized matrices. Pattern follows Dory's disk-persistence approach but saves only the expanded setup (not prover+verifier separately) since NTT caches are not serializable and must be reconstructed. Made-with: Cursor

* fix: harden CI workflow to resolve CodeQL security alerts Pin all GitHub Actions to immutable commit SHAs and add least-privilege permissions (contents: read) to address 9 medium-severity CodeQL alerts. Made-with: Cursor * chore: add .cursor/ to .gitignore Made-with: Cursor --------- Co-authored-by: Omid Bodaghi <42227752+omibo@users.noreply.github.com>

…outer_weights (#10) Three performance fixes for the prove path: 1. Gate prove_level_diagnostic, prove_level_selfcheck, and the w_eval consistency check behind #[cfg(debug_assertions)]. These were running unconditionally in release builds, causing duplicate compute_m_a_streaming calls and full polynomial evaluations purely for debug verification. 2. Factor outer_weights in prove_one_level: instead of materializing the full 2^(m_vars + r_vars) basis weight vector (~2.1 GB for large traces), pass ring_opening_point.b (size 2^r_vars) and derive the evaluation from the fold result: eval = Σ_i b[i] * fold(a)[i]. 3. Update HachiPolyOps::evaluate_and_fold signature to accept factored per-block outer scalars instead of the full tensor product. Made-with: Cursor

* perf: gate debug diagnostics behind cfg(debug_assertions) and factor outer_weights Three performance fixes for the prove path: 1. Gate prove_level_diagnostic, prove_level_selfcheck, and the w_eval consistency check behind #[cfg(debug_assertions)]. These were running unconditionally in release builds, causing duplicate compute_m_a_streaming calls and full polynomial evaluations purely for debug verification. 2. Factor outer_weights in prove_one_level: instead of materializing the full 2^(m_vars + r_vars) basis weight vector (~2.1 GB for large traces), pass ring_opening_point.b (size 2^r_vars) and derive the evaluation from the fold result: eval = Σ_i b[i] * fold(a)[i]. 3. Update HachiPolyOps::evaluate_and_fold signature to accept factored per-block outer scalars instead of the full tensor product. Made-with: Cursor * perf: streamline recursive Hachi proving path Keep recursive w witnesses in digit form to avoid rebuilding dense polynomials, and size setup and ring-switch work from exact runtime layouts to cut redundant work. Made-with: Cursor * fix: satisfy clippy on setup and ring-switch helpers Address the current CI failures with minimal changes by allowing the internal layout helper's argument count and switching the fused m_evals_x loops to iterator-based indexing. Made-with: Cursor

* perf: tighten and speed up norm sumcheck Enforce the balanced digit range produced by decomposition and reduce round-zero norm sumcheck work with compact affine precomputation plus the centered balanced point-eval form. Made-with: Cursor * feat: parameterize recursive w basis and expand profile comparisons Allow recursive w openings to use a different gadget basis from level 0 so we can explore decomposition and sumcheck tradeoffs directly. Add profile modes for comparing basis choices across the main dense and onehot workloads. Made-with: Cursor * perf: cache t rows and trim ring-switch witness overhead Cache inner Ajtai t rows so A_row can reuse them directly and accumulate only the quotient high half instead of recomposing from t_hat on every block. Trim the ring-switch witness path by dropping the unused field w-table, reusing decomposition scratch, and reading the final w evaluation from the folded prover state. Made-with: Cursor * perf: skip padded x tails in fused sumcheck Track the live x prefix from ring switch into the fused prover so x-rounds only accumulate and fold the physical witness region instead of explicit zero padding. Preserve the old semantics with round-by-round equivalence tests against the padded prover. Made-with: Cursor * test: bundle sumcheck test helper params for clippy Collapse the test-only Hachi sumcheck prover helper arguments into a small params struct so clippy no longer rejects the PR on too-many-arguments. Made-with: Cursor * fix: allow no-default-features sumcheck lint path Mark the parallel-only relation combiner as intentionally unused when the parallel feature is disabled so the CI clippy matrix stays green in both feature configurations. Made-with: Cursor * perf: specialize single-digit z_pre folds Cache dense small-digit coefficients and add direct onehot and dense single-digit fold paths so quadratic-equation z_pre construction stops paying generic decomposition costs when the witness is already digit-sized. Made-with: Cursor

* Add rayon support in Labrador * Change labrador params and match with reference impl * Impl Ajtai commitment scheme trait * Add setup to Labrador prover * Pass transcript to JL projection * Fix the issue with JL matrix distribution * Add benchmark for Labrador single level prover * Update labrador single-level proof benchmark * Refactor constraints in Labrador * Add two level labrador prover benchmark * Add docs for building next constraints functions * Make Labrador benchmark more realistic based on Greyhound numbers * Add NTT backend Ajtai commitment scheme * Add tests to verify verifier reject malicious proofs * Add more traing info for level prover * Use constants in tests/commitment * Optimizing aggregation phase * Fix recursive Labrador bug * Integrate Greyhound and Hachi * Integrate Labrador directly to Hachi * Address CI issues * Remove unused codes * Fix Labrador handoff binding and tail proof encoding Bind Labrador tails to the carried Hachi commitment, harden verifier and JL metadata checks, and make Labrador-tail serialization and size accounting honest. Add regression coverage for spliced tails, malformed metadata, variable-D handoff selection, and proof-size accounting. Made-with: Cursor * Update hachi e2e test * Use existing setup matrices * perf: switch bounded Fp128 configs to D=256 Align the default and halving Fp128 presets around the 256-dimensional Labrador path so the baseline matches the supported challenge machinery. Increase the sparse challenge weight at the lower ring dimension to preserve the intended security margin. Made-with: Cursor * perf: speed up Labrador challenge sampling Add a dedicated single-challenge fast path and reuse precomputed operator-norm tables for sparse challenges. This keeps the Fiat-Shamir distribution unchanged while removing repeated dense trigonometric work from the sampler hot path. Made-with: Cursor * perf: pack Labrador JL matrices and replay reduced statements Store JL signs in a packed ternary layout and aggregate directly into ring-aligned phi blocks to cut the dominant collapse and projection bandwidth. Carry recursive Labrador state as reduced constraint plans so prover and verifier only materialize explicit sparse constraints when they are actually needed. Made-with: Cursor * chore: trace Labrador setup and commitment helpers Label fold planning, setup derivation, and NTT commitment entry points so Perfetto traces attribute the remaining unlabeled setup and commit time to concrete Labrador phases. Made-with: Cursor * test: right-size Labrador e2e coverage Keep the Labrador end-to-end checks aligned with the current D=256 configs while reducing the default test sizes and serializing the heavy cases so nextest stays stable in CI. Made-with: Cursor * test: align Labrador coverage with profile path Use the standard onehot and full configs in the Labrador e2e checks, and benchmark the onehot prove path through OneHotPoly so the test and bench coverage matches the intended profile example behavior. Made-with: Cursor * style: format Labrador e2e imports Apply rustfmt's import grouping for the updated Labrador e2e test so the CI format check matches the checked-in tree. Made-with: Cursor * fix: regenerate stale setup caches and clarify Labrador stream IDs Avoid panicking on invalid cached setup files so local and CI runs can rebuild cleanly, and rename the deterministic challenge stream selector so CodeQL does not treat test vectors as hard-coded nonces. Made-with: Cursor --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

* Optimize aggregating jl projection functions * Add lookup to JL aggregation * Add more parallelism * perf: cut Labrador verifier recursion rebuild costs Split Labrador recursion state into prover and verifier setup shapes and compute reduced-plan verifier aggregation directly, so recursive verification stops rebuilding dense intermediate rows and unused NTT caches. Made-with: Cursor * perf: stream Labrador JL replay and tail checks Replay JL rows from the accepted transcript seed and verify the tail round directly on decomposed payloads, so recursive verification avoids rebuilding dense JL matrices and recomposed witness side data. Made-with: Cursor * perf: cut Labrador recursion earlier and batch hot kernels Prefer tail cutover as soon as it beats another standard fold, and batch the hottest aggregation, challenge replay, and linear-garbage kernels so large-nv Labrador stops dwarfing the Hachi path. Made-with: Cursor * perf: speed up Labrador aggregation and challenge replay Exploit sparse Labrador coefficient structure and cheaper challenge bound checks to cut the remaining prover and verifier hotspots without changing transcript behavior. Made-with: Cursor * perf: accelerate Labrador JL replay and aggregation kernels Reuse the in-memory JL collapse path on verifier replay, cut repeated JL scheduling overhead, and tighten dense ring accumulation so the remaining Labrador prover and verifier aggregation paths spend less time in repeated per-element work. Authored by Cursor assistant (model: GPT-5.4) on behalf of Quang Dao. Made-with: Cursor * perf: tighten Labrador handoff accounting and profiling Make profile runs fail fast outside --release and add the size diagnostics needed to compare direct and Labrador tails from real serialized cost. Reuse the handoff D-matrix NTT cache and compare recursive Labrador transitions against actual carried payload size so tail selection reflects what the proof will actually send. Made-with: Cursor * refactor: dedupe Labrador helper paths and quiet prover diagnostics Share the repeated Labrador utility helpers in one place and move the prover's profiling prints onto structured tracing, so the review feedback is addressed without changing protocol behavior. Made-with: Cursor * fix: inline profile format args for clippy Rewrite the remaining profile example format strings to use inline captures so the CI Clippy job passes again without changing the example's output. Made-with: Cursor * perf: cut allocation churn in folding helpers Reuse flat output buffers in ring-switch and sumcheck prefix folding, and evaluate multilinears recursively over slices. This trims temporary Vec creation on hot prover paths without changing protocol behavior. Made-with: Cursor * refactor: hoist opening-point helpers and simplify profile example Centralize basis and opening-point conversions so the profile example and protocol code reuse the same logic. Drop the setup-only profiling path so the example stays focused on end-to-end proving runs. Made-with: Cursor * fix: restore opening-point test helper imports Keep the commitment-scheme tests compiling after hoisting opening-point helpers into their own module. Include the accompanying rustfmt cleanup in touched Rust call sites. Made-with: Cursor * refactor: cut over Labrador naming and wire labels Replace the terse Labrador config and payload vocabulary with descriptive names across recursion, proofs, and transcript labels so the implementation is easier to follow and the wire format stays internally consistent. Guard the small-digit CRT/NTT fast path so deeper folds fall back safely once coefficients leave the lookup-table range. Made-with: Cursor * Refine Labrador handoff selection and tests --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

* perf: add d64 partial-split NTT prototype Isolate the q=2^128-5823 D=64 partial-split multiplication path, its packed cached-domain kernels, and a focused benchmark/test suite so it can be reviewed independently from the sumcheck work. Made-with: Cursor * fix: satisfy clippy in partial split benches Clean up the benchmark and test scaffolding to avoid indexed iteration warnings and packed-width modulo warnings in CI. Made-with: Cursor * perf: tighten partial split NTT hot helpers Inline the small hot wrappers, collapse duplicated scalar and packed helper kernels, and remove unused prototype-only APIs so the partial-split backend is leaner without changing behavior. Made-with: Cursor * perf: re-fuse single-product partial-split kernels Restore direct-write split multiply kernels so single-product and packed batch workloads do not pay the zero-plus-accumulate cost introduced by the cleanup refactor. Made-with: Cursor

* perf: split Hachi sumcheck into two stages Separate the prefix-range pass from the fused relation scan so stage 2 can reuse the shared local w basis and avoid redundant work. This also completes the stage naming cutover and removes the obsolete standalone sumcheck modules. Made-with: Cursor * chore: fix doc placement on HachiLevelProof and use Prime128M8M4M1M0 in tests Move the D-agnostic doc comment to HachiLevelProof where it belongs, and replace Fp64<4294967197> with the named Prime128M8M4M1M0 alias in ring_switch, hachi_stage1, and hachi_stage2 tests. Made-with: Cursor * fix: absorb s_claim into transcript before batching challenge + dedup trim_trailing_zeros Absorb the prover-supplied s_claim into the Fiat-Shamir transcript before sampling CHALLENGE_SUMCHECK_BATCH on both prover and verifier sides. Without this, an adversary could choose among multiple valid s_claim values after seeing the batching coefficient. Also extract the duplicated trim_trailing_zeros helper from hachi_stage1 and hachi_stage2 into the parent sumcheck module. Made-with: Cursor * perf: apply split-eq e_in-inside/e_out-outside optimization to all prefix_x paths Factor out the e_second multiplication from the inner loop in stage 1 and stage 2 prefix_x compute_round methods. Within each block of consecutive pairs sharing the same j_high, accumulate contributions weighted by e_first (e_in) only, then post-multiply the block result by e_second (e_out) once. This eliminates one full field multiply per pair per round in all prefix_x code paths. Made-with: Cursor * perf: optimize compact Hachi sumcheck folds Use pair-fold lookup tables for compact stage-1 and stage-2 folds and absorb stage-2 batching into split-eq so the fused kernels do less repeated field work. Clarify the stage-2 relation docs to match the actual prover/verifier identity. Made-with: Cursor * perf: skip recoverable norm linear coefficients in Hachi sumcheck Use split-eq claim recovery to omit norm-round linear q terms during accumulation while still reconstructing the full round polynomial when needed. Track the prior norm claim in stage 2 and add split-eq recovery tests so the reduced-coefficient path stays equivalent to the full computation. Made-with: Cursor * perf: add bivariate-skip proofs for early Hachi sumcheck rounds Build the first two stage-local bivariate-skip proofs directly, reconstruct the omitted round polynomials from compact algebraic state, and tighten the stage-2 prefix path so the skipped rounds stay cheap while the terminology matches the math. Made-with: Cursor * fix: keep full stage2 m table through sparse x rounds Carry the full stage2 m multilinear table across sparse prefix-x folding so boundary pairs and quads still use the verifier's full relation data, and harden the prefix tests around nonzero tail entries so the compact prover path stays aligned with the padded reference. Made-with: Cursor * fix: count prefix fields in profile proof breakdown Include both prefix option tag bytes and any serialized bivariate-skip payloads in the profile size accounting, and expose size/presence helpers on the staged proof payloads so the example can report those fields without reaching into private internals. Made-with: Cursor * test: clean up bivariate-skip reference helpers for CI clippy Use assign-op and iterator forms in the two-round prefix reference helpers so the strict all-targets Clippy job stays green without changing the helper math. Made-with: Cursor * fix: keep sumcheck prefix prover-only Bind the transcript only to canonical round messages and reject malformed proof shapes explicitly so verifier flow stays implementation-agnostic. Made-with: Cursor

… estimator (#19) * ci: add onehot nv32 benchmark reporting Track onehot nv32 timing and RSS in CI with a sticky PR report so benchmark changes stay visible across commits without heavier profiling artifacts. Made-with: Cursor * ci: clarify onehot sparsity labels Describe the nv32 benchmark and D=64 estimator as 1-of-256 one-hot so reviewers can read the sparsity assumptions directly from the check output and reports. Made-with: Cursor * docs/ci: add onehot analysis notes and harden benchmark reporting Bundle the supporting one-hot and SIS analysis notes with the benchmark branch so the PR carries the rationale for the new parameter choices. Clean up the remaining benchmark-reporting edge cases so traces stay alive for the full run, partial baselines render correctly, and PR comment upserts fail softly instead of surfacing hidden job errors. Made-with: Cursor * docs: remove local-only analysis notes from branch Keep the root analysis markdowns local-only so the benchmark PR only carries code and workflow changes. Preserve the local files via repo-local excludes instead of tracking them in git. Made-with: Cursor * ci: fix onehot timing fallback attribution Attribute missing split timings to Hachi when the benchmark log only exposes total prove or verify time, so the report stays conservative instead of assigning the whole interval to Labrador. Made-with: Cursor * ci: compare onehot bench to main and previous run Render the onehot benchmark report against both the main-branch split point and the previous successful PR update so regressions are visible against the branch base as well as the last iteration. Made-with: Cursor

…(#20) * Use scalar field randomness instead of ring randomness * Use AggregationRandomness enum for two randomness cases * Remove b computation from aggregation * Make JL projection matrix generation thread-friendly * Speedup computing h * Fix clippy

github-actions · 2026-03-19T07:48:20Z

One-hot 32 Variables Benchmark Report

Benchmark: 1-of-256 one-hot with 32 variables
Sparsity: 1-of-256 one-hot (equivalently, 1-sparse over 256 slots, density 0.39%).
Latest run: 5b5a737
Message: Merge 7ed983d into 96e7f45
Ref: quang/reduce-d-64
Command: target/release/examples/profile with HACHI_MODE=onehot HACHI_NUM_VARS=32 HACHI_PROFILE_TRACE=0 HACHI_PROFILE_SPAN_CLOSES=0 HACHI_PROFILE_LOG=info HACHI_PROFILE_ANSI=0.
Memory: maximum resident set size from /usr/bin/time on the benchmark process.

Metric	Latest run	Unit
Setup	7.414	s
Commit	2.224	s
Prove (Hachi)	4.245	s
Prove (Labrador)	0.001	s
Prove (Total)	4.246	s
Verify (Hachi)	0.152	s
Verify (Labrador)	0.070	s
Verify (Total)	0.222	s
Max RSS	6491.4	MiB

Tail: unknown
Proof size: 157,670 B
Hachi levels: 4

Posted by Cursor assistant (model: GPT-5.4) on behalf of the user (Quang Dao) with approval.

omibo and others added 20 commits February 27, 2026 17:18

Implement Batched Sumcheck and Gruen EQ (#2)

4980e1a

perf: reuse z_pre witness data across ring switch (#14)

eb8bea7

Replace eprintln! with structured logs (#15)

e62e434

feat: add D64 onehot scheduling infrastructure

7ed983d

quangvdao closed this Mar 19, 2026

quangvdao deleted the quang/reduce-d-64 branch March 26, 2026 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add D64 onehot scheduling infrastructure#8

Add D64 onehot scheduling infrastructure#8
quangvdao wants to merge 20 commits intomainfrom
quang/reduce-d-64

quangvdao commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

quangvdao commented Mar 19, 2026

Summary

Validation

Profile (examples/profile.rs, HACHI_MODE=onehot, HACHI_NUM_VARS=32)

Comparison To ONEHOT_PROOF_SIZE_OPTIMUM.md

Note

Uh oh!

github-actions bot commented Mar 19, 2026

One-hot 32 Variables Benchmark Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Profile (`examples/profile.rs`, `HACHI_MODE=onehot`, `HACHI_NUM_VARS=32`)

Comparison To `ONEHOT_PROOF_SIZE_OPTIMUM.md`