joaquinbejar · joaquinbejar · Apr 25, 2026 · Apr 25, 2026
diff --git a/BENCH.md b/BENCH.md
@@ -0,0 +1,209 @@
+# Tail-latency benchmarks
+
+This document covers the **HDR-histogram** bench suite added in 0.7.0
+under `benches/order_book/*_hdr.rs`. The default Criterion benches in
+the same directory remain — they publish HTML reports to
+`target/criterion/` and report a mean-centric statistical comparison
+that Criterion does well. The HDR benches are the source of truth for
+the **tail** numbers (`p50` / `p99` / `p99.9` / `p99.99`) that tier-one
+electronic exchanges quote in SLOs.
+
+## How to run
+
+```bash
+make bench-hdr                 # all six scenarios
+cargo bench --bench mixed_70_20_10_hdr   # single scenario
+```
+
+Each bench writes its raw HDR histogram to
+`target/bench-hdr/<scenario>.hgrm` (V2 format) for downstream HDR
+plotters; the directory lives under `target/` and is gitignored.
+
+## Methodology
+
+- **Histogram resolution.** `Histogram::<u64>` sized for `1 ns` to `1 s`
+  with three significant figures. Three sig-figs is enough to
+  distinguish `p99 ≠ p99.9` an order of magnitude apart while staying
+  memory-cheap (~80 KB per histogram).
+- **Sample collection.** Each measured operation is wrapped in a closure
+  passed to `record(...)`, which times the closure with
+  `std::time::Instant::now()` (one call before, one after) and writes
+  the elapsed-nanosecond value into the histogram. The closure result
+  is consumed via `std::hint::black_box` to prevent dead-code
+  elimination.
+- **Warmup.** Long-running scenarios (`add_only`, `mixed_70_20_10`)
+  discard 200 000 ops before the measurement window starts.
+  Pre-loading scenarios (`cancel_only`, `aggressive_walk`,
+  `mass_cancel_burst`) seed the book in a non-measured loop instead.
+- **Workload determinism.** All scenarios drive a self-contained
+  xorshift PRNG seeded with `0xA5A5_A5A5_A5A5_A5A5`. Reproducing a run
+  with the same code produces the same op stream, modulo concurrent
+  scheduling jitter on the host.
+- **Coordinated omission.** The bench loop is **closed-loop**: the
+  driver waits for each engine call to return before issuing the next.
+  Closed-loop measurements **systematically under-report** tail
+  latencies that a real load generator would observe under saturation,
+  because queueing delays that would build up under a fixed arrival
+  rate never materialize. **The numbers below are pure service time —
+  use them as a regression signal and a lower bound on the production
+  tail, not as a production SLO.** Open-loop measurement (record
+  `now - scheduled_arrival`, not `now - call_start`) is the right
+  follow-up; tracked but not in the initial drop.
+- **CPU pinning.** Optional. On Linux, `taskset -c <core> cargo bench
+  --bench mixed_70_20_10_hdr` reduces variance from cross-core
+  scheduling. On macOS the benches were run without pinning — see the
+  run conditions block below.
+
+## Run conditions for the numbers below
+
+| Item | Value |
+|---|---|
+| Host | Apple M4 Max, macOS 25.4 (Darwin 25.4.0, `arm64`) |
+| Pinning | None |
+| Toolchain | `rustc 1.95.0` (stable) |
+| Profile | `--release` (Cargo `bench` profile = `release` clone) |
+| `RUSTFLAGS` | unset |
+| Allocator | system allocator |
+| Date | 2026-04-25 |
+| Crate version | `0.7.0-unreleased` (commit on `issue-56-hdr-bench`) |
+
+## Headline numbers
+
+All values in nanoseconds. **Closed-loop service time** — see
+"Coordinated omission" above.
+
+### `add_only` — pure passive limit submission, no crossings
+
+200 000 warmup + 1 000 000 measured.
+
+| Quantile | Latency (ns) |
+|---|---|
+| p50    | 791 |
+| p99    | 78 847 |
+| p99.9  | 146 303 |
+| p99.99 | 401 663 |
+| max    | 528 895 |
+
+**Where the tail comes from.** The book grows monotonically across the
+measurement window, so each insert must walk the `SkipMap` to the
+right level. The dominant contributor at p99.99 is allocator jitter
+when `Arc<PriceLevel>` allocations churn under the system allocator;
+secondary is L2 cache misses on the price-side `SkipMap` when the
+working set outgrows L1.
+
+### `cancel_only` — pre-loaded book, sequential cancels
+
+1 000 000 pre-loaded resting orders, all cancelled in order.
+
+| Quantile | Latency (ns) |
+|---|---|
+| p50    | 42 |
+| p99    | 25 167 |
+| p99.9  | 34 047 |
+| p99.99 | 172 031 |
+| max    | 1 271 807 |
+
+**Where the tail comes from.** `DashMap::remove` on the order index is
+a shard-local lock acquisition; the median is dominated by that
+single-cycle CAS path. The very long p99.99 / max tails reflect
+shard-contention windows when multiple removals land on the same
+shard back to back, plus rare allocator returns of large
+`PriceLevel` linked-list nodes.
+
+### `aggressive_walk` — taker market orders sweep multi-level book
+
+50 levels × 100 resting orders pre-loaded, then 100 000 aggressive
+buys with qty `5..=20`.
+
+| Quantile | Latency (ns) |
+|---|---|
+| p50    | 41 |
+| p99    | 7 083 |
+| p99.9  | 16 959 |
+| p99.99 | 33 823 |
+| max    | 203 263 |
+
+**Where the tail comes from.** The fill loop iterates per-order at
+each level until the requested quantity is consumed. Median is fast
+because most sweeps fill within a single level. Tail is driven by
+sweeps that span multiple levels and drop several `Arc<PriceLevel>`s
+at once.
+
+### `mixed_70_20_10` — 70 % submit, 20 % cancel, 10 % aggressive
+
+200 000 warmup + 1 000 000 measured. The "realistic" headline number.
+
+| Quantile | Latency (ns) |
+|---|---|
+| p50    | 667 |
+| p99    | 39 487 |
+| p99.9  | 71 999 |
+| p99.99 | 298 239 |
+| max    | 644 607 |
+
+**Where the tail comes from.** Mix of all three previous tails. The
+median tracks `add_only` (because submits are 70 % of the workload).
+The p99.99 comes from rare aggressive sweeps that interact with
+allocator returns released by recent cancels.
+
+### `thin_book_sweep` — book near-empty, IOC probing
+
+Refills 3 resting asks every 5 ops; 200 000 IOC buy probes with qty
+`1..=20`.
+
+| Quantile | Latency (ns) |
+|---|---|
+| p50    | 42 |
+| p99    | 5 711 |
+| p99.9  | 15 127 |
+| p99.99 | 50 431 |
+| max    | 418 303 |
+
+**Where the tail comes from.** Most probes either fully fill the
+small resting depth or partial-fill and short-circuit. The p99 is
+shaped by the partial-fill-then-cancel-remainder bookkeeping; the max
+is allocator jitter when the book transitions empty → non-empty.
+
+### `mass_cancel_burst` — dense book, then `cancel_all_orders`
+
+10 000 orders pre-loaded × 500 bursts. Each measured sample is
+**one full burst**, not one cancel — useful as an operator-side
+wall-clock guard rather than a per-op tail.
+
+| Quantile | Latency (ns) |
+|---|---|
+| p50    | 25 711 |
+| p99    | 48 447 |
+| p99.9  | 312 575 |
+| p99.99 | 312 575 |
+| max    | 312 575 |
+
+**Where the tail comes from.** Burst latency scales linearly with the
+book depth; on a tight host the median is ~26 µs to drain 10 000
+orders, ~0.5 ns per order amortised. The p99.9 / p99.99 / max all
+collapse to the same value because only 500 samples were taken — the
+single worst-case observation dominates.
+
+## Limitations
+
+- **macOS, no pinning.** The host above is a workstation, not a
+  performance-tuned bench rig. Tail numbers will be tighter on a
+  Linux host with `isolcpus=` + `nohz_full=` + a pinned thread, with
+  the system allocator swapped for `jemalloc` or `mimalloc`.
+- **Closed-loop only.** As called out under Methodology — these
+  numbers are pure service time, not load-induced tail. Open-loop
+  measurement is the next iteration of this suite.
+- **Single-threaded driver.** The benches issue one op at a time. A
+  multi-writer driver would surface `DashMap` shard contention more
+  visibly; deferred to a follow-up.
+
+## Reproducing
+
+```bash
+git checkout issue-56-hdr-bench  # or main once merged
+make bench-hdr
+cat target/bench-hdr/*.hgrm     # raw histograms
+```
+
+`hgrm` files are V2 format — readable by `HdrHistogram` plot tooling
+or convertible via `hdrhistogram`'s `Reader`.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 > below group changes by feature; everything ships in the same
 > 0.7.0 publish.
 
+### Added — HDR-histogram tail-latency bench suite (#56)
+
+- **Six new bench binaries** under `benches/order_book/*_hdr.rs` that
+  record per-sample latency into an `hdrhistogram::Histogram` and
+  emit `p50` / `p99` / `p99.9` / `p99.99` + min / max + sample count
+  to stdout. Scenarios: `add_only`, `cancel_only`,
+  `aggressive_walk`, `mixed_70_20_10`, `thin_book_sweep`,
+  `mass_cancel_burst`. Each is a `harness = false` binary that
+  coexists with the existing Criterion benches.
+- **Shared helpers** in `benches/order_book/hdr_common.rs`
+  (`new_histogram`, `record`, `report`, `persist`) and a
+  self-contained xorshift PRNG so the bench tree pulls no extra
+  runtime dependency beyond `hdrhistogram`.
+- **`hdrhistogram` ^7** as a dev-dependency.
+- **`make bench-hdr`** target — runs all six scenarios in series.
+- **`BENCH.md`** at repo root with methodology (warmup, closed-loop
+  vs open-loop disclosure), reproducibility steps, run conditions
+  block, and an honest table of the headline numbers from a single
+  M4 Max run plus a one-paragraph "where the tail comes from"
+  paragraph per scenario. Format-version stays at `2`.
+- Raw histograms persist to `target/bench-hdr/<scenario>.hgrm` (V2
+  HDR format, gitignored under `target/`).
+
+### Notes — HDR bench
+
+- **Closed-loop service time only.** The driver waits for each call
+  before issuing the next — tail latencies under saturation will be
+  worse than what these numbers report. Used as a regression signal
+  and a lower-bound on production tail, not as a published SLO.
+  Open-loop measurement is a follow-up.
+- The Criterion benches under `benches/order_book/` (`add_orders.rs`,
+  `match_orders.rs`, etc.) are unchanged.
+
 ### Added — closed `RejectReason` enum (#55)
 
 - **New `RejectReason`** closed `#[non_exhaustive] #[repr(u16)]` enum

diff --git a/Cargo.toml b/Cargo.toml
@@ -61,12 +61,43 @@ criterion = { version = "0.8", features = ["html_reports"] }
 tokio = { version = "1.52", features = ["macros", "rt-multi-thread", "time"] }
 tempfile = "3"
 proptest = "1.7"
+hdrhistogram = "^7"
 
 [[bench]]
 name = "benches"
 path = "benches/mod.rs"
 harness = false
 
+[[bench]]
+name = "add_only_hdr"
+path = "benches/order_book/add_only_hdr.rs"
+harness = false
+
+[[bench]]
+name = "cancel_only_hdr"
+path = "benches/order_book/cancel_only_hdr.rs"
+harness = false
+
+[[bench]]
+name = "aggressive_walk_hdr"
+path = "benches/order_book/aggressive_walk_hdr.rs"
+harness = false
+
+[[bench]]
+name = "mixed_70_20_10_hdr"
+path = "benches/order_book/mixed_70_20_10_hdr.rs"
+harness = false
+
+[[bench]]
+name = "thin_book_sweep_hdr"
+path = "benches/order_book/thin_book_sweep_hdr.rs"
+harness = false
+
+[[bench]]
+name = "mass_cancel_burst_hdr"
+path = "benches/order_book/mass_cancel_burst_hdr.rs"
+harness = false
+
 [[test]]
 name = "tests"
 path = "tests/unit/mod.rs"

diff --git a/Makefile b/Makefile
@@ -167,6 +167,15 @@ bench-json: check-cargo-criterion
 bench-clean:
 	rm -rf target/criterion
 
+.PHONY: bench-hdr
+bench-hdr:
+	cargo bench --bench add_only_hdr
+	cargo bench --bench cancel_only_hdr
+	cargo bench --bench aggressive_walk_hdr
+	cargo bench --bench mixed_70_20_10_hdr
+	cargo bench --bench thin_book_sweep_hdr
+	cargo bench --bench mass_cancel_burst_hdr
+
 
 .PHONY: workflow-coverage
 workflow-coverage:

diff --git a/README.md b/README.md
@@ -48,6 +48,20 @@ This order book engine is built with the following design principles:
 
 ### What's New in Version 0.7.0
 
+#### v0.7.0 — HDR-histogram tail-latency bench suite
+
+- **Six new `*_hdr` bench binaries** under
+  `benches/order_book/`: `add_only`, `cancel_only`,
+  `aggressive_walk`, `mixed_70_20_10`, `thin_book_sweep`,
+  `mass_cancel_burst`. Each records per-sample nanosecond
+  latencies into an `hdrhistogram::Histogram` and emits
+  `p50` / `p99` / `p99.9` / `p99.99` + `min` / `max`. Coexists
+  with the existing Criterion benches.
+- **`make bench-hdr`** convenience target.
+- **Headline numbers + methodology** in `BENCH.md` at the repo
+  root, with a closed-loop disclosure block (the suite measures
+  service time, not load-induced tail).
+
 #### v0.7.0 — Closed `RejectReason` enum
 
 - **New [`RejectReason`]** — closed

diff --git a/benches/order_book/add_only_hdr.rs b/benches/order_book/add_only_hdr.rs
@@ -0,0 +1,33 @@
+// add_only_hdr — pure passive limit-order entry, no crossings.
+// Measures `add_order` insert cost in isolation.
+
+#[path = "hdr_common.rs"]
+mod common;
+
+use common::{Rng, new_histogram, persist, record, report, submit_gtc};
+
+const SCENARIO: &str = "add_only";
+const WARMUP_OPS: u64 = 200_000;
+const MEASURED_OPS: u64 = 1_000_000;
+const SEED: u64 = 0xA5A5_A5A5_A5A5_A5A5;
+
+fn main() {
+    let book = common::fresh_book();
+    let mut rng = Rng::new(SEED);
+    let mut hist = new_histogram();
+
+    // Warmup — discarded.
+    for i in 0..WARMUP_OPS {
+        submit_gtc(&book, &mut rng, i);
+    }
+
+    // Measurement — id space picks up where warmup stopped to avoid
+    // collisions inside `order_locations`.
+    for i in 0..MEASURED_OPS {
+        let id = WARMUP_OPS + i;
+        record(&mut hist, || submit_gtc(&book, &mut rng, id));
+    }
+
+    report(SCENARIO, &hist);
+    persist(SCENARIO, &hist).expect("persist hgrm");
+}