Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions BENCH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Tail-latency benchmarks

This document covers the **HDR-histogram** bench suite added in 0.7.0
under `benches/order_book/*_hdr.rs`. The default Criterion benches in
the same directory remain — they publish HTML reports to
`target/criterion/` and report a mean-centric statistical comparison
that Criterion does well. The HDR benches are the source of truth for
the **tail** numbers (`p50` / `p99` / `p99.9` / `p99.99`) that tier-one
electronic exchanges quote in SLOs.

## How to run

```bash
make bench-hdr # all six scenarios
cargo bench --bench mixed_70_20_10_hdr # single scenario
```

Each bench writes its raw HDR histogram to
`target/bench-hdr/<scenario>.hgrm` (V2 format) for downstream HDR
plotters; the directory lives under `target/` and is gitignored.

## Methodology

- **Histogram resolution.** `Histogram::<u64>` sized for `1 ns` to `1 s`
with three significant figures. Three sig-figs is enough to
distinguish `p99 ≠ p99.9` an order of magnitude apart while staying
memory-cheap (~80 KB per histogram).
- **Sample collection.** Each measured operation is wrapped in a closure
passed to `record(...)`, which times the closure with
`std::time::Instant::now()` (one call before, one after) and writes
the elapsed-nanosecond value into the histogram. The closure result
is consumed via `std::hint::black_box` to prevent dead-code
elimination.
- **Warmup.** Long-running scenarios (`add_only`, `mixed_70_20_10`)
discard 200 000 ops before the measurement window starts.
Pre-loading scenarios (`cancel_only`, `aggressive_walk`,
`mass_cancel_burst`) seed the book in a non-measured loop instead.
- **Workload determinism.** All scenarios drive a self-contained
xorshift PRNG seeded with `0xA5A5_A5A5_A5A5_A5A5`. Reproducing a run
with the same code produces the same op stream, modulo concurrent
scheduling jitter on the host.
- **Coordinated omission.** The bench loop is **closed-loop**: the
driver waits for each engine call to return before issuing the next.
Closed-loop measurements **systematically under-report** tail
latencies that a real load generator would observe under saturation,
because queueing delays that would build up under a fixed arrival
rate never materialize. **The numbers below are pure service time —
use them as a regression signal and a lower bound on the production
tail, not as a production SLO.** Open-loop measurement (record
`now - scheduled_arrival`, not `now - call_start`) is the right
follow-up; tracked but not in the initial drop.
- **CPU pinning.** Optional. On Linux, `taskset -c <core> cargo bench
--bench mixed_70_20_10_hdr` reduces variance from cross-core
scheduling. On macOS the benches were run without pinning — see the
run conditions block below.

## Run conditions for the numbers below

| Item | Value |
|---|---|
| Host | Apple M4 Max, macOS 25.4 (Darwin 25.4.0, `arm64`) |
| Pinning | None |
| Toolchain | `rustc 1.95.0` (stable) |
| Profile | `--release` (Cargo `bench` profile = `release` clone) |
| `RUSTFLAGS` | unset |
| Allocator | system allocator |
| Date | 2026-04-25 |
| Crate version | `0.7.0-unreleased` (commit on `issue-56-hdr-bench`) |

## Headline numbers

All values in nanoseconds. **Closed-loop service time** — see
"Coordinated omission" above.

### `add_only` — pure passive limit submission, no crossings

200 000 warmup + 1 000 000 measured.

| Quantile | Latency (ns) |
|---|---|
| p50 | 791 |
| p99 | 78 847 |
| p99.9 | 146 303 |
| p99.99 | 401 663 |
| max | 528 895 |

**Where the tail comes from.** The book grows monotonically across the
measurement window, so each insert must walk the `SkipMap` to the
right level. The dominant contributor at p99.99 is allocator jitter
when `Arc<PriceLevel>` allocations churn under the system allocator;
secondary is L2 cache misses on the price-side `SkipMap` when the
working set outgrows L1.

### `cancel_only` — pre-loaded book, sequential cancels

1 000 000 pre-loaded resting orders, all cancelled in order.

| Quantile | Latency (ns) |
|---|---|
| p50 | 42 |
| p99 | 25 167 |
| p99.9 | 34 047 |
| p99.99 | 172 031 |
| max | 1 271 807 |

**Where the tail comes from.** `DashMap::remove` on the order index is
a shard-local lock acquisition; the median is dominated by that
single-cycle CAS path. The very long p99.99 / max tails reflect
shard-contention windows when multiple removals land on the same
shard back to back, plus rare allocator returns of large
`PriceLevel` linked-list nodes.

### `aggressive_walk` — taker market orders sweep multi-level book

50 levels × 100 resting orders pre-loaded, then 100 000 aggressive
buys with qty `5..=20`.

| Quantile | Latency (ns) |
|---|---|
| p50 | 41 |
| p99 | 7 083 |
| p99.9 | 16 959 |
| p99.99 | 33 823 |
| max | 203 263 |

**Where the tail comes from.** The fill loop iterates per-order at
each level until the requested quantity is consumed. Median is fast
because most sweeps fill within a single level. Tail is driven by
sweeps that span multiple levels and drop several `Arc<PriceLevel>`s
at once.

### `mixed_70_20_10` — 70 % submit, 20 % cancel, 10 % aggressive

200 000 warmup + 1 000 000 measured. The "realistic" headline number.

| Quantile | Latency (ns) |
|---|---|
| p50 | 667 |
| p99 | 39 487 |
| p99.9 | 71 999 |
| p99.99 | 298 239 |
| max | 644 607 |

**Where the tail comes from.** Mix of all three previous tails. The
median tracks `add_only` (because submits are 70 % of the workload).
The p99.99 comes from rare aggressive sweeps that interact with
allocator returns released by recent cancels.

### `thin_book_sweep` — book near-empty, IOC probing

Refills 3 resting asks every 5 ops; 200 000 IOC buy probes with qty
`1..=20`.

| Quantile | Latency (ns) |
|---|---|
| p50 | 42 |
| p99 | 5 711 |
| p99.9 | 15 127 |
| p99.99 | 50 431 |
| max | 418 303 |

**Where the tail comes from.** Most probes either fully fill the
small resting depth or partial-fill and short-circuit. The p99 is
shaped by the partial-fill-then-cancel-remainder bookkeeping; the max
is allocator jitter when the book transitions empty → non-empty.

### `mass_cancel_burst` — dense book, then `cancel_all_orders`

10 000 orders pre-loaded × 500 bursts. Each measured sample is
**one full burst**, not one cancel — useful as an operator-side
wall-clock guard rather than a per-op tail.

| Quantile | Latency (ns) |
|---|---|
| p50 | 25 711 |
| p99 | 48 447 |
| p99.9 | 312 575 |
| p99.99 | 312 575 |
| max | 312 575 |

**Where the tail comes from.** Burst latency scales linearly with the
book depth; on a tight host the median is ~26 µs to drain 10 000
orders, ~0.5 ns per order amortised. The p99.9 / p99.99 / max all
collapse to the same value because only 500 samples were taken — the
single worst-case observation dominates.

## Limitations

- **macOS, no pinning.** The host above is a workstation, not a
performance-tuned bench rig. Tail numbers will be tighter on a
Linux host with `isolcpus=` + `nohz_full=` + a pinned thread, with
the system allocator swapped for `jemalloc` or `mimalloc`.
- **Closed-loop only.** As called out under Methodology — these
numbers are pure service time, not load-induced tail. Open-loop
measurement is the next iteration of this suite.
- **Single-threaded driver.** The benches issue one op at a time. A
multi-writer driver would surface `DashMap` shard contention more
visibly; deferred to a follow-up.

## Reproducing

```bash
git checkout issue-56-hdr-bench # or main once merged
make bench-hdr
cat target/bench-hdr/*.hgrm # raw histograms
```

`hgrm` files are V2 format — readable by `HdrHistogram` plot tooling
or convertible via `hdrhistogram`'s `Reader`.
33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
> below group changes by feature; everything ships in the same
> 0.7.0 publish.

### Added — HDR-histogram tail-latency bench suite (#56)

- **Six new bench binaries** under `benches/order_book/*_hdr.rs` that
record per-sample latency into an `hdrhistogram::Histogram` and
emit `p50` / `p99` / `p99.9` / `p99.99` + min / max + sample count
to stdout. Scenarios: `add_only`, `cancel_only`,
`aggressive_walk`, `mixed_70_20_10`, `thin_book_sweep`,
`mass_cancel_burst`. Each is a `harness = false` binary that
coexists with the existing Criterion benches.
- **Shared helpers** in `benches/order_book/hdr_common.rs`
(`new_histogram`, `record`, `report`, `persist`) and a
self-contained xorshift PRNG so the bench tree pulls no extra
runtime dependency beyond `hdrhistogram`.
- **`hdrhistogram` ^7** as a dev-dependency.
- **`make bench-hdr`** target — runs all six scenarios in series.
- **`BENCH.md`** at repo root with methodology (warmup, closed-loop
vs open-loop disclosure), reproducibility steps, run conditions
block, and an honest table of the headline numbers from a single
M4 Max run plus a one-paragraph "where the tail comes from"
paragraph per scenario. Format-version stays at `2`.
- Raw histograms persist to `target/bench-hdr/<scenario>.hgrm` (V2
HDR format, gitignored under `target/`).

### Notes — HDR bench

- **Closed-loop service time only.** The driver waits for each call
before issuing the next — tail latencies under saturation will be
worse than what these numbers report. Used as a regression signal
and a lower-bound on production tail, not as a published SLO.
Open-loop measurement is a follow-up.
- The Criterion benches under `benches/order_book/` (`add_orders.rs`,
`match_orders.rs`, etc.) are unchanged.

### Added — closed `RejectReason` enum (#55)

- **New `RejectReason`** closed `#[non_exhaustive] #[repr(u16)]` enum
Expand Down
31 changes: 31 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,12 +61,43 @@ criterion = { version = "0.8", features = ["html_reports"] }
tokio = { version = "1.52", features = ["macros", "rt-multi-thread", "time"] }
tempfile = "3"
proptest = "1.7"
hdrhistogram = "^7"

[[bench]]
name = "benches"
path = "benches/mod.rs"
harness = false

[[bench]]
name = "add_only_hdr"
path = "benches/order_book/add_only_hdr.rs"
harness = false

[[bench]]
name = "cancel_only_hdr"
path = "benches/order_book/cancel_only_hdr.rs"
harness = false

[[bench]]
name = "aggressive_walk_hdr"
path = "benches/order_book/aggressive_walk_hdr.rs"
harness = false

[[bench]]
name = "mixed_70_20_10_hdr"
path = "benches/order_book/mixed_70_20_10_hdr.rs"
harness = false

[[bench]]
name = "thin_book_sweep_hdr"
path = "benches/order_book/thin_book_sweep_hdr.rs"
harness = false

[[bench]]
name = "mass_cancel_burst_hdr"
path = "benches/order_book/mass_cancel_burst_hdr.rs"
harness = false

[[test]]
name = "tests"
path = "tests/unit/mod.rs"
Expand Down
9 changes: 9 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,15 @@ bench-json: check-cargo-criterion
bench-clean:
rm -rf target/criterion

.PHONY: bench-hdr
bench-hdr:
cargo bench --bench add_only_hdr
cargo bench --bench cancel_only_hdr
cargo bench --bench aggressive_walk_hdr
cargo bench --bench mixed_70_20_10_hdr
cargo bench --bench thin_book_sweep_hdr
cargo bench --bench mass_cancel_burst_hdr


.PHONY: workflow-coverage
workflow-coverage:
Expand Down
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,20 @@ This order book engine is built with the following design principles:

### What's New in Version 0.7.0

#### v0.7.0 — HDR-histogram tail-latency bench suite

- **Six new `*_hdr` bench binaries** under
`benches/order_book/`: `add_only`, `cancel_only`,
`aggressive_walk`, `mixed_70_20_10`, `thin_book_sweep`,
`mass_cancel_burst`. Each records per-sample nanosecond
latencies into an `hdrhistogram::Histogram` and emits
`p50` / `p99` / `p99.9` / `p99.99` + `min` / `max`. Coexists
with the existing Criterion benches.
- **`make bench-hdr`** convenience target.
- **Headline numbers + methodology** in `BENCH.md` at the repo
root, with a closed-loop disclosure block (the suite measures
service time, not load-induced tail).

#### v0.7.0 — Closed `RejectReason` enum

- **New [`RejectReason`]** — closed
Expand Down
33 changes: 33 additions & 0 deletions benches/order_book/add_only_hdr.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
// add_only_hdr — pure passive limit-order entry, no crossings.
// Measures `add_order` insert cost in isolation.

#[path = "hdr_common.rs"]
mod common;

use common::{Rng, new_histogram, persist, record, report, submit_gtc};

const SCENARIO: &str = "add_only";
const WARMUP_OPS: u64 = 200_000;
const MEASURED_OPS: u64 = 1_000_000;
const SEED: u64 = 0xA5A5_A5A5_A5A5_A5A5;

fn main() {
let book = common::fresh_book();
let mut rng = Rng::new(SEED);
let mut hist = new_histogram();

// Warmup — discarded.
for i in 0..WARMUP_OPS {
submit_gtc(&book, &mut rng, i);
}

// Measurement — id space picks up where warmup stopped to avoid
// collisions inside `order_locations`.
for i in 0..MEASURED_OPS {
let id = WARMUP_OPS + i;
record(&mut hist, || submit_gtc(&book, &mut rng, id));
}

report(SCENARIO, &hist);
persist(SCENARIO, &hist).expect("persist hgrm");
}
Loading