A high-performance limit order book implementation in modern C++17.
make # Build the demo
make test # Run tests
make benchmark # Build benchmark suite
make clean # Clean build artifacts- Zero heap allocations on the hot path: All
OrderandPriceLevelobjects served from pre-allocated object pools. Fill buffer reused acrossmatch_ordercalls — nostd::vector<Fill>allocation per match. Deterministic mode (-DLOB_DETERMINISTIC_POOL) enforces zero-allocation at compile time. - Templatized matching on Side:
match_order,add_order_to_book, andremove_order_from_bookaretemplate<Side S>— the compiler generates separate BUY/SELL code paths, eliminating runtime branch in the inner matching loop. - Branch hints:
LOB_LIKELY/LOB_UNLIKELYannotations on all hot-path branches (pool allocation, price range checks, matching loop predicates). - Order struct layout:
Sidenarrowed touint8_t, fields reordered for zero internal padding. Hot fields (id, price, quantity, pointers) packed into first 48 bytes (single cache line). - ITCH 5.0 parser: Zero-copy binary protocol parser for Add Order (A/F), Order Executed (E), Order Cancel (X), Order Delete (D), and Order Replace (U) messages.
- Realistic workload benchmark: 93% cancel, 5% add, 2% modify — matching real exchange traffic patterns where the vast majority of orders are cancelled before execution.
- Cycle counter: Cross-platform
rdtsc(x86) /CNTVCT_EL0(ARM64) cycle timer for sub-nanosecond measurement resolution.
- Each shard runs on a dedicated pinned CPU core (
pthread_setaffinity_npon Linux,thread_policy_seton macOS), eliminating context-switch jitter and core migration. - Single-threaded per shard — no locks, no atomics on the hot path.
- SPSC queues between producer and consumer threads.
AddOrder 26 ns | MatchOrder 47 ns | CancelHeavy 27 Mops/s
| Benchmark | Mean | Throughput | vs 1.2.0 |
|---|---|---|---|
| AddOrder | 26 ns | 38.9 Mops/s | 1.1x faster (was 29 ns) |
| MatchOrder | 47 ns | 21.4 Mops/s | comparable |
| CancelOrder | 16 ns | 63.6 Mops/s | comparable |
| MixedWorkload | 27 ns | 21.6 Mops/s | comparable |
| CancelHeavyWorkload | 19 ns | 27.0 Mops/s | NEW |
| Operation | P99 | P99.9 | P99.99 | vs 1.2.0 |
|---|---|---|---|---|
| AddOrder | 42 ns | 250 ns | 1.8 us | P99.9 4x better (was 1 us) |
| MatchOrder | 84 ns | 167 ns | 250 ns | P99.99 1.3x better (was 333 ns) |
| CancelHeavyWorkload | 83 ns | 666 ns | 833 ns | NEW |
| MixedWorkload | 167 ns | 250 ns | 500 ns | P99.99 7x better (was 3.5 us) |
See results/benchmark_result_1.3.0.csv for full data.
- Tick-indexed ladders: Contiguous arrays replace
std::map+std::unordered_mapfor price levels. Active levels tracked via bitset with__builtin_ctzll/__builtin_clzllscanning. - Object pool with growth control: Pre-allocated pools for
OrderandPriceLevel. Deterministic mode (-DLOB_DETERMINISTIC_POOL) prevents runtime heap allocation. - Sharded engine: SPSC queue per shard, strict single-producer ownership, batched consumer processing, thread pinning.
AddOrder 28 ns | MatchOrder 46 ns | MixedWorkload 22.8 Mops/s
| Benchmark | Mean | Throughput | vs 1.1.0 |
|---|---|---|---|
| AddOrder | 28 ns | 35.0 Mops/s | 1.9x faster (was 54 ns) |
| MatchOrder | 46 ns | 22.0 Mops/s | 2.6x faster (was 118 ns) |
| MixedWorkload | 28 ns | 21.2 Mops/s | 1.6x faster (was 46 ns) |
| CancelOrder | 15 ns | 65.3 Mops/s | comparable |
| ModifyOrder | 16 ns | 61.3 Mops/s | comparable |
| Operation | P99 | P99.99 | vs 1.1.0 |
|---|---|---|---|
| AddOrder | 42 ns | 1.6 us | P99.99 7x better (was 11 us) |
| MatchOrder | 84 ns | 333 ns | P99.99 1.9x better (was 625 ns) |
| MixedWorkload | 125 ns | 3.5 us | P99 2.7x better (was 333 ns) |
Each row shows shards / consumer batch size. Producer submits 100k orders; workers process in parallel.
| Shards / Batch | Throughput |
|---|---|
| 1 / 64 | 9.69 Mops/s |
| 2 / 64 | 9.40 Mops/s |
| 4 / 64 | 10.55 Mops/s |
| 8 / 64 | 8.69 Mops/s |
| 1 / 256 | 9.61 Mops/s |
| 2 / 256 | 13.40 Mops/s |
| 4 / 256 | 10.60 Mops/s |
| 8 / 256 | 8.74 Mops/s |
Submit-to-completion latency per order, including queue transit and matching.
| Shards / Batch | Mean | P95 | P99 | P99.9 |
|---|---|---|---|---|
| 1 / 256 | 348 ns | 417 ns | 875 ns | 14.1 us |
| 2 / 256 | 529 ns | 958 ns | 3.7 us | 9.1 us |
| 4 / 256 | 1.6 us | 2.3 us | 13.8 us | 35.3 us |
See results/benchmark_result_1.2.0.csv for full data.
Implemented optimizations based on "How to Build a Fast Limit Order Book" article.
- Integer Prices: Changed from
doubletoint64_t(ticks) for faster comparisons and better hashing - Intrusive Doubly-Linked Lists: Orders contain
prev/nextpointers for O(1) removal - Hash Maps:
std::unordered_mapfor O(1) price level and order lookup - Binary Search Tree: Price levels organized in BST for O(log M) insertion of new levels
- Cached Best Bid/Ask: Direct pointer access (
highest_buy_/lowest_sell_) for O(1) best price queries
AddOrder 47 ns | MatchOrder 117 ns | MixedWorkload 16.6 Mops/s
| Operation | Mean | Improvement |
|---|---|---|
| AddOrder | 47 ns | 2.3x faster (was 109 ns) |
| CancelOrder | 16 ns | 1.4x faster (was 23 ns) |
| ModifyOrder | 15 ns | 1.6x faster (was 24 ns) |
| MatchOrder | 117 ns | 1.4x faster (was 160 ns) |
| MixedWorkload | 42 ns | 1.6x faster (was 69 ns) |
| Operation | P99 | P99.99 | vs 1.0.0 |
|---|---|---|---|
| AddOrder | 84 ns | 1.8 µs | P99 9x better (was 792 ns), P99.99 19x better (was 34 µs) |
| MatchOrder | 209 ns | 334 ns | P99.99 69x better (was 23 µs) |
| MixedWorkload | 292 ns | 708 ns | P99.99 28x better (was 20 µs) |
| Operation | Complexity | Notes |
|---|---|---|
add_order (passive) |
O(1) / O(log M) | O(1) at existing limit, O(log M) for new price level |
add_order (aggressive) |
O(F + L log M) | Includes matching; see below |
cancel_order |
O(1) | Hash lookup + intrusive list removal |
modify_order |
O(1) | Hash lookup + quantity update |
get_best_bid |
O(1) | Cached pointer |
get_best_ask |
O(1) | Cached pointer |
get_snapshot |
O(D) | D = depth requested |
Matching complexity: O(F + L log M)
- F = number of fills (resting orders matched)
- L = number of price levels fully exhausted (L ≤ F)
- M = total price levels on the opposing side
| Scenario | Complexity | Example |
|---|---|---|
| Passive (no cross) | O(1) | Bid below best ask |
| Single partial fill | O(1) | Match one order, level remains |
| Single level exhausted | O(log M) | Clear one price level |
| Multi-level sweep | O(F + L log M) | Large aggressive order |
- Price-Time Priority: Orders matched by best price first, then arrival time (FIFO)
- Efficient Data Structures:
std::mapfor price levels,std::listfor order queues,std::unordered_mapfor O(1) order lookup - Clean API: Structured results with I/O separated from business logic
- Modern C++17:
std::optional,[[nodiscard]], namespaces, const-correctness
AddOrder 109 ns | MatchOrder 160 ns | MixedWorkload 11 Mops/s
| Benchmark | Mean | P99 | P99.99 |
|---|---|---|---|
| GetBestBid | 19 ns | 42 ns | 1.4 µs |
| GetBestAsk | 17 ns | 42 ns | 166 ns |
| GetSpread | 18 ns | 42 ns | 167 ns |
| CancelOrder | 23 ns | 125 ns | 459 ns |
| ModifyOrder | 24 ns | 42 ns | 167 ns |
| AddOrder | 109 ns | 792 ns | 34 µs |
| MatchOrder | 160 ns | 292 ns | 23 µs |
| MixedWorkload | 69 ns | 459 ns | 20 µs |
See benchmark/README.md for methodology.
- Memory pool allocators
- Market orders, IOC, FOK order types
- Thread safety / lock-free structures
- SIMD optimizations for batch operations