A high-frequency trading (HFT) system skeleton implemented in modern C++20, designed to demonstrate production-grade low-latency architecture patterns. This project implements a Spot-Futures Cash and Carry Arbitrage Strategy with emphasis on zero-allocation hot paths, lock-free data structures, and compile-time optimizations.
- Sub-microsecond latency: Tick-to-trade target < 1μs
- Zero-allocation hot path: Pre-allocated pools, arenas, and stack allocation
- Lock-free concurrency: SPSC queues with wait-free guarantees
- Compile-time optimization: CRTP eliminates virtual dispatch overhead
- Cross-platform: Tested on Linux (GCC/Clang), macOS, and Windows (MSVC)
- Production patterns: Cache-line alignment, false sharing prevention, RDTSC timing
- Highlights
- Architecture Overview
- Project Structure
- Design Philosophy
- Key Components
- Build Instructions
- Running Tests & Benchmarks
- Performance Considerations
- Design Decisions
┌─────────────────────────────────────────────────────────────────────────────┐
│ HFT NanoTick Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ Network Layer │───▶│ Market Data │───▶│ Order Book │ │
│ │ (FIX Protocol) │ │ Handler │ │ (Spot/Futures) │ │
│ └─────────────────┘ │ • Stale Check │ └──────────┬──────────┘ │
│ │ • Seq Gap Detect │ │ │
│ └──────────────────┘ │ │
│ ▼ │
│ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ SPSC Queue │◀───│ Strategy Engine │ │
│ │ (Lock-Free) │ │ (CRTP Pattern) │ │
│ └────────┬─────────┘ │ Cash & Carry Arb │ │
│ │ └─────────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ Order Entry │───▶│ Risk Management │ │
│ │ Gateway │ │ • Rate Limiting │ │
│ │ • FIX Builder │ │ • Position Limits │ │
│ └──────────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Support Infrastructure │ │
│ │ • Async Logger (Lock-Free) • Latency Histogram • Memory Arena │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
- Market Data Ingestion: FIX messages are parsed using a zero-copy parser
- Stale Data Protection: Messages older than threshold (default 50ms) are discarded
- Order Book Update: Limit order book is updated with price/quantity changes
- Strategy Evaluation: CRTP-based strategy engine evaluates arbitrage opportunities
- Order Generation: Paired orders (buy spot/sell futures or vice versa) are created
- Risk Validation: Pre-trade risk checks (position limits, rate limits) are applied
- Order Submission: FIX messages are built and sent to the exchange
HFT NanoTick/
├── CMakeLists.txt # Modern CMake build configuration
├── README.md # This documentation
│
├── include/ # Header-only library
│ ├── core/
│ │ ├── Types.hpp # Fundamental types, price representation
│ │ ├── SPSCQueue.hpp # Lock-free single-producer single-consumer queue
│ │ ├── MemoryArena.hpp # Object pools and linear allocators
│ │ └── Timestamp.hpp # RDTSC timing and latency histograms
│ │
│ ├── market_data/
│ │ ├── FixParser.hpp # Zero-copy FIX protocol parser
│ │ └── MarketDataHandler.hpp # Stale data protection, sequence gaps
│ │
│ ├── orderbook/
│ │ └── OrderBook.hpp # Templated limit order book
│ │
│ ├── strategy/
│ │ └── StrategyEngine.hpp # CRTP strategy framework, Cash & Carry
│ │
│ ├── gateway/
│ │ └── OrderEntryGateway.hpp # Order management, risk checks
│ │
│ └── utils/
│ └── Logger.hpp # Async lock-free logging
│
├── src/
│ └── main.cpp # Example application
│
├── tests/ # Google Test unit tests
│ ├── test_orderbook.cpp
│ ├── test_spsc_queue.cpp
│ ├── test_fix_parser.cpp
│ └── test_strategy.cpp
│
└── benchmarks/ # Google Benchmark performance tests
├── bench_orderbook.cpp
├── bench_spsc_queue.cpp
└── bench_latency.cpp
In HFT, memory allocation is a significant source of latency variance:
- malloc/new can take 100+ nanoseconds and cause page faults
- Dynamic allocation invokes the system allocator, potentially blocking
Our approach:
- Pre-allocated object pools for orders and messages
- Linear arenas for per-tick scratch memory
- Fixed-size buffers for FIX message building
- Stack allocation where possible
Modern CPUs execute faster than they can fetch data:
- L1 cache: ~4 cycles (~1.3ns)
- L2 cache: ~12 cycles (~4ns)
- L3 cache: ~40 cycles (~13ns)
- Main memory: ~200+ cycles (~70ns)
Our approach:
alignas(64)on frequently accessed structures to prevent false sharing- Contiguous memory layouts in order books
- SPSC queue slots aligned to cache lines
- Hot/cold data separation
Traditional locks introduce:
- Context switches (1000+ cycles)
- Priority inversion risks
- Deadlock potential
Our approach:
- SPSC queues for producer-consumer patterns
- Atomic operations with appropriate memory ordering
- Wait-free algorithms where possible
Lock-free single-producer single-consumer queue using Lamport's algorithm:
SPSCQueue<MarketDataMessage, 65536> queue;
// Producer thread
queue.tryPush(message);
// Consumer thread
MarketDataMessage msg;
if (queue.tryPop(msg)) {
processMessage(msg);
}Key features:
- Power-of-two capacity for fast modulo
- Acquire-release memory ordering
- Cache-line padding between head/tail
- Wait-free progress guarantee
Templated limit order book with O(1) best bid/ask access:
DefaultOrderBook book(symbolId);
// Add orders
book.addOrder(orderId, Side::Buy, price, quantity);
// Query BBO
Price bestBid = book.bestBid();
Price bestAsk = book.bestAsk();
Price spread = book.spread();
// Modify/Delete
book.modifyOrder(orderId, newQuantity);
book.deleteOrder(orderId);Key features:
- std::map for price-time priority
- Hash map for O(1) order lookup by ID
- Callback support for market data updates
Compile-time polymorphism eliminates virtual function overhead:
class MyStrategy : public StrategyBase<MyStrategy> {
public:
Signal computeSignal() noexcept {
// Strategy logic inlined at compile time
if (spread > threshold) {
return Signal::BuySpot;
}
return Signal::None;
}
};Why CRTP over virtual functions:
- vtable lookup: ~3-4 cycles + potential cache miss
- CRTP: Zero overhead, fully inlined
- Enables aggressive compiler optimizations
Stale data protection prevents trading on outdated quotes:
MarketDataConfig config;
config.staleThresholdNanos = 50'000'000; // 50ms
DefaultMarketDataHandler handler(config);
// Process message - automatically filters stale data
handler.processMessage(data, length, receiveTime);Stale data check logic:
if (receiveTime - sendingTime > threshold) {
discard message // Don't trade on stale quotes
}
Parses FIX messages without memory allocation:
FixParser parser;
if (parser.parse(data, length)) {
auto symbol = parser.symbol(); // string_view into original buffer
Price price = parser.getField(44).asPrice();
int64_t qty = parser.getField(38).asInt();
}Design:
- String views point into original message buffer
- Indexed tag lookup for common fields (O(1))
- Linear scan fallback for rare fields
- C++20 compatible compiler:
- GCC 10+
- Clang 12+
- MSVC 19.29+ (VS 2019 16.10+)
- CMake 3.20+
- Git (for fetching dependencies)
# Clone repository
git clone https://github.com/yourusername/hft-nanotick.git
cd hft-nanotick
# Create build directory
mkdir build && cd build
# Configure (Release build recommended for benchmarks)
cmake -DCMAKE_BUILD_TYPE=Release ..
# Build
cmake --build . -j$(nproc)| Option | Default | Description |
|---|---|---|
HFT_BUILD_TESTS |
ON | Build unit tests |
HFT_BUILD_BENCHMARKS |
ON | Build benchmarks |
HFT_ENABLE_SANITIZERS |
OFF | Enable ASan/UBSan |
HFT_ENABLE_LTO |
OFF | Link-time optimization |
HFT_NATIVE_ARCH |
ON | Optimize for native CPU |
# Example: Build with sanitizers for debugging
cmake -DHFT_ENABLE_SANITIZERS=ON -DCMAKE_BUILD_TYPE=Debug ..cd build
ctest --output-on-failure
# Or run directly
./hft_tests./hft_benchmarks
# Run specific benchmark
./hft_benchmarks --benchmark_filter=BM_OrderBook
# Output JSON for analysis
./hft_benchmarks --benchmark_format=json > results.json./hft_exampleMeasured on Apple M4 Pro, compiled with -O3 -march=native:
| Benchmark | Time | Throughput |
|---|---|---|
| AddOrder | 37.4 ns | 26.7 M ops/s |
| AddAndDelete | 63.4 ns | 15.8 M ops/s |
| ModifyOrder | 13.7 ns | 72.9 M ops/s |
| BestBidQuery | 0.27 ns | 3.75 G ops/s |
| MidPriceQuery | 0.47 ns | 2.14 G ops/s |
| SpreadQuery | 0.27 ns | 3.74 G ops/s |
| GetTopLevels | 41.7 ns | 24.0 M ops/s |
| RandomOperations | 111 ns | 9.0 M ops/s |
Deep Book Operations (batch add):
| Orders | Time | Throughput |
|---|---|---|
| 10 | 870 ns | 11.5 M ops/s |
| 64 | 3.58 µs | 17.9 M ops/s |
| 512 | 30.7 µs | 16.7 M ops/s |
| 1000 | 62.4 µs | 16.0 M ops/s |
| Benchmark | Time | Throughput |
|---|---|---|
| PushPop (single thread) | 3.09 ns | 647 M ops/s |
| Push (single thread) | 2.83 ns | 353 M ops/s |
| Emplace (single thread) | 2.84 ns | 352 M ops/s |
| Two-Thread Throughput | 49.3 ns | 20.3 M ops/s |
| Latency (avg) | 46.3 ns | 32.8 ns avg latency |
| FrontPeek | 0.31 ns | 3.28 G ops/s |
| EmptyCheck | 0.57 ns | 1.76 G ops/s |
| SizeApprox | 0.60 ns | 1.66 G ops/s |
Element Size Scaling:
| Size (bytes) | Time | Bandwidth |
|---|---|---|
| 8 | 2.68 ns | 5.6 GiB/s |
| 64 | 3.47 ns | 34.4 GiB/s |
| 128 | 4.44 ns | 53.7 GiB/s |
| 256 | 7.39 ns | 64.5 GiB/s |
Queue Size Impact:
| Capacity | Time | Throughput |
|---|---|---|
| 1K | 2.32 ns | 864 M ops/s |
| 64K | 2.72 ns | 734 M ops/s |
Burst Operations:
| Burst Size | Time | Throughput |
|---|---|---|
| 64 | 215 ns | 596 M ops/s |
| 512 | 2.26 µs | 454 M ops/s |
| 4096 | 19.4 µs | 422 M ops/s |
| Component | Target | Notes |
|---|---|---|
| SPSC Queue Push/Pop | < 20 ns | Single cache line access |
| Order Book Add | < 500 ns | Map insertion |
| Order Book BBO Query | < 10 ns | Cached values |
| FIX Parse | < 100 ns | Zero-copy, no allocation |
| Strategy Signal | < 50 ns | Inlined via CRTP |
| Tick-to-Trade | < 1 µs | End-to-end target |
- Compile with
-O3 -march=native - Pin threads to specific CPU cores
- Disable CPU frequency scaling
- Use huge pages for large allocations
- Isolate CPU cores from OS scheduler
- Consider kernel bypass (DPDK/RDMA) for networking
Virtual function dispatch involves:
- Load vtable pointer from object
- Index into vtable
- Load function pointer
- Indirect call
This chain causes:
- 2-3 pointer chasing operations
- Potential L1 cache misses
- Branch misprediction on indirect call
CRTP eliminates all of this by resolving at compile time.
Traditional mutex-based synchronization:
pthread_mutex_lock: 20-30ns uncontended- Context switch on contention: 1000+ cycles
- Introduces priority inversion risks
SPSC queue with atomics:
- Single atomic load/store: 10-20ns
- No system calls
- Wait-free progress
Floating-point issues in finance:
0.1 + 0.2 ≠ 0.3in IEEE 754- Rounding errors accumulate
- Different results on different hardware
Fixed-point with 8 decimal places:
- Exact representation
- Integer arithmetic (faster)
- Deterministic results
Traditional parsing copies strings:
std::string symbol = extractField(message, 55); // Allocates!Zero-copy approach:
std::string_view symbol = getField(55).value; // Just a pointer + lengthBenefits:
- No heap allocation
- No string copying
- Cache-friendly linear scan
MIT License - See LICENSE for details.
Contributions are welcome! Please read CONTRIBUTING.md for guidelines.
This is an educational project demonstrating HFT architecture patterns. It is not intended for production trading without significant additional work including:
- Network layer (kernel bypass, multicast handling)
- Persistence and recovery
- Comprehensive risk management
- Regulatory compliance
- Thorough testing and certification
Trading involves significant risk. Use at your own risk.