ELCRQ is a high-performance, lock-free, blocking-when-necessary concurrent queue implementation based on the LCRQ (Linked Concurrent Ring Queue) algorithm by Adam Morrison and Yehuda Afek.
- Overview
- Features
- Architecture
- Implementations
- API Reference
- Performance
- Requirements
- Building
- Configuration
- Usage Example
- References
- License
ELCRQ extends the LCRQ algorithm with an Event-Count mechanism that allows threads to block efficiently when the queue is empty, rather than busy-waiting. This design significantly reduces CPU usage and power consumption in scenarios where the queue may be empty for extended periods.
The project provides implementations in both C and C++, with the C++ implementation being recommended for production use due to its more stable memory management.
- Lock-free Operations: Both enqueue and dequeue operations are lock-free, ensuring system-wide progress
- Linearizable: All operations are linearizable, providing strong consistency guarantees
- Block-when-necessary: Threads can efficiently wait when the queue is empty using Linux futex-based Event-Counts
- Shared Memory Support: Designed for inter-process communication via shared memory
- Hazard Pointers: Safe memory reclamation without garbage collection (C++ implementation)
- High Throughput: Optimized for high-concurrency scenarios with multiple producer/consumer threads
- Cache-line Aligned: Data structures are aligned to avoid false sharing
┌─────────────────────────────────────────────────────────────┐
│ ELCRQ/SCRQueue │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ EventCount │ │
│ │ (Futex-based blocking mechanism) │ │
│ └──────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ LCRQueue │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Ring │──▶│ Ring │──▶│ Ring │──▶ ... │ │
│ │ │ Queue │ │ Queue │ │ Queue │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ ▲ ▲ │ │
│ │ │ │ │ │
│ │ head tail │ │
│ └──────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Hazard Pointers │ │
│ │ (Safe memory reclamation - C++ only) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Each ring queue is a fixed-size circular buffer with the following properties:
- Size:
2^RING_POWelements (configurable) - Each cell contains a value and an index
- Cells are 128-byte aligned to prevent false sharing
- When a ring fills up, a new ring is allocated and linked
The Event-Count provides efficient blocking without locks:
- prepareWait(): Register intent to wait, get current epoch
- wait(key): Block until epoch changes (new item enqueued)
- cancelWait(): Cancel waiting if item was found
- notify(): Wake one waiting thread after enqueue
Located in the c++/ directory. Uses:
- Boost.Interprocess for shared memory management
- Hazard Pointers for safe memory reclamation
- Modern C++ atomics and memory ordering
Components:
SCRQueue.hpp- Main queue interface with Event-CountLCRQueue.hpp- Core LCRQ implementationEventCount.hpp- Futex-based blocking mechanismHazardPointers.hpp- Memory reclamationFutex.hpp/cpp- Linux futex wrapper
Located in the c/ directory. Uses:
- Custom shared memory allocator (
shm_malloc) - Inline assembly for atomic operations
⚠️ Warning: The C implementation uses an experimental shared memory allocator that may be unstable. Use for reference/proof-of-concept only.
Components:
ELCRQ.h- Complete queue implementationEventCount.h- Futex-based Event-Countprimitives.h- Atomic operation primitivesmalloc.c/h- Shared memory allocator
template<typename T>
class SCRQueue {
// Constructor
SCRQueue(fixed_managed_shared_memory *main_pool,
fixed_managed_shared_memory *mem_pool,
const int num_threads);
// Blocking enqueue - notifies waiters after enqueue
void enqueue(T *item, const int tid);
// Non-blocking enqueue - no notification
void spinEnqueue(T *item, const int tid);
// Blocking dequeue - waits if queue is empty
T* dequeue(const int tid);
// Non-blocking dequeue with patience limit
T* spinDequeue(const int tid);
};// Initialize a new queue
void init_queue(ELCRQ* q);
// Blocking enqueue with notification
void enqueue(Object arg, int pid, ELCRQ* q);
// Blocking dequeue - waits on empty queue
Object dequeue(int pid, ELCRQ* q);
// Non-blocking enqueue
void spinEnqueue(Object arg, int pid, ELCRQ* q);
// Non-blocking dequeue with patience limit
Object spinDequeue(int pid, ELCRQ* q);class EventCount {
Key prepareWait(); // Register intent to wait
void wait(Key key); // Block until notified
void cancelWait(); // Cancel waiting
void notify(); // Wake one waiter
void notifyAll(); // Wake all waiters
};The three-process roundtrip benchmark demonstrates inter-process queue communication:
| Configuration | Throughput |
|---|---|
| 6 threads/process (C) | See benchmark results |
| 8 threads/process (C++) | See benchmark results |
Note: Performance depends heavily on:
- Number of CPU cores (minimum 18-24 for default configuration)
- Memory bandwidth
- Cache coherence protocol efficiency
- OS: Linux (uses futex system call)
- Architecture: x86-64 (uses 128-bit CAS via
cmpxchg16b) - CPU Cores: Minimum 18 (C) or 24 (C++) cores for full benchmark
- C++11 or later
- Boost (Interprocess library)
- CMake 3.x+
- C99 or later
- CMake 3.x+
- pthreads
# Install Boost (Ubuntu/Debian)
sudo apt-get install libboost-all-dev
# Clone and build
git clone https://github.com/r10a/elcrq
cd elcrq/c++
cmake .
make
# Run benchmark
./maingit clone https://github.com/r10a/elcrq
cd elcrq/c
cmake .
make
# Run benchmark
./c#define NUM_THREAD 8 // Threads per process (total = 3 * NUM_THREAD)main.c:
#define NUM_THREAD 6 // Threads per process
#define NUM_ITERS 10 // Simulation iterations
#define NUM_RUNS 100 // Elements per iteration
#define SHM_FILE "/shm" // Shared memory file nameELCRQ.h:
#define RING_POW 17 // Ring size = 2^17 = 131072 elements
#define MAX_PATIENCE 1000 // Spin attempts before blocking
#define Object uint64_t // Element typeThe ring size (2^RING_POW) affects:
- Memory usage: Larger rings consume more memory
- Performance: Larger rings reduce ring allocation frequency
- Latency: Smaller rings may cause more frequent blocking
#include <boost/interprocess/managed_shared_memory.hpp>
#include "SCRQueue.hpp"
using namespace boost::interprocess;
int main() {
// Create shared memory
shared_memory_object::remove("MySharedMemory");
managed_shared_memory segment(create_only, "MySharedMemory", 65536);
// Create queue
SCRQueue<int>* queue = segment.construct<SCRQueue<int>>("queue")(
&segment, &segment, 4 /* num_threads */
);
// Thread 0: Enqueue
int* item = segment.construct<int>(anonymous_instance)(42);
queue->enqueue(item, 0);
// Thread 1: Dequeue (will block if empty)
int* result = queue->dequeue(1);
return 0;
}#include "ELCRQ.h"
int main() {
ELCRQ queue;
init_queue(&queue);
// Thread 0: Enqueue
enqueue(42, 0, &queue);
// Thread 1: Dequeue (blocks if empty)
Object result = dequeue(1, &queue);
return 0;
}-
LCRQ Paper: Morrison, A., & Afek, Y. (2013). Fast concurrent queues for x86 processors. PPoPP '13.
-
Event-Counts:
-
Hazard Pointers: Michael, M. M. (2004). Hazard pointers: Safe memory reclamation for lock-free objects.
-
Shared Memory (C):
This project is licensed under the BSD 3-Clause License - see the copyright notice in source files for details.
Copyright (c) 2013, Adam Morrison and Yehuda Afek.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
...
Contributions are welcome! Please feel free to submit issues or pull requests.
- Follow existing code style
- Add tests for new functionality
- Update documentation as needed
- Ensure changes compile on Linux x86-64
