Summary
Perf profiling shows the kernel TCP stack consumes ~56% of ProxySQL CPU time
under load, with ~25% in the TCP send/receive path and ~8% in nftables packet
filtering. The current event loop uses poll() + individual recv()/send()/
write() syscalls, resulting in thousands of syscalls per second per worker
thread.
io_uring could reduce this overhead by batching I/O submissions and avoiding
per-operation syscall transitions.
Profiling Evidence
Workload: oltp_read_write, 256 threads, SSL (TLSv1.3), pool 200, 30s
Syscall breakdown (from perf, 31,615 samples):
| Syscall |
Samples |
% of syscalls |
| write |
781 |
39.5% |
| sendto |
758 |
38.3% |
| recvfrom |
439 |
22.2% |
| epoll_wait |
130 |
— |
Kernel overhead:
| Function |
Self CPU % |
Notes |
| nft_do_chain |
3.71% |
Firewall per-packet |
| _copy_to_iter |
1.30% |
Kernel↔user data copy |
| tcp_sendmsg_locked |
0.60% |
TCP send path |
| tcp_v4_rcv |
0.90% |
TCP receive path |
| __tcp_transmit_skb |
0.68% |
SKB construction |
| tcp_ack |
0.57% |
ACK processing |
Total kernel: 56% of CPU. Of that, syscall entry/exit + data copying
accounts for an estimated 5-10%.
Current Architecture
The PgSQL event loop (PgSQL_Thread::run() in lib/PgSQL_Thread.cpp:3104)
follows this pattern per iteration:
1. ProcessAllMyDS_BeforePoll() — set up poll events (POLLIN/POLLOUT)
2. poll(mypolls.fds, mypolls.len, ttw) — single blocking syscall
3. ProcessAllMyDS_AfterPoll() — check revents, call read_from_net()/write_to_net()
4. process_all_sessions() — run session state machines
I/O paths:
- Plaintext:
recv(fd, buf, len, 0) / send(fd, buf, len, MSG_NOSIGNAL)
in PgSQL_Data_Stream::read_from_net() / write_to_net()
- SSL:
recv() → BIO_write(rbio_ssl) → SSL_read() for decryption.
SSL_write() → BIO_read(wbio_ssl) → write(fd) for encryption.
Memory BIOs decouple SSL from socket I/O.
- Backend (libpq):
PQsendQuery() → PQflush() → poll for writability →
PQconsumeInput() → PQgetResult(). All non-blocking.
Key data structures:
ProxySQL_Poll<PgSQL_Data_Stream> — manages struct pollfd[] array + per-FD
metadata (data stream pointer, timestamps)
PgSQL_Data_Stream — owns queueIN/queueOUT buffers, SSL context,
poll_fds_idx for O(1) poll array access
PgSQL_Connection — async state machine (ASYNC_ST enum) driving libpq
non-blocking API
Proposed Implementation: Phased Approach
Phase 1: poll() → io_uring poll (lowest risk)
Replace poll() with IORING_OP_POLL_ADD to get event notification through
io_uring without changing any I/O code.
// Current:
rc = poll(mypolls.fds, mypolls.len, ttw);
// Phase 1:
// Submit POLL_ADD SQEs for all active FDs
for (int i = 0; i < mypolls.len; i++) {
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_poll_add(sqe, mypolls.fds[i].fd, mypolls.fds[i].events);
io_uring_sqe_set_data(sqe, (void*)(uintptr_t)i); // index for lookup
}
io_uring_submit(&ring);
// Harvest completions
struct io_uring_cqe *cqe;
while (io_uring_peek_cqe(&ring, &cqe) == 0) {
int idx = (int)(uintptr_t)io_uring_cqe_get_data(cqe);
mypolls.fds[idx].revents = cqe->res;
io_uring_cqe_seen(&ring, cqe);
}
Impact: Minimal — same revents processing, no I/O path changes.
Risk: Low — fallback to poll() trivial. Mostly a ProxySQL_Poll change.
Benefit: Eliminates one syscall per loop (poll→io_uring_enter), enables
future phases.
Phase 2: Async read/write for plaintext connections
For non-SSL data streams, replace recv()/send() with IORING_OP_RECV
/ IORING_OP_SEND SQEs.
// In ProcessAllMyDS_BeforePoll, instead of just setting POLLIN:
if (myds->encrypted == false) {
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_recv(sqe, fd, myds->queue_w_ptr(queueIN), available, 0);
io_uring_sqe_set_data(sqe, encode(idx, OP_READ));
}
Key change: read_from_net() and write_to_net() for plaintext become
completion handlers rather than initiators. The I/O is submitted before the
ring enter, and completions are processed after.
Impact: Batches multiple read/write operations into a single
io_uring_enter(). With 256 sessions, this could batch 100+ I/O operations
per syscall.
Risk: Medium — changes the I/O timing model. Partial reads/writes need
careful handling via short-read/write CQE flags.
Phase 3: Fixed buffers for zero-copy
Register frequently-used buffers with io_uring_register_buffers() to
eliminate the _copy_to_iter overhead (1.30% of CPU).
// At thread init:
struct iovec iovs[MAX_SESSIONS * 2];
for (int i = 0; i < num_sessions; i++) {
iovs[i*2].iov_base = session[i]->myds->queueIN.buffer;
iovs[i*2].iov_len = QUEUE_BUFFER_SIZE;
// ... similar for OUT
}
io_uring_register_buffers(&ring, iovs, count);
// Then use IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED
Risk: High — buffer lifetime must be carefully managed. Sessions come
and go, so buffer registration needs to be dynamic.
Phase 4: SSL integration
The SSL path already uses memory BIOs, which means the socket I/O is
decoupled from SSL. The integration pattern:
Current: recv(fd) → BIO_write(rbio) → SSL_read() → app data
io_uring: RECV CQE → BIO_write(rbio) → SSL_read() → app data
The only change is how bytes get into rbio_ssl — from a recv() call
to an io_uring RECV completion. The SSL_read()/SSL_write() and BIO
layer remain unchanged.
// Completion handler for SSL read:
void handle_ssl_recv_completion(PgSQL_Data_Stream *myds, void *buf, int len) {
BIO_write(myds->rbio_ssl, buf, len);
if (!SSL_is_init_finished(myds->ssl)) {
myds->do_ssl_handshake();
} else {
int n = SSL_read(myds->ssl, myds->queue_w_ptr(queueIN), available);
// process decrypted data
}
}
Risk: Medium — SSL handshake is multi-step and currently relies on
synchronous recv()/send() within a single read_from_net() call.
With io_uring, each step becomes a separate completion, requiring state
tracking across completions.
Files to Modify
| File |
Change |
Phase |
include/ProxySQL_Poll.h |
Add io_uring ring, SQE submission methods |
1 |
lib/ProxySQL_Poll.cpp |
io_uring init/teardown, poll→SQE conversion |
1 |
lib/PgSQL_Thread.cpp |
Replace poll() call with io_uring submit+reap |
1 |
lib/Base_Thread.cpp |
Template methods for before/after poll |
1-2 |
lib/PgSQL_Data_Stream.cpp |
Async read/write via ring for plaintext |
2 |
lib/PgSQL_Data_Stream.cpp |
SSL completion handler for ring I/O |
4 |
CMakeLists.txt / Makefile |
Link liburing, feature detection |
1 |
Challenges
-
libpq internal I/O: PQsendQuery() / PQconsumeInput() use their own
socket I/O internally. ProxySQL's fork of libpq would need modification to
use io_uring, or libpq calls would remain as traditional syscalls (limiting
the benefit for backend I/O).
-
Session lifecycle: Sessions are created and destroyed dynamically. The
io_uring ring and any registered buffers must handle this churn.
-
Backward compatibility: io_uring requires Linux 5.1+ (basic) or 5.6+
(for IORING_OP_RECV/SEND). Older kernels need the poll() fallback.
-
Testing: The async completion model changes the I/O ordering assumptions.
Race conditions that were impossible with synchronous recv/send may appear.
Estimated Impact
| Phase |
CPU reduction |
Effort |
| 1 (poll replacement) |
~1% |
1-2 weeks |
| 2 (async plaintext I/O) |
~3-5% |
3-4 weeks |
| 3 (fixed buffers) |
~1-2% |
1-2 weeks |
| 4 (SSL integration) |
~1% |
2-3 weeks |
Total potential: ~5-8% CPU reduction, translating to a proportional TPS
increase. The benefit scales with concurrency — more sessions = more I/O
operations batched per ring submission.
References
- io_uring intro — Jens Axboe's original design doc
- liburing — userspace library
- Linux kernel 5.6+ for full RECV/SEND SQE support
- OpenSSL memory BIO documentation for custom I/O integration
Summary
Perf profiling shows the kernel TCP stack consumes ~56% of ProxySQL CPU time
under load, with ~25% in the TCP send/receive path and ~8% in nftables packet
filtering. The current event loop uses
poll()+ individualrecv()/send()/write()syscalls, resulting in thousands of syscalls per second per workerthread.
io_uring could reduce this overhead by batching I/O submissions and avoiding
per-operation syscall transitions.
Profiling Evidence
Workload: oltp_read_write, 256 threads, SSL (TLSv1.3), pool 200, 30s
Syscall breakdown (from perf, 31,615 samples):
Kernel overhead:
Total kernel: 56% of CPU. Of that, syscall entry/exit + data copying
accounts for an estimated 5-10%.
Current Architecture
The PgSQL event loop (
PgSQL_Thread::run()inlib/PgSQL_Thread.cpp:3104)follows this pattern per iteration:
I/O paths:
recv(fd, buf, len, 0)/send(fd, buf, len, MSG_NOSIGNAL)in
PgSQL_Data_Stream::read_from_net()/write_to_net()recv()→BIO_write(rbio_ssl)→SSL_read()for decryption.SSL_write()→BIO_read(wbio_ssl)→write(fd)for encryption.Memory BIOs decouple SSL from socket I/O.
PQsendQuery()→PQflush()→ poll for writability →PQconsumeInput()→PQgetResult(). All non-blocking.Key data structures:
ProxySQL_Poll<PgSQL_Data_Stream>— managesstruct pollfd[]array + per-FDmetadata (data stream pointer, timestamps)
PgSQL_Data_Stream— ownsqueueIN/queueOUTbuffers, SSL context,poll_fds_idxfor O(1) poll array accessPgSQL_Connection— async state machine (ASYNC_STenum) driving libpqnon-blocking API
Proposed Implementation: Phased Approach
Phase 1: poll() → io_uring poll (lowest risk)
Replace
poll()withIORING_OP_POLL_ADDto get event notification throughio_uring without changing any I/O code.
Impact: Minimal — same
reventsprocessing, no I/O path changes.Risk: Low — fallback to poll() trivial. Mostly a ProxySQL_Poll change.
Benefit: Eliminates one syscall per loop (poll→io_uring_enter), enables
future phases.
Phase 2: Async read/write for plaintext connections
For non-SSL data streams, replace
recv()/send()withIORING_OP_RECV/
IORING_OP_SENDSQEs.Key change:
read_from_net()andwrite_to_net()for plaintext becomecompletion handlers rather than initiators. The I/O is submitted before the
ring enter, and completions are processed after.
Impact: Batches multiple read/write operations into a single
io_uring_enter(). With 256 sessions, this could batch 100+ I/O operationsper syscall.
Risk: Medium — changes the I/O timing model. Partial reads/writes need
careful handling via short-read/write CQE flags.
Phase 3: Fixed buffers for zero-copy
Register frequently-used buffers with
io_uring_register_buffers()toeliminate the
_copy_to_iteroverhead (1.30% of CPU).Risk: High — buffer lifetime must be carefully managed. Sessions come
and go, so buffer registration needs to be dynamic.
Phase 4: SSL integration
The SSL path already uses memory BIOs, which means the socket I/O is
decoupled from SSL. The integration pattern:
The only change is how bytes get into
rbio_ssl— from arecv()callto an io_uring RECV completion. The
SSL_read()/SSL_write()and BIOlayer remain unchanged.
Risk: Medium — SSL handshake is multi-step and currently relies on
synchronous
recv()/send()within a singleread_from_net()call.With io_uring, each step becomes a separate completion, requiring state
tracking across completions.
Files to Modify
include/ProxySQL_Poll.hlib/ProxySQL_Poll.cpplib/PgSQL_Thread.cpppoll()call with io_uring submit+reaplib/Base_Thread.cpplib/PgSQL_Data_Stream.cpplib/PgSQL_Data_Stream.cppCMakeLists.txt/Makefileliburing, feature detectionChallenges
libpq internal I/O:
PQsendQuery()/PQconsumeInput()use their ownsocket I/O internally. ProxySQL's fork of libpq would need modification to
use io_uring, or libpq calls would remain as traditional syscalls (limiting
the benefit for backend I/O).
Session lifecycle: Sessions are created and destroyed dynamically. The
io_uring ring and any registered buffers must handle this churn.
Backward compatibility: io_uring requires Linux 5.1+ (basic) or 5.6+
(for IORING_OP_RECV/SEND). Older kernels need the poll() fallback.
Testing: The async completion model changes the I/O ordering assumptions.
Race conditions that were impossible with synchronous recv/send may appear.
Estimated Impact
Total potential: ~5-8% CPU reduction, translating to a proportional TPS
increase. The benefit scales with concurrency — more sessions = more I/O
operations batched per ring submission.
References