Skip to content

PgSQL: investigate io_uring to reduce syscall overhead in event loop #5569

@renecannao

Description

@renecannao

Summary

Perf profiling shows the kernel TCP stack consumes ~56% of ProxySQL CPU time
under load, with ~25% in the TCP send/receive path and ~8% in nftables packet
filtering. The current event loop uses poll() + individual recv()/send()/
write() syscalls, resulting in thousands of syscalls per second per worker
thread.

io_uring could reduce this overhead by batching I/O submissions and avoiding
per-operation syscall transitions.

Profiling Evidence

Workload: oltp_read_write, 256 threads, SSL (TLSv1.3), pool 200, 30s

Syscall breakdown (from perf, 31,615 samples):

Syscall Samples % of syscalls
write 781 39.5%
sendto 758 38.3%
recvfrom 439 22.2%
epoll_wait 130

Kernel overhead:

Function Self CPU % Notes
nft_do_chain 3.71% Firewall per-packet
_copy_to_iter 1.30% Kernel↔user data copy
tcp_sendmsg_locked 0.60% TCP send path
tcp_v4_rcv 0.90% TCP receive path
__tcp_transmit_skb 0.68% SKB construction
tcp_ack 0.57% ACK processing

Total kernel: 56% of CPU. Of that, syscall entry/exit + data copying
accounts for an estimated 5-10%.

Current Architecture

The PgSQL event loop (PgSQL_Thread::run() in lib/PgSQL_Thread.cpp:3104)
follows this pattern per iteration:

1. ProcessAllMyDS_BeforePoll()  — set up poll events (POLLIN/POLLOUT)
2. poll(mypolls.fds, mypolls.len, ttw)  — single blocking syscall
3. ProcessAllMyDS_AfterPoll()  — check revents, call read_from_net()/write_to_net()
4. process_all_sessions()  — run session state machines

I/O paths:

  • Plaintext: recv(fd, buf, len, 0) / send(fd, buf, len, MSG_NOSIGNAL)
    in PgSQL_Data_Stream::read_from_net() / write_to_net()
  • SSL: recv()BIO_write(rbio_ssl)SSL_read() for decryption.
    SSL_write()BIO_read(wbio_ssl)write(fd) for encryption.
    Memory BIOs decouple SSL from socket I/O.
  • Backend (libpq): PQsendQuery()PQflush() → poll for writability →
    PQconsumeInput()PQgetResult(). All non-blocking.

Key data structures:

  • ProxySQL_Poll<PgSQL_Data_Stream> — manages struct pollfd[] array + per-FD
    metadata (data stream pointer, timestamps)
  • PgSQL_Data_Stream — owns queueIN/queueOUT buffers, SSL context,
    poll_fds_idx for O(1) poll array access
  • PgSQL_Connection — async state machine (ASYNC_ST enum) driving libpq
    non-blocking API

Proposed Implementation: Phased Approach

Phase 1: poll() → io_uring poll (lowest risk)

Replace poll() with IORING_OP_POLL_ADD to get event notification through
io_uring without changing any I/O code.

// Current:
rc = poll(mypolls.fds, mypolls.len, ttw);

// Phase 1:
// Submit POLL_ADD SQEs for all active FDs
for (int i = 0; i < mypolls.len; i++) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_poll_add(sqe, mypolls.fds[i].fd, mypolls.fds[i].events);
    io_uring_sqe_set_data(sqe, (void*)(uintptr_t)i);  // index for lookup
}
io_uring_submit(&ring);

// Harvest completions
struct io_uring_cqe *cqe;
while (io_uring_peek_cqe(&ring, &cqe) == 0) {
    int idx = (int)(uintptr_t)io_uring_cqe_get_data(cqe);
    mypolls.fds[idx].revents = cqe->res;
    io_uring_cqe_seen(&ring, cqe);
}

Impact: Minimal — same revents processing, no I/O path changes.
Risk: Low — fallback to poll() trivial. Mostly a ProxySQL_Poll change.
Benefit: Eliminates one syscall per loop (poll→io_uring_enter), enables
future phases.

Phase 2: Async read/write for plaintext connections

For non-SSL data streams, replace recv()/send() with IORING_OP_RECV
/ IORING_OP_SEND SQEs.

// In ProcessAllMyDS_BeforePoll, instead of just setting POLLIN:
if (myds->encrypted == false) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_recv(sqe, fd, myds->queue_w_ptr(queueIN), available, 0);
    io_uring_sqe_set_data(sqe, encode(idx, OP_READ));
}

Key change: read_from_net() and write_to_net() for plaintext become
completion handlers rather than initiators. The I/O is submitted before the
ring enter, and completions are processed after.

Impact: Batches multiple read/write operations into a single
io_uring_enter(). With 256 sessions, this could batch 100+ I/O operations
per syscall.

Risk: Medium — changes the I/O timing model. Partial reads/writes need
careful handling via short-read/write CQE flags.

Phase 3: Fixed buffers for zero-copy

Register frequently-used buffers with io_uring_register_buffers() to
eliminate the _copy_to_iter overhead (1.30% of CPU).

// At thread init:
struct iovec iovs[MAX_SESSIONS * 2];
for (int i = 0; i < num_sessions; i++) {
    iovs[i*2].iov_base = session[i]->myds->queueIN.buffer;
    iovs[i*2].iov_len = QUEUE_BUFFER_SIZE;
    // ... similar for OUT
}
io_uring_register_buffers(&ring, iovs, count);

// Then use IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED

Risk: High — buffer lifetime must be carefully managed. Sessions come
and go, so buffer registration needs to be dynamic.

Phase 4: SSL integration

The SSL path already uses memory BIOs, which means the socket I/O is
decoupled from SSL. The integration pattern:

Current:  recv(fd) → BIO_write(rbio) → SSL_read() → app data
io_uring: RECV CQE → BIO_write(rbio) → SSL_read() → app data

The only change is how bytes get into rbio_ssl — from a recv() call
to an io_uring RECV completion. The SSL_read()/SSL_write() and BIO
layer remain unchanged.

// Completion handler for SSL read:
void handle_ssl_recv_completion(PgSQL_Data_Stream *myds, void *buf, int len) {
    BIO_write(myds->rbio_ssl, buf, len);
    if (!SSL_is_init_finished(myds->ssl)) {
        myds->do_ssl_handshake();
    } else {
        int n = SSL_read(myds->ssl, myds->queue_w_ptr(queueIN), available);
        // process decrypted data
    }
}

Risk: Medium — SSL handshake is multi-step and currently relies on
synchronous recv()/send() within a single read_from_net() call.
With io_uring, each step becomes a separate completion, requiring state
tracking across completions.

Files to Modify

File Change Phase
include/ProxySQL_Poll.h Add io_uring ring, SQE submission methods 1
lib/ProxySQL_Poll.cpp io_uring init/teardown, poll→SQE conversion 1
lib/PgSQL_Thread.cpp Replace poll() call with io_uring submit+reap 1
lib/Base_Thread.cpp Template methods for before/after poll 1-2
lib/PgSQL_Data_Stream.cpp Async read/write via ring for plaintext 2
lib/PgSQL_Data_Stream.cpp SSL completion handler for ring I/O 4
CMakeLists.txt / Makefile Link liburing, feature detection 1

Challenges

  1. libpq internal I/O: PQsendQuery() / PQconsumeInput() use their own
    socket I/O internally. ProxySQL's fork of libpq would need modification to
    use io_uring, or libpq calls would remain as traditional syscalls (limiting
    the benefit for backend I/O).

  2. Session lifecycle: Sessions are created and destroyed dynamically. The
    io_uring ring and any registered buffers must handle this churn.

  3. Backward compatibility: io_uring requires Linux 5.1+ (basic) or 5.6+
    (for IORING_OP_RECV/SEND). Older kernels need the poll() fallback.

  4. Testing: The async completion model changes the I/O ordering assumptions.
    Race conditions that were impossible with synchronous recv/send may appear.

Estimated Impact

Phase CPU reduction Effort
1 (poll replacement) ~1% 1-2 weeks
2 (async plaintext I/O) ~3-5% 3-4 weeks
3 (fixed buffers) ~1-2% 1-2 weeks
4 (SSL integration) ~1% 2-3 weeks

Total potential: ~5-8% CPU reduction, translating to a proportional TPS
increase. The benefit scales with concurrency — more sessions = more I/O
operations batched per ring submission.

References

  • io_uring intro — Jens Axboe's original design doc
  • liburing — userspace library
  • Linux kernel 5.6+ for full RECV/SEND SQE support
  • OpenSSL memory BIO documentation for custom I/O integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions