PgSQL: investigate io_uring to reduce syscall overhead in event loop

## Summary

Perf profiling shows the kernel TCP stack consumes ~56% of ProxySQL CPU time
under load, with ~25% in the TCP send/receive path and ~8% in nftables packet
filtering. The current event loop uses `poll()` + individual `recv()`/`send()`/
`write()` syscalls, resulting in thousands of syscalls per second per worker
thread.

io_uring could reduce this overhead by batching I/O submissions and avoiding
per-operation syscall transitions.

## Profiling Evidence

**Workload:** oltp_read_write, 256 threads, SSL (TLSv1.3), pool 200, 30s

**Syscall breakdown** (from perf, 31,615 samples):

| Syscall      | Samples | % of syscalls |
|-------------|---------|---------------|
| write       | 781     | 39.5%         |
| sendto      | 758     | 38.3%         |
| recvfrom    | 439     | 22.2%         |
| epoll_wait  | 130     | —             |

**Kernel overhead:**

| Function | Self CPU % | Notes |
|----------|-----------|-------|
| nft_do_chain | 3.71% | Firewall per-packet |
| _copy_to_iter | 1.30% | Kernel↔user data copy |
| tcp_sendmsg_locked | 0.60% | TCP send path |
| tcp_v4_rcv | 0.90% | TCP receive path |
| __tcp_transmit_skb | 0.68% | SKB construction |
| tcp_ack | 0.57% | ACK processing |

Total kernel: **56% of CPU**. Of that, syscall entry/exit + data copying
accounts for an estimated 5-10%.

## Current Architecture

The PgSQL event loop (`PgSQL_Thread::run()` in `lib/PgSQL_Thread.cpp:3104`)
follows this pattern per iteration:

```
1. ProcessAllMyDS_BeforePoll()  — set up poll events (POLLIN/POLLOUT)
2. poll(mypolls.fds, mypolls.len, ttw)  — single blocking syscall
3. ProcessAllMyDS_AfterPoll()  — check revents, call read_from_net()/write_to_net()
4. process_all_sessions()  — run session state machines
```

**I/O paths:**

- **Plaintext:** `recv(fd, buf, len, 0)` / `send(fd, buf, len, MSG_NOSIGNAL)`
  in `PgSQL_Data_Stream::read_from_net()` / `write_to_net()`
- **SSL:** `recv()` → `BIO_write(rbio_ssl)` → `SSL_read()` for decryption.
  `SSL_write()` → `BIO_read(wbio_ssl)` → `write(fd)` for encryption.
  Memory BIOs decouple SSL from socket I/O.
- **Backend (libpq):** `PQsendQuery()` → `PQflush()` → poll for writability →
  `PQconsumeInput()` → `PQgetResult()`. All non-blocking.

**Key data structures:**

- `ProxySQL_Poll<PgSQL_Data_Stream>` — manages `struct pollfd[]` array + per-FD
  metadata (data stream pointer, timestamps)
- `PgSQL_Data_Stream` — owns `queueIN`/`queueOUT` buffers, SSL context,
  `poll_fds_idx` for O(1) poll array access
- `PgSQL_Connection` — async state machine (`ASYNC_ST` enum) driving libpq
  non-blocking API

## Proposed Implementation: Phased Approach

### Phase 1: poll() → io_uring poll (lowest risk)

Replace `poll()` with `IORING_OP_POLL_ADD` to get event notification through
io_uring without changing any I/O code.

```cpp
// Current:
rc = poll(mypolls.fds, mypolls.len, ttw);

// Phase 1:
// Submit POLL_ADD SQEs for all active FDs
for (int i = 0; i < mypolls.len; i++) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_poll_add(sqe, mypolls.fds[i].fd, mypolls.fds[i].events);
    io_uring_sqe_set_data(sqe, (void*)(uintptr_t)i);  // index for lookup
}
io_uring_submit(&ring);

// Harvest completions
struct io_uring_cqe *cqe;
while (io_uring_peek_cqe(&ring, &cqe) == 0) {
    int idx = (int)(uintptr_t)io_uring_cqe_get_data(cqe);
    mypolls.fds[idx].revents = cqe->res;
    io_uring_cqe_seen(&ring, cqe);
}
```

**Impact:** Minimal — same `revents` processing, no I/O path changes.
**Risk:** Low — fallback to poll() trivial. Mostly a ProxySQL_Poll change.
**Benefit:** Eliminates one syscall per loop (poll→io_uring_enter), enables
future phases.

### Phase 2: Async read/write for plaintext connections

For non-SSL data streams, replace `recv()`/`send()` with `IORING_OP_RECV`
/ `IORING_OP_SEND` SQEs.

```cpp
// In ProcessAllMyDS_BeforePoll, instead of just setting POLLIN:
if (myds->encrypted == false) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_recv(sqe, fd, myds->queue_w_ptr(queueIN), available, 0);
    io_uring_sqe_set_data(sqe, encode(idx, OP_READ));
}
```

**Key change:** `read_from_net()` and `write_to_net()` for plaintext become
completion handlers rather than initiators. The I/O is submitted before the
ring enter, and completions are processed after.

**Impact:** Batches multiple read/write operations into a single
`io_uring_enter()`. With 256 sessions, this could batch 100+ I/O operations
per syscall.

**Risk:** Medium — changes the I/O timing model. Partial reads/writes need
careful handling via short-read/write CQE flags.

### Phase 3: Fixed buffers for zero-copy

Register frequently-used buffers with `io_uring_register_buffers()` to
eliminate the `_copy_to_iter` overhead (1.30% of CPU).

```cpp
// At thread init:
struct iovec iovs[MAX_SESSIONS * 2];
for (int i = 0; i < num_sessions; i++) {
    iovs[i*2].iov_base = session[i]->myds->queueIN.buffer;
    iovs[i*2].iov_len = QUEUE_BUFFER_SIZE;
    // ... similar for OUT
}
io_uring_register_buffers(&ring, iovs, count);

// Then use IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED
```

**Risk:** High — buffer lifetime must be carefully managed. Sessions come
and go, so buffer registration needs to be dynamic.

### Phase 4: SSL integration

The SSL path already uses memory BIOs, which means the socket I/O is
decoupled from SSL. The integration pattern:

```
Current:  recv(fd) → BIO_write(rbio) → SSL_read() → app data
io_uring: RECV CQE → BIO_write(rbio) → SSL_read() → app data
```

The only change is how bytes get into `rbio_ssl` — from a `recv()` call
to an io_uring RECV completion. The `SSL_read()`/`SSL_write()` and BIO
layer remain unchanged.

```cpp
// Completion handler for SSL read:
void handle_ssl_recv_completion(PgSQL_Data_Stream *myds, void *buf, int len) {
    BIO_write(myds->rbio_ssl, buf, len);
    if (!SSL_is_init_finished(myds->ssl)) {
        myds->do_ssl_handshake();
    } else {
        int n = SSL_read(myds->ssl, myds->queue_w_ptr(queueIN), available);
        // process decrypted data
    }
}
```

**Risk:** Medium — SSL handshake is multi-step and currently relies on
synchronous `recv()`/`send()` within a single `read_from_net()` call.
With io_uring, each step becomes a separate completion, requiring state
tracking across completions.

## Files to Modify

| File | Change | Phase |
|------|--------|-------|
| `include/ProxySQL_Poll.h` | Add io_uring ring, SQE submission methods | 1 |
| `lib/ProxySQL_Poll.cpp` | io_uring init/teardown, poll→SQE conversion | 1 |
| `lib/PgSQL_Thread.cpp` | Replace `poll()` call with io_uring submit+reap | 1 |
| `lib/Base_Thread.cpp` | Template methods for before/after poll | 1-2 |
| `lib/PgSQL_Data_Stream.cpp` | Async read/write via ring for plaintext | 2 |
| `lib/PgSQL_Data_Stream.cpp` | SSL completion handler for ring I/O | 4 |
| `CMakeLists.txt` / `Makefile` | Link `liburing`, feature detection | 1 |

## Challenges

1. **libpq internal I/O:** `PQsendQuery()` / `PQconsumeInput()` use their own
   socket I/O internally. ProxySQL's fork of libpq would need modification to
   use io_uring, or libpq calls would remain as traditional syscalls (limiting
   the benefit for backend I/O).

2. **Session lifecycle:** Sessions are created and destroyed dynamically. The
   io_uring ring and any registered buffers must handle this churn.

3. **Backward compatibility:** io_uring requires Linux 5.1+ (basic) or 5.6+
   (for IORING_OP_RECV/SEND). Older kernels need the poll() fallback.

4. **Testing:** The async completion model changes the I/O ordering assumptions.
   Race conditions that were impossible with synchronous recv/send may appear.

## Estimated Impact

| Phase | CPU reduction | Effort |
|-------|-------------|--------|
| 1 (poll replacement) | ~1% | 1-2 weeks |
| 2 (async plaintext I/O) | ~3-5% | 3-4 weeks |
| 3 (fixed buffers) | ~1-2% | 1-2 weeks |
| 4 (SSL integration) | ~1% | 2-3 weeks |

Total potential: **~5-8% CPU reduction**, translating to a proportional TPS
increase. The benefit scales with concurrency — more sessions = more I/O
operations batched per ring submission.

## References

- [io_uring intro](https://kernel.dk/io_uring.pdf) — Jens Axboe's original design doc
- [liburing](https://github.com/axboe/liburing) — userspace library
- Linux kernel 5.6+ for full RECV/SEND SQE support
- OpenSSL memory BIO documentation for custom I/O integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PgSQL: investigate io_uring to reduce syscall overhead in event loop #5569

Summary

Profiling Evidence

Current Architecture

Proposed Implementation: Phased Approach

Phase 1: poll() → io_uring poll (lowest risk)

Phase 2: Async read/write for plaintext connections

Phase 3: Fixed buffers for zero-copy

Phase 4: SSL integration

Files to Modify

Challenges

Estimated Impact

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Function	Self CPU %	Notes
nft_do_chain	3.71%	Firewall per-packet
_copy_to_iter	1.30%	Kernel↔user data copy
tcp_sendmsg_locked	0.60%	TCP send path
tcp_v4_rcv	0.90%	TCP receive path
__tcp_transmit_skb	0.68%	SKB construction
tcp_ack	0.57%	ACK processing

File	Change	Phase
`include/ProxySQL_Poll.h`	Add io_uring ring, SQE submission methods	1
`lib/ProxySQL_Poll.cpp`	io_uring init/teardown, poll→SQE conversion	1
`lib/PgSQL_Thread.cpp`	Replace `poll()` call with io_uring submit+reap	1
`lib/Base_Thread.cpp`	Template methods for before/after poll	1-2
`lib/PgSQL_Data_Stream.cpp`	Async read/write via ring for plaintext	2
`lib/PgSQL_Data_Stream.cpp`	SSL completion handler for ring I/O	4
`CMakeLists.txt` / `Makefile`	Link `liburing`, feature detection	1

Phase	CPU reduction	Effort
1 (poll replacement)	~1%	1-2 weeks
2 (async plaintext I/O)	~3-5%	3-4 weeks
3 (fixed buffers)	~1-2%	1-2 weeks
4 (SSL integration)	~1%	2-3 weeks

PgSQL: investigate io_uring to reduce syscall overhead in event loop #5569

Description

Summary

Profiling Evidence

Current Architecture

Proposed Implementation: Phased Approach

Phase 1: poll() → io_uring poll (lowest risk)

Phase 2: Async read/write for plaintext connections

Phase 3: Fixed buffers for zero-copy

Phase 4: SSL integration

Files to Modify

Challenges

Estimated Impact

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions