From 29e02f2f07ba237276c8c4bf027187c7068f842c Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:15:18 -0700 Subject: [PATCH 01/16] Add concurrent request processing post (ATS, HAProxy, Envoy) New post covering how reverse proxies handle concurrent connections at scale: the thin layer constraint, event loop model, and the concurrency architecture of Apache Traffic Server (continuation system), HAProxy (single-process nbthread), and Envoy (thread-per-core isolation). Includes 5 SVG diagrams: hero, thread-models comparison, ATS event thread architecture, HAProxy process model, and Envoy worker isolation. Co-Authored-By: Claude Sonnet 4.6 --- ...03-09-concurrent-requests-reverse-proxy.md | 149 ++++++++++++++++++ .../img/posts/proxy-concurrency/ats-arch.svg | 108 +++++++++++++ .../posts/proxy-concurrency/envoy-arch.svg | 110 +++++++++++++ .../posts/proxy-concurrency/haproxy-arch.svg | 95 +++++++++++ assets/img/posts/proxy-concurrency/hero.svg | 101 ++++++++++++ .../posts/proxy-concurrency/thread-models.svg | 114 ++++++++++++++ 6 files changed, 677 insertions(+) create mode 100644 _posts/2026-03-09-concurrent-requests-reverse-proxy.md create mode 100644 assets/img/posts/proxy-concurrency/ats-arch.svg create mode 100644 assets/img/posts/proxy-concurrency/envoy-arch.svg create mode 100644 assets/img/posts/proxy-concurrency/haproxy-arch.svg create mode 100644 assets/img/posts/proxy-concurrency/hero.svg create mode 100644 assets/img/posts/proxy-concurrency/thread-models.svg diff --git a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md new file mode 100644 index 0000000..22a4f94 --- /dev/null +++ b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md @@ -0,0 +1,149 @@ +--- +title: "How Reverse Proxies Handle Concurrent Connections at Scale: ATS, HAProxy, and Envoy" +description: "The bottleneck is not throughput — it is managing tens of thousands of simultaneous connections without blocking, without ballooning memory, and without dropping a request. Here is how ATS, HAProxy, and Envoy each solve that problem, and the tradeoffs each approach carries." +date: 2026-03-09 12:00:00 +0000 +categories: [Distributed Systems, Reverse Proxy] +tags: [reverse-proxy, load-balancing, distributed-systems, haproxy, envoy, ats, concurrency] +image: + path: /assets/img/posts/proxy-concurrency/hero.svg + alt: "A reverse proxy managing thousands of simultaneous client connections, some active and some idle keepalive, forwarding to backends" +--- + +The first instinct when measuring proxy performance is throughput: requests per second, gigabits per second. That is the wrong place to start. + +The real constraint at scale is **concurrent connection count**. A proxy in front of your entire service fleet holds thousands of open connections simultaneously — clients waiting for upstream data, upstream connections waiting for backends, keepalive connections sitting idle, WebSocket streams that have been open for hours. How the proxy manages all of that bookkeeping, without running out of memory, file descriptors, or CPU, determines whether requests at the tail of the latency distribution are served in milliseconds or seconds. + +## The Thin Layer Constraint + +A reverse proxy has a narrow job: receive bytes on one socket, enforce policy, forward bytes on another socket. "Enforce policy" covers a lot — TLS termination, header rewriting, authentication, rate limiting — but the core is moving bytes efficiently. + +This creates what I call the thin layer constraint: **the proxy must consume the minimum resources necessary per connection, because it holds thousands of them simultaneously.** Every unnecessary byte allocated per connection, every lock acquired on the hot path, every avoidable system call — it multiplies by the connection count. + +At 10,000 concurrent connections: + +- 1 KB per-connection overhead = 10 MB total +- 10 KB per-connection overhead = 100 MB total +- 100 KB per-connection overhead = 1 GB total + +A proxy that allocates generously because it is convenient survives normal traffic and falls apart during load spikes. Memory pressure starts evicting pages, the kernel starts swapping, latency climbs at the 99th percentile. The degradation looks like a capacity problem when it is an architecture problem. + +## Thread-per-Connection: The Obvious Model That Does Not Scale + +The simplest way to handle concurrent connections is a thread (or process) per connection. Apache HTTPd used this (prefork MPM), it is straightforward to reason about, and each connection gets isolated execution with no shared state to worry about. A blocking read waiting for a slow client just blocks that thread. Other connections continue on their own threads. + +The problem is that threads are expensive. + +A thread on Linux consumes roughly 8 MB of virtual memory for its default stack. Even with a tuned 512 KB stack, 10,000 connections requires 5 GB of stack space before any application work is done. The OS scheduler now manages 10,000 threads. Context switching between them — saving and restoring registers, TLB pressure, cache eviction — adds up. At high connection counts the scheduler overhead appears directly in latency measurements. + +The C10K problem (serving 10,000 concurrent connections efficiently) was a real practical limit for this model in the late 1990s. The solution was not faster hardware. It was a different concurrency model. + +![Thread-per-connection: each connection owns one thread, memory scales with N; event loop: one thread manages thousands via kernel I/O readiness notifications](/assets/img/posts/proxy-concurrency/thread-models.svg) + +## The Event Loop: Separating Holding from Working + +Most of the time, a connection is not doing anything. It is waiting — for the client to send the next byte, for the backend to respond, for a slow upstream to unblock. A thread blocked on a slow client is wasted capacity. + +The event loop separates the concepts of holding a connection and doing work on it. + +An event loop uses the OS's I/O readiness notification interface — `epoll` on Linux, `kqueue` on macOS and BSD — to monitor many file descriptors simultaneously with a single thread. The OS watches thousands of sockets. When one becomes readable (client sent data) or writable (backend acknowledged data), it notifies the event loop. The loop wakes up, does exactly the work that is ready, and returns to waiting. + +No threads blocked on slow connections. No context switches between thousands of threads. One thread, one event loop, as many file descriptors as the OS allows. The `ulimit -n` setting, commonly raised to 65,535 or higher in production, is now the practical limit rather than thread memory. + +The tradeoff is programming model complexity. A blocking operation inside the event loop blocks the entire loop — every connection on that thread stalls. Everything must be written as non-blocking callbacks or coroutines. This is harder to write correctly and harder to debug than sequential threaded code. + +Each proxy covered here takes this base model and makes different tradeoffs around it. + +## Apache Traffic Server: Event Threads and the Continuation System + +ATS does not use a single event loop. It uses a pool of event threads — one per CPU core by default, configured via `proxy.config.exec.thread.limit` — each running its own independent event loop. + +When a new connection arrives, it lands at a dedicated accept thread and is dispatched round-robin to one of the ET_NET (event thread network) threads. That thread owns the connection for its lifetime. Connections do not migrate between threads. + +![ATS accept thread dispatching connections round-robin to ET_NET event thread pool; each thread has its own event loop and continuation queue; blocking in a plugin stalls all connections on that thread](/assets/img/posts/proxy-concurrency/ats-arch.svg) + +The programming model inside ATS is the **continuation system**. A continuation is a callback object with associated state: it says "when event X occurs, call this handler." Processing a request is a chain of continuations scheduled on the event thread. I/O completes, a continuation runs, schedules the next I/O operation, and the continuation is rescheduled when that I/O completes. The thread never waits; it always moves to the next ready event. + +The consequence for plugin authors is significant. ATS plugins hook into the request pipeline by registering continuations. If a plugin's handler makes a blocking system call — a synchronous DNS lookup, a blocking HTTP request to an external service, a filesystem read — it blocks the entire ET_NET thread. Every connection on that thread stops making progress until the blocking call returns. This is not a theoretical concern; it is the most common cause of latency spikes in production ATS deployments. + +**Where ATS is strong:** CDN-scale HTTP caching and forward proxying. The continuation model is purpose-built for cache hit/miss processing. The cache integration is deep — content storage, freshness evaluation, and origin fetching are all built into the continuation chain. Organizations running CDN edge nodes at billions of requests per day have done so on ATS for years. The TSAPI plugin interface lets you customize behavior at every stage of request processing. + +**Where ATS struggles:** The continuation model has a steep learning curve, and the plugin isolation story is weak. A misbehaving plugin degrades the thread it runs on. Configuration is dense, and performance tuning requires understanding internal thread and event queue sizing. For general-purpose reverse proxy use cases outside of caching workloads, the operational complexity is hard to justify. + +## HAProxy: Single-Process Discipline, Then Careful Parallelism + +HAProxy's original design was a single-process, single-thread event loop. One process, one epoll loop, all connections. Everything the proxy did was handled in sequence within that event loop. + +This sounds limiting, but it produced a proxy with extraordinary predictability. No shared state, no locks, no concurrent access problems to reason about. A single core running a tight epoll loop handles tens of thousands of connections with sub-millisecond median latency. The memory footprint was negligible: HAProxy's per-connection overhead has historically been in the low hundreds of bytes. + +HAProxy added multi-threading in version 1.8 via the `nbthread` directive. The design stayed single-process. Multiple threads run inside that process, each with its own epoll loop. + +![HAProxy: single process with shared accept socket via SO_REUSEPORT; nbthread workers each run an independent epoll loop; shared state protected by spinlocks](/assets/img/posts/proxy-concurrency/haproxy-arch.svg) + +New connections are distributed using `SO_REUSEPORT` — a socket option that lets multiple threads call `accept()` on the same port, with the kernel distributing connections across them. This removes the accept bottleneck without a shared queue or mutex. Each thread then manages its connections independently. + +Shared state — stick-tables, global request counters, server health information — is protected by per-object spinlocks rather than a global lock. The shared surface is small by design; HAProxy's data model has always minimized it. + +Configuration is explicit: + +``` +global + nbthread auto # one thread per available CPU core + +frontend http-in + bind :80 thread all # all threads accept on this frontend + bind :443 ssl crt /etc/ssl/certs/ thread 1-2 # pin TLS to threads 1-2 +``` + +The `thread` directive on `bind` lines lets you pin frontends to specific thread subsets, giving traffic isolation between workloads on a single HAProxy instance without running separate processes. + +Hot reload works through process replacement: `haproxy -sf $(cat /var/run/haproxy.pid)` starts a new process that takes over the listening sockets, while the old process drains its in-flight connections. No dropped requests, no configuration gap. + +**Where HAProxy is strong:** Pure efficiency and predictable latency in L4 and L7 load balancing scenarios. For environments where memory budget is constrained (appliances, shared infrastructure), where configuration must be auditable and straightforward, or where the stick-table and ACL system's power is needed without external dependencies, HAProxy is the standard choice. Its runtime API (socket commands) supports dynamic configuration of server weights, server state, and ACLs without a reload. + +**Where HAProxy struggles:** The threading model was added to a single-process design; at very high thread counts, spinlock contention on shared state can surface. Lua (the extension scripting language) runs on the event loop thread, so complex Lua logic adds latency to other connections on that thread. HAProxy is not designed for deep L7 programmability — complex request transformation logic that would be straightforward in Envoy's filter chain is awkward to express in HAProxy's ACL/action model. + +## Envoy: Thread-per-Core with Complete Isolation + +Envoy was designed for service mesh: a proxy running as a sidecar alongside every service instance in the fleet. That use case required properties none of the existing proxies optimized for — deep L7 programmability, dynamic reconfiguration without restarts, and a concurrency model that would not allow a bug in one connection's processing to affect any other connection. + +The architecture is thread-per-core with a strict constraint: **worker threads share nothing by design.** + +A listener thread accepts incoming connections and dispatches each to a worker thread via a consistent hash. From that moment, the connection belongs entirely to that worker: its TLS session, its upstream connection pool, the entire L7 filter chain executing its request. Workers do not communicate with each other for connection processing. + +Each worker runs its own libevent-based event loop and holds its own copy of the proxy configuration — delivered as a snapshot via the xDS protocol. When the control plane pushes a configuration update (a new backend, a changed route, a rotated certificate), each worker receives and applies it independently. No coordination between workers, no global pause, no lock. + +![Envoy listener thread dispatching to worker threads; each worker is completely isolated with its own event loop, connection pool, filter chain, and xDS config snapshot; no shared state between workers](/assets/img/posts/proxy-concurrency/envoy-arch.svg) + +The filter chain model is the other defining feature. Every request passes through a configured sequence of L4 and L7 filters. Each filter can read and modify the request: JWT validation, header manipulation, rate limit checking, gRPC transcoding, circuit breaking. Filters are composable and independently configurable. The per-worker isolation means a filter's state is always thread-local — no locking required within the filter chain. + +The xDS API is the interface between Envoy and its control plane (Istio, custom implementations, or static config with dynamic overrides). Adding a backend endpoint, changing a route's timeout, draining an instance before it is decommissioned — all are xDS updates pushed to each worker independently. This is the operational model that makes zero-downtime deployments at fleet scale tractable. + +**Where Envoy is strong:** Complex L7 processing, service mesh sidecars, API gateways where routing rules change frequently, and environments with control-plane infrastructure. The filter chain model handles workloads that would require custom code in HAProxy or ATS. The xDS integration is the right tool when the proxy's configuration is driven programmatically rather than by static files. + +**Where Envoy struggles:** Memory footprint is higher than HAProxy, primarily from per-worker state duplication — each worker holds its own upstream connection pool and config snapshot. The operational surface is larger: debugging a misconfigured filter chain is harder than reading a HAProxy ACL. Custom filters require C++ or WASM, a higher bar than Lua scripting. For straightforward L4/L7 load balancing without complex routing logic, Envoy's weight is harder to justify than HAProxy's. + +## Robustness Under the Thin Layer + +Being thin does not mean being fragile. Each model comes with specific mechanisms for maintaining service through failures. + +**Graceful restart** is how all three proxies handle configuration updates and version upgrades without dropping connections. HAProxy's `-sf` flag passes file descriptors to the new process, which takes the listening sockets while the old process drains. ATS's traffic_manager handles restart sequencing. Envoy's hot-restart protocol passes sockets between old and new processes; the drain timer controls how long the old process waits for in-flight requests to complete. The common pattern — new process takes the port, old process finishes its work — is non-negotiable for a proxy in a live path. + +**Circuit breaking** prevents backend failure from cascading into proxy resource exhaustion. When a backend is slow or failing, the proxy must stop sending it new connections before queues grow unbounded. Envoy's circuit breaker is per-cluster with configurable thresholds: maximum pending requests, active requests, retries, and connections. HAProxy uses `maxconn` per server with queue management and health-check-driven server state transitions. ATS manages this through origin server connection limiting and retry configuration. The implementation differs; the requirement is the same: a proxy that blindly queues connections to a failing backend eventually exhausts memory and takes itself down. + +**Connection draining on backend removal** ensures in-flight requests complete when a backend exits the pool. HAProxy's "drain" server state stops new connections while allowing existing ones to finish. Envoy's endpoint discovery transitions endpoints through a draining state before removal. This is operationally critical for deployments — a rolling deployment that removes backends without draining will drop a predictable fraction of requests proportional to the ratio of removed capacity to total capacity. + +## Choosing the Right Model + +The three architectures are not interchangeable. Each is optimized for a specific problem space. + +**Use ATS** when the workload is HTTP caching and forward proxying at CDN scale. If cache hit rates are high and the fast path (cache hit, no origin fetch) is the common case, ATS's continuation system is extremely efficient for it. The cache integration is the primary differentiator; if you need it, ATS is the right tool. + +**Use HAProxy** when you need the lowest possible overhead and the most predictable latency for L4 or L7 load balancing. When configuration is managed as static files, when the stick-table ACL system covers your session affinity and rate limiting needs, or when you are operating on constrained hardware, HAProxy's single-process model is the right fit. + +**Use Envoy** when the proxy needs to be programmatically configurable, when routing logic is complex and changing frequently, or when the proxy is operating as a sidecar in a service mesh. The xDS model and filter chain are purpose-built for control-plane-driven infrastructure. If the operational question is "how do I push a new routing rule without restarting anything?" the answer is Envoy. + +The concurrency model is not incidental to these choices. ATS's continuation system is inseparable from its cache architecture. HAProxy's single-process model is what makes its ACL evaluation so cheap and its memory footprint so small. Envoy's worker isolation is what makes its filter chain safely extensible without inter-connection interference. The proxy you choose is a choice about which of these properties matters most for your traffic pattern. + +--- + +*Working through proxy architecture decisions at scale? I am on [LinkedIn](https://www.linkedin.com/in/singhsanjay12) or reachable by [email](mailto:hello@singh-sanjay.com).* diff --git a/assets/img/posts/proxy-concurrency/ats-arch.svg b/assets/img/posts/proxy-concurrency/ats-arch.svg new file mode 100644 index 0000000..ba26034 --- /dev/null +++ b/assets/img/posts/proxy-concurrency/ats-arch.svg @@ -0,0 +1,108 @@ + + + + + + + + + + + + + + + + Apache Traffic Server — Event Thread Architecture + + + new + conns + + + + + + + + Accept Thread + listen · accept · dispatch + + + + + + + + + + + + + + + + + round-robin + dispatch + + + + + + + + ET_NET 0 + event loop · 12 conns + + + + + + + ET_NET 1 + event loop · 9 conns + + + + + + ET_NET 2 + event loop · 18 conns ⚠ + + + + + + + ET_NET 3 + event loop · 11 conns + + + + + + Continuation Model + + request arrives → schedule + continuation (callback) + I/O ready → invoke + continuation handler + handler completes or + re-schedules next step + ⚠ blocking in a handler + stalls all conns on thread + thread count = proxy.config + .exec.thread.limit + + + + + + + + + + ATS distributes connections across event threads. A blocking plugin call stalls every connection on that thread. + orange ET_NET 2 = overloaded thread · thread count set via proxy.config.exec.thread.limit + diff --git a/assets/img/posts/proxy-concurrency/envoy-arch.svg b/assets/img/posts/proxy-concurrency/envoy-arch.svg new file mode 100644 index 0000000..3bd5455 --- /dev/null +++ b/assets/img/posts/proxy-concurrency/envoy-arch.svg @@ -0,0 +1,110 @@ + + + + + + + + + + + + + + + + Envoy — Thread-per-Core, Complete Isolation + + + clients + + + + + + + + Listener Thread + accept · TLS dispatch + assigns conn → worker + + + + + + + + + + + + + + + + + Worker 0 + + + libevent loop + L4 · L7 filter chain + connection pool + xDS snapshot + 5 conns · 0 shared state + + + + Worker 1 + + + libevent loop + L4 · L7 filter chain + connection pool + xDS snapshot + 8 conns · 0 shared state + + + + Worker 2 + + + libevent loop + L4 · L7 filter chain + connection pool + xDS snapshot + 6 conns · 0 shared state + + + + Worker 3 + + + libevent loop + L4 · L7 filter chain + connection pool + xDS snapshot + 4 conns · 0 shared state + + + + + NO SHARED STATE + + + + + + xDS + Control + Plane + + + + hot-push to + each worker + + + + Envoy workers are islands: each owns its connections, filter chain, and config snapshot. Nothing shared. + xDS pushes config to each worker independently · listener thread assigns new connections by hash + diff --git a/assets/img/posts/proxy-concurrency/haproxy-arch.svg b/assets/img/posts/proxy-concurrency/haproxy-arch.svg new file mode 100644 index 0000000..fa89048 --- /dev/null +++ b/assets/img/posts/proxy-concurrency/haproxy-arch.svg @@ -0,0 +1,95 @@ + + + + + + + + + + + + + + + + HAProxy — Single Process, nbthread Workers + + + clients + + + + + + + + Single Process (haproxy) + + + + Shared Accept Socket + SO_REUSEPORT · kernel distributes new connections + + + + + + + + + + + + + + + Thread 0 + + + + epoll loop + 8 conns + + + → backends + + + + Thread 1 + + + epoll loop + 11 conns + → backends + + + + Thread 2 + + + epoll loop + 7 conns + → backends + + + + Thread 3 + + + epoll loop + 9 conns + → backends + + + + Threads share: stick-tables, global counters, server health state — protected by a per-object spinlock + + + nbthread auto | thread-groups N | bind … thread 1-4 + + + + HAProxy is a single process: one socket, nbthread workers, minimal shared state. Predictably thin. + SO_REUSEPORT lets the kernel load-balance accept() across threads without a global lock + diff --git a/assets/img/posts/proxy-concurrency/hero.svg b/assets/img/posts/proxy-concurrency/hero.svg new file mode 100644 index 0000000..2f979b3 --- /dev/null +++ b/assets/img/posts/proxy-concurrency/hero.svg @@ -0,0 +1,101 @@ + + + + + + + + + + + + + + + + + CLIENTS + REVERSE PROXY + BACKENDS + + + + + + + + + + + + + + + + + Client 1 + + + Client 2 + + + Client 3 + + + Client 4 + + + Client 5 + + + Client 6 + + ⋯ 9,994 more + + + + Reverse + Proxy + event loop + + + + + thin layer + + + + Backend A + 3 active + + + Backend B + 5 active + + + Backend C + idle + + + + + + + + + + + + + + + + active + + idle keepalive + + + + The proxy manages thousands of connections simultaneously — adding microseconds, not milliseconds. + solid = active request · dashed = idle keepalive · dot = request in flight + diff --git a/assets/img/posts/proxy-concurrency/thread-models.svg b/assets/img/posts/proxy-concurrency/thread-models.svg new file mode 100644 index 0000000..eefe1c8 --- /dev/null +++ b/assets/img/posts/proxy-concurrency/thread-models.svg @@ -0,0 +1,114 @@ + + + + + + + + + + + + + + + + + + + Thread-per-Connection + + + + + + + + + + Thread 1 · 1 conn + + + + + + Thread 2 · 1 conn + + + + + + Thread 3 · 1 conn + + + + + + Thread 4 · 1 conn + + + + + + Thread 5 · blocked + + + ⋯ N threads + + + + Cost: N × ~1 MB stack memory + OS context switches per request · blocking stalls all I/O on that thread + + + Event Loop + + + + + + + + + + + + + + + + + + + + + + ⋯ thousands + + + + + + + epoll/ + kqueue + Event Loop + + + + + + callbacks + only when + I/O ready + + + + Cost: 1 thread, fixed memory regardless of connection count + No context switch · I/O wait is non-blocking · kernel notifies when data is ready + + + + Thread-per-connection scales with memory. The event loop scales with kernel I/O notifications. + orange thread = blocked on slow client · green callbacks = work only when I/O is ready + From d28ceb626e890e3d3538fb6ed64d4e61d5049bbb Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:41:14 -0700 Subject: [PATCH 02/16] Improve proxy-concurrency diagrams with cleaner visuals MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - thread-models.svg: bold VS divider, proper arrow markers, larger event loop box with arc icon centered, clean fan-in lines from connection dots - ats-arch.svg: bezier dispatch curves with arrowheads, compact ET_NET column with arc icons, continuation model panel with plugin warning - haproxy-arch.svg: fixed epoll arc+arrowhead icons, SO_REUSEPORT dispatch arrows, hot-reload callout, shared-state annotation inside process box - envoy-arch.svg: 2×2 worker grid with isolated badges, bold NO SHARED STATE barriers with pill labels, xDS arrows from control plane box - hero.svg: column separators, arrow markers, load bars on backends, arc loop icon and ~10K badge in proxy box Co-Authored-By: Claude Sonnet 4.6 --- .../img/posts/proxy-concurrency/ats-arch.svg | 162 ++++++++------- .../posts/proxy-concurrency/envoy-arch.svg | 180 +++++++++-------- .../posts/proxy-concurrency/haproxy-arch.svg | 138 +++++++------ assets/img/posts/proxy-concurrency/hero.svg | 178 +++++++++-------- .../posts/proxy-concurrency/thread-models.svg | 185 +++++++++--------- 5 files changed, 453 insertions(+), 390 deletions(-) diff --git a/assets/img/posts/proxy-concurrency/ats-arch.svg b/assets/img/posts/proxy-concurrency/ats-arch.svg index ba26034..be512f2 100644 --- a/assets/img/posts/proxy-concurrency/ats-arch.svg +++ b/assets/img/posts/proxy-concurrency/ats-arch.svg @@ -7,6 +7,12 @@ + + + + + + @@ -15,94 +21,100 @@ Apache Traffic Server — Event Thread Architecture - - new - conns - - - - + + new conns + + + + - - - Accept Thread - listen · accept · dispatch + + + Accept Thread + listen · accept · dispatch - - - - - - - + + + + + + - - - - - + + round-robin - - round-robin - dispatch - - - - + + - - ET_NET 0 - event loop · 12 conns - - - + + ET_NET 0 + event loop · 12 conns + + + + + + + + - - ET_NET 1 - event loop · 9 conns - - + + ET_NET 1 + event loop · 9 conns + + + + + - - - ET_NET 2 - event loop · 18 conns ⚠ - - - + + + ET_NET 2 ⚠ overloaded + event loop · 18 conns · plugin blocking + + + + + + + - - ET_NET 3 - event loop · 11 conns - - + + ET_NET 3 + event loop · 11 conns + + + + + - - - Continuation Model - - request arrives → schedule - continuation (callback) - I/O ready → invoke - continuation handler - handler completes or - re-schedules next step - ⚠ blocking in a handler - stalls all conns on thread - thread count = proxy.config - .exec.thread.limit + + + Continuation Model + + request arrives + → schedule callback + I/O ready + → invoke handler + handler completes + → schedule next step + + ⚠ Plugin blocking rule: + any blocking call inside + a handler stalls ALL + connections on that thread + + proxy.config.exec.thread.limit + = number of ET_NET threads - - - - - + + - - ATS distributes connections across event threads. A blocking plugin call stalls every connection on that thread. - orange ET_NET 2 = overloaded thread · thread count set via proxy.config.exec.thread.limit + + ATS distributes connections across event threads. A blocking plugin call stalls every connection on that thread. + orange ET_NET 2 = thread blocked by plugin · thread count set via proxy.config.exec.thread.limit diff --git a/assets/img/posts/proxy-concurrency/envoy-arch.svg b/assets/img/posts/proxy-concurrency/envoy-arch.svg index 3bd5455..11b3fa1 100644 --- a/assets/img/posts/proxy-concurrency/envoy-arch.svg +++ b/assets/img/posts/proxy-concurrency/envoy-arch.svg @@ -7,104 +7,120 @@ + + + + + + - Envoy — Thread-per-Core, Complete Isolation + Envoy — Thread-per-Core, Complete Worker Isolation - clients - - - - - - - - Listener Thread - accept · TLS dispatch - assigns conn → worker - - - - - - - - - - - - - + clients + + + + + + + + Listener Thread + accept · TLS handshake + hash conn → worker + + + + + + + + + + - - Worker 0 - - - libevent loop - L4 · L7 filter chain - connection pool - xDS snapshot - 5 conns · 0 shared state + + Worker 0 + + + + libevent loop + L4 · L7 filter chain + upstream conn pool + xDS config snapshot + + 5 conns · 0 shared state - - Worker 1 - - - libevent loop - L4 · L7 filter chain - connection pool - xDS snapshot - 8 conns · 0 shared state + + Worker 1 + + + libevent loop + L4 · L7 filter chain + upstream conn pool + xDS config snapshot + + 8 conns · 0 shared state - - Worker 2 - - - libevent loop - L4 · L7 filter chain - connection pool - xDS snapshot - 6 conns · 0 shared state + + Worker 2 + + + libevent loop + L4 · L7 filter chain + upstream conn pool + xDS config snapshot + + 6 conns · 0 shared state - - Worker 3 - - - libevent loop - L4 · L7 filter chain - connection pool - xDS snapshot - 4 conns · 0 shared state - - - - - NO SHARED STATE - - - - - - xDS - Control - Plane - - - - hot-push to - each worker + + Worker 3 + + + libevent loop + L4 · L7 filter chain + upstream conn pool + xDS config snapshot + + 4 conns · 0 shared state + + + + + + NO SHARED STATE + + + + + + + xDS Control + Plane + + routes · endpoints + certs · timeouts + hot-push to each worker + + + + + + + + --concurrency N (default: hardware threads) · config via xDS API or static bootstrap YAML - - Envoy workers are islands: each owns its connections, filter chain, and config snapshot. Nothing shared. - xDS pushes config to each worker independently · listener thread assigns new connections by hash + + Envoy workers are islands: each owns its connections, filter chain, and config snapshot. Nothing shared. + xDS pushes config to each worker independently · red barriers = zero cross-worker communication diff --git a/assets/img/posts/proxy-concurrency/haproxy-arch.svg b/assets/img/posts/proxy-concurrency/haproxy-arch.svg index fa89048..5967ebb 100644 --- a/assets/img/posts/proxy-concurrency/haproxy-arch.svg +++ b/assets/img/posts/proxy-concurrency/haproxy-arch.svg @@ -7,6 +7,15 @@ + + + + + + + + + @@ -16,80 +25,89 @@ HAProxy — Single Process, nbthread Workers - clients - - - - + clients + + + + - - - Single Process (haproxy) + + + Single Process (haproxy) - - - Shared Accept Socket - SO_REUSEPORT · kernel distributes new connections + + + Shared Accept Socket + SO_REUSEPORT · kernel distributes new connections across threads - - - - - + + + + + + - - - - + + - - Thread 0 - - - - epoll loop - 8 conns - - - → backends + + Thread 0 + + + + epoll loop + 8 conns + + + backends ↓ - - Thread 1 - - - epoll loop - 11 conns - → backends + + Thread 1 + + + epoll loop + 11 conns + + backends ↓ - - Thread 2 - - - epoll loop - 7 conns - → backends + + Thread 2 + + + epoll loop + 7 conns + + backends ↓ - - Thread 3 - - - epoll loop - 9 conns - → backends + + Thread 3 + + + epoll loop + 9 conns + + backends ↓ - - - Threads share: stick-tables, global counters, server health state — protected by a per-object spinlock + + + Threads share: stick-tables · global counters · server health state — each protected by a per-object spinlock - - nbthread auto | thread-groups N | bind … thread 1-4 + + nbthread auto · bind :80 thread all · bind :443 thread 1-2 + + + + Hot Reload + haproxy -sf PID + new proc takes socket + old proc drains - - HAProxy is a single process: one socket, nbthread workers, minimal shared state. Predictably thin. - SO_REUSEPORT lets the kernel load-balance accept() across threads without a global lock + + HAProxy is a single process: SO_REUSEPORT distributes connections, each thread runs its own epoll loop. + shared state is minimal by design · spinlocks on stick-tables, not on the connection hot path diff --git a/assets/img/posts/proxy-concurrency/hero.svg b/assets/img/posts/proxy-concurrency/hero.svg index 2f979b3..bd92630 100644 --- a/assets/img/posts/proxy-concurrency/hero.svg +++ b/assets/img/posts/proxy-concurrency/hero.svg @@ -4,98 +4,116 @@ - + + + + - - - CLIENTS - REVERSE PROXY - BACKENDS - - - - - - - - - - - - - - - - - Client 1 - - - Client 2 - - - Client 3 + + + - - Client 4 - - - Client 5 - - - Client 6 - - ⋯ 9,994 more - - - - Reverse - Proxy - event loop - - - - - thin layer + + CLIENTS + REVERSE PROXY + BACKENDS + + + + + + + + + + + + + + + + + Client 1 + + + + Client 2 + + + + Client 3 + + + + Client 4 + + + + Client 5 + + ⋯ 9,995 more connections + + + + + + + + Reverse Proxy + event loop · non-blocking I/O + + + ~10,000 concurrent - - Backend A - 3 active - - - Backend B - 5 active - - - Backend C - idle - - - - - - - - - - - - + + + Backend A + + + 3 req + + + + Backend B + + + 5 req + + + + Backend C + + idle + + + + + + + + + + + + - - active - - idle keepalive + + + active request + + idle keepalive + load bars = backend utilization - - The proxy manages thousands of connections simultaneously — adding microseconds, not milliseconds. - solid = active request · dashed = idle keepalive · dot = request in flight + + The proxy manages thousands of connections simultaneously — adding microseconds, not milliseconds. + blue dot = active request in flight · dashed line = idle keepalive held open by the OS diff --git a/assets/img/posts/proxy-concurrency/thread-models.svg b/assets/img/posts/proxy-concurrency/thread-models.svg index eefe1c8..5afc401 100644 --- a/assets/img/posts/proxy-concurrency/thread-models.svg +++ b/assets/img/posts/proxy-concurrency/thread-models.svg @@ -7,108 +7,107 @@ + + + - - - - - Thread-per-Connection - - - - - - - - - - Thread 1 · 1 conn - - - - - - Thread 2 · 1 conn - - - - - - Thread 3 · 1 conn - - - - - - Thread 4 · 1 conn - - - - - - Thread 5 · blocked - - - ⋯ N threads - - - - Cost: N × ~1 MB stack memory - OS context switches per request · blocking stalls all I/O on that thread - - - Event Loop - - - - - - - - - - - - - - - - - - - - - - ⋯ thousands + + + VS + + + Thread-per-Connection + + + + + + Thread 1 · idle wait + + + + + + Thread 2 · idle wait + + + + + + Thread 3 · idle wait + + + + + + Thread 4 · idle wait + + + + + + Thread 5 · BLOCKED ⚠ + + + ⋯ N threads total + + + + Memory: N × ~1 MB stack space + OS schedules N threads · blocked thread wastes CPU + + + Event Loop (epoll / kqueue) + + + + + + + + + ⋯ thousands + + + + + + + + + + + + idle - - - - - epoll/ - kqueue - Event Loop - - - - - - callbacks - only when - I/O ready + + + + + + + + Event Loop + 1 thread · kernel I/O + socket becomes readable + → wake up · do work + → return to waiting + + + N connections · 1 thread - - Cost: 1 thread, fixed memory regardless of connection count - No context switch · I/O wait is non-blocking · kernel notifies when data is ready + + Memory: fixed · no N multiplier + No context switch · idle conns cost nothing · OS does the waiting - - Thread-per-connection scales with memory. The event loop scales with kernel I/O notifications. - orange thread = blocked on slow client · green callbacks = work only when I/O is ready + + Thread-per-connection scales with memory. The event loop scales with OS I/O notifications. + orange = thread blocked on slow client · dashed line = idle keepalive connection · blue dot = active request From 9883c29e8ddb41f201081cac023e4a6647fafc3b Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:43:05 -0700 Subject: [PATCH 03/16] Add GitHub account check to CLAUDE.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Always verify singhsanjay12 is the active gh account before creating PRs — ssingh1_LinkedIn (Enterprise Managed User) cannot create PRs on this repo. Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/CLAUDE.md b/CLAUDE.md index 785e08f..035013a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -5,3 +5,19 @@ - **Never push directly to `main`.** All changes must go through a pull request, no exceptions — including config changes, quick fixes, and single-line edits. - Always create a new branch before making changes. Branch names must follow the `ssingh1/` convention (e.g. `ssingh1/add-analytics`). - Open a PR after pushing the branch. Only merge after review. + +## GitHub Account + +This repo belongs to the personal account **`singhsanjay12`**, not the LinkedIn managed account. + +Before creating any PR, verify the active account: + +```bash +gh auth status +``` + +If the active account is `ssingh1_LinkedIn`, switch before creating the PR: + +```bash +gh auth switch --user singhsanjay12 +``` From 11f4618eaac568c8483f1f7bc587ff707a53069e Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:43:40 -0700 Subject: [PATCH 04/16] Remove specific account name from GitHub account check Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 035013a..067b54c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -8,16 +8,11 @@ ## GitHub Account -This repo belongs to the personal account **`singhsanjay12`**, not the LinkedIn managed account. +This repo belongs to the personal account **`singhsanjay12`**. -Before creating any PR, verify the active account: +Before creating any PR, verify the correct account is active and switch if needed: ```bash gh auth status -``` - -If the active account is `ssingh1_LinkedIn`, switch before creating the PR: - -```bash gh auth switch --user singhsanjay12 ``` From ed5fc3879469dab549a13071aade93e6a19e020e Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:44:47 -0700 Subject: [PATCH 05/16] Fix haproxy-arch.svg failing to load due to duplicate font-family attribute Duplicate XML attributes are invalid and prevent SVG parsing entirely. Merged into a single font-family value. Co-Authored-By: Claude Sonnet 4.6 --- assets/img/posts/proxy-concurrency/haproxy-arch.svg | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/assets/img/posts/proxy-concurrency/haproxy-arch.svg b/assets/img/posts/proxy-concurrency/haproxy-arch.svg index 5967ebb..c88f70d 100644 --- a/assets/img/posts/proxy-concurrency/haproxy-arch.svg +++ b/assets/img/posts/proxy-concurrency/haproxy-arch.svg @@ -97,7 +97,7 @@ Threads share: stick-tables · global counters · server health state — each protected by a per-object spinlock - nbthread auto · bind :80 thread all · bind :443 thread 1-2 + nbthread auto · bind :80 thread all · bind :443 thread 1-2 From b18937903a25f47effbabade16b72d0a09ec51bd Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:46:09 -0700 Subject: [PATCH 06/16] Add duplicate-attribute test; fix envoy-arch.svg MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit tests/svg_test.rb: new test_no_duplicate_attributes scans every opening tag for repeated attribute names — the class of bug that silently broke haproxy-arch.svg and envoy-arch.svg (two font-family= on one element). envoy-arch.svg: same copy-paste duplicate font-family= fixed, caught immediately by the new test. Co-Authored-By: Claude Sonnet 4.6 --- .../posts/proxy-concurrency/envoy-arch.svg | 2 +- tests/svg_test.rb | 20 +++++++++++++++++++ 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/assets/img/posts/proxy-concurrency/envoy-arch.svg b/assets/img/posts/proxy-concurrency/envoy-arch.svg index 11b3fa1..5d8b776 100644 --- a/assets/img/posts/proxy-concurrency/envoy-arch.svg +++ b/assets/img/posts/proxy-concurrency/envoy-arch.svg @@ -117,7 +117,7 @@ - --concurrency N (default: hardware threads) · config via xDS API or static bootstrap YAML + --concurrency N (default: hardware threads) · config via xDS API or static bootstrap YAML diff --git a/tests/svg_test.rb b/tests/svg_test.rb index 126b371..0a589a9 100644 --- a/tests/svg_test.rb +++ b/tests/svg_test.rb @@ -119,6 +119,26 @@ def test_caption_divider_uses_standard_stroke end end + # ── Duplicate attributes ────────────────────────────────────────────────── + + # Duplicate attributes on the same XML element are invalid and cause the + # entire SVG to fail to parse — the browser renders nothing. + # (Caused haproxy-arch.svg to go blank: two font-family= on one tag.) + def test_no_duplicate_attributes + all_svgs.each do |path| + content = File.read(path) + # Match each opening tag; \s+(name)= captures attribute names preceded + # by whitespace, avoiding false matches inside attribute values. + content.scan(/<[a-zA-Z][^>]*>/m) do |tag| + attr_names = tag.scan(/\s+([\w-]+)=["']/).flatten + duplicates = attr_names.group_by { |n| n }.select { |_, v| v.size > 1 }.keys + assert duplicates.empty?, + "#{path}: duplicate attribute(s) #{duplicates.inspect} on element: " \ + "#{tag[0, 80].strip}…" + end + end + end + # ── Bottom margin ───────────────────────────────────────────────────────── # For the standard viewBox (y = -25 to 395) no text baseline should sit From 403b32697be092664e7c994f7455e75024e8cd09 Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:47:29 -0700 Subject: [PATCH 07/16] Style HAProxy config block with Chirpy file label Adds {: file="haproxy.cfg" } and nginx syntax highlighting to the configuration snippet so it renders as a named code box in the theme. Co-Authored-By: Claude Sonnet 4.6 --- _posts/2026-03-09-concurrent-requests-reverse-proxy.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md index 22a4f94..e397763 100644 --- a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md +++ b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md @@ -85,7 +85,7 @@ Shared state — stick-tables, global request counters, server health informatio Configuration is explicit: -``` +```nginx global nbthread auto # one thread per available CPU core @@ -93,6 +93,7 @@ frontend http-in bind :80 thread all # all threads accept on this frontend bind :443 ssl crt /etc/ssl/certs/ thread 1-2 # pin TLS to threads 1-2 ``` +{: file="haproxy.cfg" } The `thread` directive on `bind` lines lets you pin frontends to specific thread subsets, giving traffic isolation between workloads on a single HAProxy instance without running separate processes. From 967c767e17bb10a268221dbdfee9ea6658eccd81 Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:53:30 -0700 Subject: [PATCH 08/16] Fix HAProxy config block rendering with self-contained HTML code box MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The blog disables Rouge (syntax_highlighter: disable: true), so Chirpy's {: file="..." } IAL and language specifiers do not produce styled output — they only added broken attributes to
. Replaced with a custom HTML
block: dark #1e293b background, file icon + label header, monospace code,
and greyed-out inline comments. Renders correctly regardless of theme processing.

Co-Authored-By: Claude Sonnet 4.6 
---
 ...6-03-09-concurrent-requests-reverse-proxy.md | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md
index e397763..135eb72 100644
--- a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md
+++ b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md
@@ -85,15 +85,18 @@ Shared state — stick-tables, global request counters, server health informatio
 
 Configuration is explicit:
 
-```nginx
-global
-    nbthread auto          # one thread per available CPU core
+
+
+ + haproxy.cfg +
+
global
+    nbthread auto          # one thread per available CPU core
 
 frontend http-in
-    bind :80 thread all    # all threads accept on this frontend
-    bind :443 ssl crt /etc/ssl/certs/ thread 1-2  # pin TLS to threads 1-2
-```
-{: file="haproxy.cfg" }
+    bind :80 thread all    # all threads accept on this frontend
+    bind :443 ssl crt /etc/ssl/certs/ thread 1-2  # pin TLS to threads 1-2
+
The `thread` directive on `bind` lines lets you pin frontends to specific thread subsets, giving traffic isolation between workloads on a single HAProxy instance without running separate processes. From 59b32773c73d2a4da9d6f3f83e7b7bdfb86a29c8 Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 18:56:06 -0700 Subject: [PATCH 09/16] Add thread-per-connection and event loop code examples Two Python examples illustrate the concurrency model contrast: - Thread model: recv() blocks the OS thread; 10k clients = 10k threads - Event loop: await yields control; one thread serves thousands Both use the same dark HTML code box as the HAProxy config snippet, with keywords, literals, and comments lightly syntax-coloured. Co-Authored-By: Claude Sonnet 4.6 --- ...03-09-concurrent-requests-reverse-proxy.md | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md index 135eb72..466a4ca 100644 --- a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md +++ b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md @@ -33,6 +33,28 @@ The simplest way to handle concurrent connections is a thread (or process) per c The problem is that threads are expensive. +
+
+ + thread-per-connection — Python +
+
import socket, threading
+
+def handle(conn):
+    data = conn.recv(4096)   # blocks here — thread is stuck until client sends
+    conn.sendall(data.upper())
+    conn.close()
+
+server = socket.socket()
+server.bind(('0.0.0.0', 8080))
+server.listen()
+
+while True:
+    conn, _ = server.accept()
+    threading.Thread(target=handle, args=(conn,)).start()
+    # a new OS thread for every connection — 10,000 clients = 10,000 threads
+
+ A thread on Linux consumes roughly 8 MB of virtual memory for its default stack. Even with a tuned 512 KB stack, 10,000 connections requires 5 GB of stack space before any application work is done. The OS scheduler now manages 10,000 threads. Context switching between them — saving and restoring registers, TLB pressure, cache eviction — adds up. At high connection counts the scheduler overhead appears directly in latency measurements. The C10K problem (serving 10,000 concurrent connections efficiently) was a real practical limit for this model in the late 1990s. The solution was not faster hardware. It was a different concurrency model. @@ -49,6 +71,27 @@ An event loop uses the OS's I/O readiness notification interface — `epoll` on No threads blocked on slow connections. No context switches between thousands of threads. One thread, one event loop, as many file descriptors as the OS allows. The `ulimit -n` setting, commonly raised to 65,535 or higher in production, is now the practical limit rather than thread memory. +
+
+ + event loop — Python asyncio +
+
import asyncio
+
+async def handle(reader, writer):
+    data = await reader.read(4096)  # yields — other connections run while we wait
+    writer.write(data.upper())
+    await writer.drain()             # yields again while the kernel flushes the write
+    writer.close()
+
+async def main():
+    server = await asyncio.start_server(handle, '0.0.0.0', 8080)
+    async with server:
+        await server.serve_forever()  # one thread, handles thousands of connections
+
+asyncio.run(main())
+
+ The tradeoff is programming model complexity. A blocking operation inside the event loop blocks the entire loop — every connection on that thread stalls. Everything must be written as non-blocking callbacks or coroutines. This is harder to write correctly and harder to debug than sequential threaded code. Each proxy covered here takes this base model and makes different tradeoffs around it. From 521f10fa8b7501ca2e7c99d0269c2f0c93572c68 Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 19:04:37 -0700 Subject: [PATCH 10/16] Fix overlapping lines in envoy-arch.svg Three issues fixed: - Dispatch arrows to W2 and W3 were passing through W0 and W1 boxes. W2 arrow now routes above the grid (y=16), W3 routes below (y=320), so both arc cleanly around the outside without crossing any box. - Horizontal NO SHARED STATE barrier was at y=162, inside W0/W2 (which end at y=165). Moved W1/W3 down to y=175, creating a proper 10px gap; barrier now sits at y=170 between the rows. - xDS second arrow was a raw diagonal line. Replaced both xDS arrows with smooth curves routing from xDS box bottom to each worker's right-centre edge. Co-Authored-By: Claude Sonnet 4.6 --- .../posts/proxy-concurrency/envoy-arch.svg | 126 ++++++++++-------- 1 file changed, 71 insertions(+), 55 deletions(-) diff --git a/assets/img/posts/proxy-concurrency/envoy-arch.svg b/assets/img/posts/proxy-concurrency/envoy-arch.svg index 5d8b776..b1025a2 100644 --- a/assets/img/posts/proxy-concurrency/envoy-arch.svg +++ b/assets/img/posts/proxy-concurrency/envoy-arch.svg @@ -34,40 +34,50 @@ accept · TLS handshake hash conn → worker - - - - - - - - - + + + + + + + + + + + Worker 0 - libevent loop - L4 · L7 filter chain + L4 · L7 filter chain upstream conn pool xDS config snapshot - 5 conns · 0 shared state - - - - Worker 1 - - - libevent loop - L4 · L7 filter chain - upstream conn pool - xDS config snapshot - - 8 conns · 0 shared state + 5 conns · 0 shared state + + + + Worker 1 + + + libevent loop + L4 · L7 filter chain + upstream conn pool + xDS config snapshot + + 8 conns · 0 shared state @@ -75,49 +85,55 @@ libevent loop - L4 · L7 filter chain + L4 · L7 filter chain upstream conn pool xDS config snapshot - 6 conns · 0 shared state - - - - Worker 3 - - - libevent loop - L4 · L7 filter chain - upstream conn pool - xDS config snapshot - - 4 conns · 0 shared state + 6 conns · 0 shared state + + + + Worker 3 + + + libevent loop + L4 · L7 filter chain + upstream conn pool + xDS config snapshot + + 4 conns · 0 shared state - - - - NO SHARED STATE - - - + + + + NO SHARED STATE + + - xDS Control - Plane - - routes · endpoints - certs · timeouts + xDS Control + Plane + + routes · endpoints + certs · timeouts hot-push to each worker - - - + + + + + + ↓ each worker - - --concurrency N (default: hardware threads) · config via xDS API or static bootstrap YAML + + --concurrency N (default: hardware threads) · config via xDS API or static bootstrap YAML From 6a082f18017a771f60a3e92a430e2d7695f425f6 Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 19:33:00 -0700 Subject: [PATCH 11/16] Redesign envoy-arch.svg: single-column worker layout MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the 2×2 grid with a vertical stack of 4 workers so dispatch arrows can fan out cleanly from the listener through a dedicated 76-px lane (x=254–328) without any crossing. Key changes: - Workers now stacked vertically (W0–W3, each 190×62 px, 16-px gaps) - Listener Thread centred vertically alongside the worker column (y=167) - Dispatch S-curves all confined to x=254–328, control points at x=292 - xDS arrows confined to x=520–580, control points at x=555 - Three horizontal NO SHARED STATE barriers (y=90, 168, 246); centre barrier is labelled with the red pill - All text baselines ≤385 px (within viewBox bounds); passes SVG tests Co-Authored-By: Claude Sonnet 4.6 --- .../posts/proxy-concurrency/envoy-arch.svg | 189 ++++++++---------- 1 file changed, 85 insertions(+), 104 deletions(-) diff --git a/assets/img/posts/proxy-concurrency/envoy-arch.svg b/assets/img/posts/proxy-concurrency/envoy-arch.svg index b1025a2..6dec6df 100644 --- a/assets/img/posts/proxy-concurrency/envoy-arch.svg +++ b/assets/img/posts/proxy-concurrency/envoy-arch.svg @@ -21,122 +21,103 @@ Envoy — Thread-per-Core, Complete Worker Isolation - - clients - - - - - - - - Listener Thread - accept · TLS handshake - hash conn → worker + + clients + + + + - - - - - - - - - + + + Listener Thread + accept · TLS handshake + hash conn → worker + + + + + + + + Worker 0 + + libevent loop · L4/L7 filter chain + upstream conn pool · xDS snapshot + + 5 conns · 0 shared state + + + + + + + Worker 1 + + libevent loop · L4/L7 filter chain + upstream conn pool · xDS snapshot + + 8 conns · 0 shared state + + + + + NO SHARED STATE + + + + Worker 2 + + libevent loop · L4/L7 filter chain + upstream conn pool · xDS snapshot + + 6 conns · 0 shared state + + + + + + + Worker 3 + + libevent loop · L4/L7 filter chain + upstream conn pool · xDS snapshot + + 4 conns · 0 shared state - - - Worker 0 - - - libevent loop - L4 · L7 filter chain - upstream conn pool - xDS config snapshot - - 5 conns · 0 shared state - - - - Worker 1 - - - libevent loop - L4 · L7 filter chain - upstream conn pool - xDS config snapshot - - 8 conns · 0 shared state - - - - Worker 2 - - - libevent loop - L4 · L7 filter chain - upstream conn pool - xDS config snapshot - - 6 conns · 0 shared state - - - - Worker 3 - - - libevent loop - L4 · L7 filter chain - upstream conn pool - xDS config snapshot - - 4 conns · 0 shared state - - - - - - NO SHARED STATE - - - - - - xDS Control - Plane - - routes · endpoints - certs · timeouts - hot-push to each worker + + + xDS Control + Plane + + routes · endpoints + certs · timeouts + hot-push to each worker - - - - - ↓ each worker + + + + - - --concurrency N (default: hardware threads) · config via xDS API or static bootstrap YAML + + --concurrency N (default: hardware threads) · config via xDS API or static bootstrap YAML - + Envoy workers are islands: each owns its connections, filter chain, and config snapshot. Nothing shared. xDS pushes config to each worker independently · red barriers = zero cross-worker communication From f9938850aa6ed96aa4868095cbcd4045f2eb432f Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 19:36:39 -0700 Subject: [PATCH 12/16] Add external links and talk reference to proxy-concurrency post - Link section headers to official docs: Apache Traffic Server, HAProxy, Envoy - Link body terms: Apache HTTPd, C10K problem, epoll, kqueue, TSAPI plugin interface, libevent, xDS protocol, Istio - Add HAProxy User Spotlight talk at the bottom: Modernizing LinkedIn's Traffic Stack Co-Authored-By: Claude Sonnet 4.6 --- ...03-09-concurrent-requests-reverse-proxy.md | 20 ++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md index 466a4ca..c92ee0f 100644 --- a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md +++ b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md @@ -29,7 +29,7 @@ A proxy that allocates generously because it is convenient survives normal traff ## Thread-per-Connection: The Obvious Model That Does Not Scale -The simplest way to handle concurrent connections is a thread (or process) per connection. Apache HTTPd used this (prefork MPM), it is straightforward to reason about, and each connection gets isolated execution with no shared state to worry about. A blocking read waiting for a slow client just blocks that thread. Other connections continue on their own threads. +The simplest way to handle concurrent connections is a thread (or process) per connection. [Apache HTTPd](https://httpd.apache.org/) used this (prefork MPM), it is straightforward to reason about, and each connection gets isolated execution with no shared state to worry about. A blocking read waiting for a slow client just blocks that thread. Other connections continue on their own threads. The problem is that threads are expensive. @@ -57,7 +57,7 @@ server.listen() A thread on Linux consumes roughly 8 MB of virtual memory for its default stack. Even with a tuned 512 KB stack, 10,000 connections requires 5 GB of stack space before any application work is done. The OS scheduler now manages 10,000 threads. Context switching between them — saving and restoring registers, TLB pressure, cache eviction — adds up. At high connection counts the scheduler overhead appears directly in latency measurements. -The C10K problem (serving 10,000 concurrent connections efficiently) was a real practical limit for this model in the late 1990s. The solution was not faster hardware. It was a different concurrency model. +The [C10K problem](http://www.kegel.com/c10k.html) (serving 10,000 concurrent connections efficiently) was a real practical limit for this model in the late 1990s. The solution was not faster hardware. It was a different concurrency model. ![Thread-per-connection: each connection owns one thread, memory scales with N; event loop: one thread manages thousands via kernel I/O readiness notifications](/assets/img/posts/proxy-concurrency/thread-models.svg) @@ -67,7 +67,7 @@ Most of the time, a connection is not doing anything. It is waiting — for the The event loop separates the concepts of holding a connection and doing work on it. -An event loop uses the OS's I/O readiness notification interface — `epoll` on Linux, `kqueue` on macOS and BSD — to monitor many file descriptors simultaneously with a single thread. The OS watches thousands of sockets. When one becomes readable (client sent data) or writable (backend acknowledged data), it notifies the event loop. The loop wakes up, does exactly the work that is ready, and returns to waiting. +An event loop uses the OS's I/O readiness notification interface — [`epoll`](https://man7.org/linux/man-pages/man7/epoll.7.html) on Linux, [`kqueue`](https://man.freebsd.org/cgi/man.cgi?kqueue) on macOS and BSD — to monitor many file descriptors simultaneously with a single thread. The OS watches thousands of sockets. When one becomes readable (client sent data) or writable (backend acknowledged data), it notifies the event loop. The loop wakes up, does exactly the work that is ready, and returns to waiting. No threads blocked on slow connections. No context switches between thousands of threads. One thread, one event loop, as many file descriptors as the OS allows. The `ulimit -n` setting, commonly raised to 65,535 or higher in production, is now the practical limit rather than thread memory. @@ -96,7 +96,7 @@ The tradeoff is programming model complexity. A blocking operation inside the ev Each proxy covered here takes this base model and makes different tradeoffs around it. -## Apache Traffic Server: Event Threads and the Continuation System +## [Apache Traffic Server](https://trafficserver.apache.org/): Event Threads and the Continuation System ATS does not use a single event loop. It uses a pool of event threads — one per CPU core by default, configured via `proxy.config.exec.thread.limit` — each running its own independent event loop. @@ -108,11 +108,11 @@ The programming model inside ATS is the **continuation system**. A continuation The consequence for plugin authors is significant. ATS plugins hook into the request pipeline by registering continuations. If a plugin's handler makes a blocking system call — a synchronous DNS lookup, a blocking HTTP request to an external service, a filesystem read — it blocks the entire ET_NET thread. Every connection on that thread stops making progress until the blocking call returns. This is not a theoretical concern; it is the most common cause of latency spikes in production ATS deployments. -**Where ATS is strong:** CDN-scale HTTP caching and forward proxying. The continuation model is purpose-built for cache hit/miss processing. The cache integration is deep — content storage, freshness evaluation, and origin fetching are all built into the continuation chain. Organizations running CDN edge nodes at billions of requests per day have done so on ATS for years. The TSAPI plugin interface lets you customize behavior at every stage of request processing. +**Where ATS is strong:** CDN-scale HTTP caching and forward proxying. The continuation model is purpose-built for cache hit/miss processing. The cache integration is deep — content storage, freshness evaluation, and origin fetching are all built into the continuation chain. Organizations running CDN edge nodes at billions of requests per day have done so on ATS for years. The [TSAPI](https://docs.trafficserver.apache.org/en/latest/developer-guide/plugins/plugin-interfaces.en.html) plugin interface lets you customize behavior at every stage of request processing. **Where ATS struggles:** The continuation model has a steep learning curve, and the plugin isolation story is weak. A misbehaving plugin degrades the thread it runs on. Configuration is dense, and performance tuning requires understanding internal thread and event queue sizing. For general-purpose reverse proxy use cases outside of caching workloads, the operational complexity is hard to justify. -## HAProxy: Single-Process Discipline, Then Careful Parallelism +## [HAProxy](https://www.haproxy.org/): Single-Process Discipline, Then Careful Parallelism HAProxy's original design was a single-process, single-thread event loop. One process, one epoll loop, all connections. Everything the proxy did was handled in sequence within that event loop. @@ -149,7 +149,7 @@ Hot reload works through process replacement: `haproxy -sf $(cat /var/run/haprox **Where HAProxy struggles:** The threading model was added to a single-process design; at very high thread counts, spinlock contention on shared state can surface. Lua (the extension scripting language) runs on the event loop thread, so complex Lua logic adds latency to other connections on that thread. HAProxy is not designed for deep L7 programmability — complex request transformation logic that would be straightforward in Envoy's filter chain is awkward to express in HAProxy's ACL/action model. -## Envoy: Thread-per-Core with Complete Isolation +## [Envoy](https://www.envoyproxy.io/): Thread-per-Core with Complete Isolation Envoy was designed for service mesh: a proxy running as a sidecar alongside every service instance in the fleet. That use case required properties none of the existing proxies optimized for — deep L7 programmability, dynamic reconfiguration without restarts, and a concurrency model that would not allow a bug in one connection's processing to affect any other connection. @@ -157,13 +157,13 @@ The architecture is thread-per-core with a strict constraint: **worker threads s A listener thread accepts incoming connections and dispatches each to a worker thread via a consistent hash. From that moment, the connection belongs entirely to that worker: its TLS session, its upstream connection pool, the entire L7 filter chain executing its request. Workers do not communicate with each other for connection processing. -Each worker runs its own libevent-based event loop and holds its own copy of the proxy configuration — delivered as a snapshot via the xDS protocol. When the control plane pushes a configuration update (a new backend, a changed route, a rotated certificate), each worker receives and applies it independently. No coordination between workers, no global pause, no lock. +Each worker runs its own [libevent](https://libevent.org/)-based event loop and holds its own copy of the proxy configuration — delivered as a snapshot via the [xDS protocol](https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol). When the control plane pushes a configuration update (a new backend, a changed route, a rotated certificate), each worker receives and applies it independently. No coordination between workers, no global pause, no lock. ![Envoy listener thread dispatching to worker threads; each worker is completely isolated with its own event loop, connection pool, filter chain, and xDS config snapshot; no shared state between workers](/assets/img/posts/proxy-concurrency/envoy-arch.svg) The filter chain model is the other defining feature. Every request passes through a configured sequence of L4 and L7 filters. Each filter can read and modify the request: JWT validation, header manipulation, rate limit checking, gRPC transcoding, circuit breaking. Filters are composable and independently configurable. The per-worker isolation means a filter's state is always thread-local — no locking required within the filter chain. -The xDS API is the interface between Envoy and its control plane (Istio, custom implementations, or static config with dynamic overrides). Adding a backend endpoint, changing a route's timeout, draining an instance before it is decommissioned — all are xDS updates pushed to each worker independently. This is the operational model that makes zero-downtime deployments at fleet scale tractable. +The xDS API is the interface between Envoy and its control plane ([Istio](https://istio.io/), custom implementations, or static config with dynamic overrides). Adding a backend endpoint, changing a route's timeout, draining an instance before it is decommissioned — all are xDS updates pushed to each worker independently. This is the operational model that makes zero-downtime deployments at fleet scale tractable. **Where Envoy is strong:** Complex L7 processing, service mesh sidecars, API gateways where routing rules change frequently, and environments with control-plane infrastructure. The filter chain model handles workloads that would require custom code in HAProxy or ATS. The xDS integration is the right tool when the proxy's configuration is driven programmatically rather than by static files. @@ -193,4 +193,6 @@ The concurrency model is not incidental to these choices. ATS's continuation sys --- +*I covered LinkedIn's experience with these proxies at scale in the HAProxy User Spotlight Series: [Modernizing LinkedIn's Traffic Stack](https://www.haproxy.com/user-spotlight-series/modernizing-linkedins-traffic-stack).* + *Working through proxy architecture decisions at scale? I am on [LinkedIn](https://www.linkedin.com/in/singhsanjay12) or reachable by [email](mailto:hello@singh-sanjay.com).* From 55041fcc6d050094c9d00ed8b1eb2c8445c454ce Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 19:45:02 -0700 Subject: [PATCH 13/16] Fix text overlap in thread-models.svg event loop panel MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two issues fixed: 1. Arc loop icon inside the event loop box was r=16 (~50px wide), whose right side extended into the text label zone. Text like "socket becomes readable" (centred at x=635, left edge ~x=574) overlapped the arc polygon (x≤579). Fix: shrink arc to r=10, reposition to top-left corner (x≈548-569, y≈118-133), shift all box text and badge from centre x=635 to x=648, giving ≥15 px clearance. 2. "idle" label (x=455, y=173, light-grey fill) sat directly on a fan-in line that passes through (455, ~175) in the same grey tone, making the label invisible. Fix: add opaque white background rect and darken label fill from #94a3b8 to #64748b. Co-Authored-By: Claude Sonnet 4.6 --- .../posts/proxy-concurrency/thread-models.svg | 27 ++++++++++--------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/assets/img/posts/proxy-concurrency/thread-models.svg b/assets/img/posts/proxy-concurrency/thread-models.svg index 5afc401..ac1a631 100644 --- a/assets/img/posts/proxy-concurrency/thread-models.svg +++ b/assets/img/posts/proxy-concurrency/thread-models.svg @@ -80,26 +80,27 @@ - + - idle + + idle - - - + + + - - Event Loop - 1 thread · kernel I/O - socket becomes readable - → wake up · do work - → return to waiting + + Event Loop + 1 thread · kernel I/O + socket becomes readable + → wake up · do work + → return to waiting - - N connections · 1 thread + + N connections · 1 thread From 7950d3fe6981150aa3163b4d97dc7d8ab4ac252e Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 20:00:22 -0700 Subject: [PATCH 14/16] Fix text visibility issues in proxy-concurrency diagrams MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - thread-models.svg: move '⋯ thousands' label above the green annotation box (y=303→289) so it is not covered - hero.svg: right-align backend load labels ('3 req', '5 req', 'idle') at x=775 with text-anchor=end so they don't bleed into the box border - ats-arch.svg: expand ET_NET 2 box height 52→64 and split the long sublabel into two lines; reposition arc icon and dispatch arrow to match the new box centre Co-Authored-By: Claude Sonnet 4.6 --- assets/img/posts/proxy-concurrency/ats-arch.svg | 13 +++++++------ assets/img/posts/proxy-concurrency/hero.svg | 6 +++--- .../img/posts/proxy-concurrency/thread-models.svg | 8 ++++---- 3 files changed, 14 insertions(+), 13 deletions(-) diff --git a/assets/img/posts/proxy-concurrency/ats-arch.svg b/assets/img/posts/proxy-concurrency/ats-arch.svg index be512f2..f8f35a2 100644 --- a/assets/img/posts/proxy-concurrency/ats-arch.svg +++ b/assets/img/posts/proxy-concurrency/ats-arch.svg @@ -37,7 +37,7 @@ - + @@ -70,16 +70,17 @@ - + ET_NET 2 ⚠ overloaded - event loop · 18 conns · plugin blocking + event loop · 18 conns + plugin blocking - - + + @@ -111,7 +112,7 @@ = number of ET_NET threads - + diff --git a/assets/img/posts/proxy-concurrency/hero.svg b/assets/img/posts/proxy-concurrency/hero.svg index bd92630..2332914 100644 --- a/assets/img/posts/proxy-concurrency/hero.svg +++ b/assets/img/posts/proxy-concurrency/hero.svg @@ -77,20 +77,20 @@ Backend A - 3 req + 3 req Backend B - 5 req + 5 req Backend C - idle + idle diff --git a/assets/img/posts/proxy-concurrency/thread-models.svg b/assets/img/posts/proxy-concurrency/thread-models.svg index ac1a631..562f846 100644 --- a/assets/img/posts/proxy-concurrency/thread-models.svg +++ b/assets/img/posts/proxy-concurrency/thread-models.svg @@ -70,7 +70,7 @@ - ⋯ thousands + ⋯ thousands @@ -88,9 +88,9 @@ - - - + + + Event Loop From c1599a4ecc5614d8ba2feb73a6c1f719d873e697 Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 20:05:38 -0700 Subject: [PATCH 15/16] Replace em dashes in proxy-concurrency post to pass lint MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit test_em_dash_count_in_post_sources caps em dashes at 3; the post had 32. Replaced all prose em dashes with colons, semicolons, commas, or parentheses as appropriate. Also updated code-label spans and inline code comments (thread-per-connection · Python, yields; other ...). Front-matter description is exempt from the check so its em dash is left as-is. Co-Authored-By: Claude Sonnet 4.6 --- ...03-09-concurrent-requests-reverse-proxy.md | 50 +++++++++---------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md index c92ee0f..adb20ba 100644 --- a/_posts/2026-03-09-concurrent-requests-reverse-proxy.md +++ b/_posts/2026-03-09-concurrent-requests-reverse-proxy.md @@ -11,13 +11,13 @@ image: The first instinct when measuring proxy performance is throughput: requests per second, gigabits per second. That is the wrong place to start. -The real constraint at scale is **concurrent connection count**. A proxy in front of your entire service fleet holds thousands of open connections simultaneously — clients waiting for upstream data, upstream connections waiting for backends, keepalive connections sitting idle, WebSocket streams that have been open for hours. How the proxy manages all of that bookkeeping, without running out of memory, file descriptors, or CPU, determines whether requests at the tail of the latency distribution are served in milliseconds or seconds. +The real constraint at scale is **concurrent connection count**. A proxy in front of your entire service fleet holds thousands of open connections simultaneously: clients waiting for upstream data, upstream connections waiting for backends, keepalive connections sitting idle, WebSocket streams that have been open for hours. How the proxy manages all of that bookkeeping, without running out of memory, file descriptors, or CPU, determines whether requests at the tail of the latency distribution are served in milliseconds or seconds. ## The Thin Layer Constraint -A reverse proxy has a narrow job: receive bytes on one socket, enforce policy, forward bytes on another socket. "Enforce policy" covers a lot — TLS termination, header rewriting, authentication, rate limiting — but the core is moving bytes efficiently. +A reverse proxy has a narrow job: receive bytes on one socket, enforce policy, forward bytes on another socket. "Enforce policy" covers a lot (TLS termination, header rewriting, authentication, rate limiting), but the core is moving bytes efficiently. -This creates what I call the thin layer constraint: **the proxy must consume the minimum resources necessary per connection, because it holds thousands of them simultaneously.** Every unnecessary byte allocated per connection, every lock acquired on the hot path, every avoidable system call — it multiplies by the connection count. +This creates what I call the thin layer constraint: **the proxy must consume the minimum resources necessary per connection, because it holds thousands of them simultaneously.** Every unnecessary byte allocated per connection, every lock acquired on the hot path, every avoidable system call: it multiplies by the connection count. At 10,000 concurrent connections: @@ -36,12 +36,12 @@ The problem is that threads are expensive.
- thread-per-connection — Python + thread-per-connection · Python
import socket, threading
 
 def handle(conn):
-    data = conn.recv(4096)   # blocks here — thread is stuck until client sends
+    data = conn.recv(4096)   # blocks here; thread is stuck until client sends
     conn.sendall(data.upper())
     conn.close()
 
@@ -52,10 +52,10 @@ server.listen()
 while True:
     conn, _ = server.accept()
     threading.Thread(target=handle, args=(conn,)).start()
-    # a new OS thread for every connection — 10,000 clients = 10,000 threads
+ # a new OS thread per connection; 10,000 clients = 10,000 threads
-A thread on Linux consumes roughly 8 MB of virtual memory for its default stack. Even with a tuned 512 KB stack, 10,000 connections requires 5 GB of stack space before any application work is done. The OS scheduler now manages 10,000 threads. Context switching between them — saving and restoring registers, TLB pressure, cache eviction — adds up. At high connection counts the scheduler overhead appears directly in latency measurements. +A thread on Linux consumes roughly 8 MB of virtual memory for its default stack. Even with a tuned 512 KB stack, 10,000 connections requires 5 GB of stack space before any application work is done. The OS scheduler now manages 10,000 threads. Context switching between them (saving and restoring registers, TLB pressure, cache eviction) adds up. At high connection counts the scheduler overhead appears directly in latency measurements. The [C10K problem](http://www.kegel.com/c10k.html) (serving 10,000 concurrent connections efficiently) was a real practical limit for this model in the late 1990s. The solution was not faster hardware. It was a different concurrency model. @@ -63,23 +63,23 @@ The [C10K problem](http://www.kegel.com/c10k.html) (serving 10,000 concurrent co ## The Event Loop: Separating Holding from Working -Most of the time, a connection is not doing anything. It is waiting — for the client to send the next byte, for the backend to respond, for a slow upstream to unblock. A thread blocked on a slow client is wasted capacity. +Most of the time, a connection is not doing anything. It is waiting: for the client to send the next byte, for the backend to respond, for a slow upstream to unblock. A thread blocked on a slow client is wasted capacity. The event loop separates the concepts of holding a connection and doing work on it. -An event loop uses the OS's I/O readiness notification interface — [`epoll`](https://man7.org/linux/man-pages/man7/epoll.7.html) on Linux, [`kqueue`](https://man.freebsd.org/cgi/man.cgi?kqueue) on macOS and BSD — to monitor many file descriptors simultaneously with a single thread. The OS watches thousands of sockets. When one becomes readable (client sent data) or writable (backend acknowledged data), it notifies the event loop. The loop wakes up, does exactly the work that is ready, and returns to waiting. +An event loop uses the OS's I/O readiness notification interface: [`epoll`](https://man7.org/linux/man-pages/man7/epoll.7.html) on Linux, [`kqueue`](https://man.freebsd.org/cgi/man.cgi?kqueue) on macOS and BSD, to monitor many file descriptors simultaneously with a single thread. The OS watches thousands of sockets. When one becomes readable (client sent data) or writable (backend acknowledged data), it notifies the event loop. The loop wakes up, does exactly the work that is ready, and returns to waiting. No threads blocked on slow connections. No context switches between thousands of threads. One thread, one event loop, as many file descriptors as the OS allows. The `ulimit -n` setting, commonly raised to 65,535 or higher in production, is now the practical limit rather than thread memory.
- event loop — Python asyncio + event loop · Python asyncio
import asyncio
 
 async def handle(reader, writer):
-    data = await reader.read(4096)  # yields — other connections run while we wait
+    data = await reader.read(4096)  # yields; other connections run while we wait
     writer.write(data.upper())
     await writer.drain()             # yields again while the kernel flushes the write
     writer.close()
@@ -92,13 +92,13 @@ No threads blocked on slow connections. No context switches between thousands of
 asyncio.run(main())
-The tradeoff is programming model complexity. A blocking operation inside the event loop blocks the entire loop — every connection on that thread stalls. Everything must be written as non-blocking callbacks or coroutines. This is harder to write correctly and harder to debug than sequential threaded code. +The tradeoff is programming model complexity. A blocking operation inside the event loop blocks the entire loop; every connection on that thread stalls. Everything must be written as non-blocking callbacks or coroutines. This is harder to write correctly and harder to debug than sequential threaded code. Each proxy covered here takes this base model and makes different tradeoffs around it. ## [Apache Traffic Server](https://trafficserver.apache.org/): Event Threads and the Continuation System -ATS does not use a single event loop. It uses a pool of event threads — one per CPU core by default, configured via `proxy.config.exec.thread.limit` — each running its own independent event loop. +ATS does not use a single event loop. It uses a pool of event threads, one per CPU core by default, configured via `proxy.config.exec.thread.limit`, each running its own independent event loop. When a new connection arrives, it lands at a dedicated accept thread and is dispatched round-robin to one of the ET_NET (event thread network) threads. That thread owns the connection for its lifetime. Connections do not migrate between threads. @@ -106,9 +106,9 @@ When a new connection arrives, it lands at a dedicated accept thread and is disp The programming model inside ATS is the **continuation system**. A continuation is a callback object with associated state: it says "when event X occurs, call this handler." Processing a request is a chain of continuations scheduled on the event thread. I/O completes, a continuation runs, schedules the next I/O operation, and the continuation is rescheduled when that I/O completes. The thread never waits; it always moves to the next ready event. -The consequence for plugin authors is significant. ATS plugins hook into the request pipeline by registering continuations. If a plugin's handler makes a blocking system call — a synchronous DNS lookup, a blocking HTTP request to an external service, a filesystem read — it blocks the entire ET_NET thread. Every connection on that thread stops making progress until the blocking call returns. This is not a theoretical concern; it is the most common cause of latency spikes in production ATS deployments. +The consequence for plugin authors is significant. ATS plugins hook into the request pipeline by registering continuations. If a plugin's handler makes a blocking system call (a synchronous DNS lookup, a blocking HTTP request to an external service, or a filesystem read), it blocks the entire ET_NET thread. Every connection on that thread stops making progress until the blocking call returns. This is not a theoretical concern; it is the most common cause of latency spikes in production ATS deployments. -**Where ATS is strong:** CDN-scale HTTP caching and forward proxying. The continuation model is purpose-built for cache hit/miss processing. The cache integration is deep — content storage, freshness evaluation, and origin fetching are all built into the continuation chain. Organizations running CDN edge nodes at billions of requests per day have done so on ATS for years. The [TSAPI](https://docs.trafficserver.apache.org/en/latest/developer-guide/plugins/plugin-interfaces.en.html) plugin interface lets you customize behavior at every stage of request processing. +**Where ATS is strong:** CDN-scale HTTP caching and forward proxying. The continuation model is purpose-built for cache hit/miss processing. The cache integration is deep: content storage, freshness evaluation, and origin fetching are all built into the continuation chain. Organizations running CDN edge nodes at billions of requests per day have done so on ATS for years. The [TSAPI](https://docs.trafficserver.apache.org/en/latest/developer-guide/plugins/plugin-interfaces.en.html) plugin interface lets you customize behavior at every stage of request processing. **Where ATS struggles:** The continuation model has a steep learning curve, and the plugin isolation story is weak. A misbehaving plugin degrades the thread it runs on. Configuration is dense, and performance tuning requires understanding internal thread and event queue sizing. For general-purpose reverse proxy use cases outside of caching workloads, the operational complexity is hard to justify. @@ -122,9 +122,9 @@ HAProxy added multi-threading in version 1.8 via the `nbthread` directive. The d ![HAProxy: single process with shared accept socket via SO_REUSEPORT; nbthread workers each run an independent epoll loop; shared state protected by spinlocks](/assets/img/posts/proxy-concurrency/haproxy-arch.svg) -New connections are distributed using `SO_REUSEPORT` — a socket option that lets multiple threads call `accept()` on the same port, with the kernel distributing connections across them. This removes the accept bottleneck without a shared queue or mutex. Each thread then manages its connections independently. +New connections are distributed using `SO_REUSEPORT`, a socket option that lets multiple threads call `accept()` on the same port, with the kernel distributing connections across them. This removes the accept bottleneck without a shared queue or mutex. Each thread then manages its connections independently. -Shared state — stick-tables, global request counters, server health information — is protected by per-object spinlocks rather than a global lock. The shared surface is small by design; HAProxy's data model has always minimized it. +Shared state (stick-tables, global request counters, server health information) is protected by per-object spinlocks rather than a global lock. The shared surface is small by design; HAProxy's data model has always minimized it. Configuration is explicit: @@ -147,37 +147,37 @@ Hot reload works through process replacement: `haproxy -sf $(cat /var/run/haprox **Where HAProxy is strong:** Pure efficiency and predictable latency in L4 and L7 load balancing scenarios. For environments where memory budget is constrained (appliances, shared infrastructure), where configuration must be auditable and straightforward, or where the stick-table and ACL system's power is needed without external dependencies, HAProxy is the standard choice. Its runtime API (socket commands) supports dynamic configuration of server weights, server state, and ACLs without a reload. -**Where HAProxy struggles:** The threading model was added to a single-process design; at very high thread counts, spinlock contention on shared state can surface. Lua (the extension scripting language) runs on the event loop thread, so complex Lua logic adds latency to other connections on that thread. HAProxy is not designed for deep L7 programmability — complex request transformation logic that would be straightforward in Envoy's filter chain is awkward to express in HAProxy's ACL/action model. +**Where HAProxy struggles:** The threading model was added to a single-process design; at very high thread counts, spinlock contention on shared state can surface. Lua (the extension scripting language) runs on the event loop thread, so complex Lua logic adds latency to other connections on that thread. HAProxy is not designed for deep L7 programmability; complex request transformation logic that would be straightforward in Envoy's filter chain is awkward to express in HAProxy's ACL/action model. ## [Envoy](https://www.envoyproxy.io/): Thread-per-Core with Complete Isolation -Envoy was designed for service mesh: a proxy running as a sidecar alongside every service instance in the fleet. That use case required properties none of the existing proxies optimized for — deep L7 programmability, dynamic reconfiguration without restarts, and a concurrency model that would not allow a bug in one connection's processing to affect any other connection. +Envoy was designed for service mesh: a proxy running as a sidecar alongside every service instance in the fleet. That use case required properties none of the existing proxies optimized for: deep L7 programmability, dynamic reconfiguration without restarts, and a concurrency model that would not allow a bug in one connection's processing to affect any other connection. The architecture is thread-per-core with a strict constraint: **worker threads share nothing by design.** A listener thread accepts incoming connections and dispatches each to a worker thread via a consistent hash. From that moment, the connection belongs entirely to that worker: its TLS session, its upstream connection pool, the entire L7 filter chain executing its request. Workers do not communicate with each other for connection processing. -Each worker runs its own [libevent](https://libevent.org/)-based event loop and holds its own copy of the proxy configuration — delivered as a snapshot via the [xDS protocol](https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol). When the control plane pushes a configuration update (a new backend, a changed route, a rotated certificate), each worker receives and applies it independently. No coordination between workers, no global pause, no lock. +Each worker runs its own [libevent](https://libevent.org/)-based event loop and holds its own copy of the proxy configuration, delivered as a snapshot via the [xDS protocol](https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol). When the control plane pushes a configuration update (a new backend, a changed route, a rotated certificate), each worker receives and applies it independently. No coordination between workers, no global pause, no lock. ![Envoy listener thread dispatching to worker threads; each worker is completely isolated with its own event loop, connection pool, filter chain, and xDS config snapshot; no shared state between workers](/assets/img/posts/proxy-concurrency/envoy-arch.svg) -The filter chain model is the other defining feature. Every request passes through a configured sequence of L4 and L7 filters. Each filter can read and modify the request: JWT validation, header manipulation, rate limit checking, gRPC transcoding, circuit breaking. Filters are composable and independently configurable. The per-worker isolation means a filter's state is always thread-local — no locking required within the filter chain. +The filter chain model is the other defining feature. Every request passes through a configured sequence of L4 and L7 filters. Each filter can read and modify the request: JWT validation, header manipulation, rate limit checking, gRPC transcoding, circuit breaking. Filters are composable and independently configurable. The per-worker isolation means a filter's state is always thread-local, so no locking is required within the filter chain. -The xDS API is the interface between Envoy and its control plane ([Istio](https://istio.io/), custom implementations, or static config with dynamic overrides). Adding a backend endpoint, changing a route's timeout, draining an instance before it is decommissioned — all are xDS updates pushed to each worker independently. This is the operational model that makes zero-downtime deployments at fleet scale tractable. +The xDS API is the interface between Envoy and its control plane ([Istio](https://istio.io/), custom implementations, or static config with dynamic overrides). Adding a backend endpoint, changing a route's timeout, draining an instance before it is decommissioned: all are xDS updates pushed to each worker independently. This is the operational model that makes zero-downtime deployments at fleet scale tractable. **Where Envoy is strong:** Complex L7 processing, service mesh sidecars, API gateways where routing rules change frequently, and environments with control-plane infrastructure. The filter chain model handles workloads that would require custom code in HAProxy or ATS. The xDS integration is the right tool when the proxy's configuration is driven programmatically rather than by static files. -**Where Envoy struggles:** Memory footprint is higher than HAProxy, primarily from per-worker state duplication — each worker holds its own upstream connection pool and config snapshot. The operational surface is larger: debugging a misconfigured filter chain is harder than reading a HAProxy ACL. Custom filters require C++ or WASM, a higher bar than Lua scripting. For straightforward L4/L7 load balancing without complex routing logic, Envoy's weight is harder to justify than HAProxy's. +**Where Envoy struggles:** Memory footprint is higher than HAProxy, primarily from per-worker state duplication: each worker holds its own upstream connection pool and config snapshot. The operational surface is larger: debugging a misconfigured filter chain is harder than reading a HAProxy ACL. Custom filters require C++ or WASM, a higher bar than Lua scripting. For straightforward L4/L7 load balancing without complex routing logic, Envoy's weight is harder to justify than HAProxy's. ## Robustness Under the Thin Layer Being thin does not mean being fragile. Each model comes with specific mechanisms for maintaining service through failures. -**Graceful restart** is how all three proxies handle configuration updates and version upgrades without dropping connections. HAProxy's `-sf` flag passes file descriptors to the new process, which takes the listening sockets while the old process drains. ATS's traffic_manager handles restart sequencing. Envoy's hot-restart protocol passes sockets between old and new processes; the drain timer controls how long the old process waits for in-flight requests to complete. The common pattern — new process takes the port, old process finishes its work — is non-negotiable for a proxy in a live path. +**Graceful restart** is how all three proxies handle configuration updates and version upgrades without dropping connections. HAProxy's `-sf` flag passes file descriptors to the new process, which takes the listening sockets while the old process drains. ATS's traffic_manager handles restart sequencing. Envoy's hot-restart protocol passes sockets between old and new processes; the drain timer controls how long the old process waits for in-flight requests to complete. The common pattern (new process takes the port, old process finishes its work) is non-negotiable for a proxy in a live path. **Circuit breaking** prevents backend failure from cascading into proxy resource exhaustion. When a backend is slow or failing, the proxy must stop sending it new connections before queues grow unbounded. Envoy's circuit breaker is per-cluster with configurable thresholds: maximum pending requests, active requests, retries, and connections. HAProxy uses `maxconn` per server with queue management and health-check-driven server state transitions. ATS manages this through origin server connection limiting and retry configuration. The implementation differs; the requirement is the same: a proxy that blindly queues connections to a failing backend eventually exhausts memory and takes itself down. -**Connection draining on backend removal** ensures in-flight requests complete when a backend exits the pool. HAProxy's "drain" server state stops new connections while allowing existing ones to finish. Envoy's endpoint discovery transitions endpoints through a draining state before removal. This is operationally critical for deployments — a rolling deployment that removes backends without draining will drop a predictable fraction of requests proportional to the ratio of removed capacity to total capacity. +**Connection draining on backend removal** ensures in-flight requests complete when a backend exits the pool. HAProxy's "drain" server state stops new connections while allowing existing ones to finish. Envoy's endpoint discovery transitions endpoints through a draining state before removal. This is operationally critical for deployments: a rolling deployment that removes backends without draining will drop a predictable fraction of requests proportional to the ratio of removed capacity to total capacity. ## Choosing the Right Model From ea61b2c2fa7e196ed5050fdbeb82040283f18060 Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Mon, 9 Mar 2026 20:17:01 -0700 Subject: [PATCH 16/16] Fix ET_NET 2 layout and stray orange line in ats-arch.svg MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Move conn dots from cy=192 to cy=206 so they sit on a distinct row below the title instead of overlapping 'ET_NET 2 ⚠ overloaded' - Shift sublabel text down (y=210/223 → y=219/231) to match the new row spacing in the taller 64px box - Fix orange annotation line: extend from x=557→555 (box right edge) and x=578→580 (callout box left edge) to eliminate the floating gap - Add indigo pill background behind 'round-robin' label so it reads clearly over the crossing dispatch curves Co-Authored-By: Claude Sonnet 4.6 --- .../img/posts/proxy-concurrency/ats-arch.svg | 23 ++++++++++--------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/assets/img/posts/proxy-concurrency/ats-arch.svg b/assets/img/posts/proxy-concurrency/ats-arch.svg index f8f35a2..6e077cf 100644 --- a/assets/img/posts/proxy-concurrency/ats-arch.svg +++ b/assets/img/posts/proxy-concurrency/ats-arch.svg @@ -40,8 +40,9 @@ - - round-robin + + + round-robin @@ -71,14 +72,14 @@ - ET_NET 2 ⚠ overloaded - event loop · 18 conns - plugin blocking - - - - - + ET_NET 2 ⚠ overloaded + event loop · 18 conns + plugin blocking + + + + + @@ -112,7 +113,7 @@ = number of ET_NET threads - +