From cf0b94982f6c6f210bb7a9e74e6742d11ebf7a5c Mon Sep 17 00:00:00 2001 From: Hans Kristian Flaatten Date: Mon, 24 Nov 2025 10:04:54 +0100 Subject: [PATCH 1/2] add guide for reliable communication between GCP and on-prem services --- .../explanations/migrating-to-gcp.md | 24 ++- .../workloads/how-to/gcp-fss-communication.md | 194 ++++++++++++++++++ 2 files changed, 208 insertions(+), 10 deletions(-) create mode 100644 docs/workloads/how-to/gcp-fss-communication.md diff --git a/docs/workloads/explanations/migrating-to-gcp.md b/docs/workloads/explanations/migrating-to-gcp.md index 8d995b21..0d88cce4 100644 --- a/docs/workloads/explanations/migrating-to-gcp.md +++ b/docs/workloads/explanations/migrating-to-gcp.md @@ -109,13 +109,13 @@ A PVK is not a unique requirement for GCP, so all applications should already ha ???+ faq "Answer" - Nav has a cloud strategy that includes moving applications to the cloud. + Nav has a cloud strategy that includes moving applications to the cloud. Read more about the [cloud strategy at Navet](https://navno.sharepoint.com/sites/enhet-it-avdelingen/SitePages/Skystrategi.aspx). ### What can we do in our GCP project? -???+ faq "Answer" +???+ faq "Answer" The teams GCP projects are primarily used for automatically generated resources (buckets and postgres). We're working on extending the service offering. However, additional access may be granted if required by the team @@ -140,7 +140,7 @@ A PVK is not a unique requirement for GCP, so all applications should already ha The application _on-premises_ should generally fulfill the following requirements: 1. Be secured with [OAuth 2.0][auth]. That is, either: - - a. [TokenX][tokenx], or + - a. [TokenX][tokenx], or - b. [Entra ID][entra-id] 2. Exposed to GCP using a special ingress: - `https://.dev-fss-pub.nais.io` @@ -149,7 +149,7 @@ A PVK is not a unique requirement for GCP, so all applications should already ha The application _on-premises_ must then: 1. Add the ingress created above to the list of ingresses: - + ```yaml spec: ingresses: @@ -173,24 +173,27 @@ A PVK is not a unique requirement for GCP, so all applications should already ha ``` 2. Consume the application using the special ingress. Other ingresses are not reachable from GCP. + !!! warning "Connection timeouts" + Applications calling on-prem services from GCP must handle firewall connection timeouts. See [Communication between GCP and on-prem][gcp-fss-comm] for details. + ### How do I reach an application found on GCP from my application on-premises? ???+ faq "Answer" The application in GCP must be exposed on a [matching ingress][environments]: - | ingress | reachable from zone | - | :--- | :--- | - | `.intern.dev.nav.no` | `dev-fss` | - | `.intern.nav.no` | `prod-fss` | - | `.nav.no` | internet, i.e. all clusters | + | ingress | reachable from zone | + | :------------------------ | :-------------------------- | + | `.intern.dev.nav.no` | `dev-fss` | + | `.intern.nav.no` | `prod-fss` | + | `.nav.no` | internet, i.e. all clusters | The application on-premises should _not_ have to use webproxy to reach these ingresses. ## GCP compared to on-premises | Feature | on-prem | gcp | Comment | -|:--------------------------|:-----------|:-------------------|:----------------------------------------------------------------| +| :------------------------ | :--------- | :----------------- | :-------------------------------------------------------------- | | Deploy | yes | yes | different clustername when deploying | | Logging | yes | yes | different clustername in logs.adeo.no | | Metrics | yes | yes | same mechanism, different datasource | @@ -225,6 +228,7 @@ A PVK is not a unique requirement for GCP, so all applications should already ha [entra-id]: ../../auth/entra-id/README.md [entra-id-access]: ../../auth/entra-id/how-to/secure.md#grant-access-to-consumers [access-policies]: ../how-to/access-policies.md +[gcp-fss-comm]: ../how-to/gcp-fss-communication.md [roles-responsibilites]: ../../legal/roles-responsibilities.md [pvk]: ../../legal/app-pvk.md [ros]: ../../legal/nais-ros.md diff --git a/docs/workloads/how-to/gcp-fss-communication.md b/docs/workloads/how-to/gcp-fss-communication.md new file mode 100644 index 00000000..ba6ddefa --- /dev/null +++ b/docs/workloads/how-to/gcp-fss-communication.md @@ -0,0 +1,194 @@ +--- +tags: [workloads, how-to, gcp, fss, on-premises, networking, timeout] +conditional: [tenant, nav] +--- + +# Communicate reliably between GCP and on-prem + +This guide shows you how to configure HTTP clients to handle firewall timeouts when calling on-premises services from GCP. + +## Prerequisites + +- Application running in GCP +- Calling services in on-prem FSS environment via `*.fss-pub.nais.io` ingress +- Access to modify HTTP client configuration + +## Background + +The on-prem firewall drops idle connections after 60 minutes without sending TCP close signals. HTTP clients reusing these dead connections will fail with timeout or connection reset errors. + +## Steps + +### 1. Configure HTTP client connection time-to-live + +Set connection TTL to 55 minutes (below the 60-minute firewall timeout): + +=== "Apache HttpClient" + + ```java + PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager(); + cm.setMaxTotal(200); + cm.setDefaultMaxPerRoute(20); + cm.setConnectionTimeToLive(55, TimeUnit.MINUTES); + + CloseableHttpClient client = HttpClients.custom() + .setConnectionManager(cm) + .evictIdleConnections(55, TimeUnit.MINUTES) + .build(); + ``` + +=== "OkHttp" + + ```java + OkHttpClient client = new OkHttpClient.Builder() + .connectionPool(new ConnectionPool(10, 55, TimeUnit.MINUTES)) + .build(); + ``` + +=== "Spring WebClient" + + ```java + ConnectionProvider provider = ConnectionProvider.builder("onprem-pool") + .maxConnections(200) + .maxIdleTime(Duration.ofMinutes(55)) + .maxLifeTime(Duration.ofMinutes(59)) + .evictInBackground(Duration.ofMinutes(5)) + .build(); + + HttpClient httpClient = HttpClient.create(provider) + .option(ChannelOption.SO_KEEPALIVE, true) + .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 5000) + .responseTimeout(Duration.ofSeconds(10)); + + WebClient client = WebClient.builder() + .clientConnector(new ReactorClientHttpConnector(httpClient)) + .build(); + ``` + +=== "Ktor CIO" + + ```kotlin + HttpClient(CIO) { + engine { + maxConnectionsCount = 200 + endpoint { + keepAliveTime = 55_000 + connectTimeout = 5_000 + requestTimeout = 10_000 + } + } + } + ``` + +=== "Node.js (http/https)" + + ```javascript + const https = require('https'); + + const agent = new https.Agent({ + keepAlive: true, + keepAliveMsecs: 55 * 60 * 1000, + maxSockets: 200, + maxFreeSockets: 20, + timeout: 55 * 60 * 1000 + }); + + // Use with fetch or http.request + fetch('https://onprem-service.fss-pub.nais.io', { agent }); + ``` + +=== "Axios" + + ```javascript + const axios = require('axios'); + const https = require('https'); + + const httpsAgent = new https.Agent({ + keepAlive: true, + keepAliveMsecs: 55 * 60 * 1000, + maxSockets: 200, + maxFreeSockets: 20, + timeout: 55 * 60 * 1000 + }); + + const client = axios.create({ + httpsAgent: httpsAgent, + timeout: 10000 + }); + ``` + +=== "Node-fetch" + + ```javascript + const fetch = require('node-fetch'); + const https = require('https'); + + const agent = new https.Agent({ + keepAlive: true, + keepAliveMsecs: 55 * 60 * 1000, + maxSockets: 200, + maxFreeSockets: 20, + timeout: 55 * 60 * 1000 + }); + + fetch('https://onprem-service.fss-pub.nais.io', { agent }) + .then(res => res.json()); + ``` + +### 2. Enable TCP keep-alive + +Enable SO_KEEPALIVE to send periodic packets on idle connections (shown in Spring WebClient example above). + +### 3. Configure background eviction + +Set background eviction to proactively remove stale connections every 5 minutes (shown in Spring WebClient example above). + +### 4. Monitor metrics and logs + +Use [OpenTelemetry auto-instrumentation](../../observability/how-to/auto-instrumentation.md) to track error rates and latency: + +=== "Java" + + ```promql + # Error rate for outbound requests to FSS + sum(rate(http_client_request_duration_seconds_count{ + server_address=~".*fss-pub.nais.io", + http_response_status_code!="200" + }[5m])) + + # Request latency to FSS services + histogram_quantile(0.99, + sum(rate(http_client_request_duration_seconds_bucket{ + server_address=~".*fss-pub.nais.io" + }[5m])) by (le) + ) + ``` + +=== "Node.js" + + ```promql + # Error rate for outbound requests to FSS + sum(rate(http_client_duration_milliseconds_count{ + server_address=~".*fss-pub.nais.io", + http_response_status_code!="200" + }[5m])) + + # Request latency to FSS services + histogram_quantile(0.99, + sum(rate(http_client_duration_milliseconds_bucket{ + server_address=~".*fss-pub.nais.io" + }[5m])) by (le) + ) + ``` + +Monitor application logs for these errors (should decrease after configuration): + +- `java.net.SocketTimeoutException: Connection timed out` +- `java.net.SocketException: Connection reset by peer` +- `Connection closed prematurely BEFORE response` + +## Related resources + +- [Access policies](access-policies.md) - Configure outbound access to FSS services +- [Migrating to GCP FAQ](../explanations/migrating-to-gcp.md#how-do-i-reach-an-application-found-on-premises-from-my-application-in-gcp) - Overview of GCP-FSS communication +- [OpenTelemetry metrics](../../observability/metrics/reference/otel.md#http-client-metrics) - Available HTTP client metrics From 19add747f1ea720851bdb219372bb80ac0ae700d Mon Sep 17 00:00:00 2001 From: Hans Kristian Flaatten Date: Mon, 24 Nov 2025 10:40:38 +0100 Subject: [PATCH 2/2] link gcp-fss-communication guide to HTTP client connection management documentation --- .../http-client-connection-management.md | 297 ++++++++++++++++++ .../workloads/how-to/gcp-fss-communication.md | 1 + 2 files changed, 298 insertions(+) create mode 100644 docs/workloads/explanations/http-client-connection-management.md diff --git a/docs/workloads/explanations/http-client-connection-management.md b/docs/workloads/explanations/http-client-connection-management.md new file mode 100644 index 00000000..6ec41977 --- /dev/null +++ b/docs/workloads/explanations/http-client-connection-management.md @@ -0,0 +1,297 @@ +--- +tags: [explanation, http, connections, timeouts, networking] +--- + +# HTTP client connection management + +This page explains how HTTP client connection management works, including connection pooling, timeouts, and how network infrastructure affects your application's reliability. + +## Why connection management matters + +Modern applications make hundreds or thousands of HTTP requests. Opening a new TCP connection for each request is expensive: + +- **TCP handshake overhead**: Three-way handshake (SYN, SYN-ACK, ACK) adds latency +- **TLS handshake**: Additional round trips for certificate exchange and key agreement +- **Slow start**: TCP congestion control starts with small windows +- **Resource consumption**: Each new connection consumes system resources + +HTTP connection pooling solves this by reusing established connections, dramatically improving performance and reducing load. + +## How connection pooling works + +### Connection lifecycle + +```mermaid +stateDiagram-v2 + [*] --> Establishing: Request needs connection + Establishing --> Active: TCP + TLS handshake complete + Active --> Idle: Request complete, kept in pool + Idle --> Active: Reused for new request + Idle --> Closed: TTL expired or evicted + Active --> Closed: Connection error + Closed --> [*] +``` + +When no pooled connection is available, the client performs DNS lookup, establishes TCP connection, negotiates TLS (if HTTPS), then sends the request. After the response completes, the connection returns to the pool instead of closing. The next request to the same host reuses this connection, bypassing all handshake overhead. + +Connections are removed from the pool when: TTL expires, idle timeout reached, background eviction runs, connection error detected, or pool size limit exceeded. + +### HTTP Keep-Alive + +Connection pooling relies on HTTP Keep-Alive: + +```http +Connection: keep-alive +Keep-Alive: timeout=60, max=100 +``` + +This signals that the connection should remain open after the response completes. Both client and server must support it. + +## DNS and connection failures + +DNS plays a critical role in connection establishment and can be a source of intermittent failures. + +### DNS caching issues + +**Stale DNS cache:** When service IPs change (pod rotation, deployment), cached DNS entries point to old IPs, causing connection failures until TTL expires. + +**Key considerations:** + +- Services on Nais have short DNS TTL (30 seconds) +- Client-side DNS cache may not respect TTL +- JVM caches DNS indefinitely by default (set `networkaddress.cache.ttl`) +- Connection pools may hold connections to old IPs + +### DNS failures + +DNS resolution can fail due to server overload, network partitions, or rate limiting, causing "Unknown host" errors even when services are healthy. + +**Mitigation:** + +- Set connection TTL to 5-10 minutes for periodic DNS re-resolution +- Implement retry logic for DNS failures +- Connection pooling reduces DNS lookup frequency + +## Understanding timeout types + +Different timeout settings control different aspects of connection behavior. Configuring them correctly is critical for reliability. + +### Connection timeout + +**What it controls:** Maximum time to wait for the initial TCP connection to establish. + +**Common names:** + +- `connectTimeout` (most libraries) +- `CONNECT_TIMEOUT_MILLIS` (Netty) +- Connection timeout (Apache HttpClient) + +**Typical values:** 5-10 seconds + +**What happens when exceeded:** Connection attempt fails immediately with a timeout exception. + +**When to adjust:** + +- Cross-cluster or cross-datacenter calls with high latency +- Calls through multiple proxies +- Networks with packet loss + +### Socket/idle timeout + +**What it controls:** Maximum time a connection can remain idle in the pool before being removed. + +**Common names:** + +- `timeout` (Node.js Agent) +- `connectionTimeToLive` (Apache HttpClient) +- `maxIdleTime` (Reactor Netty) +- `keepAliveTime` (Ktor) + +**Typical values:** Based on infrastructure timeout constraints (e.g., 55 minutes for on-prem firewall timeouts) + +**What happens when exceeded:** Connection is closed and removed from pool. + +**Why it matters:** Prevents attempting to reuse connections that network infrastructure has already dropped. + +### Read/response timeout + +**What it controls:** Maximum time to wait for the complete response after sending a request. + +**Common names:** + +- `responseTimeout` (Reactor Netty) +- `requestTimeout` (Ktor) +- `timeout` (Axios - request-level) +- Read timeout (Apache HttpClient) + +**Typical values:** 10-60 seconds, depending on endpoint characteristics + +**What happens when exceeded:** Request is cancelled with a timeout exception. + +**When to adjust:** + +- Long-running operations (batch processing, report generation) +- Large file downloads +- Streaming responses + +### Background eviction + +**What it controls:** Periodic cleanup of idle or stale connections from the pool. + +**Common names:** + +- `evictIdleConnections` (Apache HttpClient) +- `evictInBackground` (Reactor Netty) + +**Typical values:** Every 5 minutes + +**Why it matters:** Removes connections that may have been silently dropped by network infrastructure between requests, preventing errors on the next request. + +## How network infrastructure affects connections + +### Stateful firewalls + +Firewalls maintain connection state tables and drop idle connections to prevent exhaustion. Most firewalls drop connections **silently** without TCP FIN or RST packets - connections appear healthy in the pool until you try to reuse them. + +**Solution:** Configure connection TTL below firewall timeout threshold. + +### Load balancers and NAT gateways + +Load balancers and NAT gateways enforce their own idle timeouts (typically 60-600 seconds). + +**Key points:** + +- Client connection TTL should be less than load balancer timeout +- Keep-alive probes may not prevent timeouts +- Backend service connections have separate timeouts +- NAT timeout shorter than client TTL means silent connection drops + +### Proxies + +Forward and reverse proxies add another layer of timeout configuration: + +- **Proxy → Backend timeout**: How long proxy waits for backend response +- **Client → Proxy timeout**: How long client waits for proxy response +- **Proxy connection pooling**: Proxy may maintain separate connection pool to backends + +## Connection pool sizing + +### Maximum connections + +**Per-route/per-host limits:** + +Prevents overwhelming a single backend service: + +```java +cm.setDefaultMaxPerRoute(20); // Max 20 concurrent connections per host +``` + +**Total pool size:** + +Limits total connections across all hosts: + +```java +cm.setMaxTotal(200); // Max 200 connections total +``` + +### Pool exhaustion + +When all connections are in use, new requests must: + +- Wait for a connection to become available +- Timeout if wait exceeds configured limit +- Potentially fail with "Connection pool exhausted" + +**Symptoms:** + +- Requests fail even though backend is healthy +- High request latencies during traffic spikes +- "NoHttpResponseException" or similar errors + +**Solutions:** + +- Increase pool size if resources allow +- Reduce response timeout to fail faster +- Add circuit breaker to prevent cascade failures +- Scale application horizontally + +## Common configuration mistakes + +### Infinite or too-long connection TTL + +**Problem:** Connections never expire or expire after infrastructure drops them. + +**Symptoms:** Intermittent "Connection reset" or "Unexpected end of stream" errors, especially after idle periods. + +**Solution:** Set connection TTL below infrastructure timeout thresholds (e.g., 55 minutes for 60-minute firewall timeout). + +### No background eviction + +**Problem:** Dead connections remain in pool until used. + +**Symptoms:** First request after idle period fails, subsequent retry succeeds. + +**Solution:** Enable background eviction (e.g., every 5 minutes). + +### Confusing request timeout with connection TTL + +**Problem:** Setting very short request timeout thinking it will refresh connections. + +**Symptoms:** + +- Legitimate long-running requests fail +- Unnecessary request failures and retries + +**Solution:** Use connection TTL for pool management, request timeout for detecting hung requests. + +## Nais platform considerations + +### Pod lifecycle and connection pools + +On Nais, when your application pods are terminated (during deployments, scaling, or node maintenance): + +1. Pod receives SIGTERM signal +2. Pod enters "Terminating" state +3. Endpoints removed from Service (eventual consistency) +4. Grace period allows in-flight requests to complete (default 30s) + +**Implications for connection pools:** + +- Your application may have pooled connections to terminating pods of other services +- Requests to terminating pods may fail if grace period expires +- Need proper retry logic for pod rotation scenarios + +**Best practices on Nais:** + +- Implement graceful shutdown in your application +- Configure preStop hooks to delay shutdown +- Use readiness probes to stop traffic before shutdown +- Implement client-side retry with exponential backoff + +### Cross-cluster and cross-datacenter calls + +Higher network latency affects timeout tuning: + +**Same cluster:** + +- Connection timeout: 5-10 seconds +- Read timeout: 10-30 seconds + +**Cross-cluster or cross-datacenter:** + +- Connection timeout: 10-15 seconds +- Read timeout: 30-60 seconds + +Also consider: + +- Retry budget (avoid retry storms) +- Circuit breaker thresholds +- Hedged requests for latency-sensitive calls + +## Related resources + +{% if tenant() == "nav" %} +- [Communicate reliably between GCP and on-prem](../how-to/gcp-fss-communication.md) - Practical configuration for on-premises firewall timeouts +{% endif %} +- [Access policies](../how-to/access-policies.md) - Configure network access between services +- [Good practices](good-practices.md) - Application development best practices diff --git a/docs/workloads/how-to/gcp-fss-communication.md b/docs/workloads/how-to/gcp-fss-communication.md index ba6ddefa..87e4474c 100644 --- a/docs/workloads/how-to/gcp-fss-communication.md +++ b/docs/workloads/how-to/gcp-fss-communication.md @@ -189,6 +189,7 @@ Monitor application logs for these errors (should decrease after configuration): ## Related resources +- [HTTP client connection management](../explanations/http-client-connection-management.md) - Understanding connection pooling and timeouts - [Access policies](access-policies.md) - Configure outbound access to FSS services - [Migrating to GCP FAQ](../explanations/migrating-to-gcp.md#how-do-i-reach-an-application-found-on-premises-from-my-application-in-gcp) - Overview of GCP-FSS communication - [OpenTelemetry metrics](../../observability/metrics/reference/otel.md#http-client-metrics) - Available HTTP client metrics