From ebd42d22408a34364256ec577c556862d6fdbd3d Mon Sep 17 00:00:00 2001 From: Jacob Sussmilch Date: Tue, 10 Feb 2026 13:49:30 +1100 Subject: [PATCH] docs: CNPG SSL cert issue analysis for world-postgres --- .../KEEP-1371/CNPG-SSL-WORLD-POSTGRES.md | 249 ++++++++++++++++++ 1 file changed, 249 insertions(+) create mode 100644 docs/keeperhub/KEEP-1371/CNPG-SSL-WORLD-POSTGRES.md diff --git a/docs/keeperhub/KEEP-1371/CNPG-SSL-WORLD-POSTGRES.md b/docs/keeperhub/KEEP-1371/CNPG-SSL-WORLD-POSTGRES.md new file mode 100644 index 000000000..3456aee1c --- /dev/null +++ b/docs/keeperhub/KEEP-1371/CNPG-SSL-WORLD-POSTGRES.md @@ -0,0 +1,249 @@ +# CNPG SSL + world-postgres: SELF_SIGNED_CERT_IN_CHAIN + +## Problem + +After deploying `@workflow/world-postgres` to staging, the app fails to start: + +``` +[ERROR] Failed to prepare server Error: An error occurred while loading instrumentation hook: self-signed certificate in certificate chain + code: 'SELF_SIGNED_CERT_IN_CHAIN' + at async Module.s (.next/server/chunks/_f5dc613a._.js:2:852) +``` + +The error occurs during `world.start()` in `instrumentation.ts` (line 71), which initializes pg-boss and opens PostgreSQL connections at server startup. + +## Why PR Environments Work + +PR environments construct the DATABASE_URL inline in the Helm values template: + +``` +postgresql://keeperhub:${DB_PASSWORD}@keeperhub-pr-${PR_NUMBER}-db-rw.pr-${PR_NUMBER}.svc.cluster.local:5432/keeperhub +``` + +This is a plain connection string with **no SSL parameters**. The CNPG cluster in the PR namespace accepts the connection without SSL negotiation (or the `pg` library defaults to no SSL when `sslmode` is absent). + +## Why Staging/Production Fail + +Staging and production read the DATABASE_URL from AWS Parameter Store: + +| Environment | Parameter | +|---|---| +| Staging | `/eks/maker-staging/keeperhub/db-url` | +| Production | `/eks/maker-prod/keeperhub/db-url` | + +These URLs are generated by CNPG. CNPG enables SSL by default: + +1. The CNPG operator generates self-signed TLS certificates for the cluster +2. The default `pg_hba.conf` uses `hostssl` — **only SSL connections are accepted** +3. The connection string likely includes `?sslmode=require` or the server mandates SSL during handshake + +When `pg-boss` (which uses the `pg` / node-postgres library internally) connects: +1. SSL is negotiated with the server +2. Node.js TLS validates the certificate chain +3. The chain contains a self-signed certificate (CNPG's generated CA) +4. Node.js rejects it: `SELF_SIGNED_CERT_IN_CHAIN` + +## Why Drizzle ORM Doesn't Hit This + +The existing Drizzle ORM connection (`lib/db/index.ts`) uses the same DATABASE_URL but doesn't fail because: + +- Drizzle uses `postgres.js` v3 (the `postgres` npm package), not `pg` (node-postgres) +- The connection is **lazy** — it only connects on first database query, after the server is running +- `postgres.js` v3 may handle SSL negotiation differently from `pg` + +In contrast, `world.start()` runs during the instrumentation hook (before the HTTP server starts) and pg-boss connects **eagerly**. + +## Connection Architecture + +``` +instrumentation.ts + world.start() + pg-boss (uses `pg` library) --> CNPG SSL --> SELF_SIGNED_CERT_IN_CHAIN + postgres.js v3 (direct queries) --> CNPG SSL --> may also fail + +lib/db/index.ts + postgres.js v3 (Drizzle ORM) --> CNPG SSL --> works (different SSL handling or lazy) +``` + +Both connect to the same CNPG cluster, same URL, same self-signed certs. The difference is the driver (`pg` vs `postgres.js`) and timing (eager vs lazy). + +## Solution Options + +### Option 1: Modify sslmode in the Connection URL + +Append or change the `sslmode` parameter before world-postgres consumes it. + +**In `instrumentation.ts`:** + +```typescript +if (rawUrl) { + let encoded = encodePostgresPassword(rawUrl); + // Disable SSL cert verification for CNPG self-signed certs + const urlObj = new URL(encoded); + urlObj.searchParams.set('sslmode', 'require'); // or 'prefer' or 'disable' + encoded = urlObj.toString(); + process.env.WORKFLOW_POSTGRES_URL = encoded; +} +``` + +PostgreSQL `sslmode` values: + +| Mode | SSL | Cert Verification | Notes | +|---|---|---|---| +| `disable` | No | N/A | No encryption. Only works if CNPG `pg_hba.conf` allows `host` (not just `hostssl`) | +| `allow` | Optional | No | Client prefers non-SSL, server can force SSL | +| `prefer` | Preferred | No | Client prefers SSL, falls back to non-SSL | +| `require` | Yes | **Depends on driver** | `pg` library: may still verify certs. `libpq`: no verification | +| `verify-ca` | Yes | CA only | Requires trusted CA certificate | +| `verify-full` | Yes | CA + hostname | Strictest — requires matching CA and hostname | + +**Risk:** `sslmode=require` behavior varies between `pg` library versions. In some versions it maps to `ssl: { rejectUnauthorized: false }`, in others it maps to `ssl: true` (which still verifies). Need to check the pg-boss dependency tree to confirm which `pg` version is used. + +**Risk:** `sslmode=disable` won't work if CNPG's `pg_hba.conf` only has `hostssl` rules (rejects non-SSL connections). + +### Option 2: NODE_TLS_REJECT_UNAUTHORIZED=0 (Process-wide) + +Add the env var to staging and production Helm values: + +```yaml +NODE_TLS_REJECT_UNAUTHORIZED: + type: kv + value: "0" +``` + +**How it works:** Tells Node.js to skip certificate verification for ALL TLS connections in the process. + +**Pros:** +- Simple, guaranteed to work +- Common pattern for K8s workloads with internal self-signed certs +- Connections still use TLS encryption (data in transit is encrypted) +- External API calls (OpenAI, Sentry, SendGrid) still work — they use public CAs + +**Cons:** +- Disables cert verification for ALL outgoing HTTPS connections, not just PostgreSQL +- A MITM attack on outbound connections (e.g., to OpenAI API) would not be detected +- In practice, the risk is low inside AWS VPC / K8s cluster network + +### Option 3: Scoped NODE_TLS_REJECT_UNAUTHORIZED in instrumentation.ts + +Set the env var only during `world.start()`: + +```typescript +const prevTls = process.env.NODE_TLS_REJECT_UNAUTHORIZED; +process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0'; + +const { getWorld } = await import("workflow/runtime"); +const world = getWorld(); +if (world.start) { + await world.start(); + console.log("[Workflow] Postgres World initialized"); +} + +// Restore +if (prevTls !== undefined) { + process.env.NODE_TLS_REJECT_UNAUTHORIZED = prevTls; +} else { + delete process.env.NODE_TLS_REJECT_UNAUTHORIZED; +} +``` + +**Pros:** +- More targeted than Option 2 +- External HTTPS connections after startup use normal cert verification + +**Cons:** +- pg-boss maintains a connection pool. If a connection drops and pg-boss creates a new one AFTER we restore the env var, the new connection will fail with the same SSL error +- Gives a false sense of security — in practice, pg-boss reconnections will eventually fail + +### Option 4: Configure CNPG to Allow Non-SSL Connections + +Modify the CNPG cluster manifest to use `host` instead of `hostssl` in `pg_hba.conf`: + +```yaml +apiVersion: postgresql.cnpg.io/v1 +kind: Cluster +metadata: + name: keeperhub-staging-db +spec: + postgresql: + pg_hba: + - host all all all scram-sha-256 +``` + +**Pros:** +- Root cause fix — removes SSL requirement entirely for intra-cluster traffic +- No application code changes needed +- Pod-to-pod K8s traffic is already network-isolated + +**Cons:** +- Infrastructure change — requires updating CNPG cluster manifests for staging and production +- Removes encryption for DB traffic (acceptable within K8s network policies, but less defense-in-depth) +- May require CNPG cluster restart/rolling update + +### Option 5: Add CNPG CA Certificate to Node.js Trust Store + +Mount the CNPG CA certificate and set `NODE_EXTRA_CA_CERTS`: + +```yaml +NODE_EXTRA_CA_CERTS: + type: kv + value: "/etc/cnpg-certs/ca.crt" +``` + +With a volume mount from the CNPG TLS secret: + +```yaml +volumes: + - name: cnpg-ca + secret: + secretName: keeperhub-staging-db-ca + items: + - key: ca.crt + path: ca.crt +volumeMounts: + - name: cnpg-ca + mountPath: /etc/cnpg-certs + readOnly: true +``` + +**Pros:** +- Most correct solution — SSL is maintained with proper cert verification +- No security tradeoffs +- External HTTPS connections unaffected + +**Cons:** +- Most complex to implement +- Requires knowing the CNPG CA secret name (varies per cluster) +- Needs Helm values changes for volume mounts (may require chart modifications) +- Must be done for both staging and production CNPG clusters + +## Recommendation + +**Short term:** Option 2 (`NODE_TLS_REJECT_UNAUTHORIZED=0` in Helm values). It's simple, works immediately, and the security risk is minimal inside K8s. + +**Long term:** Option 5 (mount CNPG CA cert) or Option 4 (allow non-SSL). Option 5 is the most correct but requires infrastructure work. Option 4 is simpler but removes encryption. + +## Investigation Needed + +Before implementing, verify: + +1. **What sslmode is in the Parameter Store URL?** Check the actual value of `/eks/maker-staging/keeperhub/db-url` to confirm whether it has `?sslmode=require` or another value +2. **Does CNPG enforce SSL?** Check the CNPG cluster manifest for `pg_hba` configuration — if it's `hostssl` only, `sslmode=disable` won't work +3. **What pg version does pg-boss use?** Check `node_modules/pg-boss/package.json` for the `pg` dependency version — this determines how `sslmode=require` is interpreted +4. **What's the CNPG CA secret name?** If going with Option 5, identify the secret that holds the CA certificate + +```bash +# Check Parameter Store URL (mask credentials) +aws ssm get-parameter --name /eks/maker-staging/keeperhub/db-url \ + --with-decryption --query 'Parameter.Value' --output text \ + | sed 's|://[^@]*@|://***:***@|' + +# Check CNPG cluster config +kubectl get cluster -n keeperhub -o yaml | grep -A 10 pg_hba + +# Check CNPG CA secret +kubectl get secrets -n keeperhub | grep ca + +# Check pg version in pg-boss +cat node_modules/pg-boss/package.json | jq '.dependencies.pg' +```