Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 249 additions & 0 deletions docs/keeperhub/KEEP-1371/CNPG-SSL-WORLD-POSTGRES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
# CNPG SSL + world-postgres: SELF_SIGNED_CERT_IN_CHAIN

## Problem

After deploying `@workflow/world-postgres` to staging, the app fails to start:

```
[ERROR] Failed to prepare server Error: An error occurred while loading instrumentation hook: self-signed certificate in certificate chain
code: 'SELF_SIGNED_CERT_IN_CHAIN'
at async Module.s (.next/server/chunks/_f5dc613a._.js:2:852)
```

The error occurs during `world.start()` in `instrumentation.ts` (line 71), which initializes pg-boss and opens PostgreSQL connections at server startup.

## Why PR Environments Work

PR environments construct the DATABASE_URL inline in the Helm values template:

```
postgresql://keeperhub:${DB_PASSWORD}@keeperhub-pr-${PR_NUMBER}-db-rw.pr-${PR_NUMBER}.svc.cluster.local:5432/keeperhub
```

This is a plain connection string with **no SSL parameters**. The CNPG cluster in the PR namespace accepts the connection without SSL negotiation (or the `pg` library defaults to no SSL when `sslmode` is absent).

## Why Staging/Production Fail

Staging and production read the DATABASE_URL from AWS Parameter Store:

| Environment | Parameter |
|---|---|
| Staging | `/eks/maker-staging/keeperhub/db-url` |
| Production | `/eks/maker-prod/keeperhub/db-url` |

These URLs are generated by CNPG. CNPG enables SSL by default:

1. The CNPG operator generates self-signed TLS certificates for the cluster
2. The default `pg_hba.conf` uses `hostssl` — **only SSL connections are accepted**
3. The connection string likely includes `?sslmode=require` or the server mandates SSL during handshake

When `pg-boss` (which uses the `pg` / node-postgres library internally) connects:
1. SSL is negotiated with the server
2. Node.js TLS validates the certificate chain
3. The chain contains a self-signed certificate (CNPG's generated CA)
4. Node.js rejects it: `SELF_SIGNED_CERT_IN_CHAIN`

## Why Drizzle ORM Doesn't Hit This

The existing Drizzle ORM connection (`lib/db/index.ts`) uses the same DATABASE_URL but doesn't fail because:

- Drizzle uses `postgres.js` v3 (the `postgres` npm package), not `pg` (node-postgres)
- The connection is **lazy** — it only connects on first database query, after the server is running
- `postgres.js` v3 may handle SSL negotiation differently from `pg`

In contrast, `world.start()` runs during the instrumentation hook (before the HTTP server starts) and pg-boss connects **eagerly**.

## Connection Architecture

```
instrumentation.ts
world.start()
pg-boss (uses `pg` library) --> CNPG SSL --> SELF_SIGNED_CERT_IN_CHAIN
postgres.js v3 (direct queries) --> CNPG SSL --> may also fail

lib/db/index.ts
postgres.js v3 (Drizzle ORM) --> CNPG SSL --> works (different SSL handling or lazy)
```

Both connect to the same CNPG cluster, same URL, same self-signed certs. The difference is the driver (`pg` vs `postgres.js`) and timing (eager vs lazy).

## Solution Options

### Option 1: Modify sslmode in the Connection URL

Append or change the `sslmode` parameter before world-postgres consumes it.

**In `instrumentation.ts`:**

```typescript
if (rawUrl) {
let encoded = encodePostgresPassword(rawUrl);
// Disable SSL cert verification for CNPG self-signed certs
const urlObj = new URL(encoded);
urlObj.searchParams.set('sslmode', 'require'); // or 'prefer' or 'disable'
encoded = urlObj.toString();
process.env.WORKFLOW_POSTGRES_URL = encoded;
}
```

PostgreSQL `sslmode` values:

| Mode | SSL | Cert Verification | Notes |
|---|---|---|---|
| `disable` | No | N/A | No encryption. Only works if CNPG `pg_hba.conf` allows `host` (not just `hostssl`) |
| `allow` | Optional | No | Client prefers non-SSL, server can force SSL |
| `prefer` | Preferred | No | Client prefers SSL, falls back to non-SSL |
| `require` | Yes | **Depends on driver** | `pg` library: may still verify certs. `libpq`: no verification |
| `verify-ca` | Yes | CA only | Requires trusted CA certificate |
| `verify-full` | Yes | CA + hostname | Strictest — requires matching CA and hostname |

**Risk:** `sslmode=require` behavior varies between `pg` library versions. In some versions it maps to `ssl: { rejectUnauthorized: false }`, in others it maps to `ssl: true` (which still verifies). Need to check the pg-boss dependency tree to confirm which `pg` version is used.

**Risk:** `sslmode=disable` won't work if CNPG's `pg_hba.conf` only has `hostssl` rules (rejects non-SSL connections).

### Option 2: NODE_TLS_REJECT_UNAUTHORIZED=0 (Process-wide)

Add the env var to staging and production Helm values:

```yaml
NODE_TLS_REJECT_UNAUTHORIZED:
type: kv
value: "0"
```

**How it works:** Tells Node.js to skip certificate verification for ALL TLS connections in the process.

**Pros:**
- Simple, guaranteed to work
- Common pattern for K8s workloads with internal self-signed certs
- Connections still use TLS encryption (data in transit is encrypted)
- External API calls (OpenAI, Sentry, SendGrid) still work — they use public CAs

**Cons:**
- Disables cert verification for ALL outgoing HTTPS connections, not just PostgreSQL
- A MITM attack on outbound connections (e.g., to OpenAI API) would not be detected
- In practice, the risk is low inside AWS VPC / K8s cluster network

### Option 3: Scoped NODE_TLS_REJECT_UNAUTHORIZED in instrumentation.ts

Set the env var only during `world.start()`:

```typescript
const prevTls = process.env.NODE_TLS_REJECT_UNAUTHORIZED;
process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0';

const { getWorld } = await import("workflow/runtime");
const world = getWorld();
if (world.start) {
await world.start();
console.log("[Workflow] Postgres World initialized");
}

// Restore
if (prevTls !== undefined) {
process.env.NODE_TLS_REJECT_UNAUTHORIZED = prevTls;
} else {
delete process.env.NODE_TLS_REJECT_UNAUTHORIZED;
}
```

**Pros:**
- More targeted than Option 2
- External HTTPS connections after startup use normal cert verification

**Cons:**
- pg-boss maintains a connection pool. If a connection drops and pg-boss creates a new one AFTER we restore the env var, the new connection will fail with the same SSL error
- Gives a false sense of security — in practice, pg-boss reconnections will eventually fail

### Option 4: Configure CNPG to Allow Non-SSL Connections

Modify the CNPG cluster manifest to use `host` instead of `hostssl` in `pg_hba.conf`:

```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: keeperhub-staging-db
spec:
postgresql:
pg_hba:
- host all all all scram-sha-256
```

**Pros:**
- Root cause fix — removes SSL requirement entirely for intra-cluster traffic
- No application code changes needed
- Pod-to-pod K8s traffic is already network-isolated

**Cons:**
- Infrastructure change — requires updating CNPG cluster manifests for staging and production
- Removes encryption for DB traffic (acceptable within K8s network policies, but less defense-in-depth)
- May require CNPG cluster restart/rolling update

### Option 5: Add CNPG CA Certificate to Node.js Trust Store

Mount the CNPG CA certificate and set `NODE_EXTRA_CA_CERTS`:

```yaml
NODE_EXTRA_CA_CERTS:
type: kv
value: "/etc/cnpg-certs/ca.crt"
```

With a volume mount from the CNPG TLS secret:

```yaml
volumes:
- name: cnpg-ca
secret:
secretName: keeperhub-staging-db-ca
items:
- key: ca.crt
path: ca.crt
volumeMounts:
- name: cnpg-ca
mountPath: /etc/cnpg-certs
readOnly: true
```

**Pros:**
- Most correct solution — SSL is maintained with proper cert verification
- No security tradeoffs
- External HTTPS connections unaffected

**Cons:**
- Most complex to implement
- Requires knowing the CNPG CA secret name (varies per cluster)
- Needs Helm values changes for volume mounts (may require chart modifications)
- Must be done for both staging and production CNPG clusters

## Recommendation

**Short term:** Option 2 (`NODE_TLS_REJECT_UNAUTHORIZED=0` in Helm values). It's simple, works immediately, and the security risk is minimal inside K8s.

**Long term:** Option 5 (mount CNPG CA cert) or Option 4 (allow non-SSL). Option 5 is the most correct but requires infrastructure work. Option 4 is simpler but removes encryption.

## Investigation Needed

Before implementing, verify:

1. **What sslmode is in the Parameter Store URL?** Check the actual value of `/eks/maker-staging/keeperhub/db-url` to confirm whether it has `?sslmode=require` or another value
2. **Does CNPG enforce SSL?** Check the CNPG cluster manifest for `pg_hba` configuration — if it's `hostssl` only, `sslmode=disable` won't work
3. **What pg version does pg-boss use?** Check `node_modules/pg-boss/package.json` for the `pg` dependency version — this determines how `sslmode=require` is interpreted
4. **What's the CNPG CA secret name?** If going with Option 5, identify the secret that holds the CA certificate

```bash
# Check Parameter Store URL (mask credentials)
aws ssm get-parameter --name /eks/maker-staging/keeperhub/db-url \
--with-decryption --query 'Parameter.Value' --output text \
| sed 's|://[^@]*@|://***:***@|'

# Check CNPG cluster config
kubectl get cluster -n keeperhub -o yaml | grep -A 10 pg_hba

# Check CNPG CA secret
kubectl get secrets -n keeperhub | grep ca

# Check pg version in pg-boss
cat node_modules/pg-boss/package.json | jq '.dependencies.pg'
```