Skip to content

[Foundation F6.3] Tunnel health check + auto-reconnect#650

Open
furukama wants to merge 2 commits intomainfrom
codex/issue-568-tunnel-health
Open

[Foundation F6.3] Tunnel health check + auto-reconnect#650
furukama wants to merge 2 commits intomainfrom
codex/issue-568-tunnel-health

Conversation

@furukama
Copy link
Copy Markdown
Contributor

Summary

Closes #568.

Implements tunnel health monitoring for the built-in ngrok tunnel provider:

  • health-checks the active tunnel at a configurable interval, defaulting to 30s
  • exposes tunnel state transitions through status() and onStatusChange
  • reconnects failed tunnels with capped exponential backoff
  • records structured tunnel.up and tunnel.down audit events
  • documents deployment.tunnel.health_check_interval_ms and updates the config example/changelog

Acceptance Criteria

  • Health check at configurable interval (default 30s).
  • On failure: reconnect with capped backoff; surface state changes.
  • Audit-log entry per tunnel up/down.

Validation

  • /Users/bkoehler/src/hybridclaw-electron-mac/desktop/build/runtime-bin/node ./node_modules/vitest/vitest.mjs run --configLoader runner --config vitest.unit.config.ts tests/ngrok-tunnel.test.ts tests/runtime-config-deployment.test.ts
  • /Users/bkoehler/src/hybridclaw-electron-mac/desktop/build/runtime-bin/node ./node_modules/typescript/bin/tsc --noEmit --noUnusedLocals --noUnusedParameters
  • /Users/bkoehler/src/hybridclaw/node_modules/.bin/biome check --write src
  • /Users/bkoehler/src/hybridclaw/node_modules/.bin/biome check src

@furukama furukama marked this pull request as ready for review April 28, 2026 21:52
Copilot AI review requested due to automatic review settings April 28, 2026 21:52
@furukama
Copy link
Copy Markdown
Contributor Author

@claude Do a thorough review of this PR in terms of: KISS, YAGNI, DRY, Fail Fast, Secure by Default and performance/speed

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Claude finished @furukama's task in 3m 31s —— View job


PR Review: Tunnel health check + auto-reconnect

Tasks:

  • Gather context and read changed files
  • Review: KISS (Keep It Simple)
  • Review: YAGNI (You Aren't Gonna Need It)
  • Review: DRY (Don't Repeat Yourself)
  • Review: Fail Fast
  • Review: Secure by Default
  • Review: Performance / Speed
  • Post final review

Overall the implementation is solid: the state machine is correct, tests are thorough, and the backoff logic is well-structured. The issues below are worth addressing before merge, with one critical gap (the config value is never wired in).


Critical

health_check_interval_ms is stored in config but never passed to the provider

RuntimeDeploymentTunnelConfig.health_check_interval_ms is normalized and persisted, but createNgrokTunnelProvider (the only export for creating the provider) is never called anywhere in src/ outside of its own file. There are zero callers that pass healthCheckIntervalMs from config to the constructor. The operator can set deployment.tunnel.health_check_interval_ms all day and the tunnel provider silently ignores it, always using the 30 s default. Fix this →


KISS

updateStatus double-serializes for change detection (ngrok-tunnel-provider.ts:321,338)

const before = JSON.stringify(this.status());
// ... apply fields ...
const after = JSON.stringify(this.status());
if (before !== after) { this.publishStatusChange(); }

this.status() allocates a new object, and JSON.stringify walks it — twice, on every status update. A simple version counter (private statusVersion = 0) incremented when any field changes is both cheaper and clearer. Fix this →

unrefTimer has an over-engineered type guard (ngrok-tunnel-provider.ts:111-120)

if (
  typeof timer === 'object' && timer &&
  'unref' in timer && typeof timer.unref === 'function'
) { timer.unref(); }

TunnelTimer is ReturnType<typeof setTimeout>. In Node.js (the only runtime this code runs in) that's always NodeJS.Timeout, which always has .unref(). The 4-condition guard is guarding against a scenario that can't happen. if ('unref' in timer) (timer as NodeJS.Timeout).unref() is sufficient.

makeTunnelAuditRunId needlessly combines two sources of entropy (ngrok-tunnel-provider.ts:126-128)

return `tunnel_${Date.now()}_${randomUUID().slice(0, 8)}`;

randomUUID() is already globally unique. The Date.now() prefix adds no safety; it just makes the ID longer and harder to read. Use randomUUID() directly (or keep the tunnel_ prefix plus the full UUID). Fix this →

Two separate import blocks from the same module (ngrok-tunnel-provider.ts:5-16)

There are two separate import … from './tunnel-provider.js' statements — one for type-only imports and one for value imports. They can be merged into a single import block.

recordDefaultTunnelAudit uses dynamic import() on every invocation (ngrok-tunnel-provider.ts:130-135)

async function recordDefaultTunnelAudit(input: TunnelAuditRecord): Promise<void> {
  const { recordAuditEvent } = await import('../audit/audit-events.js');
  recordAuditEvent(input);
}

Node.js caches dynamic imports, so this won't re-execute the module, but it re-enters the dynamic import machinery every call (promise allocation, microtask queue, cache lookup). A top-level static import is simpler and faster. Fix this →


YAGNI

auditSessionId constructor option (ngrok-tunnel-provider.ts:49,167-168)

This is injectable but there is no caller that overrides it — all production and test paths use the 'system:tunnel' default. It's a pre-emptive extensibility point with no current use. Remove the option and keep the constant.

healthCheckPath constructor option (ngrok-tunnel-provider.ts:54,177)

The path /health is a reasonable hard convention for a service health check. No test or caller overrides this. If a future provider needs a different path, that's the time to add the option.

reconnectInitialBackoffMs / reconnectMaxBackoffMs as constructor options

These are tunable only through tests (vi.useFakeTimers()). They're not exposed in config, no production caller sets them, and their defaults are the only sensible values. Either promote them to config-backed options (alongside health_check_interval_ms) or drop them and use the constants directly.


DRY

errorMessage() is defined but not consistently used (ngrok-tunnel-provider.ts:351,419)

publishStatusChange and recordTunnelAudit both inline error instanceof Error ? error.message : String(error) directly — the exact pattern errorMessage() was extracted to avoid. They should call errorMessage(error). Fix this →

clearHealthTimer and clearReconnectTimer are identical in structure (ngrok-tunnel-provider.ts:424-434)

private clearHealthTimer(): void {
  if (!this.healthTimer) return;
  clearTimeout(this.healthTimer);
  this.healthTimer = null;
}
// same pattern repeated for reconnectTimer

A single private clearTimer(ref: 'healthTimer' | 'reconnectTimer') (or a generic inline helper) eliminates the duplication. Fix this →

"Reset to down" state update is scattered across 3 call sites

start() catch block (lines 224-231), stop() (lines 284-290), and markTunnelDown each compose a similar "clear listener, publicUrl, and set state=down/reconnecting" sequence. A small internal helper would make each site's intent clearer and reduce drift risk.


Fail Fast

No constructor validation that reconnectInitialBackoffMs ≤ reconnectMaxBackoffMs

If a caller passes reconnectInitialBackoffMs: 60_000 and reconnectMaxBackoffMs: 1_000, Math.min() silently caps the first attempt to 1 s, but subsequent attempts also stay at 1 s, making the "initial" parameter meaningless. A constructor-time check (if (this.reconnectInitialBackoffMs > this.reconnectMaxBackoffMs) throw new Error(...)) surfaces misconfiguration immediately.

start() guard is weaker than it looks (ngrok-tunnel-provider.ts:203-205)

if (this.listener && this.publicUrl) {
  return { public_url: this.publicUrl };
}

If start() is called while the tunnel is in 'reconnecting' state (listener is null, state is reconnecting), this guard does not fire. The method proceeds, cancels the reconnect timer, and re-attempts openTunnel(). This may be intentional ("manual reconnect beats auto-reconnect"), but it's silent. A comment or an explicit if (this.state === 'reconnecting') branch would make the intent clear.

Initial start() failure is silent in audit logs

When openTunnel() throws during the initial start() call, lastError is set, but no tunnel.down audit event is emitted (because the tunnel was never "up"). This is logically correct, but means a repeated start-failure cycle produces no audit trail. Consider a tunnel.start_failed event type, or at minimum document this gap.


Secure by Default

recordAuditEvent: false makes audit opt-out too easy

recordAuditEvent?: TunnelAuditRecorder | false;

Audit logging is a security control (operator visibility into tunnel lifecycle). The false escape hatch has no JSDoc warning and no enforcement. Tests use vi.fn() (correct), but a production caller can silently pass false with no indication that they're disabling a security feature. Consider requiring explicit documentation or removing the escape hatch in favour of the injected mock pattern already used in tests. Fix this →

Health checks hit the public internet URL

Each 30 s health check sends an HTTP GET to https://<uuid>.ngrok.app/health — through ngrok's infrastructure and back to the local service. This is correct for testing tunnel liveness, but:

  1. It consumes ngrok free-tier request quota every 30 s.
  2. It generates outbound traffic visible to ngrok's edge.

Consider documenting this in the config reference so operators can tune health_check_interval_ms knowing the cost.


Performance / Speed

JSON.stringify twice per updateStatus call (same as KISS point — repeated here as a perf note)

At 30 s intervals this is trivial, but during a reconnect storm (health check fails, reconnect loop, repeated updateStatus calls) this allocates unnecessary objects and strings in a hot path.

recordDefaultTunnelAudit dynamic import on every event (same as KISS point)

Even with module cache hits, the overhead is unnecessary for a code path that already exists only at module-load time.

No jitter in exponential backoff (ngrok-tunnel-provider.ts:504-523)

const delayMs = Math.min(
  this.reconnectMaxBackoffMs,
  this.reconnectInitialBackoffMs * 2 ** Math.max(0, attempt - 1),
);

Deterministic backoff is fine for a single instance, but adding ±10% jitter (delayMs * (0.9 + Math.random() * 0.2)) is a best practice for production reconnect logic and costs nothing. If multiple instances restart simultaneously (e.g., after a config change), they'll all hammer the ngrok API at identical intervals.


Minor

  • runId in audit records has no correlation role — each tunnel.up and its subsequent tunnel.down get different runId values (generated fresh each time). If the intent is to correlate a tunnel session's up/down pair, the runId should be generated once per tunnel session (e.g., in markTunnelUp) and carried through to the corresponding down event.
  • normalizeDeploymentConfig now always emits health_check_interval_ms — old config files on disk won't have this key; the normalizer correctly defaults to 30 000 ms. But the TypeScript type makes the field required (health_check_interval_ms: number), so any test fixture or factory that constructs a partial RuntimeDeploymentTunnelConfig without this field now gets a compile error. Audit test fixtures for silent breakage.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds health monitoring and auto-reconnect behavior to the built-in ngrok tunnel provider, with corresponding runtime-config support, audit events, tests, and documentation updates.

Changes:

  • Extend tunnel provider status to include lifecycle/health metadata and add defaults for health check + reconnect timings.
  • Implement ngrok tunnel health checks, status transitions, capped exponential backoff reconnect, and tunnel.up/tunnel.down structured audit events.
  • Add deployment.tunnel.health_check_interval_ms to runtime config defaults/normalization, update config example/docs/changelog, and expand test coverage.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/runtime-config-deployment.test.ts Updates expectations to include the new deployment tunnel health-check interval field.
tests/ngrok-tunnel.test.ts Adds/updates tests for expanded tunnel status, audit events, health checks, and reconnect backoff behavior.
src/tunnel/tunnel-provider.ts Expands TunnelStatus and introduces default constants for health checks/reconnect.
src/tunnel/ngrok-tunnel-provider.ts Implements health checking, reconnect logic, status transitions, and audit event recording for ngrok tunnels.
src/config/runtime-config.ts Bumps config version and adds deployment.tunnel.health_check_interval_ms default + normalization.
docs/development/reference/configuration.md Documents the new deployment tunnel health-check config key (but see list-rendering comment).
docs/development/guides/remote-access.md Updates remote-access guide with the new config key and behavior description.
docs/content/reference/configuration.md Mirrors configuration reference doc update for the docs site (but see list-rendering comment).
docs/content/guides/remote-access.md Mirrors remote-access guide update for the docs site.
config.example.json Adds the new config key and bumps the example config version.
README.md Updates the configuration key summary to include the new tunnel health-check interval.
CHANGELOG.md Notes tunnel health checks, audit events, and reconnect behavior in Unreleased.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +145 to +151
- `deployment.mode`, `deployment.public_url`, `deployment.tunnel.provider`, and
`deployment.tunnel.health_check_interval_ms` for declaring whether the
gateway runs behind a cloud URL or a local tunnel; cloud mode requires
`deployment.public_url`, while local mode requires a tunnel provider such as
`manual`, `ssh`, `ngrok`, `cloudflare`, or `tailscale`. The built-in ngrok
tunnel provider reads `NGROK_AUTHTOKEN` from the encrypted runtime secret
store and health-checks active tunnels every 30 seconds by default
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/static/docs.js markdown renderer closes lists unless each line starts with a list marker, so this multi-line bullet (with continuation lines that start with indentation) will render as a single bullet followed by a separate paragraph. Consider rewriting as a single-line bullet, or split into multiple - bullets so it renders correctly on the docs site.

Suggested change
- `deployment.mode`, `deployment.public_url`, `deployment.tunnel.provider`, and
`deployment.tunnel.health_check_interval_ms` for declaring whether the
gateway runs behind a cloud URL or a local tunnel; cloud mode requires
`deployment.public_url`, while local mode requires a tunnel provider such as
`manual`, `ssh`, `ngrok`, `cloudflare`, or `tailscale`. The built-in ngrok
tunnel provider reads `NGROK_AUTHTOKEN` from the encrypted runtime secret
store and health-checks active tunnels every 30 seconds by default
- `deployment.mode`, `deployment.public_url`, `deployment.tunnel.provider`, and `deployment.tunnel.health_check_interval_ms` for declaring whether the gateway runs behind a cloud URL or a local tunnel; cloud mode requires `deployment.public_url`, while local mode requires a tunnel provider such as `manual`, `ssh`, `ngrok`, `cloudflare`, or `tailscale`. The built-in ngrok tunnel provider reads `NGROK_AUTHTOKEN` from the encrypted runtime secret store and health-checks active tunnels every 30 seconds by default

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +185
- `deployment.mode`, `deployment.public_url`, `deployment.tunnel.provider`, and
`deployment.tunnel.health_check_interval_ms` for declaring whether the
gateway runs behind a cloud URL or a local tunnel; cloud mode requires
`deployment.public_url`, while local mode requires a tunnel provider such as
`manual`, `ssh`, `ngrok`, `cloudflare`, or `tailscale`. The built-in ngrok
tunnel provider reads `NGROK_AUTHTOKEN` from the encrypted runtime secret
store and health-checks active tunnels every 30 seconds by default
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/static/docs.js markdown renderer closes lists unless each line starts with a list marker, so this multi-line bullet (with continuation lines that start with indentation) will render as a single bullet followed by a separate paragraph. Consider rewriting as a single-line bullet, or split into multiple - bullets so it renders correctly on the docs site.

Suggested change
- `deployment.mode`, `deployment.public_url`, `deployment.tunnel.provider`, and
`deployment.tunnel.health_check_interval_ms` for declaring whether the
gateway runs behind a cloud URL or a local tunnel; cloud mode requires
`deployment.public_url`, while local mode requires a tunnel provider such as
`manual`, `ssh`, `ngrok`, `cloudflare`, or `tailscale`. The built-in ngrok
tunnel provider reads `NGROK_AUTHTOKEN` from the encrypted runtime secret
store and health-checks active tunnels every 30 seconds by default
- `deployment.mode`, `deployment.public_url`, `deployment.tunnel.provider`, and `deployment.tunnel.health_check_interval_ms` for declaring whether the gateway runs behind a cloud URL or a local tunnel; cloud mode requires `deployment.public_url`, while local mode requires a tunnel provider such as `manual`, `ssh`, `ngrok`, `cloudflare`, or `tailscale`. The built-in ngrok tunnel provider reads `NGROK_AUTHTOKEN` from the encrypted runtime secret store and health-checks active tunnels every 30 seconds by default

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +135
function makeTunnelAuditRunId(): string {
return `tunnel_${Date.now()}_${randomUUID().slice(0, 8)}`;
}

async function recordDefaultTunnelAudit(
input: TunnelAuditRecord,
): Promise<void> {
const { recordAuditEvent } = await import('../audit/audit-events.js');
recordAuditEvent(input);
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repo already exposes audit run ID generation via audit-events' makeAuditRunId(prefix); this file introduces a separate makeTunnelAuditRunId() implementation. Using the shared helper (or at least centralizing run-id generation) keeps run IDs consistent across audit events and avoids duplicating ID logic.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Foundation F6.3] Tunnel health check + auto-reconnect

2 participants