Skip to content

Feat/kubernetes runtime#10

Merged
Chocksy merged 9 commits intoChocksy:mainfrom
vsevolod:feat/kubernetes-runtime
Apr 13, 2026
Merged

Feat/kubernetes runtime#10
Chocksy merged 9 commits intoChocksy:mainfrom
vsevolod:feat/kubernetes-runtime

Conversation

@vsevolod
Copy link
Copy Markdown
Contributor

@vsevolod vsevolod commented Apr 10, 2026

Summary

This PR adds Kubernetes as a first-class sandbox runtime for Gooseherd and wires the existing run lifecycle through a shared execution backend model.

It introduces:

  • runtime selection via SANDBOX_RUNTIME=local|docker|kubernetes
  • a control-plane contract for runner payloads, events, artifacts, and completion reporting
  • a Kubernetes execution backend that launches per-run jobs and reconciles their state back into Gooseherd
  • a local minikube workflow for running Gooseherd end-to-end inside Kubernetes
  • supporting schema changes, docs, manifests, scripts, and test coverage

What Changed

  • Replaced the old binary sandbox toggle with explicit runtime modes and runtime preflight/hot-reload guards.
  • Added shared execution backend abstractions and routed RunManager through local, docker, and kubernetes backends.
  • Added the runner control-plane API and storage layer for payload delivery, auth tokens, event ingestion, artifact allocation, and completion idempotency.
  • Added the Kubernetes backend, job spec generation, Kubernetes resource client, runner client, and reconciliation flow.
  • Added local Kubernetes deployment assets:
    • kubernetes/local/*
    • kubernetes/app.Dockerfile
    • kubernetes/runner.Dockerfile
    • npm run k8s:local-up
    • npm run k8s:local-down
    • npm run k8s:local-status
    • npm run k8s:smoke
    • npm run k8s:harness
  • Added smoke/harness helper scripts for manual validation of the Kubernetes execution path.
  • Added migrations for runtime metadata and control-plane tables.
  • Updated dashboard settings to surface sandbox runtime status and labels.
  • Added follow-up test fix after rebasing onto main: control-plane router tests now pass the new startDashboardServer argument layout correctly.

Verification

  • docker compose run --rm -e DATABASE_URL_TEST=postgres://gooseherd:gooseherd@postgres:5432/gooseherd_test -v "$PWD:/app" --entrypoint sh gooseherd -lc 'node --test --test-reporter=tap --import tsx tests/control-plane-router.test.ts'
  • GitHub Actions build-and-test is the primary signal for the full PR because the full local suite includes tests/e2e-pipeline.test.ts, which depends on cloning epiccoders/pxls.

Notes

  • This PR is intentionally broad because it delivers the full vertical slice for Kubernetes runtime support: config, orchestration, runner protocol, local deployment path, docs, and tests.
  • The current local Kubernetes workflow is minikube-oriented and is meant to mirror the in-cluster production shape more closely than docker compose.

@vsevolod vsevolod force-pushed the feat/kubernetes-runtime branch from d7559de to 1e4b68b Compare April 10, 2026 15:15
Copy link
Copy Markdown
Owner

@Chocksy Chocksy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #10 "Feat/kubernetes runtime"

Author: @vsevolod | Branch: feat/kubernetes-runtime | Size: 103 files, +9,809 / -151 lines
Spec Compliance Score: ~40-45% (core API solid, operational guarantees largely absent)
Review Date: 2026-04-10
Reviewers: 8 specialized AI review agents (security, architecture, performance, data integrity, patterns, simplicity, spec compliance, test runner)
Independent Verification: Codex (GPT) cross-checked all CRITICAL and HIGH findings against source code


Verification Summary

All 13 CRITICAL and HIGH findings were independently verified by Codex against the actual source code:

Finding Verdict Notes
C1 (timing-unsafe token comparison) CONFIRMED ACCURATE === used; safeTokenCompare exists in dashboard/auth.ts but not imported
C2 (tokens never expire/revoked) CONFIRMED ACCURATE usedAt only ever written as null
C3 (migration journal collision) PARTIALLY ACCURATE Tag-prefix collision exists (0004_ x2), but idx values are distinct. Risk depends on Drizzle version behavior
C4 (unbounded log capture) CONFIRMED ACCURATE No limitBytes or tailLines constraint
C5 (cancellation polling no try/catch) CONFIRMED ACCURATE Network error rejects promise, crashes runner
H1 (uuid/text mismatch, no FKs) CONFIRMED ACCURATE runs.id is uuid, control plane tables use text, no REFERENCES
H2 (plaintext token in manifest) PARTIALLY ACCURATE Token IS written to file; claim of "never cleaned up" may be overstated (workspace-cleaner exists)
H3 (no NetworkPolicy) CONFIRMED ACCURATE No NetworkPolicy manifest in kubernetes/local/
H4 (eager KubeConfig loading) PARTIALLY ACCURATE Backend IS always constructed; whether it actually throws depends on environment
H5 (linear retry backoff) CONFIRMED ACCURATE BASE_RETRY_DELAY_MS * attempt is linear
H6 (no token validation caching) CONFIRMED ACCURATE DB hit on every auth check, no cache
H7 (smoke prefix for all runs) CONFIRMED ACCURATE gooseherd-smoke- prefix unconditionally
H8 (reconciler missing rows 3 & 4) PARTIALLY ACCURATE Gap exists; fact source is from run status bootstrap, not live K8s API

Result: 9 CONFIRMED ACCURATE, 4 PARTIALLY ACCURATE, 0 INCORRECT. All findings are substantively correct. Partial accuracy notes are nuance corrections, not fundamental errors.


Overall Assessment

The structural architecture is well-designed. The RunExecutionBackend Strategy pattern is clean, the runner/control-plane boundary is properly separated, type safety is above average (generic runtime parameter, Pick<> narrowing), and the shared pipeline engine requirement is correctly met. The spec documents are thorough and the code largely follows them.

However, the PR has significant security gaps, missing spec compliance on operational guarantees, systematic code duplication, and an unnecessary hand-rolled YAML renderer when the yaml library is already a project dependency (imported in 5 files).


Findings by Severity

CRITICAL -- Must Fix Before Merge

C1. Timing-unsafe token comparison

File: src/runtime/control-plane-store.ts:235
Source: Security Review

Token validation uses === for hash comparison instead of timingSafeEqual. The codebase already has a safeTokenCompare() function in src/dashboard/auth.ts that uses crypto.timingSafeEqual. This inconsistency is a clear oversight.

// Current (vulnerable):
return row.tokenHash === hashToken(token);

// Fix (import existing helper):
import { safeTokenCompare } from "../dashboard/auth.js";
return safeTokenCompare(row.tokenHash, hashToken(token));

Impact: Timing side-channel attack on cluster-internal low-latency network.
Fix effort: 1 line.


C2. Tokens never expire, never revoked

Files: src/runtime/control-plane-store.ts:81-101, 225-236, src/db/schema.ts:101-110
Source: Security + Data Integrity Reviews

The usedAt column is checked (if (!row || row.usedAt) return false) but never set to a non-null value anywhere in the codebase. No expiresAt column exists. A leaked RUN_TOKEN can authenticate requests forever, even after the K8s Job is cleaned up.

Fix:

  1. Set usedAt on first successful validation (as a "first-use stamp", not single-use since the runner makes multiple calls).
  2. Add an expiresAt column, set to issuedAt + 2 * waitTimeoutMs.
  3. Delete or revoke the token in the cleanup() method of KubernetesExecutionBackend.

Impact: Stolen token = unlimited replay window for that run.
Fix effort: Small.


C3. Migration journal ordering collision

File: drizzle/meta/_journal.json
Source: Data Integrity Review

Two entries share the 0004 prefix. 0004_agent_profiles (already on main) has a later timestamp than 0004_sandbox_runtime_modes. On a production database that already ran 0004_agent_profiles, the Drizzle migrator will see 0004_sandbox_runtime_modes as a new migration inserted before an already-executed one. This will either error or silently skip the migration.

Additionally, 0006_runs_runtime_repair and 0004_agent_profiles share the same timestamp (1778600000000).

Fix: Renumber the new migrations to 0005, 0006, 0007, 0008 with strictly monotonically increasing timestamps after the existing 0004_agent_profiles entry.
Impact: Will break on any database that already ran migrations from main.
Fix effort: Small (renumber files + update journal).


C4. Unbounded pod log capture

Files: src/runtime/kubernetes-backend.ts:267-274, src/runtime/kubernetes/resource-client.ts:124-131
Source: Performance Review

readJobLogs calls readNamespacedPodLog which returns the entire log as a single string. A verbose pipeline run could be 10-100MB. The entire payload is held in memory, then written to disk.

Fix: Use the K8s client limitBytes parameter:

return this.podLogReader.readNamespacedPodLog({
  name: podName,
  namespace,
  limitBytes: limitBytes ?? 5 * 1024 * 1024, // 5MB cap
});

Impact: OOM risk with concurrent completions (10 runs * 100MB = 1GB heap).
Fix effort: Small.


C5. Cancellation polling has no try/catch

File: src/runner/pipeline-runner.ts:117-128
Source: Performance Review

The cancellation polling loop calls client.getCancellation() without error handling. If this throws (network timeout, 500), the pollingPromise rejects. In the success path at line 133, await pollingPromise then throws, causing a successful pipeline execution to be reported as failed.

Fix:

const pollingPromise = (async () => {
  while (!stopPolling && !abortController.signal.aborted) {
    try {
      const cancellation = await client.getCancellation();
      if (cancellation.cancelRequested) {
        cancellationObserved = true;
        await emit("run.cancellation_observed", { runId: run.id });
        abortController.abort();
        return;
      }
    } catch {
      // Transient failure; keep polling
    }
    await sleep(cancellationPollMs);
  }
})();

Impact: A network blip during a successful run turns success into failure.
Fix effort: 3 lines (add try/catch).


HIGH -- Should Fix Before Merge

H1. No foreign keys on control plane tables + uuid/text type mismatch

Files: drizzle/0005_runner_control_plane.sql, src/db/schema.ts
Source: Data Integrity Review

runs.id is uuid type. All five control plane tables (run_payloads, run_tokens, run_completions, run_events, run_artifacts) define run_id as text type. No foreign key constraints exist. Orphaned records are possible; JOINs require implicit casting.

Fix: Change all run_id columns to uuid type and add REFERENCES runs(id) ON DELETE CASCADE.


H2. Plaintext token written to manifest file on disk

File: src/runtime/kubernetes-backend.ts:159
Source: Security Review

writeFile(manifestPath, renderManifestYaml(secret, job)) writes the raw RUN_TOKEN in stringData to {workRoot}/{runId}/kubernetes-job.yaml. This file is never cleaned up. Combined with C2 (tokens never expire), anyone with read access to the work directory can extract valid tokens.

Fix: Either redact the token from the manifest file, or delete it after successfully applying K8s resources.


H3. No NetworkPolicy -- runner pods can reach any cluster service

Files: kubernetes/local/ (no NetworkPolicy manifest present)
Source: Security Review

Runner Job pods have unrestricted network access. They can reach PostgreSQL directly at postgres.gooseherd.svc.cluster.local:5432 with the well-known credentials gooseherd:gooseherd.

Fix: Add a NetworkPolicy that denies all egress by default for runner pods, allowing only egress to the gooseherd service on port 8787.


H4. Eager KubeConfig loading breaks non-k8s environments

Files: src/index.ts:165, src/runtime/kubernetes/resource-client.ts:68
Source: Architecture Review

KubernetesExecutionBackend is always constructed on startup, triggering KubeConfig.loadFromDefault() even when SANDBOX_RUNTIME is local or docker. On machines without a kubeconfig file (CI, fresh dev environment, Docker container), this will throw or log warnings.

Fix: Lazy-initialize the resource client on first execute() call, or only register the kubernetes backend when config.sandboxRuntime === "kubernetes".


H5. Linear retry backoff instead of exponential with jitter

File: src/runner/control-plane-client.ts:171
Source: Performance Review

Retry delay is BASE_RETRY_DELAY_MS * attempt (250ms, 500ms, 750ms...). Under load-induced 503s, all retrying runners converge to hitting the server at the same intervals (thundering herd).

Fix:

const baseDelay = BASE_RETRY_DELAY_MS * Math.pow(2, attempt - 1);
const jitter = Math.random() * baseDelay * 0.5;
await sleep(baseDelay + jitter);

H6. Token validation hits DB on every request with no caching

Files: src/runtime/control-plane-store.ts:225-236, src/runtime/control-plane-auth.ts
Source: Performance Review

Every control-plane request queries the database to validate the bearer token. The runner's cancellation polling alone generates 12 auth queries/minute per active run. Since run tokens are immutable after issuance, they are safe to cache.

Fix: Add in-memory LRU cache with 60s TTL for validated token hashes.


H7. defaultSmokeJobName naming misleads operators

File: src/runtime/kubernetes/job-spec.ts:74-78
Source: Pattern Review

The word "smoke" in the resource name functions (gooseherd-smoke-{runId}) is used for ALL production runs, not just smoke tests. Cluster operators will see gooseherd-smoke-* resources and assume they are test artifacts.

Fix: Rename to defaultJobName / defaultSecretName, change prefix to gooseherd-run-.


H8. Reconciler missing 2 of 9 decision table rows

File: src/runtime/reconciler.ts:33-39
Source: Architecture + Spec Compliance Reviews

When completion says success but K8s state is Failed or missing (rows 3 & 4 of the spec's finalization table), no branch matches. The run is left in a non-terminal state. The synchronous path in translateOutcome handles this correctly, but crash recovery via the reconciler does not.

Fix:

// Add after the cancellation check, before the success check:
if (completion?.payload.status === "success" && fact !== "succeeded" && fact !== "running") {
  await this.runStore.updateRun(runId, {
    status: "failed",
    phase: "failed",
    finishedAt: new Date().toISOString(),
    error: "success completion contradicted by runtime state",
  });
  return;
}

MEDIUM -- Should Fix, Can Follow Up

M1. Hand-rolled YAML renderer (130 lines) when yaml library exists

Files: src/runtime/kubernetes-backend.ts:32-99, scripts/kubernetes/seed-smoke-run.ts:38-134
Source: Simplicity Review

The yaml package (yaml@^2.8.2) is already a project dependency, imported in 5 source files (pipeline-loader.ts, repo-config.ts, trigger-rules.ts, image-resolver.ts, scenario-loader.ts). The hand-rolled renderManifestYaml is duplicated across two files. Replace with import { stringify } from "yaml".


M2. Systematic code duplication

Source: Pattern Review

What Copies Files
isRecord() 3 control-plane-router.ts, pipeline-runner.ts, control-plane-client.ts
sleep() 4 shell.ts (exported), pipeline-runner.ts, control-plane-client.ts, kubernetes-backend.ts
TerminalFact type 2 kubernetes-backend.ts, reconciler.ts
normalizeBaseUrl() 2 kubernetes-backend.ts, file-artifact-store.ts
SmokeMetadata interface 4 4 scripts in scripts/kubernetes/

Fix: Extract to shared utility modules (src/utils/type-guards.ts, src/utils/sleep.ts, etc.).


M3. No K8s resource limits or security context on runner containers

File: src/runtime/kubernetes/job-spec.ts:103-162
Source: Security + Performance Reviews

No securityContext (no runAsNonRoot, no readOnlyRootFilesystem, no dropped capabilities) and no resources.limits. Runner containers run as root, get BestEffort QoS, and can consume all node resources.

Fix: Add securityContext and resources to the Job spec template.


M4. Non-idempotent migrations with separate repair migrations

Files: drizzle/0004-0007
Source: Data Integrity + Simplicity Reviews

0006 is a repair of 0004 (adds IF NOT EXISTS). 0007 is a repair of 0005 (adds IF NOT EXISTS). For an unreleased branch, squash: fold IF NOT EXISTS into the originals and remove the repairs.


M5. createRunEnvelope silently overwrites payload on upsert

File: src/runtime/control-plane-store.ts:56-79
Source: Data Integrity Review

A retry or race condition could replace the original pipeline payload. Since run_payloads has no history, the original dispatched payload is lost. Consider using onConflictDoNothing and returning the existing row.


M6. issueRunToken silently replaces previous token

File: src/runtime/control-plane-store.ts:81-102
Source: Data Integrity Review

Re-issuing a token replaces the hash and resets usedAt. The runner's existing token becomes instantly invalid with no notification. Consider throwing if a token already exists, or returning the existing token.


M7. parseJsonOrEmpty<T> unsafe cast

File: src/runner/control-plane-client.ts:45-51
Source: Pattern Review

Casts {} to any type T. If a caller inspects the returned value, this causes runtime errors TypeScript can't catch. Return undefined or throw instead.


M8. Cleanup sequencing violates spec

File: src/runtime/kubernetes-backend.ts:161-171
Source: Spec Compliance Review

Cleanup is immediate and unconditional in a finally block. The spec requires a 6-step sequence: cancellation request, grace period for artifact flush, force-delete after grace period, reconcile business status, clean K8s resources, retain artifacts per retention policy.


M9. No runner protocol version validation

Source: Spec Compliance Review

The spec requires runnerProtocolVersion compatibility check before Job creation. Not implemented anywhere. Zero search results for protocolVersion in the codebase.


M10. RBAC grants unrestricted secrets access within namespace

File: kubernetes/local/gooseherd-rbac.yaml
Source: Security Review

The Role grants create, get, delete on ALL secrets in the namespace, not just run-token secrets. The gooseherd service account can read gooseherd-env containing DATABASE_URL, ENCRYPTION_KEY, GITHUB_TOKEN, and OPENROUTER_API_KEY.


LOW -- Nice to Have

# Finding Source
L1 Error responses leak runId back to caller (enables run-ID enumeration) Security
L2 Advisory lock uses hashtext() (32-bit) -- hash collisions serialize unrelated runs Performance
L3 run_events.sequence index is non-unique -- buggy runner can create ambiguous ordering Data Integrity
L4 backend.typecheck.ts is dead weight -- TypeScript already enforces the constraint Simplicity
L5 Local/Docker backends are byte-identical -- merge into InProcessExecutionBackend<R> Simplicity
L6 assertImplementedSandboxRuntime() is a no-op (void runtime) -- remove Simplicity
L7 RunnerArtifactStore type alias adds no value over ArtifactStore Patterns
L8 makeRun() test factory duplicated in 8 test files Patterns

Spec Compliance Summary

Requirement Status
5-endpoint API (payload, artifacts, events, cancellation, complete) IMPLEMENTED
Error contract (2xx-5xx handling) IMPLEMENTED
Event schema (eventId, eventType, timestamp, sequence, payload) IMPLEMENTED
Completion idempotency (advisory lock + idempotency key) IMPLEMENTED
Authentication (per-run bearer token, SHA-256 hashed) IMPLEMENTED
Finalization decision table (9-row matrix) PARTIAL (7/9 rows)
Required event types (7 types) PARTIAL (run.artifact_status never emitted)
Cancellation polling PARTIAL (works, but no grace period or force-delete)
Cleanup sequencing (6-step) MISSING
Protocol version validation MISSING
Artifact upload protocol MISSING
Periodic reconciliation loop MISSING (class exists, no scheduling)
Crash recovery / reconciliation window MISSING
Control-plane reachability pre-flight MISSING
Git/GitHub credential injection MISSING

What's Done Well

  1. Clean Strategy pattern -- RunExecutionBackend<R> + RuntimeRegistry mapped type is textbook Strategy with compile-time type safety
  2. Shared pipeline engine -- Runner imports and calls the same PipelineEngine class, avoiding feature parity drift
  3. Proper boundary separation -- Runner knows nothing about K8s, control plane knows nothing about pipeline execution
  4. Excellent integration tests -- runner-index.test.ts spawns actual child processes against mock HTTP servers
  5. Defensive input validation -- All POST endpoints validated with type guards before reaching the store
  6. Idempotent completion recording -- pg_advisory_xact_lock + idempotency key dedup is correct

Recommended Action Plan

Before merge (blockers):

  1. Fix timing-safe token comparison (C1) -- 1 line
  2. Add token revocation in cleanup + expiry column (C2)
  3. Fix migration journal ordering (C3) -- renumber migrations
  4. Cap log retrieval with limitBytes (C4)
  5. Add try/catch to cancellation polling loop (C5) -- 3 lines
  6. Lazy-initialize KubeConfig or conditionally register k8s backend (H4)
  7. Fix reconciler rows 3 & 4 (H8) -- 5 lines
  8. Rename defaultSmokeJobName to defaultJobName (H7) -- cosmetic but important

Before production:

  1. Add NetworkPolicy for runner pods (H3)
  2. Add pod securityContext + resource limits (M3)
  3. Replace hand-rolled YAML with stringify() from existing dependency (M1)
  4. Squash repair migrations (M4)
  5. Switch to exponential backoff with jitter (H5)
  6. Add token validation caching (H6)
  7. Fix FK constraints + uuid/text type alignment (H1)

Housekeeping:

  1. Deduplicate isRecord, sleep, TerminalFact, normalizeBaseUrl (M2)
  2. Delete backend.typecheck.ts, merge local/docker backends (L4, L5)
  3. Remove assertImplementedSandboxRuntime no-op (L6)

Copy link
Copy Markdown
Owner

@Chocksy Chocksy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Suite Results (addendum to code review)

Ran all 74 test files on the feat/kubernetes-runtime branch.

70/74 pass, 3 failures, 1 hang, 0 regressions in existing tests.

All 13 new K8s-specific test files pass (kubernetes-backend, job-spec, local-manifests, local-scripts, resource-client, control-plane-store, control-plane-router, runtime-mode, run-execution-backends, run-manager-runtime, override-*).

Failures to fix:

  1. runner-client.test.ts — timing issue: AbortSignal.timeout() consumes all retries before mock server responds
  2. runner-index.test.ts (2 subtests) — run.log file not produced by runner; git clone attempted against fake URL instead of being mocked
  3. run-manager-sandbox-runtime.test.ts — async leak: mock store missing getRun method called after test ends
  4. dashboard-pipelines.test.ts — hangs indefinitely: afterEach hook doesn't close HTTP servers (open handles)

@vsevolod vsevolod requested a review from Chocksy April 10, 2026 19:26
@vsevolod vsevolod mentioned this pull request Apr 12, 2026
4 tasks
@Chocksy Chocksy merged commit a3010ff into Chocksy:main Apr 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants