Conversation
d7559de to
1e4b68b
Compare
Chocksy
left a comment
There was a problem hiding this comment.
Code Review: PR #10 "Feat/kubernetes runtime"
Author: @vsevolod | Branch: feat/kubernetes-runtime | Size: 103 files, +9,809 / -151 lines
Spec Compliance Score: ~40-45% (core API solid, operational guarantees largely absent)
Review Date: 2026-04-10
Reviewers: 8 specialized AI review agents (security, architecture, performance, data integrity, patterns, simplicity, spec compliance, test runner)
Independent Verification: Codex (GPT) cross-checked all CRITICAL and HIGH findings against source code
Verification Summary
All 13 CRITICAL and HIGH findings were independently verified by Codex against the actual source code:
| Finding | Verdict | Notes |
|---|---|---|
| C1 (timing-unsafe token comparison) | CONFIRMED ACCURATE | === used; safeTokenCompare exists in dashboard/auth.ts but not imported |
| C2 (tokens never expire/revoked) | CONFIRMED ACCURATE | usedAt only ever written as null |
| C3 (migration journal collision) | PARTIALLY ACCURATE | Tag-prefix collision exists (0004_ x2), but idx values are distinct. Risk depends on Drizzle version behavior |
| C4 (unbounded log capture) | CONFIRMED ACCURATE | No limitBytes or tailLines constraint |
| C5 (cancellation polling no try/catch) | CONFIRMED ACCURATE | Network error rejects promise, crashes runner |
| H1 (uuid/text mismatch, no FKs) | CONFIRMED ACCURATE | runs.id is uuid, control plane tables use text, no REFERENCES |
| H2 (plaintext token in manifest) | PARTIALLY ACCURATE | Token IS written to file; claim of "never cleaned up" may be overstated (workspace-cleaner exists) |
| H3 (no NetworkPolicy) | CONFIRMED ACCURATE | No NetworkPolicy manifest in kubernetes/local/ |
| H4 (eager KubeConfig loading) | PARTIALLY ACCURATE | Backend IS always constructed; whether it actually throws depends on environment |
| H5 (linear retry backoff) | CONFIRMED ACCURATE | BASE_RETRY_DELAY_MS * attempt is linear |
| H6 (no token validation caching) | CONFIRMED ACCURATE | DB hit on every auth check, no cache |
| H7 (smoke prefix for all runs) | CONFIRMED ACCURATE | gooseherd-smoke- prefix unconditionally |
| H8 (reconciler missing rows 3 & 4) | PARTIALLY ACCURATE | Gap exists; fact source is from run status bootstrap, not live K8s API |
Result: 9 CONFIRMED ACCURATE, 4 PARTIALLY ACCURATE, 0 INCORRECT. All findings are substantively correct. Partial accuracy notes are nuance corrections, not fundamental errors.
Overall Assessment
The structural architecture is well-designed. The RunExecutionBackend Strategy pattern is clean, the runner/control-plane boundary is properly separated, type safety is above average (generic runtime parameter, Pick<> narrowing), and the shared pipeline engine requirement is correctly met. The spec documents are thorough and the code largely follows them.
However, the PR has significant security gaps, missing spec compliance on operational guarantees, systematic code duplication, and an unnecessary hand-rolled YAML renderer when the yaml library is already a project dependency (imported in 5 files).
Findings by Severity
CRITICAL -- Must Fix Before Merge
C1. Timing-unsafe token comparison
File: src/runtime/control-plane-store.ts:235
Source: Security Review
Token validation uses === for hash comparison instead of timingSafeEqual. The codebase already has a safeTokenCompare() function in src/dashboard/auth.ts that uses crypto.timingSafeEqual. This inconsistency is a clear oversight.
// Current (vulnerable):
return row.tokenHash === hashToken(token);
// Fix (import existing helper):
import { safeTokenCompare } from "../dashboard/auth.js";
return safeTokenCompare(row.tokenHash, hashToken(token));Impact: Timing side-channel attack on cluster-internal low-latency network.
Fix effort: 1 line.
C2. Tokens never expire, never revoked
Files: src/runtime/control-plane-store.ts:81-101, 225-236, src/db/schema.ts:101-110
Source: Security + Data Integrity Reviews
The usedAt column is checked (if (!row || row.usedAt) return false) but never set to a non-null value anywhere in the codebase. No expiresAt column exists. A leaked RUN_TOKEN can authenticate requests forever, even after the K8s Job is cleaned up.
Fix:
- Set
usedAton first successful validation (as a "first-use stamp", not single-use since the runner makes multiple calls). - Add an
expiresAtcolumn, set toissuedAt + 2 * waitTimeoutMs. - Delete or revoke the token in the
cleanup()method ofKubernetesExecutionBackend.
Impact: Stolen token = unlimited replay window for that run.
Fix effort: Small.
C3. Migration journal ordering collision
File: drizzle/meta/_journal.json
Source: Data Integrity Review
Two entries share the 0004 prefix. 0004_agent_profiles (already on main) has a later timestamp than 0004_sandbox_runtime_modes. On a production database that already ran 0004_agent_profiles, the Drizzle migrator will see 0004_sandbox_runtime_modes as a new migration inserted before an already-executed one. This will either error or silently skip the migration.
Additionally, 0006_runs_runtime_repair and 0004_agent_profiles share the same timestamp (1778600000000).
Fix: Renumber the new migrations to 0005, 0006, 0007, 0008 with strictly monotonically increasing timestamps after the existing 0004_agent_profiles entry.
Impact: Will break on any database that already ran migrations from main.
Fix effort: Small (renumber files + update journal).
C4. Unbounded pod log capture
Files: src/runtime/kubernetes-backend.ts:267-274, src/runtime/kubernetes/resource-client.ts:124-131
Source: Performance Review
readJobLogs calls readNamespacedPodLog which returns the entire log as a single string. A verbose pipeline run could be 10-100MB. The entire payload is held in memory, then written to disk.
Fix: Use the K8s client limitBytes parameter:
return this.podLogReader.readNamespacedPodLog({
name: podName,
namespace,
limitBytes: limitBytes ?? 5 * 1024 * 1024, // 5MB cap
});Impact: OOM risk with concurrent completions (10 runs * 100MB = 1GB heap).
Fix effort: Small.
C5. Cancellation polling has no try/catch
File: src/runner/pipeline-runner.ts:117-128
Source: Performance Review
The cancellation polling loop calls client.getCancellation() without error handling. If this throws (network timeout, 500), the pollingPromise rejects. In the success path at line 133, await pollingPromise then throws, causing a successful pipeline execution to be reported as failed.
Fix:
const pollingPromise = (async () => {
while (!stopPolling && !abortController.signal.aborted) {
try {
const cancellation = await client.getCancellation();
if (cancellation.cancelRequested) {
cancellationObserved = true;
await emit("run.cancellation_observed", { runId: run.id });
abortController.abort();
return;
}
} catch {
// Transient failure; keep polling
}
await sleep(cancellationPollMs);
}
})();Impact: A network blip during a successful run turns success into failure.
Fix effort: 3 lines (add try/catch).
HIGH -- Should Fix Before Merge
H1. No foreign keys on control plane tables + uuid/text type mismatch
Files: drizzle/0005_runner_control_plane.sql, src/db/schema.ts
Source: Data Integrity Review
runs.id is uuid type. All five control plane tables (run_payloads, run_tokens, run_completions, run_events, run_artifacts) define run_id as text type. No foreign key constraints exist. Orphaned records are possible; JOINs require implicit casting.
Fix: Change all run_id columns to uuid type and add REFERENCES runs(id) ON DELETE CASCADE.
H2. Plaintext token written to manifest file on disk
File: src/runtime/kubernetes-backend.ts:159
Source: Security Review
writeFile(manifestPath, renderManifestYaml(secret, job)) writes the raw RUN_TOKEN in stringData to {workRoot}/{runId}/kubernetes-job.yaml. This file is never cleaned up. Combined with C2 (tokens never expire), anyone with read access to the work directory can extract valid tokens.
Fix: Either redact the token from the manifest file, or delete it after successfully applying K8s resources.
H3. No NetworkPolicy -- runner pods can reach any cluster service
Files: kubernetes/local/ (no NetworkPolicy manifest present)
Source: Security Review
Runner Job pods have unrestricted network access. They can reach PostgreSQL directly at postgres.gooseherd.svc.cluster.local:5432 with the well-known credentials gooseherd:gooseherd.
Fix: Add a NetworkPolicy that denies all egress by default for runner pods, allowing only egress to the gooseherd service on port 8787.
H4. Eager KubeConfig loading breaks non-k8s environments
Files: src/index.ts:165, src/runtime/kubernetes/resource-client.ts:68
Source: Architecture Review
KubernetesExecutionBackend is always constructed on startup, triggering KubeConfig.loadFromDefault() even when SANDBOX_RUNTIME is local or docker. On machines without a kubeconfig file (CI, fresh dev environment, Docker container), this will throw or log warnings.
Fix: Lazy-initialize the resource client on first execute() call, or only register the kubernetes backend when config.sandboxRuntime === "kubernetes".
H5. Linear retry backoff instead of exponential with jitter
File: src/runner/control-plane-client.ts:171
Source: Performance Review
Retry delay is BASE_RETRY_DELAY_MS * attempt (250ms, 500ms, 750ms...). Under load-induced 503s, all retrying runners converge to hitting the server at the same intervals (thundering herd).
Fix:
const baseDelay = BASE_RETRY_DELAY_MS * Math.pow(2, attempt - 1);
const jitter = Math.random() * baseDelay * 0.5;
await sleep(baseDelay + jitter);H6. Token validation hits DB on every request with no caching
Files: src/runtime/control-plane-store.ts:225-236, src/runtime/control-plane-auth.ts
Source: Performance Review
Every control-plane request queries the database to validate the bearer token. The runner's cancellation polling alone generates 12 auth queries/minute per active run. Since run tokens are immutable after issuance, they are safe to cache.
Fix: Add in-memory LRU cache with 60s TTL for validated token hashes.
H7. defaultSmokeJobName naming misleads operators
File: src/runtime/kubernetes/job-spec.ts:74-78
Source: Pattern Review
The word "smoke" in the resource name functions (gooseherd-smoke-{runId}) is used for ALL production runs, not just smoke tests. Cluster operators will see gooseherd-smoke-* resources and assume they are test artifacts.
Fix: Rename to defaultJobName / defaultSecretName, change prefix to gooseherd-run-.
H8. Reconciler missing 2 of 9 decision table rows
File: src/runtime/reconciler.ts:33-39
Source: Architecture + Spec Compliance Reviews
When completion says success but K8s state is Failed or missing (rows 3 & 4 of the spec's finalization table), no branch matches. The run is left in a non-terminal state. The synchronous path in translateOutcome handles this correctly, but crash recovery via the reconciler does not.
Fix:
// Add after the cancellation check, before the success check:
if (completion?.payload.status === "success" && fact !== "succeeded" && fact !== "running") {
await this.runStore.updateRun(runId, {
status: "failed",
phase: "failed",
finishedAt: new Date().toISOString(),
error: "success completion contradicted by runtime state",
});
return;
}MEDIUM -- Should Fix, Can Follow Up
M1. Hand-rolled YAML renderer (130 lines) when yaml library exists
Files: src/runtime/kubernetes-backend.ts:32-99, scripts/kubernetes/seed-smoke-run.ts:38-134
Source: Simplicity Review
The yaml package (yaml@^2.8.2) is already a project dependency, imported in 5 source files (pipeline-loader.ts, repo-config.ts, trigger-rules.ts, image-resolver.ts, scenario-loader.ts). The hand-rolled renderManifestYaml is duplicated across two files. Replace with import { stringify } from "yaml".
M2. Systematic code duplication
Source: Pattern Review
| What | Copies | Files |
|---|---|---|
isRecord() |
3 | control-plane-router.ts, pipeline-runner.ts, control-plane-client.ts |
sleep() |
4 | shell.ts (exported), pipeline-runner.ts, control-plane-client.ts, kubernetes-backend.ts |
TerminalFact type |
2 | kubernetes-backend.ts, reconciler.ts |
normalizeBaseUrl() |
2 | kubernetes-backend.ts, file-artifact-store.ts |
SmokeMetadata interface |
4 | 4 scripts in scripts/kubernetes/ |
Fix: Extract to shared utility modules (src/utils/type-guards.ts, src/utils/sleep.ts, etc.).
M3. No K8s resource limits or security context on runner containers
File: src/runtime/kubernetes/job-spec.ts:103-162
Source: Security + Performance Reviews
No securityContext (no runAsNonRoot, no readOnlyRootFilesystem, no dropped capabilities) and no resources.limits. Runner containers run as root, get BestEffort QoS, and can consume all node resources.
Fix: Add securityContext and resources to the Job spec template.
M4. Non-idempotent migrations with separate repair migrations
Files: drizzle/0004-0007
Source: Data Integrity + Simplicity Reviews
0006 is a repair of 0004 (adds IF NOT EXISTS). 0007 is a repair of 0005 (adds IF NOT EXISTS). For an unreleased branch, squash: fold IF NOT EXISTS into the originals and remove the repairs.
M5. createRunEnvelope silently overwrites payload on upsert
File: src/runtime/control-plane-store.ts:56-79
Source: Data Integrity Review
A retry or race condition could replace the original pipeline payload. Since run_payloads has no history, the original dispatched payload is lost. Consider using onConflictDoNothing and returning the existing row.
M6. issueRunToken silently replaces previous token
File: src/runtime/control-plane-store.ts:81-102
Source: Data Integrity Review
Re-issuing a token replaces the hash and resets usedAt. The runner's existing token becomes instantly invalid with no notification. Consider throwing if a token already exists, or returning the existing token.
M7. parseJsonOrEmpty<T> unsafe cast
File: src/runner/control-plane-client.ts:45-51
Source: Pattern Review
Casts {} to any type T. If a caller inspects the returned value, this causes runtime errors TypeScript can't catch. Return undefined or throw instead.
M8. Cleanup sequencing violates spec
File: src/runtime/kubernetes-backend.ts:161-171
Source: Spec Compliance Review
Cleanup is immediate and unconditional in a finally block. The spec requires a 6-step sequence: cancellation request, grace period for artifact flush, force-delete after grace period, reconcile business status, clean K8s resources, retain artifacts per retention policy.
M9. No runner protocol version validation
Source: Spec Compliance Review
The spec requires runnerProtocolVersion compatibility check before Job creation. Not implemented anywhere. Zero search results for protocolVersion in the codebase.
M10. RBAC grants unrestricted secrets access within namespace
File: kubernetes/local/gooseherd-rbac.yaml
Source: Security Review
The Role grants create, get, delete on ALL secrets in the namespace, not just run-token secrets. The gooseherd service account can read gooseherd-env containing DATABASE_URL, ENCRYPTION_KEY, GITHUB_TOKEN, and OPENROUTER_API_KEY.
LOW -- Nice to Have
| # | Finding | Source |
|---|---|---|
| L1 | Error responses leak runId back to caller (enables run-ID enumeration) |
Security |
| L2 | Advisory lock uses hashtext() (32-bit) -- hash collisions serialize unrelated runs |
Performance |
| L3 | run_events.sequence index is non-unique -- buggy runner can create ambiguous ordering |
Data Integrity |
| L4 | backend.typecheck.ts is dead weight -- TypeScript already enforces the constraint |
Simplicity |
| L5 | Local/Docker backends are byte-identical -- merge into InProcessExecutionBackend<R> |
Simplicity |
| L6 | assertImplementedSandboxRuntime() is a no-op (void runtime) -- remove |
Simplicity |
| L7 | RunnerArtifactStore type alias adds no value over ArtifactStore |
Patterns |
| L8 | makeRun() test factory duplicated in 8 test files |
Patterns |
Spec Compliance Summary
| Requirement | Status |
|---|---|
| 5-endpoint API (payload, artifacts, events, cancellation, complete) | IMPLEMENTED |
| Error contract (2xx-5xx handling) | IMPLEMENTED |
| Event schema (eventId, eventType, timestamp, sequence, payload) | IMPLEMENTED |
| Completion idempotency (advisory lock + idempotency key) | IMPLEMENTED |
| Authentication (per-run bearer token, SHA-256 hashed) | IMPLEMENTED |
| Finalization decision table (9-row matrix) | PARTIAL (7/9 rows) |
| Required event types (7 types) | PARTIAL (run.artifact_status never emitted) |
| Cancellation polling | PARTIAL (works, but no grace period or force-delete) |
| Cleanup sequencing (6-step) | MISSING |
| Protocol version validation | MISSING |
| Artifact upload protocol | MISSING |
| Periodic reconciliation loop | MISSING (class exists, no scheduling) |
| Crash recovery / reconciliation window | MISSING |
| Control-plane reachability pre-flight | MISSING |
| Git/GitHub credential injection | MISSING |
What's Done Well
- Clean Strategy pattern --
RunExecutionBackend<R>+RuntimeRegistrymapped type is textbook Strategy with compile-time type safety - Shared pipeline engine -- Runner imports and calls the same
PipelineEngineclass, avoiding feature parity drift - Proper boundary separation -- Runner knows nothing about K8s, control plane knows nothing about pipeline execution
- Excellent integration tests --
runner-index.test.tsspawns actual child processes against mock HTTP servers - Defensive input validation -- All POST endpoints validated with type guards before reaching the store
- Idempotent completion recording --
pg_advisory_xact_lock+ idempotency key dedup is correct
Recommended Action Plan
Before merge (blockers):
- Fix timing-safe token comparison (C1) -- 1 line
- Add token revocation in cleanup + expiry column (C2)
- Fix migration journal ordering (C3) -- renumber migrations
- Cap log retrieval with
limitBytes(C4) - Add try/catch to cancellation polling loop (C5) -- 3 lines
- Lazy-initialize KubeConfig or conditionally register k8s backend (H4)
- Fix reconciler rows 3 & 4 (H8) -- 5 lines
- Rename
defaultSmokeJobNametodefaultJobName(H7) -- cosmetic but important
Before production:
- Add NetworkPolicy for runner pods (H3)
- Add pod securityContext + resource limits (M3)
- Replace hand-rolled YAML with
stringify()from existing dependency (M1) - Squash repair migrations (M4)
- Switch to exponential backoff with jitter (H5)
- Add token validation caching (H6)
- Fix FK constraints + uuid/text type alignment (H1)
Housekeeping:
- Deduplicate
isRecord,sleep,TerminalFact,normalizeBaseUrl(M2) - Delete
backend.typecheck.ts, merge local/docker backends (L4, L5) - Remove
assertImplementedSandboxRuntimeno-op (L6)
Chocksy
left a comment
There was a problem hiding this comment.
Test Suite Results (addendum to code review)
Ran all 74 test files on the feat/kubernetes-runtime branch.
70/74 pass, 3 failures, 1 hang, 0 regressions in existing tests.
All 13 new K8s-specific test files pass (kubernetes-backend, job-spec, local-manifests, local-scripts, resource-client, control-plane-store, control-plane-router, runtime-mode, run-execution-backends, run-manager-runtime, override-*).
Failures to fix:
runner-client.test.ts— timing issue:AbortSignal.timeout()consumes all retries before mock server respondsrunner-index.test.ts(2 subtests) —run.logfile not produced by runner; git clone attempted against fake URL instead of being mockedrun-manager-sandbox-runtime.test.ts— async leak: mock store missinggetRunmethod called after test endsdashboard-pipelines.test.ts— hangs indefinitely:afterEachhook doesn't close HTTP servers (open handles)
Summary
This PR adds Kubernetes as a first-class sandbox runtime for Gooseherd and wires the existing run lifecycle through a shared execution backend model.
It introduces:
SANDBOX_RUNTIME=local|docker|kubernetesminikubeworkflow for running Gooseherd end-to-end inside KubernetesWhat Changed
RunManagerthroughlocal,docker, andkubernetesbackends.kubernetes/local/*kubernetes/app.Dockerfilekubernetes/runner.Dockerfilenpm run k8s:local-upnpm run k8s:local-downnpm run k8s:local-statusnpm run k8s:smokenpm run k8s:harnessmain: control-plane router tests now pass the newstartDashboardServerargument layout correctly.Verification
docker compose run --rm -e DATABASE_URL_TEST=postgres://gooseherd:gooseherd@postgres:5432/gooseherd_test -v "$PWD:/app" --entrypoint sh gooseherd -lc 'node --test --test-reporter=tap --import tsx tests/control-plane-router.test.ts'build-and-testis the primary signal for the full PR because the full local suite includestests/e2e-pipeline.test.ts, which depends on cloningepiccoders/pxls.Notes
minikube-oriented and is meant to mirror the in-cluster production shape more closely thandocker compose.