feat: add server readiness gate to shell shim#213
Merged
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a TCP dial check before the shim attempts enforcement. If the agentsh server isn't reachable, the shim falls through to bash.real instead of failing. This prevents boot loops when force=true is baked into server images and root's login shell is the shim — the server may not be ready during early boot. Once the server is up, all sessions (interactive and non-interactive with force=true) get enforced as before. The readiness probe uses a 200ms timeout TCP dial to AGENTSH_SERVER (default 127.0.0.1:18080), adding <1ms latency on loopback when the server is running. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ket support
Address roborev findings:
1. Unix socket support: serverAddrFromEnv now parses unix:// URLs and
returns ("unix", "/path") so the readiness probe dials the correct
socket type instead of silently falling back to TCP loopback.
2. Fail-open scope: only local servers (loopback TCP, unix sockets)
get fail-open behavior on probe failure. Remote servers fail-closed
with an error message, preventing a network issue from silently
disabling enforcement.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make serverAddrFromEnv scheme-aware: https:// without explicit port probes :443, http:// without port probes :80. The 18080 fallback only applies when AGENTSH_SERVER is unset (the agentsh default). Fixes: remote HTTPS config like AGENTSH_SERVER=https://agent.example.com previously probed :18080 and triggered fail-closed even though the server was healthy on :443. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address roborev findings: 1. Fail-open scope (HIGH): The readiness gate is now opt-in via ready_gate=true in shim.conf. Default behavior is fail-closed, preserving the security posture. Operators who need boot safety explicitly opt in, accepting the tradeoff that a crashed local server temporarily disables enforcement. 2. Unix socket parsing (MEDIUM): Align serverAddrFromEnv with the client transport logic (internal/client/client.go:55-61) — when both u.Host and u.Path are present in a unix:// URL, join them as u.Host+u.Path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address roborev findings: 1. Invalid AGENTSH_SERVER (HIGH): serverAddrFromEnv now returns an error for non-empty but unparseable values instead of falling back to the local default. The readiness gate treats parse errors as fail-closed, preventing a typo from silently enabling fail-open. 2. Test coverage (LOW): TestShimReadinessGate_ServerReachable_ForceEnforces now sets ready_gate=true so it actually exercises the gate path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
url.Parse accepts schemeless values like "localhost" or "/tmp/agentsh.sock" as path-only URLs, which bypassed scheme validation. Switch to strict scheme-based dispatch (unix, http, https only) and add test coverage for schemeless and empty-host inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes from review: 1. Use 2s dial timeout for remote servers (was 200ms for all endpoints), preventing false negatives on VPN/high-latency links. 2. Distinguish config validation errors (force=tru, ready_gate=tru) from I/O errors — validation errors now fail with a clear message instead of being silently swallowed into the fail-closed force=true path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… locality check 1. AGENTSH_SHIM_FORCE=1 (env) now skips the readiness gate entirely. The env var signals explicit operator intent to enforce — the gate's fail-open behavior would contradict that. Config-driven force=true still gets boot-safety via ready_gate. 2. serverIsLocal now resolves hostnames via DNS (200ms timeout) to handle /etc/hosts aliases that point to loopback. Resolution failure defaults to remote (fail-closed). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mixed-resolution hostnames (loopback + remote) were incorrectly treated as local, breaking the fail-closed guarantee for remote servers. Now all resolved addresses must be loopback for a hostname to be classified as local. Also makes the resolver injectable for testing and adds table tests for loopback alias, mixed resolution, and lookup failure cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. serverAddrFromEnv now checks AGENTSH_TRANSPORT=grpc and probes AGENTSH_GRPC_ADDR instead of AGENTSH_SERVER, matching the client's transport selection. Prevents probing the wrong endpoint. 2. serverReachable now returns the dial error so the caller can distinguish transient errors (ECONNREFUSED, ENOENT — server not started) from hard errors (EACCES — bad socket permissions). Only transient errors trigger fail-open; hard errors fail-closed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. readiness gate error hint now mentions AGENTSH_GRPC_ADDR when
transport is grpc, instead of always saying AGENTSH_SERVER.
2. Added end-to-end shim tests:
- TestShimReadinessGate_GRPCTransport_ProbesGRPCAddr: verifies gate
probes gRPC endpoint even when AGENTSH_SERVER is unreachable.
- TestShimReadinessGate_UnixSocketEACCES_FailsClosed: verifies EACCES
on a unix socket triggers fail-closed, not fail-open fallthrough.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. TestShimReadinessGate_GRPCTransport_ProbesGRPCAddr now tracks accepts on the fake listener and asserts the probe hit it. Uses AGENTSH_BIN=/nonexistent for deterministic enforcement failure. 2. TestShimReadinessGate_UnixSocketEACCES_FailsClosed restricted to Linux only (chmod on unix sockets is OS-dependent on BSD/macOS). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace atomic counter with channel-based synchronization for the accept tracking in TestShimReadinessGate_GRPCTransport_ProbesGRPCAddr. The channel+timeout avoids the race between the accept goroutine updating the counter and the test reading it after cmd.Run() returns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Lowercase AGENTSH_TRANSPORT before comparison (CLI does the same in client.go and exec_pty.go). Strip scheme prefix from AGENTSH_GRPC_ADDR to match client.NewGRPC normalization (accepts "grpc://host:port"). Added tests for uppercase transport and scheme-bearing gRPC address. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. serverAddrFromEnv now rejects unknown AGENTSH_TRANSPORT values (e.g. "grcp" typo) with a clear error instead of silently falling through to the HTTP path, which could disable enforcement via ready_gate fail-open. 2. Added design comment to serverReachable explaining why a raw TCP/unix dial is used instead of an HTTP health check: the shim is a boot-time login shell that must be minimal, fast, and free of HTTP dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ready_gate=truein/etc/agentsh/shim.conf) that probes the agentsh server before enforcing policyforce=trueis baked into container images and the shim is root's login shell —agentsh.servicemay not be ready yet during early bootAGENTSH_SHIM_FORCE=1(env) bypasses the gate entirely — explicit enforcement intent takes precedenceHardening (via 10 roborev iterations)
internal/client/client.goAGENTSH_TRANSPORT=grpcprobesAGENTSH_GRPC_ADDR)grcporready_gate=trufail loudly)Test plan
go test ./cmd/agentsh-shell-shim/ ./internal/shim/— all passGOOS=windows go build ./...— cross-compile cleanroborev review --since de01ffe5 --wait— full-scope review passed cleanforce=true+ready_gate=true+ server not started → shim falls through to real shell🤖 Generated with Claude Code