Skip to content

feat: add server readiness gate to shell shim#213

Merged
erans merged 17 commits intomainfrom
shim-readiness-gate
Apr 10, 2026
Merged

feat: add server readiness gate to shell shim#213
erans merged 17 commits intomainfrom
shim-readiness-gate

Conversation

@erans
Copy link
Copy Markdown
Collaborator

@erans erans commented Apr 10, 2026

Summary

  • Adds an opt-in server readiness gate (ready_gate=true in /etc/agentsh/shim.conf) that probes the agentsh server before enforcing policy
  • Prevents boot loops when force=true is baked into container images and the shim is root's login shell — agentsh.service may not be ready yet during early boot
  • Local servers (loopback/unix) get fail-open behavior; remote servers always fail-closed
  • AGENTSH_SHIM_FORCE=1 (env) bypasses the gate entirely — explicit enforcement intent takes precedence

Hardening (via 10 roborev iterations)

  • Unix socket support with path parsing aligned to internal/client/client.go
  • Strict URL scheme validation (rejects schemeless values, validates host)
  • gRPC transport support (AGENTSH_TRANSPORT=grpc probes AGENTSH_GRPC_ADDR)
  • Transport and config value validation (typos like grcp or ready_gate=tru fail loudly)
  • Hostname resolution with all-loopback requirement (mixed resolution = remote)
  • EACCES on unix sockets fails closed (bad permissions ≠ "not started yet")
  • Longer probe timeout for remote servers (2s vs 200ms local)

Test plan

  • go test ./cmd/agentsh-shell-shim/ ./internal/shim/ — all pass
  • GOOS=windows go build ./... — cross-compile clean
  • roborev review --since de01ffe5 --wait — full-scope review passed clean
  • Verify boot-loop scenario: force=true + ready_gate=true + server not started → shim falls through to real shell
  • Verify enforcement resumes once server is up

🤖 Generated with Claude Code

erans and others added 17 commits April 10, 2026 13:21
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a TCP dial check before the shim attempts enforcement. If the
agentsh server isn't reachable, the shim falls through to bash.real
instead of failing. This prevents boot loops when force=true is baked
into server images and root's login shell is the shim — the server may
not be ready during early boot.

Once the server is up, all sessions (interactive and non-interactive
with force=true) get enforced as before. The readiness probe uses a
200ms timeout TCP dial to AGENTSH_SERVER (default 127.0.0.1:18080),
adding <1ms latency on loopback when the server is running.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ket support

Address roborev findings:

1. Unix socket support: serverAddrFromEnv now parses unix:// URLs and
   returns ("unix", "/path") so the readiness probe dials the correct
   socket type instead of silently falling back to TCP loopback.

2. Fail-open scope: only local servers (loopback TCP, unix sockets)
   get fail-open behavior on probe failure. Remote servers fail-closed
   with an error message, preventing a network issue from silently
   disabling enforcement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make serverAddrFromEnv scheme-aware: https:// without explicit port
probes :443, http:// without port probes :80. The 18080 fallback only
applies when AGENTSH_SERVER is unset (the agentsh default).

Fixes: remote HTTPS config like AGENTSH_SERVER=https://agent.example.com
previously probed :18080 and triggered fail-closed even though the
server was healthy on :443.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address roborev findings:

1. Fail-open scope (HIGH): The readiness gate is now opt-in via
   ready_gate=true in shim.conf. Default behavior is fail-closed,
   preserving the security posture. Operators who need boot safety
   explicitly opt in, accepting the tradeoff that a crashed local
   server temporarily disables enforcement.

2. Unix socket parsing (MEDIUM): Align serverAddrFromEnv with the
   client transport logic (internal/client/client.go:55-61) — when
   both u.Host and u.Path are present in a unix:// URL, join them
   as u.Host+u.Path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address roborev findings:

1. Invalid AGENTSH_SERVER (HIGH): serverAddrFromEnv now returns an
   error for non-empty but unparseable values instead of falling back
   to the local default. The readiness gate treats parse errors as
   fail-closed, preventing a typo from silently enabling fail-open.

2. Test coverage (LOW): TestShimReadinessGate_ServerReachable_ForceEnforces
   now sets ready_gate=true so it actually exercises the gate path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
url.Parse accepts schemeless values like "localhost" or "/tmp/agentsh.sock"
as path-only URLs, which bypassed scheme validation. Switch to strict
scheme-based dispatch (unix, http, https only) and add test coverage for
schemeless and empty-host inputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes from review:
1. Use 2s dial timeout for remote servers (was 200ms for all endpoints),
   preventing false negatives on VPN/high-latency links.
2. Distinguish config validation errors (force=tru, ready_gate=tru) from
   I/O errors — validation errors now fail with a clear message instead
   of being silently swallowed into the fail-closed force=true path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… locality check

1. AGENTSH_SHIM_FORCE=1 (env) now skips the readiness gate entirely.
   The env var signals explicit operator intent to enforce — the gate's
   fail-open behavior would contradict that. Config-driven force=true
   still gets boot-safety via ready_gate.

2. serverIsLocal now resolves hostnames via DNS (200ms timeout) to
   handle /etc/hosts aliases that point to loopback. Resolution failure
   defaults to remote (fail-closed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mixed-resolution hostnames (loopback + remote) were incorrectly treated
as local, breaking the fail-closed guarantee for remote servers. Now all
resolved addresses must be loopback for a hostname to be classified as
local. Also makes the resolver injectable for testing and adds table
tests for loopback alias, mixed resolution, and lookup failure cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. serverAddrFromEnv now checks AGENTSH_TRANSPORT=grpc and probes
   AGENTSH_GRPC_ADDR instead of AGENTSH_SERVER, matching the client's
   transport selection. Prevents probing the wrong endpoint.

2. serverReachable now returns the dial error so the caller can
   distinguish transient errors (ECONNREFUSED, ENOENT — server not
   started) from hard errors (EACCES — bad socket permissions). Only
   transient errors trigger fail-open; hard errors fail-closed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. readiness gate error hint now mentions AGENTSH_GRPC_ADDR when
   transport is grpc, instead of always saying AGENTSH_SERVER.

2. Added end-to-end shim tests:
   - TestShimReadinessGate_GRPCTransport_ProbesGRPCAddr: verifies gate
     probes gRPC endpoint even when AGENTSH_SERVER is unreachable.
   - TestShimReadinessGate_UnixSocketEACCES_FailsClosed: verifies EACCES
     on a unix socket triggers fail-closed, not fail-open fallthrough.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. TestShimReadinessGate_GRPCTransport_ProbesGRPCAddr now tracks
   accepts on the fake listener and asserts the probe hit it. Uses
   AGENTSH_BIN=/nonexistent for deterministic enforcement failure.

2. TestShimReadinessGate_UnixSocketEACCES_FailsClosed restricted to
   Linux only (chmod on unix sockets is OS-dependent on BSD/macOS).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace atomic counter with channel-based synchronization for the accept
tracking in TestShimReadinessGate_GRPCTransport_ProbesGRPCAddr. The
channel+timeout avoids the race between the accept goroutine updating
the counter and the test reading it after cmd.Run() returns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Lowercase AGENTSH_TRANSPORT before comparison (CLI does the same in
client.go and exec_pty.go). Strip scheme prefix from AGENTSH_GRPC_ADDR
to match client.NewGRPC normalization (accepts "grpc://host:port").
Added tests for uppercase transport and scheme-bearing gRPC address.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. serverAddrFromEnv now rejects unknown AGENTSH_TRANSPORT values
   (e.g. "grcp" typo) with a clear error instead of silently falling
   through to the HTTP path, which could disable enforcement via
   ready_gate fail-open.

2. Added design comment to serverReachable explaining why a raw TCP/unix
   dial is used instead of an HTTP health check: the shim is a boot-time
   login shell that must be minimal, fast, and free of HTTP dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@erans erans merged commit 5a6ff8d into main Apr 10, 2026
9 checks passed
@erans erans deleted the shim-readiness-gate branch April 10, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant