Skip to content

fix: IPC/ptrace hardening — fix 30s exec hang + related races#215

Merged
erans merged 7 commits intomainfrom
fix/ipc-ptrace-hardening
Apr 11, 2026
Merged

fix: IPC/ptrace hardening — fix 30s exec hang + related races#215
erans merged 7 commits intomainfrom
fix/ipc-ptrace-hardening

Conversation

@erans
Copy link
Copy Markdown
Collaborator

@erans erans commented Apr 11, 2026

Summary

Fixes three IPC/ptrace bugs surfaced by a 30-second hang in agentsh exec after rebuilding with CGO=1. Root cause was that SetReadDeadline is a silent no-op on os.NewFile-wrapped socketpair fds (they aren't registered with Go's netpoll), so the FD-handoff and READY-byte handshakes had no effective timeout.

  • SOCK_CLOEXEC on the API-layer notify socketpair, matching the wrapper pattern in wrap_linux.go. Prevents fd leak on accidental fork/exec.
  • resumeTracedProcess race hardening: handles ECHILD from Wait4, already-exited/signaled status, and ESRCH from PtraceDetach. Each path is logged at debug for forensics.
  • 30s hang root fix: replaced SetReadDeadline with SO_RCVTIMEO (kernel-level timeout that works with raw blocking recvmsg) in both notify_linux.go and signal_handler_linux.go. Added lifecycle management in the notify handler so the 10s FD-handoff timeout is cleared before the 30s READY-byte read.

Design spec and implementation plan live on main at docs/superpowers/specs/2026-04-10-ipc-ptrace-hardening-design.md and docs/superpowers/plans/2026-04-10-ipc-ptrace-hardening.md.

Test plan

  • go build ./... (linux)
  • GOOS=darwin go build ./...
  • GOOS=windows go build ./...
  • go test ./... (all packages pass)
  • go vet ./internal/api/... (clean on all platforms)
  • roborev review on every commit (no findings above Low)
  • Final spec + code review (no findings above Low)
  • Manual smoke: agentsh exec with CGO=1 no longer hangs
  • CI green

🤖 Generated with Claude Code

erans and others added 7 commits April 10, 2026 16:52
Atomically sets close-on-exec on the parent fd, eliminating a TOCTOU
race window in multi-threaded Go processes where the fd could leak
into a concurrently-forking child. Matches the CLI layer pattern
(internal/cli/wrap_linux.go:93).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Treat ECHILD (already reaped), exited/signaled status, and ESRCH on
detach as success with debug logging. Previously these races surfaced
as exit_code=127 errors to callers (exec.go, exec_stream.go).
SetReadDeadline fails on os.NewFile-wrapped socketpair fds (not
registered with Go's netpoll). The notify handler already handles
this correctly (notify_linux.go:218-221). Match that pattern so
signal monitoring is not silently disabled.
unixmon.RecvFD calls recvmsg on the raw fd, bypassing Go's netpoll,
so SetReadDeadline was a no-op — it never actually bounded the
receive. Replace with SO_RCVTIMEO, a kernel-level timeout that works
with raw blocking recvmsg. Matches the wrapper pattern at
cmd/agentsh-unixwrap/main.go:163.

Fixes a Medium-severity roborev finding on 0e0a6b6 where the signal
handler could block indefinitely on RecvFD if the wrapper stalled.
The same issue exists in the notify handler (pre-existing); both are
fixed here.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 10s SO_RCVTIMEO set on parentSock for the FD handoff persisted and
unexpectedly bounded the later READY byte read, which was intended to
have a 30s timeout. The previous SetReadDeadline calls in the READY
block were silent no-ops because parentSock wraps a raw socketpair fd
that isn't registered with Go's netpoll.

Clear SO_RCVTIMEO immediately after RecvFD, apply a fresh 30s
SO_RCVTIMEO before the READY read, and clear it again afterward so it
doesn't leak to any subsequent reads on the socket.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
unix.SOCK_CLOEXEC is not defined on darwin, so the !windows build tag
broke darwin cross-compilation after the previous SOCK_CLOEXEC fix. The
seccomp user-notify code path this helper feeds is linux-only anyway
(matching the notify_stub.go pattern), so narrow the build tag to linux
and add a no-op stub for other unix platforms so core.go still builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agentsh-unixwrap signal filter installs its own
SECCOMP_RET_USER_NOTIF filter on top of the main wrapper filter. When
the main filter is already using USER_NOTIF (for unix socket
monitoring, file monitoring, metadata interception, or execve
interception), stacking a second listener on the same thread breaks
notification delivery. On Alpine/musl this surfaces as libreadline
EBADF loops inside bash because the signal filter's listener
interferes with the main filter's openat notifications, causing
TestAlpineEnvInject_BashBuiltinDisabled to hang indefinitely.

Previously the wrap.go gate only disabled the signal filter when
execveEnabled was true; unix sockets and file monitoring were silently
stacked. core.go had no gate at all, so every exec that hit
setupSeccompWrapper with signal rules tripped the bug.

Extend the wrap.go signalFilterEnabled helper with a new
mainFilterUsesUserNotify method that mirrors the feature gates in
unixmon.InstallFilterWithConfig, and route core.go's setupSeccompWrapper
through the same helper. execveEnabled has to move up above the gate in
setupSeccompWrapper because the runtime value (which respects
hybrid-ptrace mode) is what the gate must see.

signal_handler_linux.go's earlier fix — exit on error instead of
hot-looping on ENOENT when the filter is destroyed — is included here
as an orthogonal defensive improvement.

Fixes TestAlpineEnvInject_BashBuiltinDisabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@erans erans merged commit b42c2ad into main Apr 11, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant