docs: daemon foundation design note (IPC, permission routing, local auth)#74
Conversation
…al auth) Covers the three items requested by @chauncygu in SafeRL-Lab#68 before the foundation PR lands. Scope is intentionally narrow — service inventory, phasing, persistence, and cost guardrail defaults were settled in the issue thread and are not re-litigated here. Sections: 1. IPC — Unix-socket default, optional TCP, HTTP/1.1 + JSON-RPC + SSE 2. Permission routing — originator-bound, fixes the first-answer-wins race 3. Local auth — peer-cred on Unix socket, bearer token on TCP, threat model explicitly single-user single-host (security boundary, not multi-user feature) Three open questions flagged for review. Refs SafeRL-Lab#68
a1584ed to
ac18a70
Compare
|
Hi @mxh1999 This is a solid note overall. The originator-based permission routing in §2 is the strongest part and resolves a real race that the earlier "first-answer-wins" draft would have shipped. Threat model is realistic, defaults are concrete, and the Open Questions section makes the trade-offs reviewable instead of buried. I'd like a few items in the document itself before we accept it as the foundation PR's contract — most are one- or two-line additions. Comments below. Must address before accept (1–9): threading model, SSE heartbeat, Can land as follow-up checklist (10–12): Inline comments (anchor each to the listed section)§1 IPC › Protocol — threading model
Please specify §1 IPC › Event channel — keep-alive heartbeatSSE behind NAT or a reverse proxy will be silently closed after idle timeouts (commonly 30–60s) and the client won't know until it tries to reconnect. Please specify a server-side heartbeat: a single SSE comment line ( §1 IPC › Method namespace —
|
Builds on @chauncygu's spike branch (feature/daemon-spike, e980cdb) to ship the foundation runtime per the roadmap in docs/RFC/0002-daemon-foundation-roadmap.md. No service has been migrated yet — that's F-3 through F-8 work. Spike modules kept as-is (encode the wire contract reviewed in #74): cc_daemon/__init__.py, auth.py, events.py, methods.py, originator.py, permission.py, rpc.py, spike_client.py Spike modules patched (per labor split agreed in #68 — server is not on the keep-as-is list, but changes are minimal patches not rewrites): - cc_daemon/server.py: Windows guard around UnixStreamServer; DaemonState gains unauthenticated_metrics + config kwargs; /healthz /readyz /metrics route through health.payload_for(...); DaemonState registers system_methods alongside spike methods. - cc_daemon/cli.py: rewritten to expose serve_main(argv) for the new `cheetahclaws serve` surface; legacy `python -m cc_daemon.cli {serve|status|stop|logs|rotate-token}` entry preserved as backward-compat for spike-notes commands. Foundation glue (added): - cc_daemon/discovery.py — atomic ~/.cheetahclaws/daemon.json so REPL/Web/bridge clients can locate the daemon (transport, address, version) without parsing CLI args. Pid_alive cross-platform. - cc_daemon/system_methods.py — system.ping (RFC contract name) and system.shutdown (sets DaemonState.shutdown_event for cross-platform graceful exit; Windows can't deliver SIGTERM cleanly to another Python process). - commands/daemon_cmd.py — `cheetahclaws daemon {status, stop, logs, rotate-token}`. Uses Cheetahclaws-Api-Version header on every RPC. - cheetahclaws.py — main() short-circuit for `serve`, `daemon`, plus backward-compat alias for `cheetahclaws spike-daemon ...`. - health.py — extracted module-level healthz_payload(config) / readyz_payload(config) / metrics_payload(config) / payload_for(path, config) so both the standalone health server and the daemon listener reuse the same circuit-breaker / quota / runtime-registry probes. Existing health_check_port behavior unchanged. - docs/RFC/0002-daemon-foundation-roadmap.md — F-1..F-9 PR breakdown with per-PR acceptance criteria. - docs/architecture.md — Daemon section pointing at cc_daemon modules and the foundation glue. Tests (72 new, 13 spike untouched, all green): - tests/test_cc_daemon_discovery.py (16) — unit, write/read/locate - tests/test_cc_daemon_system_methods.py (8) — unit, ping/shutdown - tests/test_daemon_cmd.py (14) — unit, dispatch + tail + rotate - tests/test_health_payloads.py (8) — unit, real health.py wiring - tests/e2e_daemon_skeleton.py (13) — subprocess: real boot, discovery, RPC, auth, /events SSE heartbeat, all daemon subcommands - tests/test_daemon_spike.py (13) — chauncygu's, untouched Behavior change worth flagging: - /healthz, /readyz, /metrics are now auth-gated by default per RFC 0001 §3. Spike returned them unauthenticated; we route through health.payload_for(...) and require Authorization (or peer-cred on UDS). Opt out with `--unauthenticated-metrics` for Prometheus scrapers. What is NOT in F-1 (intentional, per roadmap): - agent.run integration (no real session.send) — F-3+ - Bridges in daemon (Telegram/Slack/WeChat) — F-6/F-7/F-8 - monitor/scheduler in daemon — F-3 - agent_runner subprocess-per-agent — F-4 - SQLite event persistence — F-2 - Cost guardrail conservative defaults — F-9 - macOS peer-cred — TODO left in cc_daemon/auth.py from spike Refs #68, #74 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ut flush, test floor Four nits surfaced while smoke-testing F-1 (PR SafeRL-Lab#80) on real subprocesses. None are blockers, but they each break a docs-promised path so they deserve a dedicated polish PR rather than getting buried in F-2's diff. 1. `cheetahclaws daemon {status, stop, rotate-token}` now read the token path from the discovery file when `serve` was started with `--token-path`. `discovery.make_info()` accepts a new optional `token_path` keyword (schema stays at 1 — additive); `cli.cmd_serve()` records it only when the path overrides the default; new `commands.daemon_cmd._resolve_token_path()` prefers discovery and falls back to the default location for old discovery files / unset case. Previously these verbs always read `~/.cheetahclaws/daemon_token`, which silently created a *different* random token, then failed 401 against the running daemon. 2. `python -m cc_daemon.cli --help` (and the `cheetahclaws spike-daemon --help` backward-compat alias) now print a usage banner and exit 0 instead of `unknown subcommand: --help` / exit 2. The unknown- subcommand branch also includes the banner so users see how to recover. The PR description for SafeRL-Lab#80 said the spike-daemon alias was "preserved" — this closes the gap. 3. The three serve-mode startup prints (`token: …`, `cheetahclaws daemon listening on …`, `audit log: …`) now `flush=True` so they appear immediately when stdout is redirected to a file under `&`. Previously they sat in Python's 4 KB block buffer until the daemon exited, silently breaking the spike-notes' `--print-token > out.log &` workflow because the token line never reached disk. 4. `tests/e2e_daemon_skeleton.py::test_daemon_writes_discovery_and_token` token-length floor raised from `>= 32` to `>= 40`. `secrets.token_urlsafe(32)` yields ~43 chars, so the previous floor was loose enough that an accidental shrink to 16 raw bytes (~22 chars) would still ship green. Tests: 10 new unit tests (4 covering `cli.main` dispatch, 4 covering `_resolve_token_path`, 2 covering `discovery.make_info`'s new field). Full suite 669/669 passing on `main`. End-to-end smoke verified all three runtime fixes against `cheetahclaws serve --listen tcp://...`. Docs: * `README.md` docs index — adds row for RFC 0002 (foundation roadmap); refreshes the spike row to reflect the actual landing path (SafeRL-Lab#77 → reverted → re-landed via SafeRL-Lab#81); marks F-1 as merged via SafeRL-Lab#80. * `docs/RFC/0002-daemon-foundation-roadmap.md` — F-1 status `OPEN` → `MERGED SafeRL-Lab#80`. * `docs/architecture.md` — daemon section now mentions the optional `token_path` discovery field and notes that the daemon-control verbs use it. * `docs/news.md` — May 2, 2026 entry covering the spike re-land and F-1 merge sequence, the polish nits, and the intentional "not in F-1" list (agent.run, bridges, SQLite, cost guardrails, agent-runner subprocess, macOS peer-cred). Refs SafeRL-Lab#68, SafeRL-Lab#74, SafeRL-Lab#80, SafeRL-Lab#81 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an isolated cc_daemon/ package (~1.1k LoC across 9 files, plus 360 lines of pytest) that implements the daemon contract surface defined in docs/RFC/0001-daemon-design-note.md (PR SafeRL-Lab#74). This is a SPIKE — it exists to lock down the IPC + permission routing + local auth contract in runnable code so the upcoming foundation PR has a verified starting point. Nothing in cc_daemon/ is load-bearing for production. Covers the ✓ rows of the RFC review-comment matrix: ThreadingHTTPServer with a non-default request_queue_size, 15s SSE heartbeat, client_id mint/persist/resume, sync RPC + async events (variant A of session.send), Cheetahclaws-Api-Version: 0 → 426 on mismatch, bounded event ring buffer with gap event on overflow, audit log default-on for both transports, 30 min interactive permission timeout with permission.refresh_timeout RPC, and originator-only permission.answer returning HTTP 403 / -32001 to non-originators. Out of scope: agent.run integration, bridges migration, SQLite event store, cost guardrails, agent-runner subprocess isolation, /metrics, macOS peer-cred (TODO(macos) left in cc_daemon/auth.py). Main code touches are minimal and isolated: a 6-line subcommand shim in cheetahclaws.py that intercepts `cheetahclaws spike-daemon ...` before the main argparse runs, plus one cc_daemon entry in pyproject.toml's package list. Existing CLI flags (--version, --help, prompt parsing) are unchanged. Tests: 593 passing (580 existing + 13 new). Run the new suite with `pytest tests/test_daemon_spike.py -v`. Manual smoke and originator- routing demo in docs/RFC/0001-spike-notes.md "How to run it". Refs SafeRL-Lab#68, SafeRL-Lab#74 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the three items @chauncygu requested in #68 before the foundation PR lands.
Scope is intentionally narrow — service inventory, phasing, persistence, and cost guardrail defaults were settled in the issue thread and are not re-litigated here.
What's in this PR
A single new doc:
docs/RFC/0001-daemon-design-note.md(164 lines).Three sections:
--listen tcp://); HTTP/1.1 framing on top; JSON-RPC 2.0 for the data plane (POST /rpc), SSE for events (GET /events); existing/healthz/readyz/metricsunchanged.PermissionRequestcarries anoriginator; only the originator may answer (other clients see read-only via/events). Fixes the first-answer-wins race called out in [Question] Should /monitor, /agent, bridges survive REPL exit? #68 review.SO_PEERCRED/LOCAL_PEERCRED) on the Unix socket; bearer token on TCP. TLS out of scope; reverse-proxy recipe documented instead.A short "Related decisions" section anchors the items already settled in #68 (subprocess-per-agent, bridges in foundation, cost defaults, API RC window) so reviewers know what's not up for debate here.
Open questions flagged in the doc
agent_runneris its own originator class or whether the configured bridge is the originator for those requests.Happy to discuss inline. Once these are resolved, the foundation PR follows.
Refs #68