docs: daemon foundation design note (IPC, permission routing, local auth) by mxh1999 · Pull Request #74 · SafeRL-Lab/cheetahclaws

mxh1999 · 2026-04-29T07:10:02Z

Covers the three items @chauncygu requested in #68 before the foundation PR lands.

Scope is intentionally narrow — service inventory, phasing, persistence, and cost guardrail defaults were settled in the issue thread and are not re-litigated here.

What's in this PR

A single new doc: docs/RFC/0001-daemon-design-note.md (164 lines).

Three sections:

IPC — Unix-socket default with optional TCP (--listen tcp://); HTTP/1.1 framing on top; JSON-RPC 2.0 for the data plane (POST /rpc), SSE for events (GET /events); existing /healthz /readyz /metrics unchanged.
Permission routing — every PermissionRequest carries an originator; only the originator may answer (other clients see read-only via /events). Fixes the first-answer-wins race called out in [Question] Should /monitor, /agent, bridges survive REPL exit? #68 review.
Local auth — explicitly framed as a security boundary, not a multi-user feature. Peer credentials (SO_PEERCRED / LOCAL_PEERCRED) on the Unix socket; bearer token on TCP. TLS out of scope; reverse-proxy recipe documented instead.

A short "Related decisions" section anchors the items already settled in #68 (subprocess-per-agent, bridges in foundation, cost defaults, API RC window) so reviewers know what's not up for debate here.

Open questions flagged in the doc

HTTP-on-socket vs raw newline-delimited JSON-RPC.
Whether an autonomous agent_runner is its own originator class or whether the configured bridge is the originator for those requests.
Audit log default for the Unix socket (off vs always-on).

Happy to discuss inline. Once these are resolved, the foundation PR follows.

Refs #68

@chauncygu

…al auth) Covers the three items requested by @chauncygu in SafeRL-Lab#68 before the foundation PR lands. Scope is intentionally narrow — service inventory, phasing, persistence, and cost guardrail defaults were settled in the issue thread and are not re-litigated here. Sections: 1. IPC — Unix-socket default, optional TCP, HTTP/1.1 + JSON-RPC + SSE 2. Permission routing — originator-bound, fixes the first-answer-wins race 3. Local auth — peer-cred on Unix socket, bearer token on TCP, threat model explicitly single-user single-host (security boundary, not multi-user feature) Three open questions flagged for review. Refs SafeRL-Lab#68

chauncygu · 2026-04-30T17:23:41Z

Hi @mxh1999

This is a solid note overall. The originator-based permission routing in §2 is the strongest part and resolves a real race that the earlier "first-answer-wins" draft would have shipped. Threat model is realistic, defaults are concrete, and the Open Questions section makes the trade-offs reviewable instead of buried.

I'd like a few items in the document itself before we accept it as the foundation PR's contract — most are one- or two-line additions. Comments below.

Must address before accept (1–9): threading model, SSE heartbeat, client_id lifecycle, session.send semantics, macOS peer-cred reality, API versioning, event retention default, audit-log default flip, interactive permission timeout.

Can land as follow-up checklist (10–12): /events filter semantics, binary payload story, metrics-endpoint redaction.

Inline comments (anchor each to the listed section)

§1 IPC › Protocol — threading model

Reuses stdlib http.server and http.client. No third-party dependency.

http.server.HTTPServer is single-threaded — one long-lived SSE client blocks every /rpc call behind it. With multiple clients in scope (REPL + Web UI + Telegram + Slack + WeChat + monitor + agent runners), this is a fast path to a hang.

Please specify ThreadingHTTPServer (also stdlib) and a per-client SSE concurrency cap (e.g. ≤ 64) so we have an explicit number to reason about under load. One sentence in §1 is enough.

§1 IPC › Event channel — keep-alive heartbeat

SSE behind NAT or a reverse proxy will be silently closed after idle timeouts (commonly 30–60s) and the client won't know until it tries to reconnect. Please specify a server-side heartbeat: a single SSE comment line (:\n\n) every 15–30s. Trivial to implement, prevents an entire class of "events stopped arriving" bug reports.

§1 IPC › Method namespace — `session.send` semantics

JSON-RPC is request/response, but agent.run() is a stream. The note doesn't say what session.send returns or where the streamed text/tool events surface.

Please pick one explicitly:

(A, recommended) session.send returns { "turn_id": "...", "accepted_at": "..." } immediately; all subsequent text chunks, tool starts/ends, permission requests for that turn flow through /events tagged with the turn_id. Keeps /rpc purely synchronous.
(B) Hold the HTTP response open and write JSON-RPC notifications until the turn ends. Breaks JSON-RPC's single-response semantics; harder to debug.

Pinning this in the note avoids re-litigating it during the foundation PR.

§2 Permission routing › `client_id` lifecycle

Originator disconnects mid-request — the request is held until timeout. On reconnect, the originator gets the request back via SSE replay scoped to its own pending requests…

Two questions the note doesn't answer:

Who issues client_id? If the daemon mints a new one on every connection, "REPL crashed and restarted" is a different originator and the held request is lost.
Does the client persist its client_id across process restarts?

Suggest: daemon mints client_id on first connection from a given client kind, returns it in the connect response; client persists it at ~/.cheetahclaws/clients/<kind>.id (mode 0600); subsequent connections present the saved id to resume the same originator identity. Worth a short subsection — this is load-bearing for the disconnect-then-reconnect flow you already designed.

§2 Permission routing › interactive timeout default

Defaults: 5 min for unattended mode, unlimited for interactive modes.

"Unlimited" for interactive is unsafe in practice: REPL crashes, user walks away, laptop sleeps — the request sits forever, holding the agent turn open and (for /agent runners) potentially blocking the schedule.

Suggest 30 min default for interactive, configurable per-session, with permission.refresh_timeout RPC if a client wants to extend an active request. Matches what most users will subjectively expect ("if I don't answer in half an hour, just deny it").

§3 Local auth › macOS peer-cred reality check

Daemon checks peer credentials on accept (SO_PEERCRED on Linux, LOCAL_PEERCRED on macOS) and rejects connections from a different UID.

Two practical issues:

Python's socket module doesn't expose LOCAL_PEERCRED directly — it needs getsockopt(SOL_LOCAL, LOCAL_PEERCRED) with SOL_LOCAL = 0 and a hand-parsed xucred struct. Real implementation hazard.
Older macOS versions have inconsistent behavior on LOCAL_PEERCRED for newly-accepted sockets.

Recommend: on macOS, lean on getpeereid() via ctypes (stable, returns just (uid, gid)), or accept that the 0600 socket + 0700 parent directory is the auth and document peer-cred as Linux-only enhancement. Either is fine; just don't promise LOCAL_PEERCRED and discover the parsing pain in the foundation PR.

§3 Local auth › audit log default for Unix socket

Off by default for the Unix socket (peer-cred-checked, low-noise). On by default for TCP.

Push back: flip Unix socket to on by default.

Authentication failures on a peer-cred-protected socket are by definition rare; the noise argument is weak.
The forensics value of "something tried to connect from a different UID" is high — that's exactly the kind of event you want logged, even (especially) when it's rare.
Cost is one rotated JSONL file with a few lines/day in steady state.

Suggest aligning here.

Note-wide › API versioning

The note doesn't say how clients negotiate protocol version. chauncygu asked for an RC one minor version before the default flip — that requires the protocol to be identifiable as v0 from the foundation PR onward, otherwise we have nothing to RC.

Pick one:

Cheetahclaws-Api-Version: 0 request header, daemon rejects mismatched majors with 426.
Method prefix: v0.session.send, v0.permission.answer.

I'd take the header — keeps method names readable. Either way, please bake it into §1 so the foundation PR ships with it instead of retrofitting.

§1 IPC › Event channel — retention default

Replay is bounded by the daemon_events retention window (rolling, default 7 days / 1M rows).

7 days of token chunks + tool events for an active user is a non-trivial SQLite file and a heavy backfill on reconnect. For a local-machine daemon with no multi-host story, the use case is "REPL just disconnected, what did I miss" — that's hours, not days.

Suggest 24h / 100K rows default, configurable. Long-term archival is a separate concern (session_store already handles it).

Smaller items / can be follow-ups

§1 IPC › `/events?since=<id>` — filtering semantics

filtered by the caller's auth — see §2

In a single-user daemon, "filtered by auth" needs a sentence of clarification. Is it:

per-originator (REPL doesn't see permission events targeted at the Telegram bridge), or
per-client-kind (Web UI sees everything; bridges only see their own session traffic), or
per-session (each subscriber names which sessions to follow)?

Pick the model and write one sentence. The current phrasing reads as "auth-filtered" but in single-user there's only one identity, which is what's confusing.

Note-wide › binary payloads

Out of v1 scope, but worth one defensive sentence so we don't paint into a corner: image/audio/file payloads go out-of-band (e.g. future /blobs endpoint with daemon-issued URLs), not inline-base64 over SSE. Keeps the event channel small and avoids buffer-bloat decisions later.

§3 Local auth › metrics-endpoint redaction

--unauthenticated-metrics is the right escape hatch for Prometheus, but please add a one-liner: even with auth on, /metrics and /healthz payloads must never include token fragments, full session IDs, or rationale strings. They're the safest endpoint to leak by misconfiguration; design out the leakage.

Once 1–9 are addressed in the doc, this is good. Thanks for the careful write-up.

@chauncygu

Builds on @chauncygu's spike branch (feature/daemon-spike, e980cdb) to ship the foundation runtime per the roadmap in docs/RFC/0002-daemon-foundation-roadmap.md. No service has been migrated yet — that's F-3 through F-8 work. Spike modules kept as-is (encode the wire contract reviewed in #74): cc_daemon/__init__.py, auth.py, events.py, methods.py, originator.py, permission.py, rpc.py, spike_client.py Spike modules patched (per labor split agreed in #68 — server is not on the keep-as-is list, but changes are minimal patches not rewrites): - cc_daemon/server.py: Windows guard around UnixStreamServer; DaemonState gains unauthenticated_metrics + config kwargs; /healthz /readyz /metrics route through health.payload_for(...); DaemonState registers system_methods alongside spike methods. - cc_daemon/cli.py: rewritten to expose serve_main(argv) for the new `cheetahclaws serve` surface; legacy `python -m cc_daemon.cli {serve|status|stop|logs|rotate-token}` entry preserved as backward-compat for spike-notes commands. Foundation glue (added): - cc_daemon/discovery.py — atomic ~/.cheetahclaws/daemon.json so REPL/Web/bridge clients can locate the daemon (transport, address, version) without parsing CLI args. Pid_alive cross-platform. - cc_daemon/system_methods.py — system.ping (RFC contract name) and system.shutdown (sets DaemonState.shutdown_event for cross-platform graceful exit; Windows can't deliver SIGTERM cleanly to another Python process). - commands/daemon_cmd.py — `cheetahclaws daemon {status, stop, logs, rotate-token}`. Uses Cheetahclaws-Api-Version header on every RPC. - cheetahclaws.py — main() short-circuit for `serve`, `daemon`, plus backward-compat alias for `cheetahclaws spike-daemon ...`. - health.py — extracted module-level healthz_payload(config) / readyz_payload(config) / metrics_payload(config) / payload_for(path, config) so both the standalone health server and the daemon listener reuse the same circuit-breaker / quota / runtime-registry probes. Existing health_check_port behavior unchanged. - docs/RFC/0002-daemon-foundation-roadmap.md — F-1..F-9 PR breakdown with per-PR acceptance criteria. - docs/architecture.md — Daemon section pointing at cc_daemon modules and the foundation glue. Tests (72 new, 13 spike untouched, all green): - tests/test_cc_daemon_discovery.py (16) — unit, write/read/locate - tests/test_cc_daemon_system_methods.py (8) — unit, ping/shutdown - tests/test_daemon_cmd.py (14) — unit, dispatch + tail + rotate - tests/test_health_payloads.py (8) — unit, real health.py wiring - tests/e2e_daemon_skeleton.py (13) — subprocess: real boot, discovery, RPC, auth, /events SSE heartbeat, all daemon subcommands - tests/test_daemon_spike.py (13) — chauncygu's, untouched Behavior change worth flagging: - /healthz, /readyz, /metrics are now auth-gated by default per RFC 0001 §3. Spike returned them unauthenticated; we route through health.payload_for(...) and require Authorization (or peer-cred on UDS). Opt out with `--unauthenticated-metrics` for Prometheus scrapers. What is NOT in F-1 (intentional, per roadmap): - agent.run integration (no real session.send) — F-3+ - Bridges in daemon (Telegram/Slack/WeChat) — F-6/F-7/F-8 - monitor/scheduler in daemon — F-3 - agent_runner subprocess-per-agent — F-4 - SQLite event persistence — F-2 - Cost guardrail conservative defaults — F-9 - macOS peer-cred — TODO left in cc_daemon/auth.py from spike Refs #68, #74 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ut flush, test floor Four nits surfaced while smoke-testing F-1 (PR SafeRL-Lab#80) on real subprocesses. None are blockers, but they each break a docs-promised path so they deserve a dedicated polish PR rather than getting buried in F-2's diff. 1. `cheetahclaws daemon {status, stop, rotate-token}` now read the token path from the discovery file when `serve` was started with `--token-path`. `discovery.make_info()` accepts a new optional `token_path` keyword (schema stays at 1 — additive); `cli.cmd_serve()` records it only when the path overrides the default; new `commands.daemon_cmd._resolve_token_path()` prefers discovery and falls back to the default location for old discovery files / unset case. Previously these verbs always read `~/.cheetahclaws/daemon_token`, which silently created a *different* random token, then failed 401 against the running daemon. 2. `python -m cc_daemon.cli --help` (and the `cheetahclaws spike-daemon --help` backward-compat alias) now print a usage banner and exit 0 instead of `unknown subcommand: --help` / exit 2. The unknown- subcommand branch also includes the banner so users see how to recover. The PR description for SafeRL-Lab#80 said the spike-daemon alias was "preserved" — this closes the gap. 3. The three serve-mode startup prints (`token: …`, `cheetahclaws daemon listening on …`, `audit log: …`) now `flush=True` so they appear immediately when stdout is redirected to a file under `&`. Previously they sat in Python's 4 KB block buffer until the daemon exited, silently breaking the spike-notes' `--print-token > out.log &` workflow because the token line never reached disk. 4. `tests/e2e_daemon_skeleton.py::test_daemon_writes_discovery_and_token` token-length floor raised from `>= 32` to `>= 40`. `secrets.token_urlsafe(32)` yields ~43 chars, so the previous floor was loose enough that an accidental shrink to 16 raw bytes (~22 chars) would still ship green. Tests: 10 new unit tests (4 covering `cli.main` dispatch, 4 covering `_resolve_token_path`, 2 covering `discovery.make_info`'s new field). Full suite 669/669 passing on `main`. End-to-end smoke verified all three runtime fixes against `cheetahclaws serve --listen tcp://...`. Docs: * `README.md` docs index — adds row for RFC 0002 (foundation roadmap); refreshes the spike row to reflect the actual landing path (SafeRL-Lab#77 → reverted → re-landed via SafeRL-Lab#81); marks F-1 as merged via SafeRL-Lab#80. * `docs/RFC/0002-daemon-foundation-roadmap.md` — F-1 status `OPEN` → `MERGED SafeRL-Lab#80`. * `docs/architecture.md` — daemon section now mentions the optional `token_path` discovery field and notes that the daemon-control verbs use it. * `docs/news.md` — May 2, 2026 entry covering the spike re-land and F-1 merge sequence, the polish nits, and the intentional "not in F-1" list (agent.run, bridges, SQLite, cost guardrails, agent-runner subprocess, macOS peer-cred). Refs SafeRL-Lab#68, SafeRL-Lab#74, SafeRL-Lab#80, SafeRL-Lab#81 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an isolated cc_daemon/ package (~1.1k LoC across 9 files, plus 360 lines of pytest) that implements the daemon contract surface defined in docs/RFC/0001-daemon-design-note.md (PR SafeRL-Lab#74). This is a SPIKE — it exists to lock down the IPC + permission routing + local auth contract in runnable code so the upcoming foundation PR has a verified starting point. Nothing in cc_daemon/ is load-bearing for production. Covers the ✓ rows of the RFC review-comment matrix: ThreadingHTTPServer with a non-default request_queue_size, 15s SSE heartbeat, client_id mint/persist/resume, sync RPC + async events (variant A of session.send), Cheetahclaws-Api-Version: 0 → 426 on mismatch, bounded event ring buffer with gap event on overflow, audit log default-on for both transports, 30 min interactive permission timeout with permission.refresh_timeout RPC, and originator-only permission.answer returning HTTP 403 / -32001 to non-originators. Out of scope: agent.run integration, bridges migration, SQLite event store, cost guardrails, agent-runner subprocess isolation, /metrics, macOS peer-cred (TODO(macos) left in cc_daemon/auth.py). Main code touches are minimal and isolated: a 6-line subcommand shim in cheetahclaws.py that intercepts `cheetahclaws spike-daemon ...` before the main argparse runs, plus one cc_daemon entry in pyproject.toml's package list. Existing CLI flags (--version, --help, prompt parsing) are unchanged. Tests: 593 passing (580 existing + 13 new). Run the new suite with `pytest tests/test_daemon_spike.py -v`. Manual smoke and originator- routing demo in docs/RFC/0001-spike-notes.md "How to run it". Refs SafeRL-Lab#68, SafeRL-Lab#74 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mxh1999 mentioned this pull request Apr 29, 2026

[Question] Should /monitor, /agent, bridges survive REPL exit? #68

Open

mxh1999 force-pushed the daemon-design-note branch from a1584ed to ac18a70 Compare April 29, 2026 12:18

chauncygu merged commit 2899254 into SafeRL-Lab:main Apr 29, 2026
6 checks passed

chauncygu mentioned this pull request Apr 30, 2026

feat(daemon): cc_daemon spike validating RFC 0001 contract end-to-end #77

Merged

mxh1999 mentioned this pull request May 2, 2026

feat(daemon): F-1 foundation — discovery, system methods, daemon CLI, health, e2e (on top of cc_daemon spike) #80

Merged

chauncygu mentioned this pull request May 2, 2026

Re-land cc_daemon spike for #80 (un-revert 3183fc6) Revert "Revert "Merge pull request #77 from SafeRL-Lab/feature/daemon… #81

Merged

chauncygu mentioned this pull request May 2, 2026

fix(daemon): F-1 polish — token_path discovery, --help dispatch, stdo… #82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: daemon foundation design note (IPC, permission routing, local auth)#74

docs: daemon foundation design note (IPC, permission routing, local auth)#74
chauncygu merged 1 commit intoSafeRL-Lab:mainfrom
mxh1999:daemon-design-note

mxh1999 commented Apr 29, 2026

Uh oh!

Uh oh!

chauncygu commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mxh1999 commented Apr 29, 2026

What's in this PR

Open questions flagged in the doc

Uh oh!

Uh oh!

chauncygu commented Apr 30, 2026

Inline comments (anchor each to the listed section)

§1 IPC › Protocol — threading model

§1 IPC › Event channel — keep-alive heartbeat

§1 IPC › Method namespace — session.send semantics

§2 Permission routing › client_id lifecycle

§2 Permission routing › interactive timeout default

§3 Local auth › macOS peer-cred reality check

§3 Local auth › audit log default for Unix socket

Note-wide › API versioning

§1 IPC › Event channel — retention default

Smaller items / can be follow-ups

§1 IPC › /events?since=<id> — filtering semantics

Note-wide › binary payloads

§3 Local auth › metrics-endpoint redaction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

§1 IPC › Method namespace — `session.send` semantics

§2 Permission routing › `client_id` lifecycle

§1 IPC › `/events?since=<id>` — filtering semantics