Skip to content

docs: daemon foundation design note (IPC, permission routing, local auth)#74

Merged
chauncygu merged 1 commit intoSafeRL-Lab:mainfrom
mxh1999:daemon-design-note
Apr 29, 2026
Merged

docs: daemon foundation design note (IPC, permission routing, local auth)#74
chauncygu merged 1 commit intoSafeRL-Lab:mainfrom
mxh1999:daemon-design-note

Conversation

@mxh1999
Copy link
Copy Markdown
Contributor

@mxh1999 mxh1999 commented Apr 29, 2026

Covers the three items @chauncygu requested in #68 before the foundation PR lands.

Scope is intentionally narrow — service inventory, phasing, persistence, and cost guardrail defaults were settled in the issue thread and are not re-litigated here.

What's in this PR

A single new doc: docs/RFC/0001-daemon-design-note.md (164 lines).

Three sections:

  1. IPC — Unix-socket default with optional TCP (--listen tcp://); HTTP/1.1 framing on top; JSON-RPC 2.0 for the data plane (POST /rpc), SSE for events (GET /events); existing /healthz /readyz /metrics unchanged.
  2. Permission routing — every PermissionRequest carries an originator; only the originator may answer (other clients see read-only via /events). Fixes the first-answer-wins race called out in [Question] Should /monitor, /agent, bridges survive REPL exit? #68 review.
  3. Local auth — explicitly framed as a security boundary, not a multi-user feature. Peer credentials (SO_PEERCRED / LOCAL_PEERCRED) on the Unix socket; bearer token on TCP. TLS out of scope; reverse-proxy recipe documented instead.

A short "Related decisions" section anchors the items already settled in #68 (subprocess-per-agent, bridges in foundation, cost defaults, API RC window) so reviewers know what's not up for debate here.

Open questions flagged in the doc

  1. HTTP-on-socket vs raw newline-delimited JSON-RPC.
  2. Whether an autonomous agent_runner is its own originator class or whether the configured bridge is the originator for those requests.
  3. Audit log default for the Unix socket (off vs always-on).

Happy to discuss inline. Once these are resolved, the foundation PR follows.

Refs #68

…al auth)

Covers the three items requested by @chauncygu in SafeRL-Lab#68 before the foundation
PR lands. Scope is intentionally narrow — service inventory, phasing,
persistence, and cost guardrail defaults were settled in the issue thread
and are not re-litigated here.

Sections:
1. IPC — Unix-socket default, optional TCP, HTTP/1.1 + JSON-RPC + SSE
2. Permission routing — originator-bound, fixes the first-answer-wins race
3. Local auth — peer-cred on Unix socket, bearer token on TCP, threat
   model explicitly single-user single-host (security boundary, not
   multi-user feature)

Three open questions flagged for review.

Refs SafeRL-Lab#68
@mxh1999 mxh1999 force-pushed the daemon-design-note branch from a1584ed to ac18a70 Compare April 29, 2026 12:18
@chauncygu chauncygu merged commit 2899254 into SafeRL-Lab:main Apr 29, 2026
6 checks passed
@chauncygu
Copy link
Copy Markdown
Contributor

Hi @mxh1999

This is a solid note overall. The originator-based permission routing in §2 is the strongest part and resolves a real race that the earlier "first-answer-wins" draft would have shipped. Threat model is realistic, defaults are concrete, and the Open Questions section makes the trade-offs reviewable instead of buried.

I'd like a few items in the document itself before we accept it as the foundation PR's contract — most are one- or two-line additions. Comments below.

Must address before accept (1–9): threading model, SSE heartbeat, client_id lifecycle, session.send semantics, macOS peer-cred reality, API versioning, event retention default, audit-log default flip, interactive permission timeout.

Can land as follow-up checklist (10–12): /events filter semantics, binary payload story, metrics-endpoint redaction.


Inline comments (anchor each to the listed section)

§1 IPC › Protocol — threading model

Reuses stdlib http.server and http.client. No third-party dependency.

http.server.HTTPServer is single-threaded — one long-lived SSE client blocks every /rpc call behind it. With multiple clients in scope (REPL + Web UI + Telegram + Slack + WeChat + monitor + agent runners), this is a fast path to a hang.

Please specify ThreadingHTTPServer (also stdlib) and a per-client SSE concurrency cap (e.g. ≤ 64) so we have an explicit number to reason about under load. One sentence in §1 is enough.

§1 IPC › Event channel — keep-alive heartbeat

SSE behind NAT or a reverse proxy will be silently closed after idle timeouts (commonly 30–60s) and the client won't know until it tries to reconnect. Please specify a server-side heartbeat: a single SSE comment line (:\n\n) every 15–30s. Trivial to implement, prevents an entire class of "events stopped arriving" bug reports.

§1 IPC › Method namespace — session.send semantics

JSON-RPC is request/response, but agent.run() is a stream. The note doesn't say what session.send returns or where the streamed text/tool events surface.

Please pick one explicitly:

  • (A, recommended) session.send returns { "turn_id": "...", "accepted_at": "..." } immediately; all subsequent text chunks, tool starts/ends, permission requests for that turn flow through /events tagged with the turn_id. Keeps /rpc purely synchronous.
  • (B) Hold the HTTP response open and write JSON-RPC notifications until the turn ends. Breaks JSON-RPC's single-response semantics; harder to debug.

Pinning this in the note avoids re-litigating it during the foundation PR.

§2 Permission routing › client_id lifecycle

  1. Originator disconnects mid-request — the request is held until timeout. On reconnect, the originator gets the request back via SSE replay scoped to its own pending requests…

Two questions the note doesn't answer:

  1. Who issues client_id? If the daemon mints a new one on every connection, "REPL crashed and restarted" is a different originator and the held request is lost.
  2. Does the client persist its client_id across process restarts?

Suggest: daemon mints client_id on first connection from a given client kind, returns it in the connect response; client persists it at ~/.cheetahclaws/clients/<kind>.id (mode 0600); subsequent connections present the saved id to resume the same originator identity. Worth a short subsection — this is load-bearing for the disconnect-then-reconnect flow you already designed.

§2 Permission routing › interactive timeout default

Defaults: 5 min for unattended mode, unlimited for interactive modes.

"Unlimited" for interactive is unsafe in practice: REPL crashes, user walks away, laptop sleeps — the request sits forever, holding the agent turn open and (for /agent runners) potentially blocking the schedule.

Suggest 30 min default for interactive, configurable per-session, with permission.refresh_timeout RPC if a client wants to extend an active request. Matches what most users will subjectively expect ("if I don't answer in half an hour, just deny it").

§3 Local auth › macOS peer-cred reality check

Daemon checks peer credentials on accept (SO_PEERCRED on Linux, LOCAL_PEERCRED on macOS) and rejects connections from a different UID.

Two practical issues:

  1. Python's socket module doesn't expose LOCAL_PEERCRED directly — it needs getsockopt(SOL_LOCAL, LOCAL_PEERCRED) with SOL_LOCAL = 0 and a hand-parsed xucred struct. Real implementation hazard.
  2. Older macOS versions have inconsistent behavior on LOCAL_PEERCRED for newly-accepted sockets.

Recommend: on macOS, lean on getpeereid() via ctypes (stable, returns just (uid, gid)), or accept that the 0600 socket + 0700 parent directory is the auth and document peer-cred as Linux-only enhancement. Either is fine; just don't promise LOCAL_PEERCRED and discover the parsing pain in the foundation PR.

§3 Local auth › audit log default for Unix socket

Off by default for the Unix socket (peer-cred-checked, low-noise). On by default for TCP.

Push back: flip Unix socket to on by default.

  • Authentication failures on a peer-cred-protected socket are by definition rare; the noise argument is weak.
  • The forensics value of "something tried to connect from a different UID" is high — that's exactly the kind of event you want logged, even (especially) when it's rare.
  • Cost is one rotated JSONL file with a few lines/day in steady state.

Suggest aligning here.

Note-wide › API versioning

The note doesn't say how clients negotiate protocol version. chauncygu asked for an RC one minor version before the default flip — that requires the protocol to be identifiable as v0 from the foundation PR onward, otherwise we have nothing to RC.

Pick one:

  • Cheetahclaws-Api-Version: 0 request header, daemon rejects mismatched majors with 426.
  • Method prefix: v0.session.send, v0.permission.answer.

I'd take the header — keeps method names readable. Either way, please bake it into §1 so the foundation PR ships with it instead of retrofitting.

§1 IPC › Event channel — retention default

Replay is bounded by the daemon_events retention window (rolling, default 7 days / 1M rows).

7 days of token chunks + tool events for an active user is a non-trivial SQLite file and a heavy backfill on reconnect. For a local-machine daemon with no multi-host story, the use case is "REPL just disconnected, what did I miss" — that's hours, not days.

Suggest 24h / 100K rows default, configurable. Long-term archival is a separate concern (session_store already handles it).


Smaller items / can be follow-ups

§1 IPC › /events?since=<id> — filtering semantics

filtered by the caller's auth — see §2

In a single-user daemon, "filtered by auth" needs a sentence of clarification. Is it:

  • per-originator (REPL doesn't see permission events targeted at the Telegram bridge), or
  • per-client-kind (Web UI sees everything; bridges only see their own session traffic), or
  • per-session (each subscriber names which sessions to follow)?

Pick the model and write one sentence. The current phrasing reads as "auth-filtered" but in single-user there's only one identity, which is what's confusing.

Note-wide › binary payloads

Out of v1 scope, but worth one defensive sentence so we don't paint into a corner: image/audio/file payloads go out-of-band (e.g. future /blobs endpoint with daemon-issued URLs), not inline-base64 over SSE. Keeps the event channel small and avoids buffer-bloat decisions later.

§3 Local auth › metrics-endpoint redaction

--unauthenticated-metrics is the right escape hatch for Prometheus, but please add a one-liner: even with auth on, /metrics and /healthz payloads must never include token fragments, full session IDs, or rationale strings. They're the safest endpoint to leak by misconfiguration; design out the leakage.


Once 1–9 are addressed in the doc, this is good. Thanks for the careful write-up.

chauncygu pushed a commit that referenced this pull request May 2, 2026
Builds on @chauncygu's spike branch (feature/daemon-spike,
e980cdb) to ship the foundation runtime per the roadmap in
docs/RFC/0002-daemon-foundation-roadmap.md.  No service has been
migrated yet — that's F-3 through F-8 work.

Spike modules kept as-is (encode the wire contract reviewed in #74):
  cc_daemon/__init__.py, auth.py, events.py, methods.py,
  originator.py, permission.py, rpc.py, spike_client.py

Spike modules patched (per labor split agreed in #68 — server is not
on the keep-as-is list, but changes are minimal patches not rewrites):
  - cc_daemon/server.py: Windows guard around UnixStreamServer;
    DaemonState gains unauthenticated_metrics + config kwargs;
    /healthz /readyz /metrics route through health.payload_for(...);
    DaemonState registers system_methods alongside spike methods.
  - cc_daemon/cli.py: rewritten to expose serve_main(argv) for the
    new `cheetahclaws serve` surface; legacy `python -m cc_daemon.cli
    {serve|status|stop|logs|rotate-token}` entry preserved as
    backward-compat for spike-notes commands.

Foundation glue (added):
  - cc_daemon/discovery.py — atomic ~/.cheetahclaws/daemon.json so
    REPL/Web/bridge clients can locate the daemon (transport, address,
    version) without parsing CLI args.  Pid_alive cross-platform.
  - cc_daemon/system_methods.py — system.ping (RFC contract name) and
    system.shutdown (sets DaemonState.shutdown_event for cross-platform
    graceful exit; Windows can't deliver SIGTERM cleanly to another
    Python process).
  - commands/daemon_cmd.py — `cheetahclaws daemon {status, stop, logs,
    rotate-token}`.  Uses Cheetahclaws-Api-Version header on every RPC.
  - cheetahclaws.py — main() short-circuit for `serve`, `daemon`, plus
    backward-compat alias for `cheetahclaws spike-daemon ...`.
  - health.py — extracted module-level healthz_payload(config) /
    readyz_payload(config) / metrics_payload(config) /
    payload_for(path, config) so both the standalone health server and
    the daemon listener reuse the same circuit-breaker / quota /
    runtime-registry probes.  Existing health_check_port behavior
    unchanged.
  - docs/RFC/0002-daemon-foundation-roadmap.md — F-1..F-9 PR breakdown
    with per-PR acceptance criteria.
  - docs/architecture.md — Daemon section pointing at cc_daemon
    modules and the foundation glue.

Tests (72 new, 13 spike untouched, all green):
  - tests/test_cc_daemon_discovery.py   (16) — unit, write/read/locate
  - tests/test_cc_daemon_system_methods.py (8) — unit, ping/shutdown
  - tests/test_daemon_cmd.py           (14) — unit, dispatch + tail + rotate
  - tests/test_health_payloads.py       (8) — unit, real health.py wiring
  - tests/e2e_daemon_skeleton.py       (13) — subprocess: real boot,
    discovery, RPC, auth, /events SSE heartbeat, all daemon subcommands
  - tests/test_daemon_spike.py         (13) — chauncygu's, untouched

Behavior change worth flagging:
  - /healthz, /readyz, /metrics are now auth-gated by default per RFC
    0001 §3.  Spike returned them unauthenticated; we route through
    health.payload_for(...) and require Authorization (or peer-cred
    on UDS).  Opt out with `--unauthenticated-metrics` for Prometheus
    scrapers.

What is NOT in F-1 (intentional, per roadmap):
  - agent.run integration (no real session.send) — F-3+
  - Bridges in daemon (Telegram/Slack/WeChat) — F-6/F-7/F-8
  - monitor/scheduler in daemon — F-3
  - agent_runner subprocess-per-agent — F-4
  - SQLite event persistence — F-2
  - Cost guardrail conservative defaults — F-9
  - macOS peer-cred — TODO left in cc_daemon/auth.py from spike

Refs #68, #74

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dsqueri-temeriteecorp pushed a commit to dsqueri-temeriteecorp/cheetahclaws that referenced this pull request May 4, 2026
…ut flush, test floor

Four nits surfaced while smoke-testing F-1 (PR SafeRL-Lab#80) on real subprocesses.
None are blockers, but they each break a docs-promised path so they
deserve a dedicated polish PR rather than getting buried in F-2's diff.

1. `cheetahclaws daemon {status, stop, rotate-token}` now read the token
   path from the discovery file when `serve` was started with
   `--token-path`. `discovery.make_info()` accepts a new optional
   `token_path` keyword (schema stays at 1 — additive); `cli.cmd_serve()`
   records it only when the path overrides the default; new
   `commands.daemon_cmd._resolve_token_path()` prefers discovery and falls
   back to the default location for old discovery files / unset case.
   Previously these verbs always read `~/.cheetahclaws/daemon_token`,
   which silently created a *different* random token, then failed 401
   against the running daemon.

2. `python -m cc_daemon.cli --help` (and the `cheetahclaws spike-daemon
   --help` backward-compat alias) now print a usage banner and exit 0
   instead of `unknown subcommand: --help` / exit 2. The unknown-
   subcommand branch also includes the banner so users see how to
   recover. The PR description for SafeRL-Lab#80 said the spike-daemon alias was
   "preserved" — this closes the gap.

3. The three serve-mode startup prints (`token: …`, `cheetahclaws daemon
   listening on …`, `audit log: …`) now `flush=True` so they appear
   immediately when stdout is redirected to a file under `&`. Previously
   they sat in Python's 4 KB block buffer until the daemon exited,
   silently breaking the spike-notes' `--print-token > out.log &`
   workflow because the token line never reached disk.

4. `tests/e2e_daemon_skeleton.py::test_daemon_writes_discovery_and_token`
   token-length floor raised from `>= 32` to `>= 40`. `secrets.token_urlsafe(32)`
   yields ~43 chars, so the previous floor was loose enough that an
   accidental shrink to 16 raw bytes (~22 chars) would still ship green.

Tests: 10 new unit tests (4 covering `cli.main` dispatch, 4 covering
`_resolve_token_path`, 2 covering `discovery.make_info`'s new field).
Full suite 669/669 passing on `main`. End-to-end smoke verified all
three runtime fixes against `cheetahclaws serve --listen tcp://...`.

Docs:
* `README.md` docs index — adds row for RFC 0002 (foundation roadmap);
  refreshes the spike row to reflect the actual landing path
  (SafeRL-Lab#77 → reverted → re-landed via SafeRL-Lab#81); marks F-1 as merged via SafeRL-Lab#80.
* `docs/RFC/0002-daemon-foundation-roadmap.md` — F-1 status `OPEN`
  → `MERGED SafeRL-Lab#80`.
* `docs/architecture.md` — daemon section now mentions the optional
  `token_path` discovery field and notes that the daemon-control verbs
  use it.
* `docs/news.md` — May 2, 2026 entry covering the spike re-land and
  F-1 merge sequence, the polish nits, and the intentional "not in F-1"
  list (agent.run, bridges, SQLite, cost guardrails, agent-runner
  subprocess, macOS peer-cred).

Refs SafeRL-Lab#68, SafeRL-Lab#74, SafeRL-Lab#80, SafeRL-Lab#81

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yamaceay pushed a commit to yamaceay/cheetahclaws that referenced this pull request May 6, 2026
Adds an isolated cc_daemon/ package (~1.1k LoC across 9 files, plus 360
lines of pytest) that implements the daemon contract surface defined in
docs/RFC/0001-daemon-design-note.md (PR SafeRL-Lab#74). This is a SPIKE — it
exists to lock down the IPC + permission routing + local auth contract
in runnable code so the upcoming foundation PR has a verified starting
point. Nothing in cc_daemon/ is load-bearing for production.

Covers the ✓ rows of the RFC review-comment matrix: ThreadingHTTPServer
with a non-default request_queue_size, 15s SSE heartbeat, client_id
mint/persist/resume, sync RPC + async events (variant A of
session.send), Cheetahclaws-Api-Version: 0 → 426 on mismatch, bounded
event ring buffer with gap event on overflow, audit log default-on for
both transports, 30 min interactive permission timeout with
permission.refresh_timeout RPC, and originator-only permission.answer
returning HTTP 403 / -32001 to non-originators.

Out of scope: agent.run integration, bridges migration, SQLite event
store, cost guardrails, agent-runner subprocess isolation, /metrics,
macOS peer-cred (TODO(macos) left in cc_daemon/auth.py).

Main code touches are minimal and isolated: a 6-line subcommand shim in
cheetahclaws.py that intercepts `cheetahclaws spike-daemon ...` before
the main argparse runs, plus one cc_daemon entry in pyproject.toml's
package list. Existing CLI flags (--version, --help, prompt parsing)
are unchanged.

Tests: 593 passing (580 existing + 13 new). Run the new suite with
`pytest tests/test_daemon_spike.py -v`. Manual smoke and originator-
routing demo in docs/RFC/0001-spike-notes.md "How to run it".

Refs SafeRL-Lab#68, SafeRL-Lab#74

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants