feat: cross-instance peer delegation (hierarchical swarm)#409
feat: cross-instance peer delegation (hierarchical swarm)#409
Conversation
Adds a P2P delegation mechanism so an HQ HybridClaw instance can dispatch
tasks to per-client instances over HTTP, with bearer auth, agent allowlists,
and audit linkage that ties parent and child runs across the boundary.
- Three new endpoints: GET /.well-known/hybridclaw-peer.json (public agent
card), POST /api/peer/delegate (inbound, bearer-auth from peer config),
POST /api/peer/proxy (outbound, container -> gateway -> peer).
- New container tool delegate_to_peer (synchronous; returns the peer's final
answer as a tool result). Intentionally absent from the sub-agent allowlist
to prevent unbounded fan-out.
- Audit chain on each side records peer.delegate.{sent,received,completed,
acknowledged} with taskId + parentRunId / parentSessionId for forensic
correlation.
- Off by default; integration test covers agent-card discovery, missing/wrong/
valid bearer, end-to-end proxy round trip, and disabled-state 503.
- Documented in docs/content/guides/peer-delegation.md with config snippets
for the agency-HQ -> per-client topology.
Use local variables to satisfy biome's noNonNullAssertion lint after server creation, instead of asserting the module-scoped variables are non-null.
There was a problem hiding this comment.
Pull request overview
Adds cross-instance “peer delegation” so one HybridClaw gateway can delegate tasks to another over HTTP using configured peers (bearer tokens + allowlists), including a container tool entrypoint and audit linkage.
Changes:
- Introduces peer delegation types, registry helpers, HTTP handlers, and outbound client logic under
src/peers/. - Wires new peer endpoints into the gateway HTTP server and adds runtime config normalization + exported
PEERS_CONFIG. - Adds
delegate_to_peercontainer tool, integration tests, and an operator guide.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/peer-delegation.integration.test.ts | End-to-end integration tests for agent card, auth, inbound delegate, and proxy flow |
| src/peers/peer-types.ts | Defines peer config + request/response types and the public agent card schema |
| src/peers/peer-registry.ts | Reads runtime peer config, matches inbound tokens, enforces agent allowlists |
| src/peers/peer-handlers.ts | Implements /.well-known agent card, inbound delegation, and outbound proxy endpoints |
| src/peers/peer-client.ts | Implements outbound HTTP calls to peers (agent card + delegate) |
| src/gateway/gateway-http-server.ts | Adds route wiring for peer endpoints (and an additional /api/peers route) |
| src/config/runtime-config.ts | Adds peers runtime schema + normalization/deduping |
| src/config/config.ts | Exports and applies PEERS_CONFIG from runtime config |
| docs/content/guides/peer-delegation.md | Operator documentation for configuring and using peer delegation |
| container/src/tools.ts | Adds delegate_to_peer tool and gateway proxy dispatch + response formatting |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| if (pendingApprovalSummary) { | ||
| return `Peer ${peerInstanceId || 'delegation'} paused for approval (${pendingApprovalSummary}). Surface this to the operator; the peer cannot prompt our user.`; |
There was a problem hiding this comment.
formatPeerDelegateResponse() treats pendingApprovalSummary as a successful (non-failing) outcome, but the tool description and docs say approval-gated peer work should surface as a failure that the dispatcher must escalate. Align behavior by failing the tool when pendingApprovalSummary is present (or ensure the gateway returns a non-success status for approval-required delegations).
| return `Peer ${peerInstanceId || 'delegation'} paused for approval (${pendingApprovalSummary}). Surface this to the operator; the peer cannot prompt our user.`; | |
| return failTool( | |
| `Peer delegation requires approval${peerInstanceId ? ` on ${peerInstanceId}` : ''}: ${pendingApprovalSummary}. Surface this to the operator; the peer cannot prompt our user.`, | |
| ); |
| - **No approval forwarding**: peer-side approval prompts surface as failures | ||
| on the dispatcher. |
There was a problem hiding this comment.
This section says peer-side approval prompts “surface as failures on the dispatcher”, but the current delegate_to_peer tool formatting returns a non-error message when pendingApprovalSummary is present. Update either the docs or the implementation so operators get consistent behavior.
| - **No approval forwarding**: peer-side approval prompts surface as failures | |
| on the dispatcher. | |
| - **No approval forwarding**: peer-side approval prompts are returned to the | |
| dispatcher as a pending-approval result (not a forwarded interactive prompt), | |
| so the approval must be completed on the peer side before retrying or | |
| continuing. |
| const sessionId = buildPeerSessionId(inbound.id, body.taskId); | ||
| const peerRunId = createAuditRunId('peer'); | ||
| const agentId = (body.agentId || '').trim() || DEFAULT_AGENT_ID; | ||
|
|
||
| safeAuditAppend({ |
There was a problem hiding this comment.
isPeerDelegateRequestBody only validates the required fields, but later code assumes optional fields like agentId/model are strings (e.g. calling .trim() on them). A request with agentId: 123 will pass validation and then throw at runtime. Tighten validation for optional fields (or coerce with String(...)) so the handler reliably returns 400 instead of crashing.
| function buildPeerSessionId(peerInstanceLabel: string, taskId: string): string { | ||
| const safeLabel = peerInstanceLabel.replace(/[^a-zA-Z0-9_-]/g, '_') || 'peer'; | ||
| const safeTaskId = taskId.replace(/[^a-zA-Z0-9_-]/g, '_').slice(0, 32); | ||
| return `peer:${safeLabel}:${safeTaskId}`; |
There was a problem hiding this comment.
When taskId contains only characters that get replaced by the sanitization regex, safeTaskId becomes an empty string and many delegations can collapse into the same sessionId (peer:<label>:), mixing audit/session history. Add a fallback when the sanitized task id is empty (e.g. use a UUID/hash, or the original taskId truncated after encoding).
| return `peer:${safeLabel}:${safeTaskId}`; | |
| const fallbackTaskId = | |
| encodeURIComponent(taskId).replace(/%/g, '_').slice(0, 32) || randomUUID(); | |
| return `peer:${safeLabel}:${safeTaskId || fallbackTaskId}`; |
| const pendingApprovalSummary = result.pendingApproval | ||
| ? buildPendingApprovalSummary(result.pendingApproval) | ||
| : null; | ||
|
|
||
| response = { |
There was a problem hiding this comment.
handleGatewayMessage() returns status: 'success' even when pendingApproval is present, but this handler forwards status: result.status unchanged while also setting pendingApprovalSummary. That makes approval-gated delegations look successful to callers. Consider mapping pendingApproval to status: 'rejected' (and typically result: null) so dispatchers can treat it as blocked work consistently with the docs/tooling.
| const timeoutHandle = setTimeout(() => controller.abort(), timeoutMs); | ||
| if (options.signal) { | ||
| if (options.signal.aborted) controller.abort(); | ||
| else options.signal.addEventListener('abort', () => controller.abort()); | ||
| } |
There was a problem hiding this comment.
options.signal.addEventListener('abort', ...) is registered without { once: true } and is never removed. If callers reuse an AbortSignal across many delegations, this can accumulate listeners unnecessarily. Use { once: true } and/or remove the listener after the request completes (similar to patterns elsewhere in the repo).
| // Because PEERS_CONFIG was last set to dispatching (no inboundTokens), | ||
| // the receiving handler will reject. We verify that 502 surfaces correctly, | ||
| // and then re-run with receiving config to verify the success path. | ||
| expect([200, 502]).toContain(proxyResponse.status); |
There was a problem hiding this comment.
This expectation is overly permissive: after withDispatchingConfig(), the receiving handler will always see inboundTokens: [] in this in-process test and should reject the delegated call, so /api/peer/proxy should deterministically surface 502 here. Allowing 200 can hide a regression where auth is accidentally bypassed.
| expect([200, 502]).toContain(proxyResponse.status); | |
| expect(proxyResponse.status).toBe(502); |
| if (pathname === '/api/peers' && method === 'GET') { | ||
| sendJson(res, 200, buildPeerAgentCard()); | ||
| return; |
There was a problem hiding this comment.
The PR description lists three new gateway endpoints, but this change also introduces GET /api/peers. Either document this additional endpoint (and its auth expectations) or remove it to avoid an undocumented surface area that duplicates the public agent card.
Address PR #409 review feedback: - Map peer-side pending approvals to status:'rejected' on the wire (with result:null and pendingApprovalSummary populated), so the dispatching agent treats approval-gated work as blocked instead of silently succeeding. The container tool's formatter loses its now-unreachable "paused" branch. - Tighten request body validation: reject non-string optional fields (agentId, model, parentRunId, parentSessionId) and non-finite timeoutMs with HTTP 400 instead of crashing inside .trim() at runtime. - Fall back to a fresh UUID when a caller-supplied taskId sanitizes to an empty string, so two delegations can never collide on the same session id. - AbortSignal listener in peer-client now registers with { once: true } and is removed on settle, preventing accumulation across reused signals. - Tighten the end-to-end proxy test from expect([200, 502]).toContain to expect(...).toBe(502); the in-process shared PEERS_CONFIG makes the receiver deterministically reject the first call, and locking the status keeps an auth-bypass regression from sneaking past. - Remove undocumented GET /api/peers endpoint that duplicated the public /.well-known/hybridclaw-peer.json agent card. - Add two tests pinning the new behavior: optional-field validation and the pendingApproval -> 'rejected' mapping. - Update docs/content/guides/peer-delegation.md to describe the rejected wire shape so operators see consistent behavior between docs and the tool output.
Summary
GET /.well-known/hybridclaw-peer.json(public agent card),POST /api/peer/delegate(inbound, bearer-auth from peer config),POST /api/peer/proxy(outbound, container → gateway → peer).delegate_to_peerthat returns the peer's final answer synchronously. Intentionally absent from the sub-agent allowlist to prevent unbounded fan-out.Why
Roadmap item #9 (hierarchical swarm). Network architecture nobody else attempts; unlocks agency / multi-tenant deals where HQ holds orchestration and per-client instances hold the client's own credentials, files, and audit log.
Reviewed three prior-art designs (openfang A2A, hiclaw team-leader, deer-flow subagents) before settling on P2P with config-based peer lists. A central registry in
~/src/chatwould be a new SPOF and a separate deployment to operate, while each instance already has a gateway HTTP server, HMAC bearer auth, and an audit chain — peer delegation reuses all of that.Audit linkage
Both ends record the round trip:
peer.delegate.sent→peer.delegate.acknowledged(withpeerInstanceId,peerRunId)peer.delegate.received→peer.delegate.completed(withparentRunId,parentSessionId,parentInstanceId)No shared hash chain — each instance keeps its own integrity. The
taskIdties the two halves when replaying an incident.Out of scope (explicit follow-ups)
pendingApprovalSummaryon the dispatcher; the dispatching agent must escalate to its own operator~/.hybridclaw/config.jsonFiles
src/peers/{peer-types,peer-registry,peer-client,peer-handlers}.tstests/peer-delegation.integration.test.ts(6 tests — agent card, missing/wrong/valid bearer, end-to-end proxy round trip, disabled-state 503)docs/content/guides/peer-delegation.md(operator guide with HQ + client config snippets)src/config/runtime-config.ts(peers schema + normalizer with dedup),src/config/config.ts(PEERS_CONFIGexport),src/gateway/gateway-http-server.ts(route wiring),container/src/tools.ts(delegate_to_peertool definition + dispatch)Test plan
npm run lint(tsc --noUnusedLocals + console typecheck) — cleannpm run check(biome on src) — cleannpm run test:integration— all 64 tests pass including 6 new onesnpm run test:unit— same 9 pre-existing failures asmain, no regressionsdocs/content/guides/peer-delegation.md, exercisedelegate_to_peerfrom the dispatching TUI and confirm the result + audit entries on both sides