Promote dev → main: agent card + fleet-health filter + ceremony fixes (v0.7.22 candidate, re-cut)#466
Conversation
Acknowledges main's squash commit as an ancestor to prevent phantom conflicts on the next dev→main promotion.
All 27 references to https://protolabs.ai/a2a/ext/* changed to https://proto-labs.ai/a2a/ext/* to match the actual domain. These URIs are opaque identifiers (not published specs today) but should reference a domain we own. Breaking: external agents (Quinn, protoPen) whose cards declare the old URI will stop matching the registry until they update. Filed on Quinn to update her card. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chore: back-merge main → dev
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Both services decommissioned. Containers stopped + removed. Only reference in protoWorkstacean was the rollcall script. Note: homelab-iac/stacks/ai/docker-compose.yml still has a worldmonitor network reference at line 521 + service at line 833. Needs separate cleanup in that repo. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ls (#411) Two changes: 1. Replace the basic `web_search` tool (5 results, hardcoded engines) with `searxng_search` — adapted from rabbit-hole.io's full-surface SearXNG integration. New capabilities: - Category routing: general, news, science, it - Time range filtering: day, week, month, year - Bang syntax: !wp (Wikipedia), !scholar, !gh (GitHub) - Infoboxes, direct answers, suggestions in response - Configurable max_results (default 10, was 5) Updated in both bus-tools.ts (@protolabsai/sdk pattern) and deep-agent-executor.ts (LangChain pattern). 2. Give Ava three fleet health tools she was missing: - get_ci_health — CI success rates across repos - get_pr_pipeline — open PRs, conflicts, staleness - get_incidents — security/ops incidents Ava can now answer fleet health questions directly instead of always delegating to Quinn. Ava's tool count: 10 → 13. Tool rename: web_search → searxng_search (greenfield, no backward compat alias). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
protoAgent is the new GitHub Template repo that replaces per-agent A2A bootstrapping. Registers it as an active dev project owned by Quinn, matching the shape of existing entries. Plane / GitHub webhook / Discord provisioning remain TODO — those integrations aren't configured in this deployment, so the onboard plugin skipped them. Co-authored-by: Josh <artificialcitizens@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Acknowledges main's squash commit as an ancestor to prevent phantom conflicts on the next dev→main promotion.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
chore: back-merge main → dev
…e bug (#415) Ava agent audit + overhaul: - Tools: 10 → 22 (direct observation, propose_config_change, incident reporting) - Skills: 3 → 7 (debug_ci_failures, fleet_incident_response, downshift_models, investigate_orphaned_skills) - System prompt rewritten: self-improvement instructions, escalation policy, GOAP-dispatch playbook - DeepAgentExecutor now applies skill-level systemPromptOverride (goal_proposal, diagnose_pr_stuck) - Fix ceremony loader bug: disabled ceremonies were filtered out, preventing hot-reload from cancelling timers - Clean up board.pr-audit.yaml (remove spurious action field, restore schedule, keep disabled) - Update docs: README, deep-agent runtime, agent-skills reference, self-improving loop Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-approve (#417) Two root causes prevented PRs from being auto-merged: 1. Dispatch gap: tier_0 short-circuit in ActionDispatcherPlugin completed all actions immediately without dispatching to agent.skill.request. Every action in actions.yaml is tier_0, so the fireAndForget path (which publishes the skill request) was unreachable dead code. Fix: tier_0 now falls through when meta.fireAndForget is set. 2. Approval gap: readyToMerge requires reviewState=approved, but auto-approve only covered dependabot/renovate/promote:/chore(deps. Human PRs, release PRs (chore(release), and github-actions PRs all lacked approved reviews and sat indefinitely. Fix: added app/github-actions to authors, chore(release, chore:, docs( to safe title prefixes. Additionally, PrRemediatorPlugin now self-dispatches remediation on every world.state.updated tick — checking for readyToMerge, dirty, failingCi, and changesRequested PRs directly from cached domain data. This removes the dependency on GOAP dispatch reaching the plugin via pr.remediate.* topics (which were never published in production after Arc 1.4 removed meta.topic routing). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ss fleet (#419) Adds a github_issues domain that polls /repos/{repo}/issues?state=open for all managed projects and classifies by label (critical, bug, enhancement). Three GOAP goals enforce issue hygiene: - issues.zero_critical (critical severity, max: 0) - issues.zero_bugs (high severity, max: 0) - issues.total_low (medium severity, max: 5) Each goal has a matching alert action and a triage dispatch action that invokes Ava's new issue_triage skill. The skill instructs Ava to resolve, convert to board features, delegate, or close issues with rationale — driving toward zero open issues across all repos. Domain polls every 5 minutes (issue velocity is low, GitHub rate limits are a concern with 6+ repos). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: manage_board list action (#247) + a2a.trace extension (#359) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-remediator): case-insensitive auto-approve prefix matching "Promote dev to main" titles start with capital P, but the prefix check was case-sensitive against "promote:". Now lowercases the title before matching so both "promote:" and "Promote" patterns are caught. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…) (#427) Closes the structural gap where 6+ tier_0 fire-and-forget alert skills had no registered executor, causing SkillDispatcherPlugin to log "No executor found" and silently drop the dispatch on every GOAP planning cycle. - AlertSkillExecutorPlugin registers FunctionExecutors for all 24 bare alert.* actions in workspace/actions.yaml. Each translates the dispatch into a structured message.outbound.discord.alert event consumed by the existing WorldEngineAlertPlugin webhook routing. - validate-action-executors.ts cross-checks the loaded ActionRegistry against the live ExecutorRegistry at startup. Surfaces every gap as a HIGH-severity Discord alert (goal platform.skills_unwired) and a loud console.error. Set WORKSTACEAN_STRICT_WIRING=1 to crash startup instead. - action.issues_triage_bugs already routes correctly via meta.agentId=ava to the existing DeepAgentExecutor for Ava's issue_triage skill — no duplicate wiring needed (greenfield). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CeremonyStateExtension was publishing { domain, data } envelopes on
`world.state.snapshot` after every ceremony completion. GoalEvaluatorPlugin
subscribes to `world.state.#`, treated the malformed payload as a WorldState,
and emitted a "Selector ... not found" violation for every loaded goal on
every ceremony tick (the cluster of 25+ violations at each :15/:30 boundary
in the live container logs). All listed selectors (flow.efficiency.ratio,
services.discord.connected, agent_health.agentCount, etc.) actually exist
in the producer output — the goals are correct.
Changes:
- Move ceremony snapshot publish to `ceremony.state.snapshot` (off the
world.state.# namespace). Leaves the existing CeremoniesState shape and
consumers unchanged.
- Goal evaluator: defensive payload shape check. Reject single-domain
envelopes ({ domain, data }) and other non-WorldState payloads loud-once
instead of generating one violation per goal.
- Goal evaluator: startup selector validator. After the first valid world
state arrives, walk every loaded goal's selector and HIGH-log any that
doesn't resolve. Re-armed on goals.reload / config.reload so drift caught
by future hot-reloads also surfaces.
- Tests: regression guard that CeremonyStateExtension does not publish on
world.state.#; goal evaluator ignores malformed payloads; validator
catches an intentionally broken selector.
Closes #424
Co-authored-by: Automaker <automaker@localhost>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…429) The Claude Code skill was calling the homelab-iac copy of agent-rollcall.sh, which had drifted from this repo's copy. The in-repo script knows about the in-process DeepAgent runtime (Ava, protoBot, Tuner) and the current A2A fleet; the homelab-iac copy still probed for the archived ava-agent container and the deprecated protoaudio/protovoice services. Single source of truth: this repo. The homelab-iac copy was separately synced in homelab-iac@64e8dcf. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: extension URIs use proto-labs.ai (not protolabs.ai) (#407) All 27 references to https://protolabs.ai/a2a/ext/* changed to https://proto-labs.ai/a2a/ext/* to match the actual domain. These URIs are opaque identifiers (not published specs today) but should reference a domain we own. Breaking: external agents (Quinn, protoPen) whose cards declare the old URI will stop matching the registry until they update. Filed on Quinn to update her card. * chore(release): bump to v0.7.20 (#408) * chore: remove protoaudio + protovoice from agent rollcall (#410) Both services decommissioned. Containers stopped + removed. Only reference in protoWorkstacean was the rollcall script. Note: homelab-iac/stacks/ai/docker-compose.yml still has a worldmonitor network reference at line 521 + service at line 833. Needs separate cleanup in that repo. * feat: upgrade web_search → searxng_search + give Ava fleet health tools (#411) Two changes: 1. Replace the basic `web_search` tool (5 results, hardcoded engines) with `searxng_search` — adapted from rabbit-hole.io's full-surface SearXNG integration. New capabilities: - Category routing: general, news, science, it - Time range filtering: day, week, month, year - Bang syntax: !wp (Wikipedia), !scholar, !gh (GitHub) - Infoboxes, direct answers, suggestions in response - Configurable max_results (default 10, was 5) Updated in both bus-tools.ts (@protolabsai/sdk pattern) and deep-agent-executor.ts (LangChain pattern). 2. Give Ava three fleet health tools she was missing: - get_ci_health — CI success rates across repos - get_pr_pipeline — open PRs, conflicts, staleness - get_incidents — security/ops incidents Ava can now answer fleet health questions directly instead of always delegating to Quinn. Ava's tool count: 10 → 13. Tool rename: web_search → searxng_search (greenfield, no backward compat alias). * chore(projects): register protoAgent in projects.yaml (#414) protoAgent is the new GitHub Template repo that replaces per-agent A2A bootstrapping. Registers it as an active dev project owned by Quinn, matching the shape of existing entries. Plane / GitHub webhook / Discord provisioning remain TODO — those integrations aren't configured in this deployment, so the onboard plugin skipped them. * chore(release): bump to v0.7.21 (#413) * feat(ava): expand helm toolset, wire GOAP skills, fix ceremony disable bug (#415) Ava agent audit + overhaul: - Tools: 10 → 22 (direct observation, propose_config_change, incident reporting) - Skills: 3 → 7 (debug_ci_failures, fleet_incident_response, downshift_models, investigate_orphaned_skills) - System prompt rewritten: self-improvement instructions, escalation policy, GOAP-dispatch playbook - DeepAgentExecutor now applies skill-level systemPromptOverride (goal_proposal, diagnose_pr_stuck) - Fix ceremony loader bug: disabled ceremonies were filtered out, preventing hot-reload from cancelling timers - Clean up board.pr-audit.yaml (remove spurious action field, restore schedule, keep disabled) - Update docs: README, deep-agent runtime, agent-skills reference, self-improving loop * fix(pr-remediator): close dispatch gap — self-dispatch + broaden auto-approve (#417) Two root causes prevented PRs from being auto-merged: 1. Dispatch gap: tier_0 short-circuit in ActionDispatcherPlugin completed all actions immediately without dispatching to agent.skill.request. Every action in actions.yaml is tier_0, so the fireAndForget path (which publishes the skill request) was unreachable dead code. Fix: tier_0 now falls through when meta.fireAndForget is set. 2. Approval gap: readyToMerge requires reviewState=approved, but auto-approve only covered dependabot/renovate/promote:/chore(deps. Human PRs, release PRs (chore(release), and github-actions PRs all lacked approved reviews and sat indefinitely. Fix: added app/github-actions to authors, chore(release, chore:, docs( to safe title prefixes. Additionally, PrRemediatorPlugin now self-dispatches remediation on every world.state.updated tick — checking for readyToMerge, dirty, failingCi, and changesRequested PRs directly from cached domain data. This removes the dependency on GOAP dispatch reaching the plugin via pr.remediate.* topics (which were never published in production after Arc 1.4 removed meta.topic routing). * feat(goap): issue_zero domain + goals — track open GitHub issues across fleet (#419) Adds a github_issues domain that polls /repos/{repo}/issues?state=open for all managed projects and classifies by label (critical, bug, enhancement). Three GOAP goals enforce issue hygiene: - issues.zero_critical (critical severity, max: 0) - issues.zero_bugs (high severity, max: 0) - issues.total_low (medium severity, max: 5) Each goal has a matching alert action and a triage dispatch action that invokes Ava's new issue_triage skill. The skill instructs Ava to resolve, convert to board features, delegate, or close issues with rationale — driving toward zero open issues across all repos. Domain polls every 5 minutes (issue velocity is low, GitHub rate limits are a concern with 6+ repos). * feat: manage_board list action (#247) + a2a.trace extension (#359) (#420) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching (#421) * feat: manage_board list action (#247) + a2a.trace extension (#359) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching "Promote dev to main" titles start with capital P, but the prefix check was case-sensitive against "promote:". Now lowercases the title before matching so both "promote:" and "Promote" patterns are caught. --------- --------- Co-authored-by: Josh Mabry <31560031+mabry1985@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Josh <artificialcitizens@gmail.com>
…th_discord (#431) Adds CeremonySkillExecutorPlugin — registers FunctionExecutors that bridge GOAP `ceremony.*` actions to the matching `ceremony.<id>.execute` topic CeremonyPlugin already listens for. Without this bridge, SkillDispatcherPlugin dropped every dispatch with "No executor found …" and (post-#427) emitted HIGH platform.skills_unwired alerts every cycle. Mirrors the alert-skill-executor-plugin pattern from #427 — explicit action→ceremony id mapping, install order matters (after registry, before skill-dispatcher). Partial fix for #430. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…432) Closes the structural gap where 5 actions in workspace/actions.yaml route to handlers in PrRemediatorPlugin but had no registered executor: - action.pr_update_branch → pr.remediate.update_branch - action.pr_merge_ready → pr.remediate.merge_ready - action.pr_fix_ci → pr.remediate.fix_ci - action.pr_address_feedback → pr.remediate.address_feedback - action.dispatch_backmerge → pr.backmerge.dispatch Before this change SkillDispatcherPlugin logged "No executor found" and dropped the dispatch every GOAP cycle. After PR #427's startup validator the same gap raised platform.skills_unwired HIGH every tick. Wiring follows the AlertSkillExecutorPlugin pattern from #427: - PrRemediatorSkillExecutorPlugin registers FunctionExecutors that publish on the existing pr-remediator subscription topics, keeping "bus is the contract" — no plugin holds a reference to the other. - Executors are fire-and-forget per actions.yaml meta. They return a successful SkillResult immediately; pr-remediator's handler runs asynchronously on the bus subscription. - Install order matches alert-skill-executor: AFTER ExecutorRegistry construction, BEFORE skill-dispatcher. For action.pr_merge_ready specifically, the meta.hitlPolicy (ttlMs: 1800000, onTimeout: approve) is now honoured. The executor forwards meta into the trigger payload; _handleMergeReady extracts it via _extractHitlPolicy and passes it to _emitHitlApproval, which populates HITLRequest.{ttlMs, onTimeout}. HITLPlugin already auto- publishes a synthetic approve response when onTimeout=approve fires. Closes part of #430. Ceremony + protoMaker actions ship in a separate PR. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
protoMaker (apps/server in protoLabsAI/ava) has been A2A-ready for a while — serves agent-card.json with 10 skills including the two referenced by unwired GOAP actions: - action.protomaker_triage_blocked → skill board_health - action.protomaker_start_auto_mode → skill auto_mode Both actions targeted [protomaker], but no agent named "protomaker" was registered, so the dispatcher couldn't route. Adding the entry closes the routing gap; A2AExecutor's existing target-matching does the rest. Endpoint: http://automaker-server:3008/a2a (verified from inside the workstacean container with AVA_API_KEY → JSON-RPC 2.0 response). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#433 added `subscribesTo: message.inbound.github.#` to the new protomaker agent registration, copy-pasted from quinn's pattern. That was wrong: protoMaker is reached via explicit GOAP `targets: [protomaker]` dispatches (action.protomaker_triage_blocked, action.protomaker_start_auto_mode), not as a broadcast inbound listener. Quinn already subscribes to all GitHub inbound and dispatches `bug_triage` on protoMaker's behalf. Having protomaker subscribe to the same broadcast topic is one of the contributing paths to the duplicate-triage spam loop filed as protoLabsAI/protoMaker#3503 (the root cause is Quinn's handler not being idempotent — but this cleanup removes one extra firing path). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…me (#435) PR #411 renamed the tool from web_search to searxng_search in src/agent-runtime/tools/bus-tools.ts (line 393), but ava.yaml still declared the old name. Result: at startup the runtime warns "agent ava declares unknown tools: web_search" and Ava ends up with no search capability — when asked, she explicitly responds "I don't have a searxng_search or web_search tool in my current toolkit." This is the config side of the half-finished rename that PR #411 missed. After this lands and workstacean restarts, Ava's toolkit should include searxng_search and the unknown-tools warning should be empty for her. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…up (#436) Two GOAP actions on goal fleet.no_agent_stuck were spamming Discord because they had no `effects` and no cooldown: - alert.fleet_agent_stuck → posts a Discord alert - action.fleet_incident_response → dispatches Ava to file an incident, page on-call, and pause routing Observed 2026-04-20: when auto-triage-sweep hit 100% failure rate on bug_triage (cascading from the non-idempotent handler in protoLabsAI/protoMaker#3503), GOAP re-fired both actions every planning cycle. Ava filed INC-003 through INC-009 in ~30 seconds, each posting to Discord. The pause routing succeeded but the rolling 1h failure rate metric doesn't drop instantly, so the goal stayed violated and the loop kept re-firing. Disabling both actions until proper dedup lands. Reinstate when: 1. action.fleet_incident_response gains a cooldown OR an `effects` marker that satisfies the goal for the cooldown window 2. Ava's fleet_incident_response skill checks for an existing open incident on the same agent before filing a new one 3. alert.fleet_agent_stuck gains per-agent rate limiting Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (#451) After #436 disabled action.fleet_incident_response, observed similar loop spam from FOUR more action.* dispatches that share the same architectural bug (no cooldown, no satisfying effects → re-fire every GOAP cycle while goal stays violated): action.fleet_investigate_orphaned_skills — Ava posts orphaned-skill diagnosis to Discord ops on every cycle (8+ posts in 1 min observed) action.issues_triage_critical — ~447 fires in 30 min action.issues_triage_bugs — ~447 fires in 30 min; compounds with #3503 by re-triaging same issues action.fleet_downshift_models — same pattern when cost exceeds budget ALL share the same pattern as the alert.* actions and the original fleet_incident_response: effects: [] + no cooldownMs + persistent goal violation = infinite re-fire. Mitigation only — Ava temporarily can't auto-investigate orphaned skills or auto-triage new GitHub issues. Reinstate when issue #437 ships action-level cooldown (meta.cooldownMs) or proper effects-with-TTL. The alert.* siblings (24 of them) also have this bug but are non-impacting today because DISCORD_WEBHOOK_ALERTS is unset and WorldEngineAlertPlugin drops them silently — fixing #437 will cover both at the dispatcher level. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GOAP actions with `effects: []` re-fired every planning cycle (~3s) while their goal stayed violated. Two prod fires (PR #436, PR #451) had to disable 6 actions before this lands. Restoring autonomy. ActionDispatcherPlugin now honors `meta.cooldownMs` on every action. When an action with a positive cooldownMs fires, the dispatcher records its timestamp; subsequent dispatches of the same action id within the window are dropped BEFORE the WIP queue and BEFORE the executor. Single chokepoint covers both alert.* (FunctionExecutor → Discord) and action.* (DeepAgent / A2A) paths. Drops log a fail-fast diagnostic with action id, age, and remaining window, plus bump a new `cooldown_dropped` telemetry event. Cooldown bucket is keyed on action id alone — per-target keying isn't needed because each GOAP action targets one situation. Greenfield shape: absence of `meta.cooldownMs` means "no cooldown" naturally, no flag. Defaults applied to workspace/actions.yaml: - alert.* → 15 min - action.* with skillHint → 30 min (agent work is expensive) - action.pr_* → 5 min (remediation must stay responsive) - ceremony.* → 30 min (treated as skill dispatch) - action.dispatch_backmerge → none (in-handler per-repo cooldown in pr-remediator stays authoritative) Re-enables the 6 actions disabled by PRs #436 and #451: alert.fleet_agent_stuck, action.fleet_incident_response, action.fleet_downshift_models, action.fleet_investigate_orphaned_skills, action.issues_triage_critical, action.issues_triage_bugs. Tests: - action-dispatcher unit tests cover: blocks repeats within window; A's cooldown does not affect B; window expiry admits next dispatch; absent cooldownMs and cooldownMs<=0 mean no throttling. - End-to-end test spam-publishes 100 violations of fleet.no_skill_orphaned and asserts exactly 1 dispatch reaches the executor. - bun test: 1023 / 1023 pass. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…roken (#454) Audit of all 10 workspace/ceremonies/*.yaml against the live agent skill registry surfaced 7 with skill-target mismatches that fail every fire: ROUTED to correct skill owner: board.cleanup skill=board_audit targets [all] → [quinn] board.health skill=board_health targets [all] → [protomaker] daily-standup skill=board_audit targets [ava] → [quinn] health-check skill=board_audit targets [ava] → [quinn] DISABLED (no agent advertises the skill): agent-health skill=health_check — no health_check anywhere board.retro skill=pattern_analysis — no pattern_analysis anywhere service-health skill=health_check — same as agent-health Quinn owns board_audit (and bug_triage / pr_review / qa_report). Protomaker owns board_health (and 9 other apps/server skills). The two `health_check`-keyed ceremonies were redundant anyway — the agent_health and services world-state domains poll the same data every 60s and expose it on /api/agent-health and /api/services. Re-enable the disabled three with a real skill if a periodic ANNOUNCE to Discord is wanted (small `*_health_report` skill on protobot would do it). Workspace is bind-mounted, so the live container picks this up on restart with no rebuild. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The GOAP planner could dispatch a skill to a target agent (e.g. auto-triage-sweep, user) where the named target wasn't in the live ExecutorRegistry. The dispatcher fired anyway, the executor errored 404-style, and the failure cascaded into stuck work items + duplicate incident filings (INC-003 through INC-018, ~93 work items in error state). The cooldown work in #437 / #452 masked the symptom but the structural gap remained. ActionDispatcherPlugin now takes the shared ExecutorRegistry handle and runs `_admitOrTargetUnresolved` immediately after the cooldown check and BEFORE the WIP queue / executor. Same chokepoint pattern as cooldown: - target absent (skill-routed) → admit - target = "all" (broadcast sentinel) → admit - target matches a registration agentName → admit - target unresolvable → drop, log loud, telemetry bump `target_unresolved` Drops surface action id, the unresolvable target, AND the agents that DO exist so the routing mistake is immediately diagnosable. The only opt-out is the broadcast sentinel "all" — there is no flag, no enabled-bool. Greenfield-strict shape. Wiring: src/index.ts passes the shared executorRegistry into the dispatcher factory. Test fixtures that don't exercise target routing omit the registry and the check is skipped — so existing tests stay green without modification. Audit of workspace/actions.yaml: only `protomaker` appears as an active agentId target (twice). That agent is registered in workspace/agents.yaml. The historical bad targets (`auto-triage-sweep`, `user`) were removed by prior cleanup; this PR ensures any future regression fails closed. Tests added in src/plugins/action-dispatcher-plugin.test.ts: - admits when target is registered - drops when target is unregistered + bumps `target_unresolved` - drops on mixed-target intent (single-target shape today via meta.agentId) - admits when target is the "all" broadcast sentinel - admits when meta.agentId is absent (skill-routed dispatch) - admits when no registry is wired (legacy test fixtures) bun test: 1029 / 1029 pass. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #415 fixed the hot-reload path so flipping a ceremony to disabled cancelled its timer, but the initial-load path still added every YAML entry (disabled or not) to the in-memory registry. Two consequences: 1. External `ceremony.<id>.execute` triggers (from CeremonySkillExecutorPlugin's GOAP bridge) found the disabled ceremony in the registry and fired it anyway. 2. After hot-reload flipped a ceremony enabled→disabled, the entry stayed in the registry — same external-trigger leak. Fix: filter `enabled === false` at every place that lands a ceremony in the registry (initial install, hot-reload new-file path, hot-reload changed-file path). Disabled ceremonies are loaded by the YAML parser (so the changed-file path can detect a flip) but never reach the registry, never schedule a timer, and cannot be resurrected by an external trigger. Operators see a `Skipping disabled ceremony: <id>` log line for each skip — fail-loud per project convention. Greenfield: no flag, no toggle. enabled:false means disabled everywhere. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 TS2322 errors snuck through #452 and #455 because the CI test job has been failing on type-check while build-and-push (the gate that actually publishes :dev) is a separate workflow that runs on push. Result: main/dev were green for container publish even though tsc --noEmit was returning exit 2. Not visible in PR merge gates either because test.conclusion=failure + build-and-push.conclusion=skipping still resolved to a mergeable state. Pattern of the 13 errors: bus.subscribe(T, "spy", (m) => requests.push(m)); ^^^^^^^^^^^^^^^^ Array.push() returns number, but the subscribe callback expects void | Promise<void>. Fix: wrap the body in a block so the arrow returns void: bus.subscribe(T, "spy", (m) => { requests.push(m); }); Applied via sed-style regex across both test files. 1029 tests still pass (bun test). `bun run tsc --noEmit` now exits 0. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… whitelist (closes #459) (#460) AgentFleetHealthPlugin now takes an optional ExecutorRegistry (mirrors the #455 wiring on ActionDispatcherPlugin). On each inbound autonomous.outcome, `systemActor` is checked against `executorRegistry.list().map(r => r.agentName)`: - Registered agent (ava, quinn, protomaker, ...) → aggregated in agents[] (existing shape, no behavior change). - Anything else (pr-remediator, auto-triage-sweep, goap, user, ...) → routed to a separate systemActors[] bucket. No longer pollutes agentCount / maxFailureRate1h / orphanedSkillCount, so Ava's sitreps stop surfacing plugin names as "stuck agents". First time a synthetic actor is seen, the plugin emits a one-time console.warn naming the actor + skill (fail-fast and loud, per policy) so operators know what's being filtered. No flag — same greenfield / chokepoint discipline as #437 (cooldown) and #444 (target guard). Scope note on `_default`: this plugin keys on outcome `systemActor`, not registry `agentName`. `_default` only appears in /api/agent-health (the registry-driven view). Nothing currently publishes an outcome with `systemActor: "_default"`, so it doesn't reach agents[] here. If it ever did, the new whitelist would drop it to systemActors[] — the right outcome. Verification plan (post-deploy): curl -s -X POST http://localhost:3000/v1/chat/completions \\ -H 'Content-Type: application/json' \\ -d '{"model":"ava","messages":[{"role":"user","content":"fleet sitrep"}]}' Expected: no `pr-remediator`, `auto-triage-sweep`, or `user` in agents[]. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oard URL (#462) The card at /.well-known/agent-card.json was advertising a URL like http://ava:8081/a2a — host-mapped to the Astro dashboard, which 404s on /a2a. Spec-compliant clients (@a2a-js/sdk and friends) doing card discovery → POST to card.url could not reach the actual A2A endpoint; the voice agent team papered over this by switching to the /v1/chat/completions shim. Fix: derive the card's `url` from variables that describe where the A2A endpoint actually lives. 1. WORKSTACEAN_PUBLIC_BASE_URL (e.g. https://ava.proto-labs.ai) → ${publicBase}/a2a. The canonical Cloudflare-fronted URL for external/Tailscale callers. 2. Otherwise, http://${WORKSTACEAN_INTERNAL_HOST ?? "workstacean"}:${WORKSTACEAN_HTTP_PORT}/a2a — docker-network service name + the actual API port. Also populate `additionalInterfaces` with the JSON-RPC transport at the same URL so spec-compliant clients can pick deterministically. Drop the WORKSTACEAN_BASE_URL coupling in the card builder — that variable remains the externally-reachable URL stamped into A2A push-notification callbacks (different concern, separate documentation). Closes #461 Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore: back-merge main → dev (post v0.7.21 — fix merge-base)
📝 WalkthroughWalkthroughThis PR implements multi-layer fixes for duplicate action dispatches, unregistered agent routing failures, and ceremony state isolation. It introduces action-level cooldown deduplication in Changes
Sequence Diagram(s)sequenceDiagram
participant Planner as GOAP Planner
participant Dispatcher as ActionDispatcher
participant Registry as ExecutorRegistry
participant Executor as FunctionExecutor
participant Bus as EventBus
Planner->>Dispatcher: handleDispatch(action)
rect rgba(255, 165, 0, 0.5)
Note over Dispatcher: Cooldown Gate
Dispatcher->>Dispatcher: _admitOrCooldown(action)
alt Within cooldown window
Dispatcher->>Dispatcher: log warning, bump telemetry
Dispatcher-->>Planner: (dropped)
end
end
rect rgba(100, 150, 255, 0.5)
Note over Dispatcher: Target Registry Check
Dispatcher->>Registry: resolve(skillHint, targets)
Registry-->>Dispatcher: executor or null
alt Target unregistered
Dispatcher->>Dispatcher: log warning, bump target_unresolved
Dispatcher-->>Planner: (dropped)
end
end
rect rgba(144, 238, 144, 0.5)
Note over Executor: Executor Dispatch
Dispatcher->>Executor: _execute(skillRequest)
Executor->>Bus: publish(message)
Bus-->>Executor: ack
Executor-->>Dispatcher: SkillResult
end
Dispatcher->>Dispatcher: record lastDispatchedAt
Dispatcher-->>Planner: return
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (6)
.claude/commands/rollcall.md (1)
1-1: Good terminology refinement; consider hyphenation.The updated description "(in-process DeepAgents, A2A external services)" is more precise and aligns well with the codebase terminology shown in the relevant code snippets.
Minor style note: As a compound adjective modifying "smoke test," "roll call" could be hyphenated as "roll-call."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/commands/rollcall.md at line 1, Update the sentence "Run the agent roll call smoke test. Reports status of all agents (in-process DeepAgents, A2A external services), AI infrastructure, media stack, and monitoring services." to hyphenate the compound adjective by changing "roll call" to "roll-call" (i.e., "Run the agent roll-call smoke test...") while preserving the parenthetical "(in-process DeepAgents, A2A external services)" and the rest of the description.workspace/ceremonies/health-check.yaml (1)
5-12: Routing fix is correct; ceremony semantics are still misleading.The target change to
quinnis solid, but this config still executesboard_auditunder a service-health identity. Consider renaming/reframing this ceremony (or switching to an actual health-report skill) to avoid operator confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workspace/ceremonies/health-check.yaml` around lines 5 - 12, Summary: The ceremony is routed correctly to quinn but its identity still reads as a service-health check while running the board_audit skill, which is misleading. Fix: change the ceremony identity or name and/or replace the skill to match intent so operators aren't confused — either rename this ceremony from a "service-health" framing to something that reflects a board audit, or swap the skill to an actual health-report skill; specifically update the YAML entries referencing the skill value "board_audit" and any metadata/label that implies "service-health" so the ceremony name/skill and the identity are consistent with each other and with the target "quinn".__tests__/goal_evaluator_plugin.test.ts (1)
356-367: Captureconsole.errorbefore install to make the startup-validator test robust.If validator logging happens during
install(), the current ordering can miss the signal and create a false-negative test.Proposed test ordering tweak
- plugin = new GoalEvaluatorPlugin({ workspaceDir: join(TMP_DIR, "workspace") }); - plugin.install(bus); - const errors: string[] = []; const origError = console.error; console.error = (...args: unknown[]) => { errors.push(args.map(a => (typeof a === "string" ? a : JSON.stringify(a))).join(" ")); }; try { + plugin = new GoalEvaluatorPlugin({ workspaceDir: join(TMP_DIR, "workspace") }); + plugin.install(bus); bus.publish("world.state.updated", { id: "ws-2", correlationId: "ws-2",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@__tests__/goal_evaluator_plugin.test.ts` around lines 356 - 367, The test captures console.error too late: move the console.error override so it runs before calling plugin.install(...) to ensure any validator logging during install is caught; specifically, in the test that constructs GoalEvaluatorPlugin and calls plugin.install(bus), set up the errors array and override console.error prior to invoking plugin.install, then restore console.error after bus.publish(...) completes, referencing GoalEvaluatorPlugin, install, console.error and bus.publish to locate the affected lines.src/plugins/CeremonyPlugin.ts (1)
538-547: Consider clearing ceremony history when unregistering.When a ceremony is disabled via hot-reload,
unregisterCeremonyremoves it fromceremoniesandstatusmaps but — per the context snippet fromCeremonyStateExtension.ts:48-55— does not clearstate.historyorstate.lastRun. Old execution history may persist in snapshots, potentially affecting monitoring/reporting systems.If stale history is acceptable (e.g., for audit trails), this is fine. Otherwise, consider extending
unregisterCeremonyto also clear history entries for the ceremony ID.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/plugins/CeremonyPlugin.ts` around lines 538 - 547, The unregisterCeremony path currently removes entries from the ceremonies and status maps but leaves per-ceremony execution data in CeremonyStateExtension (state.history and state.lastRun), so update the unregisterCeremony method to also clear any stored execution history and lastRun for the given ceremony id (e.g., remove state.history[id] and state.lastRun[id] or call the appropriate CeremonyStateExtension clear method) to avoid stale history persisting after a hot-disable; ensure you reference and clear the same keys used by CeremonyStateExtension.ts (state.history and state.lastRun) when unregistering.src/plugins/__tests__/CeremonyPlugin.test.ts (1)
113-160: Consider increasing timing margins for CI stability.The 5ms and 10ms delays may be insufficient on slow CI runners. Consider using slightly larger margins (e.g., 20-50ms) or a polling-based assertion pattern to reduce flakiness.
💡 Example polling pattern
// Instead of fixed delays: const waitFor = async (fn: () => boolean, timeoutMs = 100) => { const start = Date.now(); while (!fn() && Date.now() - start < timeoutMs) { await new Promise(r => setTimeout(r, 5)); } }; await waitFor(() => !plugin.getCeremonies().map(c => c.id).includes("board.flipping"));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/plugins/__tests__/CeremonyPlugin.test.ts` around lines 113 - 160, The test "CeremonyPlugin loader: hot-reload flips enabled→disabled (cancels timer, removes from registry)" is flaky due to very small fixed delays (setTimeout 5ms and 10ms); either replace those sleeps with a small polling helper (e.g., waitFor that repeatedly checks plugin.getCeremonies() or timers.has("board.flipping") until the condition is met) and await it after calling pluginAny._checkForChanges, or simply increase the delays to a safer margin (e.g., 20–50ms). Update references in the test to use the polling helper or larger timeouts around the pluginAny._checkForChanges calls that follow writeCeremony so CI stability improves.src/planner/__tests__/validate-action-executors.test.ts (1)
148-156: Consider using a more explicit type cast.The
bus as nevercast works but is unconventional. Usingas unknown as EventBuswould make the intent clearer to future readers.♻️ Suggested change
- validateActionExecutors(actions, executors, { bus: bus as never }); + validateActionExecutors(actions, executors, { bus: bus as unknown as EventBus });Add the import at the top:
import type { EventBus } from "../../../lib/types.ts";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/planner/__tests__/validate-action-executors.test.ts` around lines 148 - 156, The test uses an unconventional cast "bus as never" when calling validateActionExecutors; instead import the EventBus type and cast the mock bus more explicitly (e.g., bus as unknown as EventBus) so intent is clear — update the test to add the import for EventBus and replace the "bus as never" cast with "as unknown as EventBus", referencing the local bus variable and the validateActionExecutors(...) call to locate where to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/commands/rollcall.md:
- Line 5: Replace the hard-coded absolute path in the rollcall command with a
repo-relative invocation so the script is portable; update the line that
currently runs `bash /home/josh/dev/protoWorkstacean/scripts/agent-rollcall.sh`
in .claude/commands/rollcall.md to call the repo-local script (e.g.,
`scripts/agent-rollcall.sh` or `./scripts/agent-rollcall.sh`) or use a repo-root
resolution (e.g., via git rev-parse --show-toplevel) so the command works from
the repository regardless of the developer's home directory or CI environment.
In `@lib/plugins/pr-remediator.ts`:
- Around line 1201-1214: The extracted hitl policy currently allows ttlMs values
like NaN or Infinity which later cause RangeError in _emitHitlApproval; update
_extractHitlPolicy to validate ttlMs by checking typeof p.ttlMs === "number" &&
Number.isFinite(p.ttlMs) && p.ttlMs >= 0 (optionally Math.floor if you want an
integer) before assigning out.ttlMs so only finite, non-negative numbers are
propagated to _emitHitlApproval.
In `@README.md`:
- Line 65: Add a dedicated environment variable entry for
WORKSTACEAN_INTERNAL_HOST in the env-table: describe it as the internal host
used for docker-network fallback, note its default value (workstacean) and how
it interacts with WORKSTACEAN_HTTP_PORT and WORKSTACEAN_PUBLIC_BASE_URL (i.e.,
used in the fallback
`http://${WORKSTACEAN_INTERNAL_HOST:-workstacean}:${WORKSTACEAN_HTTP_PORT}/a2a`),
so readers can easily find and override the internal host behavior.
In `@src/index.ts`:
- Around line 511-526: The wiring validation currently runs too early; move the
validateActionExecutors call (the block that imports
"./planner/validate-action-executors.js" and calls
validateActionExecutors(actionRegistry, executorRegistry, { bus, throwOnUnwired:
process.env.WORKSTACEAN_STRICT_WIRING === "1" })) so it executes after
loadWorkspacePlugins() completes, and also invoke the same
validateActionExecutors call from the config reload path that handles
TOPICS.CONFIG_RELOAD (so actions reloaded at runtime get validated against
executorRegistry). Ensure you reuse the same actionRegistry and executorRegistry
instances and preserve the existing logging/throwOnUnwired behavior.
- Around line 316-329: The pr-remediator skill executor is unconditionally
installed (name: "pr-remediator-skill-executor", factory: returns new
PrRemediatorSkillExecutorPlugin(executorRegistry)) causing actions to be routed
when the actual pr-remediator plugin is absent; change the registrar's condition
from () => true to the same gating predicate used by the real pr-remediator
plugin (e.g., reuse the GitHub-credentials check or the plugin presence check),
or make the condition consult the plugin registry
(pluginManager.has("pr-remediator") / similar) so the
PrRemediatorSkillExecutorPlugin is only registered when the real pr-remediator
plugin will be installed.
In `@workspace/agents.yaml`:
- Around line 90-95: The protomaker agent registration references AVA_API_KEY
but that env var is not documented or provisioned; add AVA_API_KEY to the
environment template and onboarding/provisioning docs/scripts so fresh or local
deployments export a placeholder or real key before registering the agent.
Update the onboarding guide and any env files (the env template used for local
dev / CI) to include AVA_API_KEY with a short note about its purpose and where
to obtain/set it, and ensure any deployment/provision scripts (used during
bootstrap) populate or validate AVA_API_KEY before the protomaker agent is
registered.
---
Nitpick comments:
In `@__tests__/goal_evaluator_plugin.test.ts`:
- Around line 356-367: The test captures console.error too late: move the
console.error override so it runs before calling plugin.install(...) to ensure
any validator logging during install is caught; specifically, in the test that
constructs GoalEvaluatorPlugin and calls plugin.install(bus), set up the errors
array and override console.error prior to invoking plugin.install, then restore
console.error after bus.publish(...) completes, referencing GoalEvaluatorPlugin,
install, console.error and bus.publish to locate the affected lines.
In @.claude/commands/rollcall.md:
- Line 1: Update the sentence "Run the agent roll call smoke test. Reports
status of all agents (in-process DeepAgents, A2A external services), AI
infrastructure, media stack, and monitoring services." to hyphenate the compound
adjective by changing "roll call" to "roll-call" (i.e., "Run the agent roll-call
smoke test...") while preserving the parenthetical "(in-process DeepAgents, A2A
external services)" and the rest of the description.
In `@src/planner/__tests__/validate-action-executors.test.ts`:
- Around line 148-156: The test uses an unconventional cast "bus as never" when
calling validateActionExecutors; instead import the EventBus type and cast the
mock bus more explicitly (e.g., bus as unknown as EventBus) so intent is clear —
update the test to add the import for EventBus and replace the "bus as never"
cast with "as unknown as EventBus", referencing the local bus variable and the
validateActionExecutors(...) call to locate where to change.
In `@src/plugins/__tests__/CeremonyPlugin.test.ts`:
- Around line 113-160: The test "CeremonyPlugin loader: hot-reload flips
enabled→disabled (cancels timer, removes from registry)" is flaky due to very
small fixed delays (setTimeout 5ms and 10ms); either replace those sleeps with a
small polling helper (e.g., waitFor that repeatedly checks
plugin.getCeremonies() or timers.has("board.flipping") until the condition is
met) and await it after calling pluginAny._checkForChanges, or simply increase
the delays to a safer margin (e.g., 20–50ms). Update references in the test to
use the polling helper or larger timeouts around the pluginAny._checkForChanges
calls that follow writeCeremony so CI stability improves.
In `@src/plugins/CeremonyPlugin.ts`:
- Around line 538-547: The unregisterCeremony path currently removes entries
from the ceremonies and status maps but leaves per-ceremony execution data in
CeremonyStateExtension (state.history and state.lastRun), so update the
unregisterCeremony method to also clear any stored execution history and lastRun
for the given ceremony id (e.g., remove state.history[id] and state.lastRun[id]
or call the appropriate CeremonyStateExtension clear method) to avoid stale
history persisting after a hot-disable; ensure you reference and clear the same
keys used by CeremonyStateExtension.ts (state.history and state.lastRun) when
unregistering.
In `@workspace/ceremonies/health-check.yaml`:
- Around line 5-12: Summary: The ceremony is routed correctly to quinn but its
identity still reads as a service-health check while running the board_audit
skill, which is misleading. Fix: change the ceremony identity or name and/or
replace the skill to match intent so operators aren't confused — either rename
this ceremony from a "service-health" framing to something that reflects a board
audit, or swap the skill to an actual health-report skill; specifically update
the YAML entries referencing the skill value "board_audit" and any
metadata/label that implies "service-health" so the ceremony name/skill and the
identity are consistent with each other and with the target "quinn".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 95cf24cb-d206-4df0-bbe1-850bb259bae2
📒 Files selected for processing (45)
.claude/commands/rollcall.md.env.distREADME.md__tests__/goal_evaluator_plugin.test.tsdocs/reference/ceremony-plugin.mddocs/reference/env-vars.mddocs/reference/http-api.mdlib/plugins/pr-remediator.tssrc/api/__tests__/agent-card.test.tssrc/api/agent-card.tssrc/config/env.tssrc/index.tssrc/loaders/ceremonyYamlLoader.tssrc/planner/__tests__/end-to-end-loop.test.tssrc/planner/__tests__/validate-action-executors.test.tssrc/planner/types/action.tssrc/planner/validate-action-executors.tssrc/plugins/CeremonyPlugin.tssrc/plugins/__tests__/CeremonyPlugin.test.tssrc/plugins/__tests__/alert-skill-executor-plugin.test.tssrc/plugins/__tests__/ceremony-skill-executor-plugin.test.tssrc/plugins/__tests__/pr-remediator-skill-executor-plugin.test.tssrc/plugins/action-dispatcher-plugin.test.tssrc/plugins/action-dispatcher-plugin.tssrc/plugins/agent-fleet-health-plugin.test.tssrc/plugins/agent-fleet-health-plugin.tssrc/plugins/alert-skill-executor-plugin.tssrc/plugins/ceremony-skill-executor-plugin.tssrc/plugins/goal_evaluator_plugin.tssrc/plugins/pr-remediator-skill-executor-plugin.tssrc/pr-remediator.test.tssrc/schemas/yaml-schemas.tssrc/telemetry/telemetry-service.tssrc/world/extensions/CeremonyStateExtension.tssrc/world/extensions/__tests__/CeremonyStateExtension.test.tsworkspace/actions.yamlworkspace/agents.yamlworkspace/agents/ava.yamlworkspace/ceremonies/agent-health.yamlworkspace/ceremonies/board.cleanup.yamlworkspace/ceremonies/board.health.yamlworkspace/ceremonies/board.retro.yamlworkspace/ceremonies/daily-standup.yamlworkspace/ceremonies/health-check.yamlworkspace/ceremonies/service-health.yaml
…l condition `pr-remediator-skill-executor` was unconditionally installed (condition: () => true), but `pr-remediator` itself is gated on GitHub credentials. When creds are absent, dispatches to `pr.remediate.*` topics passed validation and the executor ran — but no subscriber consumed them, resulting in silent success with no actual work done. Fix: apply the same condition guard used by `pr-remediator`: !!(process.env.QUINN_APP_PRIVATE_KEY || process.env.GITHUB_TOKEN) Resolves GitHub issue #467 (CodeRabbit finding #4 from PR #466). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…geError (#467 finding #2) (#470) * fix(pr-remediator): guard ttlMs against Infinity/NaN with Number.isFinite _extractHitlPolicy accepted any typeof "number" for ttlMs, which includes Infinity and NaN. Both pass the typeof check but cause new Date(Date.now() + ttlMs).toISOString() to throw a RangeError. Fix: add Number.isFinite(p.ttlMs) && p.ttlMs > 0 guard. Fixes finding #2 from GitHub issue #467 (CodeRabbit review on PR #466). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(plugins): gate pr-remediator-skill-executor with GitHub credential condition `pr-remediator-skill-executor` was unconditionally installed (condition: () => true), but `pr-remediator` itself is gated on GitHub credentials. When creds are absent, dispatches to `pr.remediate.*` topics passed validation and the executor ran — but no subscriber consumed them, resulting in silent success with no actual work done. Fix: apply the same condition guard used by `pr-remediator`: !!(process.env.QUINN_APP_PRIVATE_KEY || process.env.GITHUB_TOKEN) Resolves GitHub issue #467 (CodeRabbit finding #4 from PR #466). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes the three open CodeRabbit findings from #466 not yet shipped (items #2 and #4 already landed via #470 and 9792b0c). #467 finding #1 (.claude/commands/rollcall.md): Hard-coded operator path /home/josh/dev/... → relative path scripts/agent-rollcall.sh with explicit "from repo root" guidance. Works for any clone; matches every other repo-script reference. #467 finding #3 (README.md env table): Add a dedicated env-table row for WORKSTACEAN_INTERNAL_HOST so it's discoverable by anyone overriding the docker-network default. Cross-references WORKSTACEAN_PUBLIC_BASE_URL. #467 finding #5 (src/index.ts startup-validator): validateActionExecutors() ran BEFORE loadWorkspacePlugins(), so executor registrars shipped as workspace plugins were falsely flagged in strict mode. It also ran only once at startup, so config.reload of actions.yaml bypassed the fail-loud guard. Fix: extract a runWiringValidator(reason) helper, call it AFTER all plugin loading (core + registered + workspace), and re-run inside the CONFIG_RELOAD subscriber after loadActionsYaml(). New "[reload-validator]" log tag distinguishes the call sites. Tests: 1054/1054 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (v0.7.22 candidate, re-cut) (#466) (#468) * fix: extension URIs use proto-labs.ai (not protolabs.ai) (#407) All 27 references to https://protolabs.ai/a2a/ext/* changed to https://proto-labs.ai/a2a/ext/* to match the actual domain. These URIs are opaque identifiers (not published specs today) but should reference a domain we own. Breaking: external agents (Quinn, protoPen) whose cards declare the old URI will stop matching the registry until they update. Filed on Quinn to update her card. * chore(release): bump to v0.7.20 (#408) * chore: remove protoaudio + protovoice from agent rollcall (#410) Both services decommissioned. Containers stopped + removed. Only reference in protoWorkstacean was the rollcall script. Note: homelab-iac/stacks/ai/docker-compose.yml still has a worldmonitor network reference at line 521 + service at line 833. Needs separate cleanup in that repo. * feat: upgrade web_search → searxng_search + give Ava fleet health tools (#411) Two changes: 1. Replace the basic `web_search` tool (5 results, hardcoded engines) with `searxng_search` — adapted from rabbit-hole.io's full-surface SearXNG integration. New capabilities: - Category routing: general, news, science, it - Time range filtering: day, week, month, year - Bang syntax: !wp (Wikipedia), !scholar, !gh (GitHub) - Infoboxes, direct answers, suggestions in response - Configurable max_results (default 10, was 5) Updated in both bus-tools.ts (@protolabsai/sdk pattern) and deep-agent-executor.ts (LangChain pattern). 2. Give Ava three fleet health tools she was missing: - get_ci_health — CI success rates across repos - get_pr_pipeline — open PRs, conflicts, staleness - get_incidents — security/ops incidents Ava can now answer fleet health questions directly instead of always delegating to Quinn. Ava's tool count: 10 → 13. Tool rename: web_search → searxng_search (greenfield, no backward compat alias). * chore(projects): register protoAgent in projects.yaml (#414) protoAgent is the new GitHub Template repo that replaces per-agent A2A bootstrapping. Registers it as an active dev project owned by Quinn, matching the shape of existing entries. Plane / GitHub webhook / Discord provisioning remain TODO — those integrations aren't configured in this deployment, so the onboard plugin skipped them. * chore(release): bump to v0.7.21 (#413) * feat(ava): expand helm toolset, wire GOAP skills, fix ceremony disable bug (#415) Ava agent audit + overhaul: - Tools: 10 → 22 (direct observation, propose_config_change, incident reporting) - Skills: 3 → 7 (debug_ci_failures, fleet_incident_response, downshift_models, investigate_orphaned_skills) - System prompt rewritten: self-improvement instructions, escalation policy, GOAP-dispatch playbook - DeepAgentExecutor now applies skill-level systemPromptOverride (goal_proposal, diagnose_pr_stuck) - Fix ceremony loader bug: disabled ceremonies were filtered out, preventing hot-reload from cancelling timers - Clean up board.pr-audit.yaml (remove spurious action field, restore schedule, keep disabled) - Update docs: README, deep-agent runtime, agent-skills reference, self-improving loop * fix(pr-remediator): close dispatch gap — self-dispatch + broaden auto-approve (#417) Two root causes prevented PRs from being auto-merged: 1. Dispatch gap: tier_0 short-circuit in ActionDispatcherPlugin completed all actions immediately without dispatching to agent.skill.request. Every action in actions.yaml is tier_0, so the fireAndForget path (which publishes the skill request) was unreachable dead code. Fix: tier_0 now falls through when meta.fireAndForget is set. 2. Approval gap: readyToMerge requires reviewState=approved, but auto-approve only covered dependabot/renovate/promote:/chore(deps. Human PRs, release PRs (chore(release), and github-actions PRs all lacked approved reviews and sat indefinitely. Fix: added app/github-actions to authors, chore(release, chore:, docs( to safe title prefixes. Additionally, PrRemediatorPlugin now self-dispatches remediation on every world.state.updated tick — checking for readyToMerge, dirty, failingCi, and changesRequested PRs directly from cached domain data. This removes the dependency on GOAP dispatch reaching the plugin via pr.remediate.* topics (which were never published in production after Arc 1.4 removed meta.topic routing). * feat(goap): issue_zero domain + goals — track open GitHub issues across fleet (#419) Adds a github_issues domain that polls /repos/{repo}/issues?state=open for all managed projects and classifies by label (critical, bug, enhancement). Three GOAP goals enforce issue hygiene: - issues.zero_critical (critical severity, max: 0) - issues.zero_bugs (high severity, max: 0) - issues.total_low (medium severity, max: 5) Each goal has a matching alert action and a triage dispatch action that invokes Ava's new issue_triage skill. The skill instructs Ava to resolve, convert to board features, delegate, or close issues with rationale — driving toward zero open issues across all repos. Domain polls every 5 minutes (issue velocity is low, GitHub rate limits are a concern with 6+ repos). * feat: manage_board list action (#247) + a2a.trace extension (#359) (#420) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching (#421) * feat: manage_board list action (#247) + a2a.trace extension (#359) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching "Promote dev to main" titles start with capital P, but the prefix check was case-sensitive against "promote:". Now lowercases the title before matching so both "promote:" and "Promote" patterns are caught. --------- * fix(skill-dispatcher): wire alert.* executors + startup validator (#426) (#427) Closes the structural gap where 6+ tier_0 fire-and-forget alert skills had no registered executor, causing SkillDispatcherPlugin to log "No executor found" and silently drop the dispatch on every GOAP planning cycle. - AlertSkillExecutorPlugin registers FunctionExecutors for all 24 bare alert.* actions in workspace/actions.yaml. Each translates the dispatch into a structured message.outbound.discord.alert event consumed by the existing WorldEngineAlertPlugin webhook routing. - validate-action-executors.ts cross-checks the loaded ActionRegistry against the live ExecutorRegistry at startup. Surfaces every gap as a HIGH-severity Discord alert (goal platform.skills_unwired) and a loud console.error. Set WORKSTACEAN_STRICT_WIRING=1 to crash startup instead. - action.issues_triage_bugs already routes correctly via meta.agentId=ava to the existing DeepAgentExecutor for Ava's issue_triage skill — no duplicate wiring needed (greenfield). * fix(ceremonies): stop world.state.# leak from ceremony snapshots (#428) CeremonyStateExtension was publishing { domain, data } envelopes on `world.state.snapshot` after every ceremony completion. GoalEvaluatorPlugin subscribes to `world.state.#`, treated the malformed payload as a WorldState, and emitted a "Selector ... not found" violation for every loaded goal on every ceremony tick (the cluster of 25+ violations at each :15/:30 boundary in the live container logs). All listed selectors (flow.efficiency.ratio, services.discord.connected, agent_health.agentCount, etc.) actually exist in the producer output — the goals are correct. Changes: - Move ceremony snapshot publish to `ceremony.state.snapshot` (off the world.state.# namespace). Leaves the existing CeremoniesState shape and consumers unchanged. - Goal evaluator: defensive payload shape check. Reject single-domain envelopes ({ domain, data }) and other non-WorldState payloads loud-once instead of generating one violation per goal. - Goal evaluator: startup selector validator. After the first valid world state arrives, walk every loaded goal's selector and HIGH-log any that doesn't resolve. Re-armed on goals.reload / config.reload so drift caught by future hot-reloads also surfaces. - Tests: regression guard that CeremonyStateExtension does not publish on world.state.#; goal evaluator ignores malformed payloads; validator catches an intentionally broken selector. Closes #424 * fix(rollcall): point /rollcall skill at in-repo script (closes #425) (#429) The Claude Code skill was calling the homelab-iac copy of agent-rollcall.sh, which had drifted from this repo's copy. The in-repo script knows about the in-process DeepAgent runtime (Ava, protoBot, Tuner) and the current A2A fleet; the homelab-iac copy still probed for the archived ava-agent container and the deprecated protoaudio/protovoice services. Single source of truth: this repo. The homelab-iac copy was separately synced in homelab-iac@64e8dcf. * Promote dev to main (v0.7.21) (#418) (#423) * fix: extension URIs use proto-labs.ai (not protolabs.ai) (#407) All 27 references to https://protolabs.ai/a2a/ext/* changed to https://proto-labs.ai/a2a/ext/* to match the actual domain. These URIs are opaque identifiers (not published specs today) but should reference a domain we own. Breaking: external agents (Quinn, protoPen) whose cards declare the old URI will stop matching the registry until they update. Filed on Quinn to update her card. * chore(release): bump to v0.7.20 (#408) * chore: remove protoaudio + protovoice from agent rollcall (#410) Both services decommissioned. Containers stopped + removed. Only reference in protoWorkstacean was the rollcall script. Note: homelab-iac/stacks/ai/docker-compose.yml still has a worldmonitor network reference at line 521 + service at line 833. Needs separate cleanup in that repo. * feat: upgrade web_search → searxng_search + give Ava fleet health tools (#411) Two changes: 1. Replace the basic `web_search` tool (5 results, hardcoded engines) with `searxng_search` — adapted from rabbit-hole.io's full-surface SearXNG integration. New capabilities: - Category routing: general, news, science, it - Time range filtering: day, week, month, year - Bang syntax: !wp (Wikipedia), !scholar, !gh (GitHub) - Infoboxes, direct answers, suggestions in response - Configurable max_results (default 10, was 5) Updated in both bus-tools.ts (@protolabsai/sdk pattern) and deep-agent-executor.ts (LangChain pattern). 2. Give Ava three fleet health tools she was missing: - get_ci_health — CI success rates across repos - get_pr_pipeline — open PRs, conflicts, staleness - get_incidents — security/ops incidents Ava can now answer fleet health questions directly instead of always delegating to Quinn. Ava's tool count: 10 → 13. Tool rename: web_search → searxng_search (greenfield, no backward compat alias). * chore(projects): register protoAgent in projects.yaml (#414) protoAgent is the new GitHub Template repo that replaces per-agent A2A bootstrapping. Registers it as an active dev project owned by Quinn, matching the shape of existing entries. Plane / GitHub webhook / Discord provisioning remain TODO — those integrations aren't configured in this deployment, so the onboard plugin skipped them. * chore(release): bump to v0.7.21 (#413) * feat(ava): expand helm toolset, wire GOAP skills, fix ceremony disable bug (#415) Ava agent audit + overhaul: - Tools: 10 → 22 (direct observation, propose_config_change, incident reporting) - Skills: 3 → 7 (debug_ci_failures, fleet_incident_response, downshift_models, investigate_orphaned_skills) - System prompt rewritten: self-improvement instructions, escalation policy, GOAP-dispatch playbook - DeepAgentExecutor now applies skill-level systemPromptOverride (goal_proposal, diagnose_pr_stuck) - Fix ceremony loader bug: disabled ceremonies were filtered out, preventing hot-reload from cancelling timers - Clean up board.pr-audit.yaml (remove spurious action field, restore schedule, keep disabled) - Update docs: README, deep-agent runtime, agent-skills reference, self-improving loop * fix(pr-remediator): close dispatch gap — self-dispatch + broaden auto-approve (#417) Two root causes prevented PRs from being auto-merged: 1. Dispatch gap: tier_0 short-circuit in ActionDispatcherPlugin completed all actions immediately without dispatching to agent.skill.request. Every action in actions.yaml is tier_0, so the fireAndForget path (which publishes the skill request) was unreachable dead code. Fix: tier_0 now falls through when meta.fireAndForget is set. 2. Approval gap: readyToMerge requires reviewState=approved, but auto-approve only covered dependabot/renovate/promote:/chore(deps. Human PRs, release PRs (chore(release), and github-actions PRs all lacked approved reviews and sat indefinitely. Fix: added app/github-actions to authors, chore(release, chore:, docs( to safe title prefixes. Additionally, PrRemediatorPlugin now self-dispatches remediation on every world.state.updated tick — checking for readyToMerge, dirty, failingCi, and changesRequested PRs directly from cached domain data. This removes the dependency on GOAP dispatch reaching the plugin via pr.remediate.* topics (which were never published in production after Arc 1.4 removed meta.topic routing). * feat(goap): issue_zero domain + goals — track open GitHub issues across fleet (#419) Adds a github_issues domain that polls /repos/{repo}/issues?state=open for all managed projects and classifies by label (critical, bug, enhancement). Three GOAP goals enforce issue hygiene: - issues.zero_critical (critical severity, max: 0) - issues.zero_bugs (high severity, max: 0) - issues.total_low (medium severity, max: 5) Each goal has a matching alert action and a triage dispatch action that invokes Ava's new issue_triage skill. The skill instructs Ava to resolve, convert to board features, delegate, or close issues with rationale — driving toward zero open issues across all repos. Domain polls every 5 minutes (issue velocity is low, GitHub rate limits are a concern with 6+ repos). * feat: manage_board list action (#247) + a2a.trace extension (#359) (#420) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching (#421) * feat: manage_board list action (#247) + a2a.trace extension (#359) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching "Promote dev to main" titles start with capital P, but the prefix check was case-sensitive against "promote:". Now lowercases the title before matching so both "promote:" and "Promote" patterns are caught. --------- --------- * feat(ceremony): wire ceremony.security_triage + ceremony.service_health_discord (#431) Adds CeremonySkillExecutorPlugin — registers FunctionExecutors that bridge GOAP `ceremony.*` actions to the matching `ceremony.<id>.execute` topic CeremonyPlugin already listens for. Without this bridge, SkillDispatcherPlugin dropped every dispatch with "No executor found …" and (post-#427) emitted HIGH platform.skills_unwired alerts every cycle. Mirrors the alert-skill-executor-plugin pattern from #427 — explicit action→ceremony id mapping, install order matters (after registry, before skill-dispatcher). Partial fix for #430. * fix(pr-remediator): wire 5 GOAP-dispatched skills + honor hitlPolicy (#432) Closes the structural gap where 5 actions in workspace/actions.yaml route to handlers in PrRemediatorPlugin but had no registered executor: - action.pr_update_branch → pr.remediate.update_branch - action.pr_merge_ready → pr.remediate.merge_ready - action.pr_fix_ci → pr.remediate.fix_ci - action.pr_address_feedback → pr.remediate.address_feedback - action.dispatch_backmerge → pr.backmerge.dispatch Before this change SkillDispatcherPlugin logged "No executor found" and dropped the dispatch every GOAP cycle. After PR #427's startup validator the same gap raised platform.skills_unwired HIGH every tick. Wiring follows the AlertSkillExecutorPlugin pattern from #427: - PrRemediatorSkillExecutorPlugin registers FunctionExecutors that publish on the existing pr-remediator subscription topics, keeping "bus is the contract" — no plugin holds a reference to the other. - Executors are fire-and-forget per actions.yaml meta. They return a successful SkillResult immediately; pr-remediator's handler runs asynchronously on the bus subscription. - Install order matches alert-skill-executor: AFTER ExecutorRegistry construction, BEFORE skill-dispatcher. For action.pr_merge_ready specifically, the meta.hitlPolicy (ttlMs: 1800000, onTimeout: approve) is now honoured. The executor forwards meta into the trigger payload; _handleMergeReady extracts it via _extractHitlPolicy and passes it to _emitHitlApproval, which populates HITLRequest.{ttlMs, onTimeout}. HITLPlugin already auto- publishes a synthetic approve response when onTimeout=approve fires. Closes part of #430. Ceremony + protoMaker actions ship in a separate PR. * feat(agents): register protomaker A2A agent (closes part of #430) (#433) protoMaker (apps/server in protoLabsAI/ava) has been A2A-ready for a while — serves agent-card.json with 10 skills including the two referenced by unwired GOAP actions: - action.protomaker_triage_blocked → skill board_health - action.protomaker_start_auto_mode → skill auto_mode Both actions targeted [protomaker], but no agent named "protomaker" was registered, so the dispatcher couldn't route. Adding the entry closes the routing gap; A2AExecutor's existing target-matching does the rest. Endpoint: http://automaker-server:3008/a2a (verified from inside the workstacean container with AVA_API_KEY → JSON-RPC 2.0 response). * fix(agents): drop overscoped subscribesTo from protomaker entry (#434) #433 added `subscribesTo: message.inbound.github.#` to the new protomaker agent registration, copy-pasted from quinn's pattern. That was wrong: protoMaker is reached via explicit GOAP `targets: [protomaker]` dispatches (action.protomaker_triage_blocked, action.protomaker_start_auto_mode), not as a broadcast inbound listener. Quinn already subscribes to all GitHub inbound and dispatches `bug_triage` on protoMaker's behalf. Having protomaker subscribe to the same broadcast topic is one of the contributing paths to the duplicate-triage spam loop filed as protoLabsAI/protoMaker#3503 (the root cause is Quinn's handler not being idempotent — but this cleanup removes one extra firing path). * fix(ava): rename web_search → searxng_search to match runtime tool name (#435) PR #411 renamed the tool from web_search to searxng_search in src/agent-runtime/tools/bus-tools.ts (line 393), but ava.yaml still declared the old name. Result: at startup the runtime warns "agent ava declares unknown tools: web_search" and Ava ends up with no search capability — when asked, she explicitly responds "I don't have a searxng_search or web_search tool in my current toolkit." This is the config side of the half-finished rename that PR #411 missed. After this lands and workstacean restarts, Ava's toolkit should include searxng_search and the unknown-tools warning should be empty for her. * fix(goap): disable fleet_agent_stuck loop — fires every cycle, no dedup (#436) Two GOAP actions on goal fleet.no_agent_stuck were spamming Discord because they had no `effects` and no cooldown: - alert.fleet_agent_stuck → posts a Discord alert - action.fleet_incident_response → dispatches Ava to file an incident, page on-call, and pause routing Observed 2026-04-20: when auto-triage-sweep hit 100% failure rate on bug_triage (cascading from the non-idempotent handler in protoLabsAI/protoMaker#3503), GOAP re-fired both actions every planning cycle. Ava filed INC-003 through INC-009 in ~30 seconds, each posting to Discord. The pause routing succeeded but the rolling 1h failure rate metric doesn't drop instantly, so the goal stayed violated and the loop kept re-firing. Disabling both actions until proper dedup lands. Reinstate when: 1. action.fleet_incident_response gains a cooldown OR an `effects` marker that satisfies the goal for the cooldown window 2. Ava's fleet_incident_response skill checks for an existing open incident on the same agent before filing a new one 3. alert.fleet_agent_stuck gains per-agent rate limiting * fix(goap): disable 4 more action.* loops — same no-cooldown bug as #436 (#451) After #436 disabled action.fleet_incident_response, observed similar loop spam from FOUR more action.* dispatches that share the same architectural bug (no cooldown, no satisfying effects → re-fire every GOAP cycle while goal stays violated): action.fleet_investigate_orphaned_skills — Ava posts orphaned-skill diagnosis to Discord ops on every cycle (8+ posts in 1 min observed) action.issues_triage_critical — ~447 fires in 30 min action.issues_triage_bugs — ~447 fires in 30 min; compounds with #3503 by re-triaging same issues action.fleet_downshift_models — same pattern when cost exceeds budget ALL share the same pattern as the alert.* actions and the original fleet_incident_response: effects: [] + no cooldownMs + persistent goal violation = infinite re-fire. Mitigation only — Ava temporarily can't auto-investigate orphaned skills or auto-triage new GitHub issues. Reinstate when issue #437 ships action-level cooldown (meta.cooldownMs) or proper effects-with-TTL. The alert.* siblings (24 of them) also have this bug but are non-impacting today because DISCORD_WEBHOOK_ALERTS is unset and WorldEngineAlertPlugin drops them silently — fixing #437 will cover both at the dispatcher level. * feat(goap): per-action cooldown in ActionDispatcher (closes #437) (#452) GOAP actions with `effects: []` re-fired every planning cycle (~3s) while their goal stayed violated. Two prod fires (PR #436, PR #451) had to disable 6 actions before this lands. Restoring autonomy. ActionDispatcherPlugin now honors `meta.cooldownMs` on every action. When an action with a positive cooldownMs fires, the dispatcher records its timestamp; subsequent dispatches of the same action id within the window are dropped BEFORE the WIP queue and BEFORE the executor. Single chokepoint covers both alert.* (FunctionExecutor → Discord) and action.* (DeepAgent / A2A) paths. Drops log a fail-fast diagnostic with action id, age, and remaining window, plus bump a new `cooldown_dropped` telemetry event. Cooldown bucket is keyed on action id alone — per-target keying isn't needed because each GOAP action targets one situation. Greenfield shape: absence of `meta.cooldownMs` means "no cooldown" naturally, no flag. Defaults applied to workspace/actions.yaml: - alert.* → 15 min - action.* with skillHint → 30 min (agent work is expensive) - action.pr_* → 5 min (remediation must stay responsive) - ceremony.* → 30 min (treated as skill dispatch) - action.dispatch_backmerge → none (in-handler per-repo cooldown in pr-remediator stays authoritative) Re-enables the 6 actions disabled by PRs #436 and #451: alert.fleet_agent_stuck, action.fleet_incident_response, action.fleet_downshift_models, action.fleet_investigate_orphaned_skills, action.issues_triage_critical, action.issues_triage_bugs. Tests: - action-dispatcher unit tests cover: blocks repeats within window; A's cooldown does not affect B; window expiry admits next dispatch; absent cooldownMs and cooldownMs<=0 mean no throttling. - End-to-end test spam-publishes 100 violations of fleet.no_skill_orphaned and asserts exactly 1 dispatch reaches the executor. - bun test: 1023 / 1023 pass. * fix(ceremonies): route 4 ceremonies to right skill owner, disable 3 broken (#454) Audit of all 10 workspace/ceremonies/*.yaml against the live agent skill registry surfaced 7 with skill-target mismatches that fail every fire: ROUTED to correct skill owner: board.cleanup skill=board_audit targets [all] → [quinn] board.health skill=board_health targets [all] → [protomaker] daily-standup skill=board_audit targets [ava] → [quinn] health-check skill=board_audit targets [ava] → [quinn] DISABLED (no agent advertises the skill): agent-health skill=health_check — no health_check anywhere board.retro skill=pattern_analysis — no pattern_analysis anywhere service-health skill=health_check — same as agent-health Quinn owns board_audit (and bug_triage / pr_review / qa_report). Protomaker owns board_health (and 9 other apps/server skills). The two `health_check`-keyed ceremonies were redundant anyway — the agent_health and services world-state domains poll the same data every 60s and expose it on /api/agent-health and /api/services. Re-enable the disabled three with a real skill if a periodic ANNOUNCE to Discord is wanted (small `*_health_report` skill on protobot would do it). Workspace is bind-mounted, so the live container picks this up on restart with no rebuild. * feat(goap): pre-dispatch target registry guard (closes #444) (#455) The GOAP planner could dispatch a skill to a target agent (e.g. auto-triage-sweep, user) where the named target wasn't in the live ExecutorRegistry. The dispatcher fired anyway, the executor errored 404-style, and the failure cascaded into stuck work items + duplicate incident filings (INC-003 through INC-018, ~93 work items in error state). The cooldown work in #437 / #452 masked the symptom but the structural gap remained. ActionDispatcherPlugin now takes the shared ExecutorRegistry handle and runs `_admitOrTargetUnresolved` immediately after the cooldown check and BEFORE the WIP queue / executor. Same chokepoint pattern as cooldown: - target absent (skill-routed) → admit - target = "all" (broadcast sentinel) → admit - target matches a registration agentName → admit - target unresolvable → drop, log loud, telemetry bump `target_unresolved` Drops surface action id, the unresolvable target, AND the agents that DO exist so the routing mistake is immediately diagnosable. The only opt-out is the broadcast sentinel "all" — there is no flag, no enabled-bool. Greenfield-strict shape. Wiring: src/index.ts passes the shared executorRegistry into the dispatcher factory. Test fixtures that don't exercise target routing omit the registry and the check is skipped — so existing tests stay green without modification. Audit of workspace/actions.yaml: only `protomaker` appears as an active agentId target (twice). That agent is registered in workspace/agents.yaml. The historical bad targets (`auto-triage-sweep`, `user`) were removed by prior cleanup; this PR ensures any future regression fails closed. Tests added in src/plugins/action-dispatcher-plugin.test.ts: - admits when target is registered - drops when target is unregistered + bumps `target_unresolved` - drops on mixed-target intent (single-target shape today via meta.agentId) - admits when target is the "all" broadcast sentinel - admits when meta.agentId is absent (skill-routed dispatch) - admits when no registry is wired (legacy test fixtures) bun test: 1029 / 1029 pass. * fix(ceremony): honor enabled:false on initial load (closes #453) (#456) PR #415 fixed the hot-reload path so flipping a ceremony to disabled cancelled its timer, but the initial-load path still added every YAML entry (disabled or not) to the in-memory registry. Two consequences: 1. External `ceremony.<id>.execute` triggers (from CeremonySkillExecutorPlugin's GOAP bridge) found the disabled ceremony in the registry and fired it anyway. 2. After hot-reload flipped a ceremony enabled→disabled, the entry stayed in the registry — same external-trigger leak. Fix: filter `enabled === false` at every place that lands a ceremony in the registry (initial install, hot-reload new-file path, hot-reload changed-file path). Disabled ceremonies are loaded by the YAML parser (so the changed-file path can detect a flip) but never reach the registry, never schedule a timer, and cannot be resurrected by an external trigger. Operators see a `Skipping disabled ceremony: <id>` log line for each skip — fail-loud per project convention. Greenfield: no flag, no toggle. enabled:false means disabled everywhere. * fix(tests): wrap subscribe-spy push callbacks to return void (#457) 13 TS2322 errors snuck through #452 and #455 because the CI test job has been failing on type-check while build-and-push (the gate that actually publishes :dev) is a separate workflow that runs on push. Result: main/dev were green for container publish even though tsc --noEmit was returning exit 2. Not visible in PR merge gates either because test.conclusion=failure + build-and-push.conclusion=skipping still resolved to a mergeable state. Pattern of the 13 errors: bus.subscribe(T, "spy", (m) => requests.push(m)); ^^^^^^^^^^^^^^^^ Array.push() returns number, but the subscribe callback expects void | Promise<void>. Fix: wrap the body in a block so the arrow returns void: bus.subscribe(T, "spy", (m) => { requests.push(m); }); Applied via sed-style regex across both test files. 1029 tests still pass (bun test). `bun run tsc --noEmit` now exits 0. * fix(fleet-health): filter outcomes from synthetic actors via registry whitelist (closes #459) (#460) AgentFleetHealthPlugin now takes an optional ExecutorRegistry (mirrors the #455 wiring on ActionDispatcherPlugin). On each inbound autonomous.outcome, `systemActor` is checked against `executorRegistry.list().map(r => r.agentName)`: - Registered agent (ava, quinn, protomaker, ...) → aggregated in agents[] (existing shape, no behavior change). - Anything else (pr-remediator, auto-triage-sweep, goap, user, ...) → routed to a separate systemActors[] bucket. No longer pollutes agentCount / maxFailureRate1h / orphanedSkillCount, so Ava's sitreps stop surfacing plugin names as "stuck agents". First time a synthetic actor is seen, the plugin emits a one-time console.warn naming the actor + skill (fail-fast and loud, per policy) so operators know what's being filtered. No flag — same greenfield / chokepoint discipline as #437 (cooldown) and #444 (target guard). Scope note on `_default`: this plugin keys on outcome `systemActor`, not registry `agentName`. `_default` only appears in /api/agent-health (the registry-driven view). Nothing currently publishes an outcome with `systemActor: "_default"`, so it doesn't reach agents[] here. If it ever did, the new whitelist would drop it to systemActors[] — the right outcome. Verification plan (post-deploy): curl -s -X POST http://localhost:3000/v1/chat/completions \\ -H 'Content-Type: application/json' \\ -d '{"model":"ava","messages":[{"role":"user","content":"fleet sitrep"}]}' Expected: no `pr-remediator`, `auto-triage-sweep`, or `user` in agents[]. * fix(a2a): agent card advertises canonical A2A endpoint, not the dashboard URL (#462) The card at /.well-known/agent-card.json was advertising a URL like http://ava:8081/a2a — host-mapped to the Astro dashboard, which 404s on /a2a. Spec-compliant clients (@a2a-js/sdk and friends) doing card discovery → POST to card.url could not reach the actual A2A endpoint; the voice agent team papered over this by switching to the /v1/chat/completions shim. Fix: derive the card's `url` from variables that describe where the A2A endpoint actually lives. 1. WORKSTACEAN_PUBLIC_BASE_URL (e.g. https://ava.proto-labs.ai) → ${publicBase}/a2a. The canonical Cloudflare-fronted URL for external/Tailscale callers. 2. Otherwise, http://${WORKSTACEAN_INTERNAL_HOST ?? "workstacean"}:${WORKSTACEAN_HTTP_PORT}/a2a — docker-network service name + the actual API port. Also populate `additionalInterfaces` with the JSON-RPC transport at the same URL so spec-compliant clients can pick deterministically. Drop the WORKSTACEAN_BASE_URL coupling in the card builder — that variable remains the externally-reachable URL stamped into A2A push-notification callbacks (different concern, separate documentation). Closes #461 --------- Co-authored-by: Josh Mabry <31560031+mabry1985@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Josh <artificialcitizens@gmail.com>
…ning (v0.7.23 candidate) (#475) * fix: extension URIs use proto-labs.ai (not protolabs.ai) (#407) All 27 references to https://protolabs.ai/a2a/ext/* changed to https://proto-labs.ai/a2a/ext/* to match the actual domain. These URIs are opaque identifiers (not published specs today) but should reference a domain we own. Breaking: external agents (Quinn, protoPen) whose cards declare the old URI will stop matching the registry until they update. Filed on Quinn to update her card. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(release): bump to v0.7.20 (#408) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * chore: remove protoaudio + protovoice from agent rollcall (#410) Both services decommissioned. Containers stopped + removed. Only reference in protoWorkstacean was the rollcall script. Note: homelab-iac/stacks/ai/docker-compose.yml still has a worldmonitor network reference at line 521 + service at line 833. Needs separate cleanup in that repo. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: upgrade web_search → searxng_search + give Ava fleet health tools (#411) Two changes: 1. Replace the basic `web_search` tool (5 results, hardcoded engines) with `searxng_search` — adapted from rabbit-hole.io's full-surface SearXNG integration. New capabilities: - Category routing: general, news, science, it - Time range filtering: day, week, month, year - Bang syntax: !wp (Wikipedia), !scholar, !gh (GitHub) - Infoboxes, direct answers, suggestions in response - Configurable max_results (default 10, was 5) Updated in both bus-tools.ts (@protolabsai/sdk pattern) and deep-agent-executor.ts (LangChain pattern). 2. Give Ava three fleet health tools she was missing: - get_ci_health — CI success rates across repos - get_pr_pipeline — open PRs, conflicts, staleness - get_incidents — security/ops incidents Ava can now answer fleet health questions directly instead of always delegating to Quinn. Ava's tool count: 10 → 13. Tool rename: web_search → searxng_search (greenfield, no backward compat alias). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(projects): register protoAgent in projects.yaml (#414) protoAgent is the new GitHub Template repo that replaces per-agent A2A bootstrapping. Registers it as an active dev project owned by Quinn, matching the shape of existing entries. Plane / GitHub webhook / Discord provisioning remain TODO — those integrations aren't configured in this deployment, so the onboard plugin skipped them. Co-authored-by: Josh <artificialcitizens@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): bump to v0.7.21 (#413) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * feat(ava): expand helm toolset, wire GOAP skills, fix ceremony disable bug (#415) Ava agent audit + overhaul: - Tools: 10 → 22 (direct observation, propose_config_change, incident reporting) - Skills: 3 → 7 (debug_ci_failures, fleet_incident_response, downshift_models, investigate_orphaned_skills) - System prompt rewritten: self-improvement instructions, escalation policy, GOAP-dispatch playbook - DeepAgentExecutor now applies skill-level systemPromptOverride (goal_proposal, diagnose_pr_stuck) - Fix ceremony loader bug: disabled ceremonies were filtered out, preventing hot-reload from cancelling timers - Clean up board.pr-audit.yaml (remove spurious action field, restore schedule, keep disabled) - Update docs: README, deep-agent runtime, agent-skills reference, self-improving loop Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-remediator): close dispatch gap — self-dispatch + broaden auto-approve (#417) Two root causes prevented PRs from being auto-merged: 1. Dispatch gap: tier_0 short-circuit in ActionDispatcherPlugin completed all actions immediately without dispatching to agent.skill.request. Every action in actions.yaml is tier_0, so the fireAndForget path (which publishes the skill request) was unreachable dead code. Fix: tier_0 now falls through when meta.fireAndForget is set. 2. Approval gap: readyToMerge requires reviewState=approved, but auto-approve only covered dependabot/renovate/promote:/chore(deps. Human PRs, release PRs (chore(release), and github-actions PRs all lacked approved reviews and sat indefinitely. Fix: added app/github-actions to authors, chore(release, chore:, docs( to safe title prefixes. Additionally, PrRemediatorPlugin now self-dispatches remediation on every world.state.updated tick — checking for readyToMerge, dirty, failingCi, and changesRequested PRs directly from cached domain data. This removes the dependency on GOAP dispatch reaching the plugin via pr.remediate.* topics (which were never published in production after Arc 1.4 removed meta.topic routing). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(goap): issue_zero domain + goals — track open GitHub issues across fleet (#419) Adds a github_issues domain that polls /repos/{repo}/issues?state=open for all managed projects and classifies by label (critical, bug, enhancement). Three GOAP goals enforce issue hygiene: - issues.zero_critical (critical severity, max: 0) - issues.zero_bugs (high severity, max: 0) - issues.total_low (medium severity, max: 5) Each goal has a matching alert action and a triage dispatch action that invokes Ava's new issue_triage skill. The skill instructs Ava to resolve, convert to board features, delegate, or close issues with rationale — driving toward zero open issues across all repos. Domain polls every 5 minutes (issue velocity is low, GitHub rate limits are a concern with 6+ repos). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: manage_board list action (#247) + a2a.trace extension (#359) (#420) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-remediator): case-insensitive auto-approve prefix matching (#421) * feat: manage_board list action (#247) + a2a.trace extension (#359) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-remediator): case-insensitive auto-approve prefix matching "Promote dev to main" titles start with capital P, but the prefix check was case-sensitive against "promote:". Now lowercases the title before matching so both "promote:" and "Promote" patterns are caught. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(skill-dispatcher): wire alert.* executors + startup validator (#426) (#427) Closes the structural gap where 6+ tier_0 fire-and-forget alert skills had no registered executor, causing SkillDispatcherPlugin to log "No executor found" and silently drop the dispatch on every GOAP planning cycle. - AlertSkillExecutorPlugin registers FunctionExecutors for all 24 bare alert.* actions in workspace/actions.yaml. Each translates the dispatch into a structured message.outbound.discord.alert event consumed by the existing WorldEngineAlertPlugin webhook routing. - validate-action-executors.ts cross-checks the loaded ActionRegistry against the live ExecutorRegistry at startup. Surfaces every gap as a HIGH-severity Discord alert (goal platform.skills_unwired) and a loud console.error. Set WORKSTACEAN_STRICT_WIRING=1 to crash startup instead. - action.issues_triage_bugs already routes correctly via meta.agentId=ava to the existing DeepAgentExecutor for Ava's issue_triage skill — no duplicate wiring needed (greenfield). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ceremonies): stop world.state.# leak from ceremony snapshots (#428) CeremonyStateExtension was publishing { domain, data } envelopes on `world.state.snapshot` after every ceremony completion. GoalEvaluatorPlugin subscribes to `world.state.#`, treated the malformed payload as a WorldState, and emitted a "Selector ... not found" violation for every loaded goal on every ceremony tick (the cluster of 25+ violations at each :15/:30 boundary in the live container logs). All listed selectors (flow.efficiency.ratio, services.discord.connected, agent_health.agentCount, etc.) actually exist in the producer output — the goals are correct. Changes: - Move ceremony snapshot publish to `ceremony.state.snapshot` (off the world.state.# namespace). Leaves the existing CeremoniesState shape and consumers unchanged. - Goal evaluator: defensive payload shape check. Reject single-domain envelopes ({ domain, data }) and other non-WorldState payloads loud-once instead of generating one violation per goal. - Goal evaluator: startup selector validator. After the first valid world state arrives, walk every loaded goal's selector and HIGH-log any that doesn't resolve. Re-armed on goals.reload / config.reload so drift caught by future hot-reloads also surfaces. - Tests: regression guard that CeremonyStateExtension does not publish on world.state.#; goal evaluator ignores malformed payloads; validator catches an intentionally broken selector. Closes #424 Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rollcall): point /rollcall skill at in-repo script (closes #425) (#429) The Claude Code skill was calling the homelab-iac copy of agent-rollcall.sh, which had drifted from this repo's copy. The in-repo script knows about the in-process DeepAgent runtime (Ava, protoBot, Tuner) and the current A2A fleet; the homelab-iac copy still probed for the archived ava-agent container and the deprecated protoaudio/protovoice services. Single source of truth: this repo. The homelab-iac copy was separately synced in homelab-iac@64e8dcf. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Promote dev to main (v0.7.21) (#418) (#423) * fix: extension URIs use proto-labs.ai (not protolabs.ai) (#407) All 27 references to https://protolabs.ai/a2a/ext/* changed to https://proto-labs.ai/a2a/ext/* to match the actual domain. These URIs are opaque identifiers (not published specs today) but should reference a domain we own. Breaking: external agents (Quinn, protoPen) whose cards declare the old URI will stop matching the registry until they update. Filed on Quinn to update her card. * chore(release): bump to v0.7.20 (#408) * chore: remove protoaudio + protovoice from agent rollcall (#410) Both services decommissioned. Containers stopped + removed. Only reference in protoWorkstacean was the rollcall script. Note: homelab-iac/stacks/ai/docker-compose.yml still has a worldmonitor network reference at line 521 + service at line 833. Needs separate cleanup in that repo. * feat: upgrade web_search → searxng_search + give Ava fleet health tools (#411) Two changes: 1. Replace the basic `web_search` tool (5 results, hardcoded engines) with `searxng_search` — adapted from rabbit-hole.io's full-surface SearXNG integration. New capabilities: - Category routing: general, news, science, it - Time range filtering: day, week, month, year - Bang syntax: !wp (Wikipedia), !scholar, !gh (GitHub) - Infoboxes, direct answers, suggestions in response - Configurable max_results (default 10, was 5) Updated in both bus-tools.ts (@protolabsai/sdk pattern) and deep-agent-executor.ts (LangChain pattern). 2. Give Ava three fleet health tools she was missing: - get_ci_health — CI success rates across repos - get_pr_pipeline — open PRs, conflicts, staleness - get_incidents — security/ops incidents Ava can now answer fleet health questions directly instead of always delegating to Quinn. Ava's tool count: 10 → 13. Tool rename: web_search → searxng_search (greenfield, no backward compat alias). * chore(projects): register protoAgent in projects.yaml (#414) protoAgent is the new GitHub Template repo that replaces per-agent A2A bootstrapping. Registers it as an active dev project owned by Quinn, matching the shape of existing entries. Plane / GitHub webhook / Discord provisioning remain TODO — those integrations aren't configured in this deployment, so the onboard plugin skipped them. * chore(release): bump to v0.7.21 (#413) * feat(ava): expand helm toolset, wire GOAP skills, fix ceremony disable bug (#415) Ava agent audit + overhaul: - Tools: 10 → 22 (direct observation, propose_config_change, incident reporting) - Skills: 3 → 7 (debug_ci_failures, fleet_incident_response, downshift_models, investigate_orphaned_skills) - System prompt rewritten: self-improvement instructions, escalation policy, GOAP-dispatch playbook - DeepAgentExecutor now applies skill-level systemPromptOverride (goal_proposal, diagnose_pr_stuck) - Fix ceremony loader bug: disabled ceremonies were filtered out, preventing hot-reload from cancelling timers - Clean up board.pr-audit.yaml (remove spurious action field, restore schedule, keep disabled) - Update docs: README, deep-agent runtime, agent-skills reference, self-improving loop * fix(pr-remediator): close dispatch gap — self-dispatch + broaden auto-approve (#417) Two root causes prevented PRs from being auto-merged: 1. Dispatch gap: tier_0 short-circuit in ActionDispatcherPlugin completed all actions immediately without dispatching to agent.skill.request. Every action in actions.yaml is tier_0, so the fireAndForget path (which publishes the skill request) was unreachable dead code. Fix: tier_0 now falls through when meta.fireAndForget is set. 2. Approval gap: readyToMerge requires reviewState=approved, but auto-approve only covered dependabot/renovate/promote:/chore(deps. Human PRs, release PRs (chore(release), and github-actions PRs all lacked approved reviews and sat indefinitely. Fix: added app/github-actions to authors, chore(release, chore:, docs( to safe title prefixes. Additionally, PrRemediatorPlugin now self-dispatches remediation on every world.state.updated tick — checking for readyToMerge, dirty, failingCi, and changesRequested PRs directly from cached domain data. This removes the dependency on GOAP dispatch reaching the plugin via pr.remediate.* topics (which were never published in production after Arc 1.4 removed meta.topic routing). * feat(goap): issue_zero domain + goals — track open GitHub issues across fleet (#419) Adds a github_issues domain that polls /repos/{repo}/issues?state=open for all managed projects and classifies by label (critical, bug, enhancement). Three GOAP goals enforce issue hygiene: - issues.zero_critical (critical severity, max: 0) - issues.zero_bugs (high severity, max: 0) - issues.total_low (medium severity, max: 5) Each goal has a matching alert action and a triage dispatch action that invokes Ava's new issue_triage skill. The skill instructs Ava to resolve, convert to board features, delegate, or close issues with rationale — driving toward zero open issues across all repos. Domain polls every 5 minutes (issue velocity is low, GitHub rate limits are a concern with 6+ repos). * feat: manage_board list action (#247) + a2a.trace extension (#359) (#420) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching (#421) * feat: manage_board list action (#247) + a2a.trace extension (#359) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching "Promote dev to main" titles start with capital P, but the prefix check was case-sensitive against "promote:". Now lowercases the title before matching so both "promote:" and "Promote" patterns are caught. --------- --------- Co-authored-by: Josh Mabry <31560031+mabry1985@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Josh <artificialcitizens@gmail.com> * feat(ceremony): wire ceremony.security_triage + ceremony.service_health_discord (#431) Adds CeremonySkillExecutorPlugin — registers FunctionExecutors that bridge GOAP `ceremony.*` actions to the matching `ceremony.<id>.execute` topic CeremonyPlugin already listens for. Without this bridge, SkillDispatcherPlugin dropped every dispatch with "No executor found …" and (post-#427) emitted HIGH platform.skills_unwired alerts every cycle. Mirrors the alert-skill-executor-plugin pattern from #427 — explicit action→ceremony id mapping, install order matters (after registry, before skill-dispatcher). Partial fix for #430. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr-remediator): wire 5 GOAP-dispatched skills + honor hitlPolicy (#432) Closes the structural gap where 5 actions in workspace/actions.yaml route to handlers in PrRemediatorPlugin but had no registered executor: - action.pr_update_branch → pr.remediate.update_branch - action.pr_merge_ready → pr.remediate.merge_ready - action.pr_fix_ci → pr.remediate.fix_ci - action.pr_address_feedback → pr.remediate.address_feedback - action.dispatch_backmerge → pr.backmerge.dispatch Before this change SkillDispatcherPlugin logged "No executor found" and dropped the dispatch every GOAP cycle. After PR #427's startup validator the same gap raised platform.skills_unwired HIGH every tick. Wiring follows the AlertSkillExecutorPlugin pattern from #427: - PrRemediatorSkillExecutorPlugin registers FunctionExecutors that publish on the existing pr-remediator subscription topics, keeping "bus is the contract" — no plugin holds a reference to the other. - Executors are fire-and-forget per actions.yaml meta. They return a successful SkillResult immediately; pr-remediator's handler runs asynchronously on the bus subscription. - Install order matches alert-skill-executor: AFTER ExecutorRegistry construction, BEFORE skill-dispatcher. For action.pr_merge_ready specifically, the meta.hitlPolicy (ttlMs: 1800000, onTimeout: approve) is now honoured. The executor forwards meta into the trigger payload; _handleMergeReady extracts it via _extractHitlPolicy and passes it to _emitHitlApproval, which populates HITLRequest.{ttlMs, onTimeout}. HITLPlugin already auto- publishes a synthetic approve response when onTimeout=approve fires. Closes part of #430. Ceremony + protoMaker actions ship in a separate PR. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(agents): register protomaker A2A agent (closes part of #430) (#433) protoMaker (apps/server in protoLabsAI/ava) has been A2A-ready for a while — serves agent-card.json with 10 skills including the two referenced by unwired GOAP actions: - action.protomaker_triage_blocked → skill board_health - action.protomaker_start_auto_mode → skill auto_mode Both actions targeted [protomaker], but no agent named "protomaker" was registered, so the dispatcher couldn't route. Adding the entry closes the routing gap; A2AExecutor's existing target-matching does the rest. Endpoint: http://automaker-server:3008/a2a (verified from inside the workstacean container with AVA_API_KEY → JSON-RPC 2.0 response). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(agents): drop overscoped subscribesTo from protomaker entry (#434) #433 added `subscribesTo: message.inbound.github.#` to the new protomaker agent registration, copy-pasted from quinn's pattern. That was wrong: protoMaker is reached via explicit GOAP `targets: [protomaker]` dispatches (action.protomaker_triage_blocked, action.protomaker_start_auto_mode), not as a broadcast inbound listener. Quinn already subscribes to all GitHub inbound and dispatches `bug_triage` on protoMaker's behalf. Having protomaker subscribe to the same broadcast topic is one of the contributing paths to the duplicate-triage spam loop filed as protoLabsAI/protoMaker#3503 (the root cause is Quinn's handler not being idempotent — but this cleanup removes one extra firing path). Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ava): rename web_search → searxng_search to match runtime tool name (#435) PR #411 renamed the tool from web_search to searxng_search in src/agent-runtime/tools/bus-tools.ts (line 393), but ava.yaml still declared the old name. Result: at startup the runtime warns "agent ava declares unknown tools: web_search" and Ava ends up with no search capability — when asked, she explicitly responds "I don't have a searxng_search or web_search tool in my current toolkit." This is the config side of the half-finished rename that PR #411 missed. After this lands and workstacean restarts, Ava's toolkit should include searxng_search and the unknown-tools warning should be empty for her. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(goap): disable fleet_agent_stuck loop — fires every cycle, no dedup (#436) Two GOAP actions on goal fleet.no_agent_stuck were spamming Discord because they had no `effects` and no cooldown: - alert.fleet_agent_stuck → posts a Discord alert - action.fleet_incident_response → dispatches Ava to file an incident, page on-call, and pause routing Observed 2026-04-20: when auto-triage-sweep hit 100% failure rate on bug_triage (cascading from the non-idempotent handler in protoLabsAI/protoMaker#3503), GOAP re-fired both actions every planning cycle. Ava filed INC-003 through INC-009 in ~30 seconds, each posting to Discord. The pause routing succeeded but the rolling 1h failure rate metric doesn't drop instantly, so the goal stayed violated and the loop kept re-firing. Disabling both actions until proper dedup lands. Reinstate when: 1. action.fleet_incident_response gains a cooldown OR an `effects` marker that satisfies the goal for the cooldown window 2. Ava's fleet_incident_response skill checks for an existing open incident on the same agent before filing a new one 3. alert.fleet_agent_stuck gains per-agent rate limiting Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(goap): disable 4 more action.* loops — same no-cooldown bug as #436 (#451) After #436 disabled action.fleet_incident_response, observed similar loop spam from FOUR more action.* dispatches that share the same architectural bug (no cooldown, no satisfying effects → re-fire every GOAP cycle while goal stays violated): action.fleet_investigate_orphaned_skills — Ava posts orphaned-skill diagnosis to Discord ops on every cycle (8+ posts in 1 min observed) action.issues_triage_critical — ~447 fires in 30 min action.issues_triage_bugs — ~447 fires in 30 min; compounds with #3503 by re-triaging same issues action.fleet_downshift_models — same pattern when cost exceeds budget ALL share the same pattern as the alert.* actions and the original fleet_incident_response: effects: [] + no cooldownMs + persistent goal violation = infinite re-fire. Mitigation only — Ava temporarily can't auto-investigate orphaned skills or auto-triage new GitHub issues. Reinstate when issue #437 ships action-level cooldown (meta.cooldownMs) or proper effects-with-TTL. The alert.* siblings (24 of them) also have this bug but are non-impacting today because DISCORD_WEBHOOK_ALERTS is unset and WorldEngineAlertPlugin drops them silently — fixing #437 will cover both at the dispatcher level. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(goap): per-action cooldown in ActionDispatcher (closes #437) (#452) GOAP actions with `effects: []` re-fired every planning cycle (~3s) while their goal stayed violated. Two prod fires (PR #436, PR #451) had to disable 6 actions before this lands. Restoring autonomy. ActionDispatcherPlugin now honors `meta.cooldownMs` on every action. When an action with a positive cooldownMs fires, the dispatcher records its timestamp; subsequent dispatches of the same action id within the window are dropped BEFORE the WIP queue and BEFORE the executor. Single chokepoint covers both alert.* (FunctionExecutor → Discord) and action.* (DeepAgent / A2A) paths. Drops log a fail-fast diagnostic with action id, age, and remaining window, plus bump a new `cooldown_dropped` telemetry event. Cooldown bucket is keyed on action id alone — per-target keying isn't needed because each GOAP action targets one situation. Greenfield shape: absence of `meta.cooldownMs` means "no cooldown" naturally, no flag. Defaults applied to workspace/actions.yaml: - alert.* → 15 min - action.* with skillHint → 30 min (agent work is expensive) - action.pr_* → 5 min (remediation must stay responsive) - ceremony.* → 30 min (treated as skill dispatch) - action.dispatch_backmerge → none (in-handler per-repo cooldown in pr-remediator stays authoritative) Re-enables the 6 actions disabled by PRs #436 and #451: alert.fleet_agent_stuck, action.fleet_incident_response, action.fleet_downshift_models, action.fleet_investigate_orphaned_skills, action.issues_triage_critical, action.issues_triage_bugs. Tests: - action-dispatcher unit tests cover: blocks repeats within window; A's cooldown does not affect B; window expiry admits next dispatch; absent cooldownMs and cooldownMs<=0 mean no throttling. - End-to-end test spam-publishes 100 violations of fleet.no_skill_orphaned and asserts exactly 1 dispatch reaches the executor. - bun test: 1023 / 1023 pass. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ceremonies): route 4 ceremonies to right skill owner, disable 3 broken (#454) Audit of all 10 workspace/ceremonies/*.yaml against the live agent skill registry surfaced 7 with skill-target mismatches that fail every fire: ROUTED to correct skill owner: board.cleanup skill=board_audit targets [all] → [quinn] board.health skill=board_health targets [all] → [protomaker] daily-standup skill=board_audit targets [ava] → [quinn] health-check skill=board_audit targets [ava] → [quinn] DISABLED (no agent advertises the skill): agent-health skill=health_check — no health_check anywhere board.retro skill=pattern_analysis — no pattern_analysis anywhere service-health skill=health_check — same as agent-health Quinn owns board_audit (and bug_triage / pr_review / qa_report). Protomaker owns board_health (and 9 other apps/server skills). The two `health_check`-keyed ceremonies were redundant anyway — the agent_health and services world-state domains poll the same data every 60s and expose it on /api/agent-health and /api/services. Re-enable the disabled three with a real skill if a periodic ANNOUNCE to Discord is wanted (small `*_health_report` skill on protobot would do it). Workspace is bind-mounted, so the live container picks this up on restart with no rebuild. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(goap): pre-dispatch target registry guard (closes #444) (#455) The GOAP planner could dispatch a skill to a target agent (e.g. auto-triage-sweep, user) where the named target wasn't in the live ExecutorRegistry. The dispatcher fired anyway, the executor errored 404-style, and the failure cascaded into stuck work items + duplicate incident filings (INC-003 through INC-018, ~93 work items in error state). The cooldown work in #437 / #452 masked the symptom but the structural gap remained. ActionDispatcherPlugin now takes the shared ExecutorRegistry handle and runs `_admitOrTargetUnresolved` immediately after the cooldown check and BEFORE the WIP queue / executor. Same chokepoint pattern as cooldown: - target absent (skill-routed) → admit - target = "all" (broadcast sentinel) → admit - target matches a registration agentName → admit - target unresolvable → drop, log loud, telemetry bump `target_unresolved` Drops surface action id, the unresolvable target, AND the agents that DO exist so the routing mistake is immediately diagnosable. The only opt-out is the broadcast sentinel "all" — there is no flag, no enabled-bool. Greenfield-strict shape. Wiring: src/index.ts passes the shared executorRegistry into the dispatcher factory. Test fixtures that don't exercise target routing omit the registry and the check is skipped — so existing tests stay green without modification. Audit of workspace/actions.yaml: only `protomaker` appears as an active agentId target (twice). That agent is registered in workspace/agents.yaml. The historical bad targets (`auto-triage-sweep`, `user`) were removed by prior cleanup; this PR ensures any future regression fails closed. Tests added in src/plugins/action-dispatcher-plugin.test.ts: - admits when target is registered - drops when target is unregistered + bumps `target_unresolved` - drops on mixed-target intent (single-target shape today via meta.agentId) - admits when target is the "all" broadcast sentinel - admits when meta.agentId is absent (skill-routed dispatch) - admits when no registry is wired (legacy test fixtures) bun test: 1029 / 1029 pass. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ceremony): honor enabled:false on initial load (closes #453) (#456) PR #415 fixed the hot-reload path so flipping a ceremony to disabled cancelled its timer, but the initial-load path still added every YAML entry (disabled or not) to the in-memory registry. Two consequences: 1. External `ceremony.<id>.execute` triggers (from CeremonySkillExecutorPlugin's GOAP bridge) found the disabled ceremony in the registry and fired it anyway. 2. After hot-reload flipped a ceremony enabled→disabled, the entry stayed in the registry — same external-trigger leak. Fix: filter `enabled === false` at every place that lands a ceremony in the registry (initial install, hot-reload new-file path, hot-reload changed-file path). Disabled ceremonies are loaded by the YAML parser (so the changed-file path can detect a flip) but never reach the registry, never schedule a timer, and cannot be resurrected by an external trigger. Operators see a `Skipping disabled ceremony: <id>` log line for each skip — fail-loud per project convention. Greenfield: no flag, no toggle. enabled:false means disabled everywhere. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): wrap subscribe-spy push callbacks to return void (#457) 13 TS2322 errors snuck through #452 and #455 because the CI test job has been failing on type-check while build-and-push (the gate that actually publishes :dev) is a separate workflow that runs on push. Result: main/dev were green for container publish even though tsc --noEmit was returning exit 2. Not visible in PR merge gates either because test.conclusion=failure + build-and-push.conclusion=skipping still resolved to a mergeable state. Pattern of the 13 errors: bus.subscribe(T, "spy", (m) => requests.push(m)); ^^^^^^^^^^^^^^^^ Array.push() returns number, but the subscribe callback expects void | Promise<void>. Fix: wrap the body in a block so the arrow returns void: bus.subscribe(T, "spy", (m) => { requests.push(m); }); Applied via sed-style regex across both test files. 1029 tests still pass (bun test). `bun run tsc --noEmit` now exits 0. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(fleet-health): filter outcomes from synthetic actors via registry whitelist (closes #459) (#460) AgentFleetHealthPlugin now takes an optional ExecutorRegistry (mirrors the #455 wiring on ActionDispatcherPlugin). On each inbound autonomous.outcome, `systemActor` is checked against `executorRegistry.list().map(r => r.agentName)`: - Registered agent (ava, quinn, protomaker, ...) → aggregated in agents[] (existing shape, no behavior change). - Anything else (pr-remediator, auto-triage-sweep, goap, user, ...) → routed to a separate systemActors[] bucket. No longer pollutes agentCount / maxFailureRate1h / orphanedSkillCount, so Ava's sitreps stop surfacing plugin names as "stuck agents". First time a synthetic actor is seen, the plugin emits a one-time console.warn naming the actor + skill (fail-fast and loud, per policy) so operators know what's being filtered. No flag — same greenfield / chokepoint discipline as #437 (cooldown) and #444 (target guard). Scope note on `_default`: this plugin keys on outcome `systemActor`, not registry `agentName`. `_default` only appears in /api/agent-health (the registry-driven view). Nothing currently publishes an outcome with `systemActor: "_default"`, so it doesn't reach agents[] here. If it ever did, the new whitelist would drop it to systemActors[] — the right outcome. Verification plan (post-deploy): curl -s -X POST http://localhost:3000/v1/chat/completions \\ -H 'Content-Type: application/json' \\ -d '{"model":"ava","messages":[{"role":"user","content":"fleet sitrep"}]}' Expected: no `pr-remediator`, `auto-triage-sweep`, or `user` in agents[]. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(a2a): agent card advertises canonical A2A endpoint, not the dashboard URL (#462) The card at /.well-known/agent-card.json was advertising a URL like http://ava:8081/a2a — host-mapped to the Astro dashboard, which 404s on /a2a. Spec-compliant clients (@a2a-js/sdk and friends) doing card discovery → POST to card.url could not reach the actual A2A endpoint; the voice agent team papered over this by switching to the /v1/chat/completions shim. Fix: derive the card's `url` from variables that describe where the A2A endpoint actually lives. 1. WORKSTACEAN_PUBLIC_BASE_URL (e.g. https://ava.proto-labs.ai) → ${publicBase}/a2a. The canonical Cloudflare-fronted URL for external/Tailscale callers. 2. Otherwise, http://${WORKSTACEAN_INTERNAL_HOST ?? "workstacean"}:${WORKSTACEAN_HTTP_PORT}/a2a — docker-network service name + the actual API port. Also populate `additionalInterfaces` with the JSON-RPC transport at the same URL so spec-compliant clients can pick deterministically. Drop the WORKSTACEAN_BASE_URL coupling in the card builder — that variable remains the externally-reachable URL stamped into A2A push-notification callbacks (different concern, separate documentation). Closes #461 Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr-remediator): never auto-close promotion PRs on decomposable verdict (closes #465) (#469) Two-layer fix per the issue body: Layer A — Ava's diagnose_pr_stuck prompt (workspace/agents/ava.yaml): The promotion-PR rule is now stated FIRST, above the four verdict definitions. When the head is dev/staging OR base is main/staging OR title starts "Promote"/"promote:", the verdict is always rebasable — drift between branches is fixed with a back-merge of base into head, not by splitting the PR. Phrased in positive framing per the no-negative-reinforcement memory. Layer B — code guard (lib/plugins/pr-remediator.ts): New isPromotionPr() helper, called at the case "decomposable" handler chokepoint before the close+comment path. On promotion PRs the guard warns loudly (naming head/base/title), escalates to HITL via the existing _emitStuckHitlEscalation pathway, and returns. PrDomainEntry gains an optional headRef field; src/api/github.ts surfaces pr.head.ref in the pr_pipeline domain so the guard can see it. _dispatchDiagnose also adds "Head branch:" to the prompt payload so Ava sees the same field. Same defense-in-depth shape as #437 (cooldown), #444 (target registry), #459 (synthetic actor filter): invariants live at the action-site chokepoint, not at the planner. A drifting prompt cannot close a release-pipeline PR. Tests: 4 new cases in src/pr-remediator.test.ts cover refactor PR (still closes), dev→main / staging→main / dev→staging promotion PRs (escalate to HITL, do NOT close). Full suite 1047/1047 pass. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr-remediator): Number.isFinite guard on ttlMs — Infinity/NaN RangeError (#467 finding #2) (#470) * fix(pr-remediator): guard ttlMs against Infinity/NaN with Number.isFinite _extractHitlPolicy accepted any typeof "number" for ttlMs, which includes Infinity and NaN. Both pass the typeof check but cause new Date(Date.now() + ttlMs).toISOString() to throw a RangeError. Fix: add Number.isFinite(p.ttlMs) && p.ttlMs > 0 guard. Fixes finding #2 from GitHub issue #467 (CodeRabbit review on PR #466). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(plugins): gate pr-remediator-skill-executor with GitHub credential condition `pr-remediator-skill-executor` was unconditionally installed (condition: () => true), but `pr-remediator` itself is gated on GitHub credentials. When creds are absent, dispatches to `pr.remediate.*` topics passed validation and the executor ran — but no subscriber consumed them, resulting in silent success with no actual work done. Fix: apply the same condition guard used by `pr-remediator`: !!(process.env.QUINN_APP_PRIVATE_KEY || process.env.GITHUB_TOKEN) Resolves GitHub issue #467 (CodeRabbit finding #4 from PR #466). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(a2a-server): default to [ava] target + sanitize upstream HTML errors (closes #471) (#473) Two bugs reported by protoVoice against ava.proto-labs.ai/a2a: 1. message/send with no metadata routed to protoBot (the router's default chat agent) instead of Ava. This endpoint is Ava's — the orchestrator card aggregates the fleet's skills, but routing defaults here must be Ava. Callers targeting another agent pass explicit metadata.targets. Logs an info line on the default path so operators can see the fallback firing. 2. When a downstream A2A sub-call was misrouted (e.g. upstream protoLabsAI/protoMaker#3536 — broken card URL), the raw HTML 404 page bubbled through the bus response's content field and was rendered as the assistant's reply text. BusAgentExecutor now detects HTML-looking payloads (<!DOCTYPE, <html, Cannot POST/GET, 404 Not Found), treats them as failures, logs the raw payload at warn level for debugging, and replaces the final message text with a sanitized operator-facing string that includes a short stripped debug hint. No flag, no opt-out — greenfield, new behavior is the only behavior. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(467): clear remaining hardening items from CodeRabbit PR #466 review Closes the three open CodeRabbit findings from #466 not yet shipped (items #2 and #4 already landed via #470 and 9792b0c). #467 finding #1 (.claude/commands/rollcall.md): Hard-coded operator path /home/josh/dev/... → relative path scripts/agent-rollcall.sh with explicit "from repo root" guidance. Works for any clone; matches every other repo-script reference. #467 finding #3 (README.md env table): Add a dedicated env-table row for WORKSTACEAN_INTERNAL_HOST so it's discoverable by anyone overriding the docker-network default. Cross-references WORKSTACEAN_PUBLIC_BASE_URL. #467 finding #5 (src/index.ts startup-validator): validateActionExecutors() ran BEFORE loadWorkspacePlugins(), so executor registrars shipped as workspace plugins were falsely flagged in strict mode. It also ran only once at startup, so config.reload of actions.yaml bypassed the fail-loud guard. Fix: extract a runWiringValidator(reason) helper, call it AFTER all plugin loading (core + registered + workspace), and re-run inside the CONFIG_RELOAD subscriber after loadActionsYaml(). New "[reload-validator]" log tag distinguishes the call sites. Tests: 1054/1054 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Promote dev → main: agent card + fleet-health filter + ceremony fixes (v0.7.22 candidate, re-cut) (#466) (#468) * fix: extension URIs use proto-labs.ai (not protolabs.ai) (#407) All 27 references to https://protolabs.ai/a2a/ext/* changed to https://proto-labs.ai/a2a/ext/* to match the actual domain. These URIs are opaque identifiers (not published specs today) but should reference a domain we own. Breaking: external agents (Quinn, protoPen) whose cards declare the old URI will stop matching the registry until they update. Filed on Quinn to update her card. * chore(release): bump to v0.7.20 (#408) * chore: remove protoaudio + protovoice from agent rollcall (#410) Both services decommissioned. Containers stopped + removed. Only reference in protoWorkstacean was the rollcall script. Note: homelab-iac/stacks/ai/docker-compose.yml still has a worldmonitor network reference at line 521 + service at line 833. Needs separate cleanup in that repo. * feat: upgrade web_search → searxng_search + give Ava fleet health tools (#411) Two changes: 1. Replace the basic `web_search` tool (5 results, hardcoded engines) with `searxng_search` — adapted from rabbit-hole.io's full-surface SearXNG integration. New capabilities: - Category routing: general, news, science, it - Time range filtering: day, week, month, year - Bang syntax: !wp (Wikipedia), !scholar, !gh (GitHub) - Infoboxes, direct answers, suggestions in response - Configurable max_results (default 10, was 5) Updated in both bus-tools.ts (@protolabsai/sdk pattern) and deep-agent-executor.ts (LangChain pattern). 2. Give Ava three fleet health tools she was missing: - get_ci_health — CI success rates across repos - get_pr_pipeline — open PRs, conflicts, staleness - get_incidents — security/ops incidents Ava can now answer fleet health questions directly instead of always delegating to Quinn. Ava's tool count: 10 → 13. Tool rename: web_search → searxng_search (greenfield, no backward compat alias). * chore(projects): register protoAgent in projects.yaml (#414) protoAgent is the new GitHub Template repo that replaces per-agent A2A bootstrapping. Registers it as an active dev project owned by Quinn, matching the shape of existing entries. Plane / GitHub webhook / Discord provisioning remain TODO — those integrations aren't configured in this deployment, so the onboard plugin skipped them. * chore(release): bump to v0.7.21 (#413) * feat(ava): expand helm toolset, wire GOAP skills, fix ceremony disable bug (#415) Ava agent audit + overhaul: - Tools: 10 → 22 (direct observation, propose_config_change, incident reporting) - Skills: 3 → 7 (debug_ci_failures, fleet_incident_response, downshift_models, investigate_orphaned_skills) - System prompt rewritten: self-improvement instructions, escalation policy, GOAP-dispatch playbook - DeepAgentExecutor now applies skill-level systemPromptOverride (goal_proposal, diagnose_pr_stuck) - Fix ceremony loader bug: disabled ceremonies were filtered out, preventing hot-reload from cancelling timers - Clean up board.pr-audit.yaml (remove spurious action field, restore schedule, keep disabled) - Update docs: README, deep-agent runtime, agent-skills reference, self-improving loop * fix(pr-remediator): close dispatch gap — self-dispatch + broaden auto-approve (#417) Two root causes prevented PRs from being auto-merged: 1. Dispatch gap: tier_0 short-circuit in ActionDispatcherPlugin completed all actions immediately without dispatching to agent.skill.request. Every action in actions.yaml is tier_0, so the fireAndForget path (which publishes the skill request) was unreachable dead code. Fix: tier_0 now falls through when meta.fireAndForget is set. 2. Approval gap: readyToMerge requires reviewState=approved, but auto-approve only covered dependabot/renovate/promote:/chore(deps. Human PRs, release PRs (chore(release), and github-actions PRs all lacked approved reviews and sat indefinitely. Fix: added app/github-actions to authors, chore(release, chore:, docs( to safe title prefixes. Additionally, PrRemediatorPlugin now self-dispatches remediation on every world.state.updated tick — checking for readyToMerge, dirty, failingCi, and changesRequested PRs directly from cached domain data. This removes the dependency on GOAP dispatch reaching the plugin via pr.remediate.* topics (which were never published in production after Arc 1.4 removed meta.topic routing). * feat(goap): issue_zero domain + goals — track open GitHub issues across fleet (#419) Adds a github_issues domain that polls /repos/{repo}/issues?state=open for all managed projects and classifies by label (critical, bug, enhancement). Three GOAP goals enforce issue hygiene: - issues.zero_critical (critical severity, max: 0) - issues.zero_bugs (high severity, max: 0) - issues.total_low (medium severity, max: 5) Each goal has a matching alert action and a triage dispatch action that invokes Ava's new issue_triage skill. The skill instructs Ava to resolve, convert to board features, delegate, or close issues with rationale — driving toward zero open issues across all repos. Domain polls every 5 minutes (issue velocity is low, GitHub rate limits are a concern with 6+ repos). * feat: manage_board list action (#247) + a2a.trace extension (#359) (#420) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching (#421) * feat: manage_board list action (#247) + a2a.trace extension (#359) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching "Promote dev to main" titles start with capital P, but the prefix check was case-sensitive against "promote:". Now lowercases the title before matching so both "promote:" and "Promote" patterns are caught. --------- * fix(skill-dispatcher): wire alert.* executors + startup validator (#426) (#427) Closes the structural gap where 6+ tier_0 fire-and-forget alert skills had no registered executor, causing SkillDispatcherPlugin to log "No executor found" and silently drop the dispatch on every GOAP planning cycle. - AlertSkillExecutorPlugin registers FunctionExecutors for all 24 bare alert.* actions in workspace/actions.yaml. Each translates the dispatch into a structured message.outbound.discord.alert event consumed by the existing WorldEngineAlertPlugin webhook routing. - validate-action-executors.ts cross-checks the loaded ActionRegistry against the live ExecutorRegistry at startup. Surfaces every gap as a HIGH-severity Discord alert (goal platform.skills_unwired) and a loud console.error. Set WORKSTACEAN_STRICT_WIRING=1 to crash startup instead. - action.issues_triage_bugs already routes correctly via meta.agentId=ava to the existing DeepAgentExecutor for Ava's issue_triage skill — no duplicate wiring needed (greenfield). * fix(ceremonies): stop world.state.# leak from ceremony snapshots (#428) CeremonyStateExtension was publishing { domain, data } envelopes on `world.state.snapshot` after every ceremony completion. GoalEvaluatorPlugin subscribes to `world.state.#`, treated the malformed payload as a WorldState, and emitted a "Selector ... not found" violation for every loaded goal on every ceremony tick (the cluster of 25+ violations at each :15/:30 boundary in the live container logs). All listed selectors (flow.efficiency.ratio, services.discord.connected, agent_health.agentCount, etc.) actually exist in the producer output — the goals are correct. Changes: - Move ceremony snapshot publish to `ceremony.state.snapshot` (off the world.state.# namespace). Leaves the existing CeremoniesState shape and consumers unchanged. - Goal evaluator: defensive payload shape check. Reject single-domain envelopes ({ domain, data }) and other non-WorldState payloads loud-once instead of generating one violation per goal. - Goal evaluator: startup selector validator. After the first valid world state arrives, walk every loaded goal's selector and HIGH-log any that doesn't resolve. Re-armed on goals.reload / config.reload so drift caught by future hot-reloads also surfaces. - Tests: regression guard that CeremonyStateExtension does not publish on world.state.#; goal evaluator ignores malformed payloads; validator catches an intentionally broken selector. Closes #424 * fix(rollcall): point /rollcall skill at in-repo script (closes #425) (#429) The Claude Code skill was calling the homelab-iac copy of agent-rollcall.sh, which had drifted from this repo's copy. The in-repo script knows about the in-process DeepAgent runtime (Ava, protoBot, Tuner) and the current A2A fleet; the homelab-iac copy still probed for the archived ava-agent container and the deprecated protoaudio/protovoice services. Single source of truth: this repo. The homelab-iac copy was separately synced in homelab-iac@64e8dcf. * Promote dev to main (v0.7.21) (#418) (#423) * fix: extension URIs use proto-labs.ai (not protolabs.ai) (#407) All 27 references to https://protolabs.ai/a2a/ext/* changed to https://proto-labs.ai/a2a/ext/* to match the actual domain. These URIs are opaque identifiers (not published specs today) but should reference a domain we own. Breaking: external agents (Quinn, protoPen) whose cards declare the old URI will stop matching the registry until they update. Filed on Quinn to update her card. * chore(release): bump to v0.7.20 (#408) * chore: remove protoaudio + protovoice from agent rollcall (#410) Both services decommissioned. Containers stopped + removed. Only reference in protoWorkstacean was the rollcall script. Note: homelab-iac/stacks/ai/docker-compose.yml still has a worldmonitor network reference at line 521 + service at line 833. Needs separate cleanup in that repo. * feat: upgrade web_search → searxng_search + give Ava fleet health tools (#411) Two changes: 1. Replace the basic `web_search` tool (5 results, hardcoded engines) with `searxng_search` — adapted from rabbit-hole.io's full-surface SearXNG integration. New capabilities: - Category routing: general, news, science, it - Time range filtering: day, week, month, year - Bang syntax: !wp (Wikipedia), !scholar, !gh (GitHub) - Infoboxes, direct answers, suggestions in response - Configurable max_results (default 10, was 5) Updated in both bus-tools.ts (@protolabsai/sdk pattern) and deep-agent-executor.ts (LangChain pattern). 2. Give Ava three fleet health tools she was missing: - get_ci_health — CI success rates across repos - get_pr_pipeline — open PRs, conflicts, staleness - get_incidents — security/ops incidents Ava can now answer fleet health questions directly instead of always delegating to Quinn. Ava's tool count: 10 → 13. Tool rename: web_search → searxng_search (greenfield, no backward compat alias). * chore(projects): register protoAgent in projects.yaml (#414) protoAgent is the new GitHub Template repo that replaces per-agent A2A bootstrapping. Registers it as an active dev project owned by Quinn, matching the shape of existing entries. Plane / GitHub webhook / Discord provisioning remain TODO — those integrations aren't configured in this deployment, so the onboard plugin skipped them. * chore(release): bump to v0.7.21 (#413) * feat(ava): expand helm toolset, wire GOAP skills, fix ceremony disable bug (#415) Ava agent audit + overhaul: - Tools: 10 → 22 (direct observation, propose_config_change, incident reporting) - Skills: 3 → 7 (debug_ci_failures, fleet_incident_response, downshift_models, investigate_orphaned_skills) - System prompt rewritten: self-improvement instructions, escalation policy, GOAP-dispatch playbook - DeepAgentExecutor now applies skill-level systemPromptOverride (goal_proposal, diagnose_pr_stuck) - Fix ceremony loader bug: disabled ceremonies were filtered out, preventing hot-reload from cancelling timers - Clean up board.pr-audit.yaml (remove spurious action field, restore schedule, keep disabled) - Update docs: README, deep-agent runtime, agent-skills reference, self-improving loop * fix(pr-remediator): close dispatch gap — self-dispatch + broaden auto-approve (#417) Two root causes prevented PRs from being auto-merged: 1. Dispatch gap: tier_0 short-circuit in ActionDispatcherPlugin completed all actions immediately without dispatching to agent.skill.request. Every action in actions.yaml is tier_0, so the fireAndForget path (which publishes the skill request) was unreachable dead code. Fix: tier_0 now falls through when meta.fireAndForget is set. 2. Approval gap: readyToMerge requires reviewState=approved, but auto-approve only covered dependabot/renovate/promote:/chore(deps. Human PRs, release PRs (chore(release), and github-actions PRs all lacked approved reviews and sat indefinitely. Fix: added app/github-actions to authors, chore(release, chore:, docs( to safe title prefixes. Additionally, PrRemediatorPlugin now self-dispatches remediation on every world.state.updated tick — checking for readyToMerge, dirty, failingCi, and changesRequested PRs directly from cached domain data. This removes the dependency on GOAP dispatch reaching the plugin via pr.remediate.* topics (which were never published in production after Arc 1.4 removed meta.topic routing). * feat(goap): issue_zero domain + goals — track open GitHub issues across fleet (#419) Adds a github_issues domain that polls /repos/{repo}/issues?state=open for all managed projects and classifies by label (critical, bug, enhancement). Three GOAP goals enforce issue hygiene: - issues.zero_critical (critical severity, max: 0) - issues.zero_bugs (high severity, max: 0) - issues.total_low (medium severity, max: 5) Each goal has a matching alert action and a triage dispatch action that invokes Ava's new issue_triage skill. The skill instructs Ava to resolve, convert to board features, delegate, or close issues with rationale — driving toward zero open issues across all repos. Domain polls every 5 minutes (issue velocity is low, GitHub rate limits are a concern with 6+ repos). * feat: manage_board list action (#247) + a2a.trace extension (#359) (#420) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching (#421) * feat: manage_board list action (#247) + a2a.trace extension (#359) Two enhancements to reach issue zero: manage_board list (#247): - Added GET /api/board/features/list endpoint proxying to Studio - Added "list" action to manage_board tool with status filter - Ava can now query "show me all blocked features" directly a2a.trace extension (#359): - New langfuse-trace extension stamps a2a.trace metadata on all outbound A2A dispatches (traceId, callerAgent, skill, project) - Quinn reads this to link Langfuse traces across agent boundaries - Registered at startup alongside cost/confidence/blast extensions Closes #247, closes #359. * fix(pr-remediator): case-insensitive auto-approve prefix matching "Promote dev to main" titles start with capital P, but the prefix check was case-sensitive against "promote:". Now lowercases the title before matching so both "promote:" and "Promote" patterns are caught. --------- --------- * feat(ceremony): wire ceremony.security_triage + ceremony.service_health_discord (#431) Adds CeremonySkillExecutorPlugin — registers FunctionExecutors that bridge GOAP `ceremony.*` actions to the matching `ceremony.<id>.execute` topic CeremonyPlugin already listens for. Without this bridge, SkillDispatcherPlugin dropped every dispatch with "No executor found …" and (post-#427) emitted HIGH platform.skills_unwired alerts every cycle. Mirrors the alert-skill-executor-plugin pattern from #427 — explicit action→ceremony id mapping, install order matters (after registry, before skill-dispatcher). Partial fix for #430. * fix(pr-remediator): wire 5 GOAP-dispatched skills + honor hitlPolicy (#432) Closes the structural gap where 5 actions in workspace/actions.yaml route to handlers in PrRemediatorPlugin but had no registered executor: - action.pr_update_branch → pr.remediate.update_branch - action.pr_merge_ready → pr.remediate.merge_ready - action.pr_fix_ci → pr.remediate.fix_ci - action.pr_address_feedback → pr.remediate.address_feedback - action.dispatch_backmerge → pr.backmerge.dispatch Before this change SkillDispatcherPlugin logged "No executor found" and dropped the dispatch every GOAP cycle. After PR #427's startup validator the same gap raised platform.skills_unwired HIGH every tick. Wiring follows the AlertSkillExecutorPlugin pattern from #427: - PrRemediatorSkillExecutorPlugin registers FunctionExecutors that publish on the existing pr-remediator subscription topics, keeping "bus is the contract" — no plugin holds a reference to the other. - Executors are fire-and-forget per actions.yaml meta. They return a successful SkillResult immediately; pr-remediator's handler runs asynchronously on the bus subscription. - Install order matches alert-skill-executor: AFTER ExecutorRegistry construction, BEFORE skill-dispatcher. For action.pr_merge_ready specifically, the meta.hitlPolicy (ttlMs: 1800000, onTimeout: approve) is now honoured. The executor forwards meta into the trigger payload; _handleMergeReady extracts it via _extractHitlPolicy and passes it to _emitHitlApproval, which populates HITLRequest.{ttlMs, onTimeout}. HITLPlugin already auto- publishes a synthetic approve response when onTimeout=approve fires. Closes part of #430. Ceremony + protoMaker actions ship in a separate PR. * feat(agents): register protomaker A2A agent (closes part of #430) (#433) protoMaker (apps/server in protoLabsAI/ava) has been A2A-ready for a while — serves agent-card.json with 10 skills including the two referenced by unwired GOAP actions: - action.protomaker_triage_blocked → skill board_health - action.protomaker_start_auto_mode → skill auto_mode Both actions targeted [protomaker], but no agent named "protomaker" was registered, so the dispatcher couldn't route. Adding the entry closes the routing gap; A2AExecutor's existing target-matching does the rest. Endpoint: http://automaker-server:3008/a2a (verified from inside the workstacean container with AVA_API_KEY → JSON-RPC 2.0 response). * fix(agents): drop overscoped subscribesTo from protomaker entry (#434) #433 added `subscribesTo: message.inbound.github.#` to the new protomaker agent registration, copy-pasted from quinn's pattern. That was wrong: protoMaker is reached via explicit GOAP `targets: [protomaker]` dispatches (action.protomaker_triage_blocked, action.protomaker_start_auto_mode), not as a broadcast inbound listener. Quinn already subscribes to all GitHub inbound and dispatches `bug_triage` on protoMaker's behalf. Having protomaker subscribe to the same broadcast topic is one of the contributing paths to the duplicate-triage spam loop filed as protoLabsAI/protoMaker#3503 (the root cause is Quinn's handler not being idempotent — but this cleanup remove…
Re-cut of #463 after back-merge #464 cleared the merge-base drift. Full PR list (17 fixes since v0.7.21):
All landed and verified live on :dev for 24+ hours. Discord ops silent for the duration (cooldown + registry guard holding).
HITL-gated per protocol — operator approval required. Do not auto-merge.
Note: pr-remediator's auto-close-on-decomposable misfired on the original #463 (filed as #465). Code fix not yet merged, so this re-cut is at risk of being auto-closed AGAIN if a new conflict surfaces while it's open. Mitigation: assuming the back-merge cleaned everything, this should be conflict-free at open time.
Summary by CodeRabbit
Release Notes
New Features
WORKSTACEAN_PUBLIC_BASE_URLandWORKSTACEAN_INTERNAL_HOSTcontrol externally-advertised agent endpoint URLs.Bug Fixes
Documentation
Configuration