feat: brand-voice + output classifier guard plugin by furukama · Pull Request #408 · HybridAIOne/hybridclaw

furukama · 2026-04-26T09:50:21Z

Summary

Problem: Marketers' favicon fehlt #1 fear with agents is "the bot said something dumb to a client." Roadmap item feat: sandbox-as-tool pattern with full agent loop in gateway #8 calls for a brand-voice + output classifier that gates final responses before they ship.
Why it matters: Off-brand language reaches the user only after the agent has already produced it. Without a gate, every drift incident is irreversible.
What changed: Added an output-guard extension point to the plugin SDK and a brand-voice plugin that uses it. The gateway now runs registered guards on the finalized resultText before recordSuccessfulTurn/audit, so the gated text is what the user sees, what's persisted, and what's logged. Companion brand-voice SKILL added.
What did not change: No changes to memory, approval policy, audit hash chain, container/IPC, or other plugins. Plugins without an output guard are unaffected — applyOutputGuards is a no-op when no guards are registered.

Change Type

Feature
Tests

Linked Context

Roadmap item feat: sandbox-as-tool pattern with full agent loop in gateway #8 (brand-voice + output classifier)

Validation

npm run typecheck            # pass
npm run lint                 # pass
npm run check                # pass
npx vitest run --config vitest.unit.config.ts \
  tests/brand-voice-plugin.test.ts \
  tests/plugin-manager.test.ts \
  tests/gateway-service.plugins.test.ts
# 65/65 pass

Verified manually: rule-based block (banned phrase + missing required), clean-output allow-through, mocked Anthropic rewrite path with x-api-key header propagation, fallback-to-block when mode=rewrite but rewriter.provider=none, mode=flag stays transparent, guard-pipeline priority ordering and short-circuit on block, error isolation when a guard throws.
Edge cases checked: empty/short text bypass via minLength, classifier returns non-parseable JSON (logged + ignored), rewriter returns text that still violates rules (escalates to block), aux-model API errors (default failureMode: allow so a flaky API never hard-blocks the agent).
Skipped checks: full unit suite has 9 pre-existing failures on main (host-runner.worker-restart, host-runner.redaction, eval-command, admin-terminal). Verified identical failure set on main — unrelated to this PR. Integration/e2e/live not run; no integration surface touched.

Docs And Config Impact

README, docs, or examples updated — new SKILL at skills/brand-voice/SKILL.md; new plugin manifest with configSchema + configUiHints so the admin console renders config correctly.
Config or environment behavior changed — only when the user opts in by enabling the plugin.
Templates or workspace bootstrap files changed
No docs or config impact

Risk Notes

Security-sensitive paths touched? No. The guard only inspects/rewrites the agent's outgoing text; it cannot escalate permissions, bypass approvals, or change tool execution.
Gateway, audit, approval, or container boundaries touched? Gateway: yes (response finalization in gateway-chat-service.ts). Audit: indirect — the guard runs before audit/recordSuccessfulTurn so audit reflects gated text, which is the desired behavior. Approval/container: no.
Failure mode: the entire guard pipeline is wrapped in a try/catch in the gateway; if it errors, the original output ships and a warning is logged. Individual guard errors are caught and skipped inside applyOutputGuards. A misconfigured plugin (e.g. rewriter API down) defaults to failureMode: allow. Block is the only mode that replaces user-facing text, and only when a guard explicitly returns {action: 'block'}.

Evidence

New or updated test coverage — tests/brand-voice-plugin.test.ts (7 tests) covers config, rules, block/allow/rewrite/flag modes and the Anthropic rewrite path; tests/plugin-manager.test.ts adds priority-order, short-circuit-on-block, and guard-error-isolation tests for the pipeline mechanics.

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds a new plugin SDK extension point for “output guards” and introduces a bundled brand-voice output-guard plugin that can allow/rewrite/block finalized assistant responses before they’re persisted/logged/shown.

Changes:

Extend plugin types + API + SDK exports to support output-guard plugins and guard registration.
Implement guard registration/storage and an applyOutputGuards() pipeline in PluginManager, plus gateway integration to run guards on finalized resultText.
Add bundled brand-voice plugin + skill docs, along with targeted unit tests for plugin-manager ordering/error isolation and brand-voice behavior.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/plugin-manager.test.ts	Adds tests for guard ordering, block short-circuiting, and guard error isolation.
tests/gateway-service.plugins.test.ts	Updates gateway plugin-manager mock to include output-guard methods.
tests/brand-voice-plugin.test.ts	New unit tests covering brand-voice config/rules and allow/block/rewrite/flag behavior.
src/plugins/plugin-types.ts	Introduces `output-guard` plugin kind and guard-related types + API surface.
src/plugins/plugin-sdk.ts	Re-exports new output-guard types via the plugin SDK.
src/plugins/plugin-manager.ts	Adds guard registration, snapshot rollback support, and the guard pipeline implementation.
src/plugins/plugin-api.ts	Exposes `registerOutputGuard()` to plugins.
src/gateway/gateway-chat-service.ts	Applies output guards to finalized response text before citations/audit/recording.
skills/brand-voice/SKILL.md	Documents the companion skill and how it interacts with guard modes.
plugins/brand-voice/src/rules.js	Implements rule-based detection + summaries for banned/required phrases and patterns.
plugins/brand-voice/src/llm.js	Adds aux-model calls + classifier verdict parsing helpers.
plugins/brand-voice/src/index.js	Registers the brand-voice output guard and a `brand-voice` command.
plugins/brand-voice/src/guard.js	Implements the guard decision logic (allow/flag/block/rewrite + failure-mode handling).
plugins/brand-voice/src/config.js	Normalizes/validates config, compiles patterns, resolves voice file, builds voice brief.
plugins/brand-voice/package.json	Adds bundled plugin package metadata (private, node engine, ESM).
plugins/brand-voice/hybridclaw.plugin.yaml	Adds plugin manifest, credentials, configSchema, and admin UI hints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T09:54:58Z

+  const bannedPatternStrings = normalizeStringArray(rawConfig?.bannedPatterns);
+  const bannedPatterns = bannedPatternStrings
+    .map((entry) => compileRegexEntry(entry, errors))
+    .filter((value) => value !== null);


bannedPatterns is derived by mapping bannedPatternStrings and then filtering out invalid regexes, but bannedPatternStrings is left unfiltered. This can desync the two arrays (e.g., when one pattern fails to compile), and downstream code indexes into bannedPatternStrings by the bannedPatterns index, producing incorrect detail values. Consider building a single array of { pattern, source } pairs (or filtering both arrays in lockstep) so the displayed pattern string always matches the compiled RegExp.

Suggested change

const bannedPatternStrings = normalizeStringArray(rawConfig?.bannedPatterns);

const bannedPatterns = bannedPatternStrings

.map((entry) => compileRegexEntry(entry, errors))

.filter((value) => value !== null);

const bannedPatternEntries = normalizeStringArray(rawConfig?.bannedPatterns)

.map((entry) => {

const pattern = compileRegexEntry(entry, errors);

return pattern === null ? null : { source: entry, pattern };

})

.filter((value) => value !== null);

const bannedPatternStrings = bannedPatternEntries.map((entry) => entry.source);

const bannedPatterns = bannedPatternEntries.map((entry) => entry.pattern);

Copilot · 2026-04-26T09:54:58Z

+  for (let index = 0; index < config.bannedPatterns.length; index++) {
+    const pattern = config.bannedPatterns[index];
+    if (!pattern) continue;
+    if (pattern.test(text)) {
+      violations.push({
+        kind: 'banned_pattern',
+        detail: config.bannedPatternStrings[index] || pattern.source,


This loop assumes config.bannedPatternStrings[index] corresponds to config.bannedPatterns[index]. If invalid regex entries were dropped during config parsing, the indices can shift and the reported detail can refer to the wrong pattern. Fix by iterating over a single structured list (pattern + original string) instead of parallel arrays, or ensure the arrays are filtered in sync.

Suggested change

for (let index = 0; index < config.bannedPatterns.length; index++) {

const pattern = config.bannedPatterns[index];

if (!pattern) continue;

if (pattern.test(text)) {

violations.push({

kind: 'banned_pattern',

detail: config.bannedPatternStrings[index] || pattern.source,

for (const pattern of config.bannedPatterns) {

if (!pattern) continue;

if (pattern.test(text)) {

violations.push({

kind: 'banned_pattern',

detail: pattern.source,

Copilot · 2026-04-26T09:54:59Z

+          type: number
+          default: 8000
+          minimum: 1000
+          maximum: 60000


resolveModelClientConfig() reads maxRetries for both classifier and rewriter, but the configSchema for classifier does not include maxRetries and has additionalProperties: false. Any attempt to configure classifier.maxRetries will be rejected by Ajv validation at plugin load. Either add maxRetries to the classifier schema (and UI hints if desired) or remove/ignore it in the classifier resolver for consistency.

Suggested change

maximum: 60000

maximum: 60000

maxRetries:

type: number

default: 1

minimum: 0

maximum: 3

Copilot · 2026-04-26T09:54:59Z

+        const value = await entry.guard.inspect(guardContext);
+        if (value && typeof value === 'object' && 'action' in value) {
+          decision = value as PluginOutputGuardDecision;
+        }


applyOutputGuards() treats any object with an action property as a valid PluginOutputGuardDecision without validating that action is one of allow|rewrite|block. If a plugin returns { action: 'foo' }, it will currently fall through and be treated as a block decision, potentially blocking user output unexpectedly. Validate action against the allowed set before accepting the decision; otherwise log and treat it as allow/ignore.

- Pair banned regex patterns with their original source strings in a single structured list so an invalid pattern dropped during compile no longer desyncs the displayed `detail` reported by the rules detector. - Add `maxRetries` to the classifier configSchema; the resolver already reads it for both classifier and rewriter, but Ajv's `additionalProperties: false` was rejecting it on the classifier side at plugin load. - Validate guard `action` values in `applyOutputGuards`; an unknown action now logs and is treated as `allow` instead of falling through and being treated as a block. Adds a regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an output-guard extension point to the plugin SDK and a `brand-voice` plugin that uses it to keep off-brand language out of agent replies. The guard runs after the agent finishes a turn, before the response is stored or shipped to the user, so the gated text is what the user sees, what's persisted, and what's logged. Detection is rules-first (banned phrases, banned regex patterns, required phrases) with an optional aux-model classifier. When violations are found in `rewrite` mode (the default), an aux model rewrites the response and the rewrite is re-checked against the rules before shipping; if it still violates, the guard falls back to block. `block` mode replaces the text outright; `flag` mode only logs. Classifier and rewriter each have their own provider/model/key/timeout, so a cheap classifier can pair with a stronger rewriter. Aux-model errors default to `allow` so a flaky API never hard-blocks the agent. Implements roadmap item #8 (brand-voice + output classifier). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Pair banned regex patterns with their original source strings in a single structured list so an invalid pattern dropped during compile no longer desyncs the displayed `detail` reported by the rules detector. - Add `maxRetries` to the classifier configSchema; the resolver already reads it for both classifier and rewriter, but Ajv's `additionalProperties: false` was rejecting it on the classifier side at plugin load. - Validate guard `action` values in `applyOutputGuards`; an unknown action now logs and is treated as `allow` instead of falling through and being treated as a block. Adds a regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 26, 2026 09:50

Copilot started reviewing on behalf of furukama April 26, 2026 09:50 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Benedikt Koehler and others added 2 commits April 26, 2026 14:15

furukama force-pushed the feat/brand-voice-output-guard branch from 9014ca6 to d716c2e Compare April 26, 2026 12:15

furukama mentioned this pull request Apr 28, 2026

docs(roadmap): add status snapshot + per-feature progress markers #640

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: brand-voice + output classifier guard plugin#408

feat: brand-voice + output classifier guard plugin#408
furukama wants to merge 2 commits intomainfrom
feat/brand-voice-output-guard

furukama commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-  const bannedPatternStrings = normalizeStringArray(rawConfig?.bannedPatterns);
-  const bannedPatterns = bannedPatternStrings
-    .map((entry) => compileRegexEntry(entry, errors))
-    .filter((value) => value !== null);
+  const bannedPatternEntries = normalizeStringArray(rawConfig?.bannedPatterns)
+    .map((entry) => {
+      const pattern = compileRegexEntry(entry, errors);
+      return pattern === null ? null : { source: entry, pattern };
+    })
+    .filter((value) => value !== null);
+  const bannedPatternStrings = bannedPatternEntries.map((entry) => entry.source);
+  const bannedPatterns = bannedPatternEntries.map((entry) => entry.pattern);

-          maximum: 60000
+          maximum: 60000
+        maxRetries:
+          type: number
+          default: 1
+          minimum: 0
+          maximum: 3

Conversation

furukama commented Apr 26, 2026

Summary

Change Type

Linked Context

Validation

Docs And Config Impact

Risk Notes

Evidence

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants