Skip to content

feat: brand-voice + output classifier guard plugin#408

Open
furukama wants to merge 2 commits intomainfrom
feat/brand-voice-output-guard
Open

feat: brand-voice + output classifier guard plugin#408
furukama wants to merge 2 commits intomainfrom
feat/brand-voice-output-guard

Conversation

@furukama
Copy link
Copy Markdown
Contributor

Summary

  • Problem: Marketers' favicon fehlt #1 fear with agents is "the bot said something dumb to a client." Roadmap item feat: sandbox-as-tool pattern with full agent loop in gateway #8 calls for a brand-voice + output classifier that gates final responses before they ship.
  • Why it matters: Off-brand language reaches the user only after the agent has already produced it. Without a gate, every drift incident is irreversible.
  • What changed: Added an output-guard extension point to the plugin SDK and a brand-voice plugin that uses it. The gateway now runs registered guards on the finalized resultText before recordSuccessfulTurn/audit, so the gated text is what the user sees, what's persisted, and what's logged. Companion brand-voice SKILL added.
  • What did not change: No changes to memory, approval policy, audit hash chain, container/IPC, or other plugins. Plugins without an output guard are unaffected — applyOutputGuards is a no-op when no guards are registered.

Change Type

  • Feature
  • Tests

Linked Context

Validation

npm run typecheck            # pass
npm run lint                 # pass
npm run check                # pass
npx vitest run --config vitest.unit.config.ts \
  tests/brand-voice-plugin.test.ts \
  tests/plugin-manager.test.ts \
  tests/gateway-service.plugins.test.ts
# 65/65 pass
  • Verified manually: rule-based block (banned phrase + missing required), clean-output allow-through, mocked Anthropic rewrite path with x-api-key header propagation, fallback-to-block when mode=rewrite but rewriter.provider=none, mode=flag stays transparent, guard-pipeline priority ordering and short-circuit on block, error isolation when a guard throws.
  • Edge cases checked: empty/short text bypass via minLength, classifier returns non-parseable JSON (logged + ignored), rewriter returns text that still violates rules (escalates to block), aux-model API errors (default failureMode: allow so a flaky API never hard-blocks the agent).
  • Skipped checks: full unit suite has 9 pre-existing failures on main (host-runner.worker-restart, host-runner.redaction, eval-command, admin-terminal). Verified identical failure set on main — unrelated to this PR. Integration/e2e/live not run; no integration surface touched.

Docs And Config Impact

  • README, docs, or examples updated — new SKILL at skills/brand-voice/SKILL.md; new plugin manifest with configSchema + configUiHints so the admin console renders config correctly.
  • Config or environment behavior changed — only when the user opts in by enabling the plugin.
  • Templates or workspace bootstrap files changed
  • No docs or config impact

Risk Notes

  • Security-sensitive paths touched? No. The guard only inspects/rewrites the agent's outgoing text; it cannot escalate permissions, bypass approvals, or change tool execution.
  • Gateway, audit, approval, or container boundaries touched? Gateway: yes (response finalization in gateway-chat-service.ts). Audit: indirect — the guard runs before audit/recordSuccessfulTurn so audit reflects gated text, which is the desired behavior. Approval/container: no.
  • Failure mode: the entire guard pipeline is wrapped in a try/catch in the gateway; if it errors, the original output ships and a warning is logged. Individual guard errors are caught and skipped inside applyOutputGuards. A misconfigured plugin (e.g. rewriter API down) defaults to failureMode: allow. Block is the only mode that replaces user-facing text, and only when a guard explicitly returns {action: 'block'}.

Evidence

  • New or updated test coverage — tests/brand-voice-plugin.test.ts (7 tests) covers config, rules, block/allow/rewrite/flag modes and the Anthropic rewrite path; tests/plugin-manager.test.ts adds priority-order, short-circuit-on-block, and guard-error-isolation tests for the pipeline mechanics.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 26, 2026 09:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new plugin SDK extension point for “output guards” and introduces a bundled brand-voice output-guard plugin that can allow/rewrite/block finalized assistant responses before they’re persisted/logged/shown.

Changes:

  • Extend plugin types + API + SDK exports to support output-guard plugins and guard registration.
  • Implement guard registration/storage and an applyOutputGuards() pipeline in PluginManager, plus gateway integration to run guards on finalized resultText.
  • Add bundled brand-voice plugin + skill docs, along with targeted unit tests for plugin-manager ordering/error isolation and brand-voice behavior.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/plugin-manager.test.ts Adds tests for guard ordering, block short-circuiting, and guard error isolation.
tests/gateway-service.plugins.test.ts Updates gateway plugin-manager mock to include output-guard methods.
tests/brand-voice-plugin.test.ts New unit tests covering brand-voice config/rules and allow/block/rewrite/flag behavior.
src/plugins/plugin-types.ts Introduces output-guard plugin kind and guard-related types + API surface.
src/plugins/plugin-sdk.ts Re-exports new output-guard types via the plugin SDK.
src/plugins/plugin-manager.ts Adds guard registration, snapshot rollback support, and the guard pipeline implementation.
src/plugins/plugin-api.ts Exposes registerOutputGuard() to plugins.
src/gateway/gateway-chat-service.ts Applies output guards to finalized response text before citations/audit/recording.
skills/brand-voice/SKILL.md Documents the companion skill and how it interacts with guard modes.
plugins/brand-voice/src/rules.js Implements rule-based detection + summaries for banned/required phrases and patterns.
plugins/brand-voice/src/llm.js Adds aux-model calls + classifier verdict parsing helpers.
plugins/brand-voice/src/index.js Registers the brand-voice output guard and a brand-voice command.
plugins/brand-voice/src/guard.js Implements the guard decision logic (allow/flag/block/rewrite + failure-mode handling).
plugins/brand-voice/src/config.js Normalizes/validates config, compiles patterns, resolves voice file, builds voice brief.
plugins/brand-voice/package.json Adds bundled plugin package metadata (private, node engine, ESM).
plugins/brand-voice/hybridclaw.plugin.yaml Adds plugin manifest, credentials, configSchema, and admin UI hints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugins/brand-voice/src/config.js Outdated
Comment on lines +135 to +138
const bannedPatternStrings = normalizeStringArray(rawConfig?.bannedPatterns);
const bannedPatterns = bannedPatternStrings
.map((entry) => compileRegexEntry(entry, errors))
.filter((value) => value !== null);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bannedPatterns is derived by mapping bannedPatternStrings and then filtering out invalid regexes, but bannedPatternStrings is left unfiltered. This can desync the two arrays (e.g., when one pattern fails to compile), and downstream code indexes into bannedPatternStrings by the bannedPatterns index, producing incorrect detail values. Consider building a single array of { pattern, source } pairs (or filtering both arrays in lockstep) so the displayed pattern string always matches the compiled RegExp.

Suggested change
const bannedPatternStrings = normalizeStringArray(rawConfig?.bannedPatterns);
const bannedPatterns = bannedPatternStrings
.map((entry) => compileRegexEntry(entry, errors))
.filter((value) => value !== null);
const bannedPatternEntries = normalizeStringArray(rawConfig?.bannedPatterns)
.map((entry) => {
const pattern = compileRegexEntry(entry, errors);
return pattern === null ? null : { source: entry, pattern };
})
.filter((value) => value !== null);
const bannedPatternStrings = bannedPatternEntries.map((entry) => entry.source);
const bannedPatterns = bannedPatternEntries.map((entry) => entry.pattern);

Copilot uses AI. Check for mistakes.
Comment thread plugins/brand-voice/src/rules.js Outdated
Comment on lines +10 to +16
for (let index = 0; index < config.bannedPatterns.length; index++) {
const pattern = config.bannedPatterns[index];
if (!pattern) continue;
if (pattern.test(text)) {
violations.push({
kind: 'banned_pattern',
detail: config.bannedPatternStrings[index] || pattern.source,
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop assumes config.bannedPatternStrings[index] corresponds to config.bannedPatterns[index]. If invalid regex entries were dropped during config parsing, the indices can shift and the reported detail can refer to the wrong pattern. Fix by iterating over a single structured list (pattern + original string) instead of parallel arrays, or ensure the arrays are filtered in sync.

Suggested change
for (let index = 0; index < config.bannedPatterns.length; index++) {
const pattern = config.bannedPatterns[index];
if (!pattern) continue;
if (pattern.test(text)) {
violations.push({
kind: 'banned_pattern',
detail: config.bannedPatternStrings[index] || pattern.source,
for (const pattern of config.bannedPatterns) {
if (!pattern) continue;
if (pattern.test(text)) {
violations.push({
kind: 'banned_pattern',
detail: pattern.source,

Copilot uses AI. Check for mistakes.
type: number
default: 8000
minimum: 1000
maximum: 60000
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveModelClientConfig() reads maxRetries for both classifier and rewriter, but the configSchema for classifier does not include maxRetries and has additionalProperties: false. Any attempt to configure classifier.maxRetries will be rejected by Ajv validation at plugin load. Either add maxRetries to the classifier schema (and UI hints if desired) or remove/ignore it in the classifier resolver for consistency.

Suggested change
maximum: 60000
maximum: 60000
maxRetries:
type: number
default: 1
minimum: 0
maximum: 3

Copilot uses AI. Check for mistakes.
Comment on lines +1879 to +1882
const value = await entry.guard.inspect(guardContext);
if (value && typeof value === 'object' && 'action' in value) {
decision = value as PluginOutputGuardDecision;
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyOutputGuards() treats any object with an action property as a valid PluginOutputGuardDecision without validating that action is one of allow|rewrite|block. If a plugin returns { action: 'foo' }, it will currently fall through and be treated as a block decision, potentially blocking user output unexpectedly. Validate action against the allowed set before accepting the decision; otherwise log and treat it as allow/ignore.

Copilot uses AI. Check for mistakes.
furukama pushed a commit that referenced this pull request Apr 26, 2026
- Pair banned regex patterns with their original source strings in a single
  structured list so an invalid pattern dropped during compile no longer
  desyncs the displayed `detail` reported by the rules detector.
- Add `maxRetries` to the classifier configSchema; the resolver already
  reads it for both classifier and rewriter, but Ajv's `additionalProperties:
  false` was rejecting it on the classifier side at plugin load.
- Validate guard `action` values in `applyOutputGuards`; an unknown action
  now logs and is treated as `allow` instead of falling through and being
  treated as a block. Adds a regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benedikt Koehler and others added 2 commits April 26, 2026 14:15
Adds an output-guard extension point to the plugin SDK and a `brand-voice`
plugin that uses it to keep off-brand language out of agent replies. The
guard runs after the agent finishes a turn, before the response is stored
or shipped to the user, so the gated text is what the user sees, what's
persisted, and what's logged.

Detection is rules-first (banned phrases, banned regex patterns, required
phrases) with an optional aux-model classifier. When violations are found
in `rewrite` mode (the default), an aux model rewrites the response and
the rewrite is re-checked against the rules before shipping; if it still
violates, the guard falls back to block. `block` mode replaces the text
outright; `flag` mode only logs.

Classifier and rewriter each have their own provider/model/key/timeout,
so a cheap classifier can pair with a stronger rewriter. Aux-model errors
default to `allow` so a flaky API never hard-blocks the agent.

Implements roadmap item #8 (brand-voice + output classifier).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Pair banned regex patterns with their original source strings in a single
  structured list so an invalid pattern dropped during compile no longer
  desyncs the displayed `detail` reported by the rules detector.
- Add `maxRetries` to the classifier configSchema; the resolver already
  reads it for both classifier and rewriter, but Ajv's `additionalProperties:
  false` was rejecting it on the classifier side at plugin load.
- Validate guard `action` values in `applyOutputGuards`; an unknown action
  now logs and is treated as `allow` instead of falling through and being
  treated as a block. Adds a regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants