feat: brand-voice + output classifier guard plugin#408
feat: brand-voice + output classifier guard plugin#408
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new plugin SDK extension point for “output guards” and introduces a bundled brand-voice output-guard plugin that can allow/rewrite/block finalized assistant responses before they’re persisted/logged/shown.
Changes:
- Extend plugin types + API + SDK exports to support
output-guardplugins and guard registration. - Implement guard registration/storage and an
applyOutputGuards()pipeline inPluginManager, plus gateway integration to run guards on finalizedresultText. - Add bundled
brand-voiceplugin + skill docs, along with targeted unit tests for plugin-manager ordering/error isolation and brand-voice behavior.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/plugin-manager.test.ts | Adds tests for guard ordering, block short-circuiting, and guard error isolation. |
| tests/gateway-service.plugins.test.ts | Updates gateway plugin-manager mock to include output-guard methods. |
| tests/brand-voice-plugin.test.ts | New unit tests covering brand-voice config/rules and allow/block/rewrite/flag behavior. |
| src/plugins/plugin-types.ts | Introduces output-guard plugin kind and guard-related types + API surface. |
| src/plugins/plugin-sdk.ts | Re-exports new output-guard types via the plugin SDK. |
| src/plugins/plugin-manager.ts | Adds guard registration, snapshot rollback support, and the guard pipeline implementation. |
| src/plugins/plugin-api.ts | Exposes registerOutputGuard() to plugins. |
| src/gateway/gateway-chat-service.ts | Applies output guards to finalized response text before citations/audit/recording. |
| skills/brand-voice/SKILL.md | Documents the companion skill and how it interacts with guard modes. |
| plugins/brand-voice/src/rules.js | Implements rule-based detection + summaries for banned/required phrases and patterns. |
| plugins/brand-voice/src/llm.js | Adds aux-model calls + classifier verdict parsing helpers. |
| plugins/brand-voice/src/index.js | Registers the brand-voice output guard and a brand-voice command. |
| plugins/brand-voice/src/guard.js | Implements the guard decision logic (allow/flag/block/rewrite + failure-mode handling). |
| plugins/brand-voice/src/config.js | Normalizes/validates config, compiles patterns, resolves voice file, builds voice brief. |
| plugins/brand-voice/package.json | Adds bundled plugin package metadata (private, node engine, ESM). |
| plugins/brand-voice/hybridclaw.plugin.yaml | Adds plugin manifest, credentials, configSchema, and admin UI hints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const bannedPatternStrings = normalizeStringArray(rawConfig?.bannedPatterns); | ||
| const bannedPatterns = bannedPatternStrings | ||
| .map((entry) => compileRegexEntry(entry, errors)) | ||
| .filter((value) => value !== null); |
There was a problem hiding this comment.
bannedPatterns is derived by mapping bannedPatternStrings and then filtering out invalid regexes, but bannedPatternStrings is left unfiltered. This can desync the two arrays (e.g., when one pattern fails to compile), and downstream code indexes into bannedPatternStrings by the bannedPatterns index, producing incorrect detail values. Consider building a single array of { pattern, source } pairs (or filtering both arrays in lockstep) so the displayed pattern string always matches the compiled RegExp.
| const bannedPatternStrings = normalizeStringArray(rawConfig?.bannedPatterns); | |
| const bannedPatterns = bannedPatternStrings | |
| .map((entry) => compileRegexEntry(entry, errors)) | |
| .filter((value) => value !== null); | |
| const bannedPatternEntries = normalizeStringArray(rawConfig?.bannedPatterns) | |
| .map((entry) => { | |
| const pattern = compileRegexEntry(entry, errors); | |
| return pattern === null ? null : { source: entry, pattern }; | |
| }) | |
| .filter((value) => value !== null); | |
| const bannedPatternStrings = bannedPatternEntries.map((entry) => entry.source); | |
| const bannedPatterns = bannedPatternEntries.map((entry) => entry.pattern); |
| for (let index = 0; index < config.bannedPatterns.length; index++) { | ||
| const pattern = config.bannedPatterns[index]; | ||
| if (!pattern) continue; | ||
| if (pattern.test(text)) { | ||
| violations.push({ | ||
| kind: 'banned_pattern', | ||
| detail: config.bannedPatternStrings[index] || pattern.source, |
There was a problem hiding this comment.
This loop assumes config.bannedPatternStrings[index] corresponds to config.bannedPatterns[index]. If invalid regex entries were dropped during config parsing, the indices can shift and the reported detail can refer to the wrong pattern. Fix by iterating over a single structured list (pattern + original string) instead of parallel arrays, or ensure the arrays are filtered in sync.
| for (let index = 0; index < config.bannedPatterns.length; index++) { | |
| const pattern = config.bannedPatterns[index]; | |
| if (!pattern) continue; | |
| if (pattern.test(text)) { | |
| violations.push({ | |
| kind: 'banned_pattern', | |
| detail: config.bannedPatternStrings[index] || pattern.source, | |
| for (const pattern of config.bannedPatterns) { | |
| if (!pattern) continue; | |
| if (pattern.test(text)) { | |
| violations.push({ | |
| kind: 'banned_pattern', | |
| detail: pattern.source, |
| type: number | ||
| default: 8000 | ||
| minimum: 1000 | ||
| maximum: 60000 |
There was a problem hiding this comment.
resolveModelClientConfig() reads maxRetries for both classifier and rewriter, but the configSchema for classifier does not include maxRetries and has additionalProperties: false. Any attempt to configure classifier.maxRetries will be rejected by Ajv validation at plugin load. Either add maxRetries to the classifier schema (and UI hints if desired) or remove/ignore it in the classifier resolver for consistency.
| maximum: 60000 | |
| maximum: 60000 | |
| maxRetries: | |
| type: number | |
| default: 1 | |
| minimum: 0 | |
| maximum: 3 |
| const value = await entry.guard.inspect(guardContext); | ||
| if (value && typeof value === 'object' && 'action' in value) { | ||
| decision = value as PluginOutputGuardDecision; | ||
| } |
There was a problem hiding this comment.
applyOutputGuards() treats any object with an action property as a valid PluginOutputGuardDecision without validating that action is one of allow|rewrite|block. If a plugin returns { action: 'foo' }, it will currently fall through and be treated as a block decision, potentially blocking user output unexpectedly. Validate action against the allowed set before accepting the decision; otherwise log and treat it as allow/ignore.
- Pair banned regex patterns with their original source strings in a single structured list so an invalid pattern dropped during compile no longer desyncs the displayed `detail` reported by the rules detector. - Add `maxRetries` to the classifier configSchema; the resolver already reads it for both classifier and rewriter, but Ajv's `additionalProperties: false` was rejecting it on the classifier side at plugin load. - Validate guard `action` values in `applyOutputGuards`; an unknown action now logs and is treated as `allow` instead of falling through and being treated as a block. Adds a regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an output-guard extension point to the plugin SDK and a `brand-voice` plugin that uses it to keep off-brand language out of agent replies. The guard runs after the agent finishes a turn, before the response is stored or shipped to the user, so the gated text is what the user sees, what's persisted, and what's logged. Detection is rules-first (banned phrases, banned regex patterns, required phrases) with an optional aux-model classifier. When violations are found in `rewrite` mode (the default), an aux model rewrites the response and the rewrite is re-checked against the rules before shipping; if it still violates, the guard falls back to block. `block` mode replaces the text outright; `flag` mode only logs. Classifier and rewriter each have their own provider/model/key/timeout, so a cheap classifier can pair with a stronger rewriter. Aux-model errors default to `allow` so a flaky API never hard-blocks the agent. Implements roadmap item #8 (brand-voice + output classifier). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Pair banned regex patterns with their original source strings in a single structured list so an invalid pattern dropped during compile no longer desyncs the displayed `detail` reported by the rules detector. - Add `maxRetries` to the classifier configSchema; the resolver already reads it for both classifier and rewriter, but Ajv's `additionalProperties: false` was rejecting it on the classifier side at plugin load. - Validate guard `action` values in `applyOutputGuards`; an unknown action now logs and is treated as `allow` instead of falling through and being treated as a block. Adds a regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9014ca6 to
d716c2e
Compare
Summary
output-guardextension point to the plugin SDK and abrand-voiceplugin that uses it. The gateway now runs registered guards on the finalizedresultTextbeforerecordSuccessfulTurn/audit, so the gated text is what the user sees, what's persisted, and what's logged. Companionbrand-voiceSKILL added.applyOutputGuardsis a no-op when no guards are registered.Change Type
Linked Context
Validation
x-api-keyheader propagation, fallback-to-block whenmode=rewritebutrewriter.provider=none,mode=flagstays transparent, guard-pipeline priority ordering and short-circuit on block, error isolation when a guard throws.minLength, classifier returns non-parseable JSON (logged + ignored), rewriter returns text that still violates rules (escalates to block), aux-model API errors (defaultfailureMode: allowso a flaky API never hard-blocks the agent).main(host-runner.worker-restart, host-runner.redaction, eval-command, admin-terminal). Verified identical failure set onmain— unrelated to this PR. Integration/e2e/live not run; no integration surface touched.Docs And Config Impact
skills/brand-voice/SKILL.md; new plugin manifest withconfigSchema+configUiHintsso the admin console renders config correctly.Risk Notes
gateway-chat-service.ts). Audit: indirect — the guard runs before audit/recordSuccessfulTurn so audit reflects gated text, which is the desired behavior. Approval/container: no.applyOutputGuards. A misconfigured plugin (e.g. rewriter API down) defaults tofailureMode: allow. Block is the only mode that replaces user-facing text, and only when a guard explicitly returns{action: 'block'}.Evidence
tests/brand-voice-plugin.test.ts(7 tests) covers config, rules, block/allow/rewrite/flag modes and the Anthropic rewrite path;tests/plugin-manager.test.tsadds priority-order, short-circuit-on-block, and guard-error-isolation tests for the pipeline mechanics.🤖 Generated with Claude Code