Skip to content

fix(soul): emit HIGH SOUL-PROFILE-MISMATCH on profile-filter scope bypass (#162)#168

Merged
thebenignhacker merged 3 commits intomainfrom
fix/scan-soul-profile-bypass-162
Apr 29, 2026
Merged

fix(soul): emit HIGH SOUL-PROFILE-MISMATCH on profile-filter scope bypass (#162)#168
thebenignhacker merged 3 commits intomainfrom
fix/scan-soul-profile-bypass-162

Conversation

@thebenignhacker
Copy link
Copy Markdown
Contributor

Closes #162.

Summary

A <!-- soul:profile=conversational --> marker (or --profile override) narrows the scan-soul evaluation to 4 of 9 governance domains. The kitchen-sink malicious corpus fixture exploited this: its body declares Capability Boundaries, Trust Hierarchy, Data Handling, Agentic Safety, and Human Oversight — but the marker forces the scanner to skip them, producing a 100/100 HARDENED verdict on a genuinely-malicious agent definition.

Path B fix per [CHIEF-CDS]: honor the marker for legitimate conversational SOULs, but emit HIGH SOUL-PROFILE-MISMATCH whenever body content suggests a broader profile than the marker declares, plus disclose evaluation scope in every partial-scan output.

Implementation

src/soul/scanner.ts

  • New SoulProfileMismatch field on SoulScanResult carrying declared profile, inferred profile, skipped domains, and signals.
  • New inferProfileFromContent(content) heuristic. Layered by domain heading + verb signal: ## Agentic Safety or autonomous language → autonomous; ## Capability Boundaries + tool-execution signal / ## Human Oversight / tool-shell-execute language → tool-agent; ## Trust Hierarchy / ## Data Handling / scope-only ## Capability Boundaries → code-assistant; otherwise conversational.
  • Mismatch detection in scanSoul: when the declared profile (marker or --profile override) is strictly narrower than the inferred profile AND the inference would have evaluated extra domains, populate result.profileMismatch.
  • Tier cross-check: AGENTIC / MULTI-AGENT tiers (which require explicit autonomous / orchestrator language) override even a zero-signal body. TOOL-USING tier escalates only when body inference also has at least one signal — keeps documentation/tutorial fixtures from firing FPs.

src/cli.ts (scan-soul render)

  • Header line appends (N of 9 domains evaluated) when skippedDomains.length > 0.
  • New SOUL-PROFILE-MISMATCH HIGH block before Domain Scores. Shows declared vs. inferred profile, signals (up to 6 + overflow), skipped domains, runnable Fix.
  • Conformance label prefixed with PARTIAL when any domain was skipped — bare HARDENED / STANDARD / ESSENTIAL only appear at full-scope evaluation.
  • --ci exits non-zero when profileMismatch is set.

Phase 4.5 adversarial review (Tier A)

Subagent review of the first commit identified 5 HIGH-severity bypass / FP issues, all closed in the second commit before merge:

# Issue Resolution
H1 #### / **bold** heading-shape bypass Extended heading regex to ATX H1-H6 + bold-line + setext branches
H2 Verb-only bypass (MCP servers and function calls under the hood) Added proseSignal regex + tier cross-check (AGENTIC+ overrides; TOOL-USING escalates only with body signal)
H3 Code-fence / HTML-comment FP (tutorials documenting how to write SOUL.md) Pre-strip ```, ~~~, and non-soul: HTML comments before parsing
H4 <!-- soul:profile=code-assistant --> + scope-only Capability Boundaries FP Capability Boundaries heading alone now infers code-assistant; tool-agent requires the heading PLUS tool-execution / manifest / MCP signal
H5 Defensive prose FP (## Trust Hierarchy\nThis bot does not have one.) isDefensiveSection checks first ~240 chars of section body for negation / disclaim language

All 5 scenarios covered by regression tests in __tests__/soul/scanner-profile-mismatch.test.ts.

Phase 4.5 score-jump classification: kitchen-sink flips from 100/100 HARDENED to 100/100 PARTIAL HARDENED + HIGH SOUL-PROFILE-MISMATCH. Classification (d) expanded-detection — clean per the rule because the new HIGH surfaces a TP that was previously hidden by the marker bypass.

Residual MEDIUM items deferred to follow-ups (none blocking):

  • M1: <!-- soul:profile=basic --> (invalid profile name) silently ignored — UX warning needed.
  • M2: No test pinning the actual ~/.opena2a/corpus/repo/malicious/kitchen-sink/SOUL.md fixture.
  • M3: Registry JSON schema audit for the new profileMismatch field.

Verification

  • npm test 2048/2074 pass (was 2022/2048 baseline; +20 tests: 13 scanner-profile-mismatch + 7 scan-soul-scope-disclosure + 6 H1-H5 regression).
  • npm run release-smoke:corpus 12/12.
  • node dist/cli.js scan-soul ~/.opena2a/corpus/repo/malicious/kitchen-sink: HIGH SOUL-PROFILE-MISMATCH block + (4 of 9 domains evaluated) + Level PARTIAL HARDENED.
  • Cross-analyzer consistency: scan-soul kitchen-sink/ flips from 100 HARDENED to PARTIAL HARDENED + HIGH; check --nanomind kitchen-sink/ already fires HIGH on same body. Direction now agrees per cli-subagent-testing.md § Phase 7a.
  • All 5 adversarial scenarios (H1-H5) end-to-end verified: bypasses fire, FPs don't.

Test plan

  • npm test green (2048/2074)
  • npm run release-smoke:corpus 12/12
  • kitchen-sink scan-soul flips from 100 HARDENED to PARTIAL HARDENED + SOUL-PROFILE-MISMATCH HIGH
  • Phase 4.5 H1-H5 regression tests pass
  • benign hardened-soul / partial-controls-soul / clean-skill fixtures do NOT trigger mismatch
  • --ci exits 1 on mismatch fixture, exits 0 on clean conversational SOUL

Related

Bundled with #163 / #164 fixes for the 0.22.1 release.

…pass (#162)

A `<!-- soul:profile=conversational -->` marker (or `--profile`
override) narrows the scan-soul evaluation to 4 of 9 governance
domains. The kitchen-sink malicious corpus fixture exploited this:
its body declares Capability Boundaries, Trust Hierarchy, Data
Handling, Agentic Safety, and Human Oversight — but the marker
forces the scanner to skip them, producing a 100/100 HARDENED verdict
on a genuinely-malicious agent definition.

Path B fix per [CHIEF-CDS]: honor the marker for legitimate
conversational SOULs, but emit HIGH `SOUL-PROFILE-MISMATCH` whenever
body content suggests a broader profile than the marker declares,
plus disclose evaluation scope in every partial-scan output.

Changes:

- `src/soul/scanner.ts`:
  - New `SoulProfileMismatch` field on `SoulScanResult` carrying the
    declared profile, inferred profile, skipped domains, and the
    body signals that drove the inference.
  - New `inferProfileFromContent(content)` heuristic. Layered by
    domain heading + verb signal: `## Agentic Safety` or autonomous
    language → autonomous; `## Capability Boundaries` / `## Human
    Oversight` / tool-shell-execute language → tool-agent; `## Trust
    Hierarchy` / `## Data Handling` → code-assistant; otherwise
    conversational.
  - Mismatch detection in `scanSoul`: when the declared profile
    (marker or `--profile` override) is strictly narrower than the
    inferred profile AND the inference would have evaluated extra
    domains, populate `result.profileMismatch` with the human-
    readable signals so the UI can render them.

- `src/cli.ts` (scan-soul render):
  - Header line now appends ` · (N of 9 domains evaluated)` whenever
    `result.skippedDomains.length > 0`.
  - Verdict line for mismatch fixtures replaces "All controls
    covered" with "Profile mismatch: declared=X skips N domains the
    body content suggests should be evaluated".
  - New SOUL-PROFILE-MISMATCH HIGH block rendered before Domain
    Scores. Shows declared vs. inferred profile, the signals (up to
    6 + an "...and N more" overflow), the skipped domain list, and
    a runnable Fix recommending the user remove the marker.
  - Conformance label is prefixed with "PARTIAL " when any domain
    was skipped — bare "HARDENED" / "STANDARD" / "ESSENTIAL" only
    appear when all 9 domains were evaluated.
  - New "Scope" line under Conformance disclosing the
    evaluated-vs-skipped domain ratio.
  - `--ci` exits non-zero when `profileMismatch` is set (uses
    `globalCiMode || options.ci` because the global --ci flag is
    stripped from argv before Commander parses).

- `__tests__/soul/scanner-profile-mismatch.test.ts` (NEW): 13 tests
  covering every mismatch scenario — declared=conversational +
  Capability Boundaries body fires HIGH; declared=conversational +
  conversational body does not; bare body without marker infers
  cleanly (no mismatch); legitimate marker matching body does not
  fire; Agentic Safety body infers autonomous; finding body lists
  every skipped domain; --profile override path fires; and 5 unit
  tests for `inferProfileFromContent` covering each tier.

- `__tests__/cli/scan-soul-scope-disclosure.test.ts` (NEW): 7
  end-to-end tests against the built CLI — partial-scope header
  includes "(N of 9 domains evaluated)"; full-scope header omits it;
  PARTIAL prefix on conformance label; SOUL-PROFILE-MISMATCH block
  renders with Fix command; --ci exits 1 on mismatch fixture; --ci
  exits 0 on clean conversational fixture.

Verification:

- `node dist/cli.js scan-soul ~/.opena2a/corpus/repo/malicious/kitchen-sink`
  no longer ships 100/100 HARDENED. Output now opens with HIGH
  SOUL-PROFILE-MISMATCH, declares "(4 of 9 domains evaluated)", and
  the conformance label reads "PARTIAL NONE" because every
  conversational-profile control fails on the malicious body.
- `npm test` 2042/2068 pass (was 2022/2048 baseline; +20 tests:
  13 scanner-profile-mismatch + 7 scan-soul-scope-disclosure).
- `npm run release-smoke:corpus` 12/12 (corpus smoke runs `secure`,
  not `scan-soul`; no regression).
- Cross-analyzer consistency on kitchen-sink: scan-soul flips from
  100/100 HARDENED to PARTIAL NONE + HIGH; check --nanomind already
  fires HIGH on the same body. Direction now agrees.
- Benign fixtures (clean-skill, hardened-soul, partial-controls-soul)
  do NOT trigger profileMismatch because they have no marker (their
  profile is detected from body, not declared).

Closes #162.
…FP findings (#162)

Phase 4.5 Tier A adversarial review of b0f9e4b surfaced 5 HIGH-severity
issues. This commit closes all 5 and adds regression tests so future
edits don't reopen them.

H1 (heading-shape bypass) — `^#{2,3}\s+` regex missed H1 / H4-H6 and
bold-line `**Capability Boundaries**` headings, so an attacker could
dodge the heuristic with `#### Capability Boundaries` or `**Capability
Boundaries**`. Extended the heading detector to ATX H1-H6 + a bold-line
branch + a setext (`=====`) branch.

H2 (verb-only bypass) — body without the exact `execute|run|invoke
tool|shell|command` verb-noun pair (e.g., "uses MCP servers and function
calls under the hood") missed inference. Added a `proseSignal` regex
that catches MCP / function-call / tool-use / orchestrator language.
Layered with a tier cross-check: AGENTIC / MULTI-AGENT tiers (which
require explicit autonomous / orchestrator language) override even a
zero-signal body. TOOL-USING tier escalates only when body inference
also has at least one signal — keeps documentation/tutorial fixtures
that mention "Execute tool calls" in code fences from firing FPs.

H3 (code-fence FP) — pre-strip ``` and ~~~ code fences and HTML
comments (excluding `soul:` markers, which downstream parsers still
need) before running the heading regex. A tutorial SOUL.md that
documents how to write SOUL.md no longer fires a false-positive
mismatch.

H4 (code-assistant + scope-only Capability Boundaries FP) — a
`<!-- soul:profile=code-assistant -->` marker on a body whose only
heading is `## Capability Boundaries` documenting file-scope ("reads
files within the project") used to be inferred as tool-agent and fire
mismatch. Now: Capability Boundaries heading alone is treated as
code-assistant scope — tool-agent inference requires the heading PLUS
at least one tool-execution / manifest / MCP signal, OR a Human
Oversight heading.

H5 (defensive-prose FP) — a `## Trust Hierarchy` heading whose section
body is "This bot does not have one" used to fire mismatch. Now:
`isDefensiveSection` slices the section body and checks for negation /
disclaim language ("does not", "no inline", "not applicable", "n/a")
in the first ~240 chars. Defensive sections do NOT count as governance
signals.

Regression tests (6 new in `scanner-profile-mismatch.test.ts`):

  H1 ####     → MUST fire mismatch
  H1 **bold** → MUST fire mismatch
  H2 MCP      → MUST fire mismatch
  H3 code-fence → MUST NOT fire (tutorial)
  H4 code-assistant → MUST NOT fire (legitimate scope)
  H5 defensive → MUST NOT fire (disclaim language)

Verification:

- `npm test` 2048/2074 (was 2042/2068; +6 H1-H5 regression tests).
- `npm run release-smoke:corpus` 12/12.
- Kitchen-sink corpus fixture still fires HIGH SOUL-PROFILE-MISMATCH,
  PARTIAL HARDENED label, "(4 of 9 domains evaluated)" disclosure.
- All 5 adversarial bypass / FP scenarios now produce the expected
  verdict (MISMATCH for H1/H2 attacks, no-mismatch for H3/H4/H5
  legitimate authors).

Phase 4.5 review residual items DEFERRED:

- M1 (invalid-profile-name silent ignore — `<!-- soul:profile=basic -->`)
- M2 (no test pinning kitchen-sink corpus fixture itself)
- M3 (Registry JSON schema audit for new `profileMismatch` field)

These are tracked separately. None are blocking — M1 is UX, M2 is
test-coverage hardening (the heuristic itself is now well-tested), M3
needs Registry-side coordination.
@thebenignhacker thebenignhacker enabled auto-merge (squash) April 29, 2026 21:30
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

Security Review: PR #168 - SOUL Profile Mismatch Detection

VERDICT: APPROVE

SUMMARY:

This PR introduces HIGH severity SOUL-PROFILE-MISMATCH detection to prevent scope-narrowing attacks where malicious <!-- soul:profile=conversational --> markers hide governance domains from evaluation. The implementation adds content inference heuristics, mismatch detection logic, CLI rendering, and comprehensive test coverage. After thorough verification against the full source files, no unmitigated security or correctness issues were found. The code properly validates inputs at system boundaries, uses safe child_process invocations with array arguments, and includes defensive programming patterns throughout.

VERIFICATION NOTES:

Command Injection Check ✓

  • Line 31: execFileSync(process.execPath, [CLI_PATH, 'scan-soul', target, ...flags], {...}) — arguments passed as array elements to execFileSync, not shell-interpolated. Safe pattern.
  • Mitigation verified: No exec() or shell-interpolation found in diff or related code paths.

Path Traversal Check ✓

  • Lines 23-25: mkdtempSync(join(tmpdir(), 'scan-soul-scope-')) and writeFileSync(join(dir, 'SOUL.md'), ...) — uses Node.js path.join() which normalizes paths automatically. Temp directory is system-controlled.
  • Mitigation verified: All file operations use safe path construction with join() or resolve().

Regex DoS Check ✓

  • Line 543-575: Multiple regex patterns in inferProfileFromContent:
    • /```[\s\S]*?```/g — non-greedy quantifier on bounded input (SOUL.md files are typically <50KB)
    • /^#{1,6}\s+(.+?)\s*$/gm — fixed repetition count {1,6}, non-greedy .+?
    • /^\*\*([^*\n]+)\*\*\s*:?\s*$/gm — negated character class [^*\n]+, no nested quantifiers
    • /\b(?:execute|run|invoke|spawn|fork|exec)\s+(?:approved\s+)?(?:tool|shell|command|...)\b/i — all linear-time alternation
  • Mitigation verified: All patterns use non-greedy quantifiers, bounded repetition, or negated character classes. No catastrophic backtracking patterns like (a+)+ or (a|b+)+.

Prototype Pollution Check ✓

  • Lines 600-607: new Map<string, string>() and headingPositions: Array<{...}> — typed structures, no unsafe merge operations.
  • Line 918: const declaredDomainIds = new Set(PROFILE_DOMAINS[profile]) — reads from pre-defined constant PROFILE_DOMAINS, not user input.
  • Mitigation verified: No dynamic property assignment or Object.assign() on user-controlled keys.

Type Safety Check ✓

  • Line 88-104: SoulProfileMismatch interface properly typed with strict field definitions.
  • Line 541: Return type { profile: AgentProfile; signals: string[] } explicitly declared.
  • Line 908: profileMismatch typed as SoulProfileMismatch | undefined, consistent with interface.

Input Validation Check ✓

  • Line 552-556: Code fence stripping removes potentially malicious content before heading extraction, preventing injection via markdown code blocks.
  • Line 631-637: isDefensiveSection() validates section content to prevent false positives from negation language.
  • Line 894-900: Marker validation ensures profileFromMarker only captures valid profile names via Object.keys(PROFILE_DOMAINS).includes().

Exit Code Handling Check ✓

  • Line 6547-6555: --ci mode correctly exits with status 1 when profile mismatch detected, writing diagnostic to stderr. Consistent with existing CI failure patterns in the codebase.

Race Condition Check ✓

  • Lines 23-25: Temp directory creation and file writes are synchronous operations; no async interleaving possible.
  • Line 31: execFileSync is synchronous; no concurrent access issues.

Resource Leak Check ✓

  • Line 31-38: execFileSync stdout/stderr captured in try/catch block; buffers automatically released on completion or error.
  • No file handles or streams opened without corresponding cleanup.

CODE QUALITY OBSERVATIONS (non-blocking):

  1. Comprehensive test coverage: 164 lines of end-to-end CLI tests + 324 lines of unit tests covering adversarial bypass scenarios, false positive regressions, and edge cases.

  2. Defense in depth: Phase 4.5 fixes address multiple bypass vectors (H1-H5) with layered mitigations:

    • Code fence stripping prevents tutorial FPs
    • Defensive section detection prevents negation-language FPs
    • Tier cross-checking catches prose-only signals
    • Multi-level heading detection (ATX, Setext, bold-as-heading)
  3. Clear user feedback: CLI output includes actionable fix instructions and lists all skipped domains for manual audit.


Reviewed 4 files changed (39607 bytes)

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

Security Review: PR #168 - SOUL Profile Mismatch Detection

VERDICT: APPROVE

SUMMARY:
This PR implements HIGH severity SOUL-PROFILE-MISMATCH detection to prevent governance scope-narrowing attacks where <!-- soul:profile=conversational --> markers hide critical domains from evaluation. The implementation adds content-based profile inference, mismatch detection with tier cross-checking, CLI rendering with scope disclosure, and comprehensive test coverage including adversarial bypass cases. After verifying all findings against the full source files, no unmitigated security or correctness issues remain. The code uses safe patterns throughout: array-based child_process invocations, path normalization via join(), linear-time regexes, typed data structures, and proper input validation at system boundaries.

VERIFICATION NOTES (Previous Review Follow-up):

All findings from the previous review have been confirmed as non-issues upon verification against the complete source context:

Command Injection (Line 31) — execFileSync with array arguments, not shell-interpolated
Path Traversal (Lines 23-25) — path.join() normalizes paths automatically
ReDoS (Lines 543-575) — All patterns use non-greedy quantifiers, bounded repetition, or negated character classes
Prototype Pollution (Lines 600-607, 918) — Typed structures, no dynamic property assignment on user input
Type Safety (Lines 88-104, 541, 908) — Strict TypeScript interfaces with proper nullability
Input Validation (Lines 552-556, 631-637, 894-900) — Code fence stripping, defensive section detection, marker allowlist
Exit Codes (Lines 6547-6555) — --ci mode exits 1 on mismatch with stderr diagnostic
Race Conditions (Lines 23-25, 31) — Synchronous file ops and execFileSync, no async interleaving
Resource Leaks (Lines 31-38) — execFileSync buffers released in try/catch; no unclosed handles

ADDITIONAL VERIFICATION (Current PR Code):

No new security or correctness issues introduced. The implementation correctly:

  • Strips code fences before heading extraction (Line 552-556) to prevent tutorial/documentation false positives
  • Validates profile markers against PROFILE_DOMAINS allowlist (Line 900) before trusting them
  • Uses tier cross-check as confirming signal only when body inference also has signals (Lines 926-940), preventing docs-only false positives
  • Renders scope disclosure in CLI output (Lines 6349-6351, 6386) so users cannot misread partial-scope verdicts
  • Prefixes conformance labels with "PARTIAL" when domains were skipped (Lines 6460-6463)
  • Exits non-zero in --ci mode on mismatch (Lines 6554-6563)

All test fixtures demonstrate proper threat modeling (conversational marker hiding tool-agent content, tier/profile interaction edge cases, adversarial bypass attempts).


Reviewed 4 files changed (39607 bytes)

@thebenignhacker thebenignhacker merged commit fb6147f into main Apr 29, 2026
1 check passed
@thebenignhacker thebenignhacker deleted the fix/scan-soul-profile-bypass-162 branch April 29, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scan-soul profile filter is attacker-controllable; HARDENED label shipped on partial-domain evaluation (detection bypass)

1 participant