fix(soul): emit HIGH SOUL-PROFILE-MISMATCH on profile-filter scope bypass (#162) by thebenignhacker · Pull Request #168 · opena2a-org/hackmyagent

thebenignhacker · 2026-04-29T21:30:38Z

Closes #162.

Summary

A  marker (or --profile override) narrows the scan-soul evaluation to 4 of 9 governance domains. The kitchen-sink malicious corpus fixture exploited this: its body declares Capability Boundaries, Trust Hierarchy, Data Handling, Agentic Safety, and Human Oversight — but the marker forces the scanner to skip them, producing a 100/100 HARDENED verdict on a genuinely-malicious agent definition.

Path B fix per [CHIEF-CDS]: honor the marker for legitimate conversational SOULs, but emit HIGH SOUL-PROFILE-MISMATCH whenever body content suggests a broader profile than the marker declares, plus disclose evaluation scope in every partial-scan output.

Implementation

`src/soul/scanner.ts`

New SoulProfileMismatch field on SoulScanResult carrying declared profile, inferred profile, skipped domains, and signals.
New inferProfileFromContent(content) heuristic. Layered by domain heading + verb signal: ## Agentic Safety or autonomous language → autonomous; ## Capability Boundaries + tool-execution signal / ## Human Oversight / tool-shell-execute language → tool-agent; ## Trust Hierarchy / ## Data Handling / scope-only ## Capability Boundaries → code-assistant; otherwise conversational.
Mismatch detection in scanSoul: when the declared profile (marker or --profile override) is strictly narrower than the inferred profile AND the inference would have evaluated extra domains, populate result.profileMismatch.
Tier cross-check: AGENTIC / MULTI-AGENT tiers (which require explicit autonomous / orchestrator language) override even a zero-signal body. TOOL-USING tier escalates only when body inference also has at least one signal — keeps documentation/tutorial fixtures from firing FPs.

`src/cli.ts` (scan-soul render)

Header line appends (N of 9 domains evaluated) when skippedDomains.length > 0.
New SOUL-PROFILE-MISMATCH HIGH block before Domain Scores. Shows declared vs. inferred profile, signals (up to 6 + overflow), skipped domains, runnable Fix.
Conformance label prefixed with PARTIAL when any domain was skipped — bare HARDENED / STANDARD / ESSENTIAL only appear at full-scope evaluation.
--ci exits non-zero when profileMismatch is set.

Phase 4.5 adversarial review (Tier A)

Subagent review of the first commit identified 5 HIGH-severity bypass / FP issues, all closed in the second commit before merge:

#	Issue	Resolution
H1	`####` / `bold` heading-shape bypass	Extended heading regex to ATX H1-H6 + bold-line + setext branches
H2	Verb-only bypass (`MCP servers and function calls under the hood`)	Added `proseSignal` regex + tier cross-check (AGENTIC+ overrides; TOOL-USING escalates only with body signal)
H3	Code-fence / HTML-comment FP (tutorials documenting how to write SOUL.md)	Pre-strip ```, `~~~`, and non-`soul:` HTML comments before parsing
H4	`<!-- soul:profile=code-assistant -->` + scope-only Capability Boundaries FP	Capability Boundaries heading alone now infers code-assistant; tool-agent requires the heading PLUS tool-execution / manifest / MCP signal
H5	Defensive prose FP (`## Trust Hierarchy\nThis bot does not have one.`)	`isDefensiveSection` checks first ~240 chars of section body for negation / disclaim language

All 5 scenarios covered by regression tests in __tests__/soul/scanner-profile-mismatch.test.ts.

Phase 4.5 score-jump classification: kitchen-sink flips from 100/100 HARDENED to 100/100 PARTIAL HARDENED + HIGH SOUL-PROFILE-MISMATCH. Classification (d) expanded-detection — clean per the rule because the new HIGH surfaces a TP that was previously hidden by the marker bypass.

Residual MEDIUM items deferred to follow-ups (none blocking):

M1:  (invalid profile name) silently ignored — UX warning needed.
M2: No test pinning the actual ~/.opena2a/corpus/repo/malicious/kitchen-sink/SOUL.md fixture.
M3: Registry JSON schema audit for the new profileMismatch field.

Verification

npm test 2048/2074 pass (was 2022/2048 baseline; +20 tests: 13 scanner-profile-mismatch + 7 scan-soul-scope-disclosure + 6 H1-H5 regression).
npm run release-smoke:corpus 12/12.
node dist/cli.js scan-soul ~/.opena2a/corpus/repo/malicious/kitchen-sink: HIGH SOUL-PROFILE-MISMATCH block + (4 of 9 domains evaluated) + Level PARTIAL HARDENED.
Cross-analyzer consistency: scan-soul kitchen-sink/ flips from 100 HARDENED to PARTIAL HARDENED + HIGH; check --nanomind kitchen-sink/ already fires HIGH on same body. Direction now agrees per cli-subagent-testing.md § Phase 7a.
All 5 adversarial scenarios (H1-H5) end-to-end verified: bypasses fire, FPs don't.

Test plan

npm test green (2048/2074)
npm run release-smoke:corpus 12/12
kitchen-sink scan-soul flips from 100 HARDENED to PARTIAL HARDENED + SOUL-PROFILE-MISMATCH HIGH
Phase 4.5 H1-H5 regression tests pass
benign hardened-soul / partial-controls-soul / clean-skill fixtures do NOT trigger mismatch
--ci exits 1 on mismatch fixture, exits 0 on clean conversational SOUL

…pass (#162) A `` marker (or `--profile` override) narrows the scan-soul evaluation to 4 of 9 governance domains. The kitchen-sink malicious corpus fixture exploited this: its body declares Capability Boundaries, Trust Hierarchy, Data Handling, Agentic Safety, and Human Oversight — but the marker forces the scanner to skip them, producing a 100/100 HARDENED verdict on a genuinely-malicious agent definition. Path B fix per [CHIEF-CDS]: honor the marker for legitimate conversational SOULs, but emit HIGH `SOUL-PROFILE-MISMATCH` whenever body content suggests a broader profile than the marker declares, plus disclose evaluation scope in every partial-scan output. Changes: - `src/soul/scanner.ts`: - New `SoulProfileMismatch` field on `SoulScanResult` carrying the declared profile, inferred profile, skipped domains, and the body signals that drove the inference. - New `inferProfileFromContent(content)` heuristic. Layered by domain heading + verb signal: `## Agentic Safety` or autonomous language → autonomous; `## Capability Boundaries` / `## Human Oversight` / tool-shell-execute language → tool-agent; `## Trust Hierarchy` / `## Data Handling` → code-assistant; otherwise conversational. - Mismatch detection in `scanSoul`: when the declared profile (marker or `--profile` override) is strictly narrower than the inferred profile AND the inference would have evaluated extra domains, populate `result.profileMismatch` with the human- readable signals so the UI can render them. - `src/cli.ts` (scan-soul render): - Header line now appends ` · (N of 9 domains evaluated)` whenever `result.skippedDomains.length > 0`. - Verdict line for mismatch fixtures replaces "All controls covered" with "Profile mismatch: declared=X skips N domains the body content suggests should be evaluated". - New SOUL-PROFILE-MISMATCH HIGH block rendered before Domain Scores. Shows declared vs. inferred profile, the signals (up to 6 + an "...and N more" overflow), the skipped domain list, and a runnable Fix recommending the user remove the marker. - Conformance label is prefixed with "PARTIAL " when any domain was skipped — bare "HARDENED" / "STANDARD" / "ESSENTIAL" only appear when all 9 domains were evaluated. - New "Scope" line under Conformance disclosing the evaluated-vs-skipped domain ratio. - `--ci` exits non-zero when `profileMismatch` is set (uses `globalCiMode || options.ci` because the global --ci flag is stripped from argv before Commander parses). - `__tests__/soul/scanner-profile-mismatch.test.ts` (NEW): 13 tests covering every mismatch scenario — declared=conversational + Capability Boundaries body fires HIGH; declared=conversational + conversational body does not; bare body without marker infers cleanly (no mismatch); legitimate marker matching body does not fire; Agentic Safety body infers autonomous; finding body lists every skipped domain; --profile override path fires; and 5 unit tests for `inferProfileFromContent` covering each tier. - `__tests__/cli/scan-soul-scope-disclosure.test.ts` (NEW): 7 end-to-end tests against the built CLI — partial-scope header includes "(N of 9 domains evaluated)"; full-scope header omits it; PARTIAL prefix on conformance label; SOUL-PROFILE-MISMATCH block renders with Fix command; --ci exits 1 on mismatch fixture; --ci exits 0 on clean conversational fixture. Verification: - `node dist/cli.js scan-soul ~/.opena2a/corpus/repo/malicious/kitchen-sink` no longer ships 100/100 HARDENED. Output now opens with HIGH SOUL-PROFILE-MISMATCH, declares "(4 of 9 domains evaluated)", and the conformance label reads "PARTIAL NONE" because every conversational-profile control fails on the malicious body. - `npm test` 2042/2068 pass (was 2022/2048 baseline; +20 tests: 13 scanner-profile-mismatch + 7 scan-soul-scope-disclosure). - `npm run release-smoke:corpus` 12/12 (corpus smoke runs `secure`, not `scan-soul`; no regression). - Cross-analyzer consistency on kitchen-sink: scan-soul flips from 100/100 HARDENED to PARTIAL NONE + HIGH; check --nanomind already fires HIGH on the same body. Direction now agrees. - Benign fixtures (clean-skill, hardened-soul, partial-controls-soul) do NOT trigger profileMismatch because they have no marker (their profile is detected from body, not declared). Closes #162.

…FP findings (#162) Phase 4.5 Tier A adversarial review of b0f9e4b surfaced 5 HIGH-severity issues. This commit closes all 5 and adds regression tests so future edits don't reopen them. H1 (heading-shape bypass) — `^#{2,3}\s+` regex missed H1 / H4-H6 and bold-line `**Capability Boundaries**` headings, so an attacker could dodge the heuristic with `#### Capability Boundaries` or `**Capability Boundaries**`. Extended the heading detector to ATX H1-H6 + a bold-line branch + a setext (`=====`) branch. H2 (verb-only bypass) — body without the exact `execute|run|invoke tool|shell|command` verb-noun pair (e.g., "uses MCP servers and function calls under the hood") missed inference. Added a `proseSignal` regex that catches MCP / function-call / tool-use / orchestrator language. Layered with a tier cross-check: AGENTIC / MULTI-AGENT tiers (which require explicit autonomous / orchestrator language) override even a zero-signal body. TOOL-USING tier escalates only when body inference also has at least one signal — keeps documentation/tutorial fixtures that mention "Execute tool calls" in code fences from firing FPs. H3 (code-fence FP) — pre-strip ``` and ~~~ code fences and HTML comments (excluding `soul:` markers, which downstream parsers still need) before running the heading regex. A tutorial SOUL.md that documents how to write SOUL.md no longer fires a false-positive mismatch. H4 (code-assistant + scope-only Capability Boundaries FP) — a `` marker on a body whose only heading is `## Capability Boundaries` documenting file-scope ("reads files within the project") used to be inferred as tool-agent and fire mismatch. Now: Capability Boundaries heading alone is treated as code-assistant scope — tool-agent inference requires the heading PLUS at least one tool-execution / manifest / MCP signal, OR a Human Oversight heading. H5 (defensive-prose FP) — a `## Trust Hierarchy` heading whose section body is "This bot does not have one" used to fire mismatch. Now: `isDefensiveSection` slices the section body and checks for negation / disclaim language ("does not", "no inline", "not applicable", "n/a") in the first ~240 chars. Defensive sections do NOT count as governance signals. Regression tests (6 new in `scanner-profile-mismatch.test.ts`): H1 #### → MUST fire mismatch H1 **bold** → MUST fire mismatch H2 MCP → MUST fire mismatch H3 code-fence → MUST NOT fire (tutorial) H4 code-assistant → MUST NOT fire (legitimate scope) H5 defensive → MUST NOT fire (disclaim language) Verification: - `npm test` 2048/2074 (was 2042/2068; +6 H1-H5 regression tests). - `npm run release-smoke:corpus` 12/12. - Kitchen-sink corpus fixture still fires HIGH SOUL-PROFILE-MISMATCH, PARTIAL HARDENED label, "(4 of 9 domains evaluated)" disclosure. - All 5 adversarial bypass / FP scenarios now produce the expected verdict (MISMATCH for H1/H2 attacks, no-mismatch for H3/H4/H5 legitimate authors). Phase 4.5 review residual items DEFERRED: - M1 (invalid-profile-name silent ignore — ``) - M2 (no test pinning kitchen-sink corpus fixture itself) - M3 (Registry JSON schema audit for new `profileMismatch` field) These are tracked separately. None are blocking — M1 is UX, M2 is test-coverage hardening (the heuristic itself is now well-tested), M3 needs Registry-side coordination.

github-actions

Claude Code Review

Security Review: PR #168 - SOUL Profile Mismatch Detection

VERDICT: APPROVE

SUMMARY:

This PR introduces HIGH severity SOUL-PROFILE-MISMATCH detection to prevent scope-narrowing attacks where malicious  markers hide governance domains from evaluation. The implementation adds content inference heuristics, mismatch detection logic, CLI rendering, and comprehensive test coverage. After thorough verification against the full source files, no unmitigated security or correctness issues were found. The code properly validates inputs at system boundaries, uses safe child_process invocations with array arguments, and includes defensive programming patterns throughout.

VERIFICATION NOTES:

Command Injection Check ✓

Line 31: execFileSync(process.execPath, [CLI_PATH, 'scan-soul', target, ...flags], {...}) — arguments passed as array elements to execFileSync, not shell-interpolated. Safe pattern.
Mitigation verified: No exec() or shell-interpolation found in diff or related code paths.

Path Traversal Check ✓

Lines 23-25: mkdtempSync(join(tmpdir(), 'scan-soul-scope-')) and writeFileSync(join(dir, 'SOUL.md'), ...) — uses Node.js path.join() which normalizes paths automatically. Temp directory is system-controlled.
Mitigation verified: All file operations use safe path construction with join() or resolve().

Regex DoS Check ✓

Line 543-575: Multiple regex patterns in inferProfileFromContent:
- /```[\s\S]*?```/g — non-greedy quantifier on bounded input (SOUL.md files are typically <50KB)
- /^#{1,6}\s+(.+?)\s*$/gm — fixed repetition count {1,6}, non-greedy .+?
- /^\*\*([^*\n]+)\*\*\s*:?\s*$/gm — negated character class [^*\n]+, no nested quantifiers
- /\b(?:execute|run|invoke|spawn|fork|exec)\s+(?:approved\s+)?(?:tool|shell|command|...)\b/i — all linear-time alternation
Mitigation verified: All patterns use non-greedy quantifiers, bounded repetition, or negated character classes. No catastrophic backtracking patterns like (a+)+ or (a|b+)+.

Prototype Pollution Check ✓

Lines 600-607: new Map<string, string>() and headingPositions: Array<{...}> — typed structures, no unsafe merge operations.
Line 918: const declaredDomainIds = new Set(PROFILE_DOMAINS[profile]) — reads from pre-defined constant PROFILE_DOMAINS, not user input.
Mitigation verified: No dynamic property assignment or Object.assign() on user-controlled keys.

Type Safety Check ✓

Line 88-104: SoulProfileMismatch interface properly typed with strict field definitions.
Line 541: Return type { profile: AgentProfile; signals: string[] } explicitly declared.
Line 908: profileMismatch typed as SoulProfileMismatch | undefined, consistent with interface.

Input Validation Check ✓

Line 552-556: Code fence stripping removes potentially malicious content before heading extraction, preventing injection via markdown code blocks.
Line 631-637: isDefensiveSection() validates section content to prevent false positives from negation language.
Line 894-900: Marker validation ensures profileFromMarker only captures valid profile names via Object.keys(PROFILE_DOMAINS).includes().

Exit Code Handling Check ✓

Line 6547-6555: --ci mode correctly exits with status 1 when profile mismatch detected, writing diagnostic to stderr. Consistent with existing CI failure patterns in the codebase.

Race Condition Check ✓

Lines 23-25: Temp directory creation and file writes are synchronous operations; no async interleaving possible.
Line 31: execFileSync is synchronous; no concurrent access issues.

Resource Leak Check ✓

Line 31-38: execFileSync stdout/stderr captured in try/catch block; buffers automatically released on completion or error.
No file handles or streams opened without corresponding cleanup.

CODE QUALITY OBSERVATIONS (non-blocking):

Comprehensive test coverage: 164 lines of end-to-end CLI tests + 324 lines of unit tests covering adversarial bypass scenarios, false positive regressions, and edge cases.
Defense in depth: Phase 4.5 fixes address multiple bypass vectors (H1-H5) with layered mitigations:
- Code fence stripping prevents tutorial FPs
- Defensive section detection prevents negation-language FPs
- Tier cross-checking catches prose-only signals
- Multi-level heading detection (ATX, Setext, bold-as-heading)
Clear user feedback: CLI output includes actionable fix instructions and lists all skipped domains for manual audit.

Reviewed 4 files changed (39607 bytes)

github-actions

Claude Code Review

Security Review: PR #168 - SOUL Profile Mismatch Detection

VERDICT: APPROVE

SUMMARY:
This PR implements HIGH severity SOUL-PROFILE-MISMATCH detection to prevent governance scope-narrowing attacks where  markers hide critical domains from evaluation. The implementation adds content-based profile inference, mismatch detection with tier cross-checking, CLI rendering with scope disclosure, and comprehensive test coverage including adversarial bypass cases. After verifying all findings against the full source files, no unmitigated security or correctness issues remain. The code uses safe patterns throughout: array-based child_process invocations, path normalization via join(), linear-time regexes, typed data structures, and proper input validation at system boundaries.

VERIFICATION NOTES (Previous Review Follow-up):

All findings from the previous review have been confirmed as non-issues upon verification against the complete source context:

✅ Command Injection (Line 31) — execFileSync with array arguments, not shell-interpolated
✅ Path Traversal (Lines 23-25) — path.join() normalizes paths automatically
✅ ReDoS (Lines 543-575) — All patterns use non-greedy quantifiers, bounded repetition, or negated character classes
✅ Prototype Pollution (Lines 600-607, 918) — Typed structures, no dynamic property assignment on user input
✅ Type Safety (Lines 88-104, 541, 908) — Strict TypeScript interfaces with proper nullability
✅ Input Validation (Lines 552-556, 631-637, 894-900) — Code fence stripping, defensive section detection, marker allowlist
✅ Exit Codes (Lines 6547-6555) — --ci mode exits 1 on mismatch with stderr diagnostic
✅ Race Conditions (Lines 23-25, 31) — Synchronous file ops and execFileSync, no async interleaving
✅ Resource Leaks (Lines 31-38) — execFileSync buffers released in try/catch; no unclosed handles

ADDITIONAL VERIFICATION (Current PR Code):

No new security or correctness issues introduced. The implementation correctly:

Strips code fences before heading extraction (Line 552-556) to prevent tutorial/documentation false positives
Validates profile markers against PROFILE_DOMAINS allowlist (Line 900) before trusting them
Uses tier cross-check as confirming signal only when body inference also has signals (Lines 926-940), preventing docs-only false positives
Renders scope disclosure in CLI output (Lines 6349-6351, 6386) so users cannot misread partial-scope verdicts
Prefixes conformance labels with "PARTIAL" when domains were skipped (Lines 6460-6463)
Exits non-zero in --ci mode on mismatch (Lines 6554-6563)

All test fixtures demonstrate proper threat modeling (conversational marker hiding tool-agent content, tier/profile interaction edge cases, adversarial bypass attempts).

Reviewed 4 files changed (39607 bytes)

thebenignhacker added 2 commits April 29, 2026 15:16

thebenignhacker enabled auto-merge (squash) April 29, 2026 21:30

github-actions Bot approved these changes Apr 29, 2026

View reviewed changes

Merge branch 'main' into fix/scan-soul-profile-bypass-162

751dfb2

github-actions Bot approved these changes Apr 29, 2026

View reviewed changes

thebenignhacker merged commit fb6147f into main Apr 29, 2026
1 check passed

thebenignhacker deleted the fix/scan-soul-profile-bypass-162 branch April 29, 2026 21:32

thebenignhacker mentioned this pull request Apr 30, 2026

chore(deps): re-pin @opena2a/credential-patterns to 0.1.1 (Wave 1 follow-up) — 0.22.1 #169

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(soul): emit HIGH SOUL-PROFILE-MISMATCH on profile-filter scope bypass (#162)#168

fix(soul): emit HIGH SOUL-PROFILE-MISMATCH on profile-filter scope bypass (#162)#168
thebenignhacker merged 3 commits intomainfrom
fix/scan-soul-profile-bypass-162

thebenignhacker commented Apr 29, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thebenignhacker commented Apr 29, 2026

Summary

Implementation

src/soul/scanner.ts

src/cli.ts (scan-soul render)

Phase 4.5 adversarial review (Tier A)

Verification

Test plan

Related

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Security Review: PR #168 - SOUL Profile Mismatch Detection

VERDICT: APPROVE

SUMMARY:

VERIFICATION NOTES:

Command Injection Check ✓

Path Traversal Check ✓

Regex DoS Check ✓

Prototype Pollution Check ✓

Type Safety Check ✓

Input Validation Check ✓

Exit Code Handling Check ✓

Race Condition Check ✓

Resource Leak Check ✓

CODE QUALITY OBSERVATIONS (non-blocking):

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Security Review: PR #168 - SOUL Profile Mismatch Detection

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/soul/scanner.ts`

`src/cli.ts` (scan-soul render)