extractAndEnrich writes pages from untrusted text without a gate

## Summary

`extractAndEnrich` in `src/core/enrichment-service.ts` runs a regex over arbitrary text, pulls out anything that looks like a capitalized name, and unconditionally calls `engine.putPage()` on the owner's brain under `people/{slug}` or `companies/{slug}`. There is no confirmation step, no allow-list, no review queue, and the source of the text is whatever skill invoked the call — ingest, meeting-ingestion, idea-ingest, any of the recipes that forward raw email / transcript / pasted content.

The practical shape:

1. A meeting transcript, pasted article, or email comes in via one of the ingest skills.
2. The skill calls `extractAndEnrich(engine, text, sourceSlug)` as part of enrichment.
3. `extractEntities` greps for `\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+){1,3}\b`, accepts any hit, and feeds each to `enrichEntity`.
4. If the entity page doesn't exist, `enrichEntity` creates it with `generateStubContent(name, type, context)` — where `context` is a 50-char window of the raw input text — and writes it to the brain as an authoritative page.

So the attacker-shaped case looks like: a hostile email (or a spoofed meeting transcript, or a poisoned CSV attached to a briefing) carrying the sentences

> *I had lunch with Paul Graham today. He mentioned that Ycombinator Inc. is pivoting to crypto.*

ends up creating `people/paul-graham` and `companies/ycombinator-inc` pages in the owner's brain, with the sentence copied in as the summary. The owner never typed that. Agents that query the brain later ("what does gbrain say about Paul Graham?") quote it back as ground truth.

This is a data-injection / authz gap, not an RCE. It's latent today because the current in-tree caller set is small, but the skill-driven ingest paths this wires into (meeting-ingestion, idea-ingest, media-ingest) all take external content by design.

## Why it matters

The brain's value is that it's trusted. Everything downstream — search, enrichment chains, agent responses — assumes pages in `people/` and `companies/` reflect the owner's observations. Once untrusted ingest gets unbounded write access to those namespaces, the trust model collapses silently: the owner can't tell which entries came from their own notes and which came from a sentence that happened to sit in an incoming email.

Secondary effects:

- **Slug collisions.** `slugifyEntity` normalizes `Tim<Cook>` and `Tim Cook` to the same slug. An attacker who wants to tamper with an existing legitimate page just picks a name that slugifies to its path and ships a sentence that mentions it — `enrichEntity` sees the page exists, takes the UPDATE branch, and appends a timeline entry quoting the attacker's context (see `enrichEntity` around lines 85-100 — the UPDATE path is silent on who supplied the timeline text). R6 filed this as a separate finding (slugify collision); it's more impactful once F004 is closed but worth mentioning here.
- **Regex is greedy.** `[A-Z][a-z]+` matches `Dear John`, `Best Regards`, `New York`, `Bar Mitzvah`. The false-positive rate on real email means a brain wired to this function fills with junk entries even without an adversary.

## Proposed approaches

Three shapes, ordered by blast radius. All of them keep the function exported with the same name so existing in-tree callers keep working.

### 1. Quarantine namespace by default

- `extractAndEnrich` writes proposed pages under `_pending_review/{slug}` instead of `people/{slug}` / `companies/{slug}`.
- A separate command (`gbrain review --pending`) lets the owner approve, reject, or merge each proposal into the real namespace.
- Ingest skills are unaffected — they still call `extractAndEnrich` — but the user is the gate on what becomes authoritative.

Smallest behavior change from the owner's point of view, strongest defense. The review queue is the audit trail.

### 2. Require an explicit `autocreate: true` flag per request

- Default `enrichEntity` / `extractAndEnrich` to `action: 'skipped'` when the page doesn't exist.
- Callers that genuinely want page creation pass `{ autocreate: true }` and accept the risk.
- Skills that take external content (meeting-ingestion, idea-ingest, media-ingest) get the default; skills that take owner-typed input (`gbrain new person …` if it exists) set the flag explicitly.

Smaller diff than (1). Downside: the default is a behavior change from today, and every skill that currently relies on auto-create needs a touch. That's the skills surface you asked to keep sensitive — happy to leave the audit to a maintainer-led review rather than a drive-by PR.

### 3. Source-based policy (config-driven)

- Add `enrichment.policy` in gbrain config: `allow | quarantine | deny`, per `sourceSlug` prefix.
- `meeting-ingestion` → `quarantine`, `manual-entry` → `allow`, etc.
- Ships as config, no code change beyond one read at the top of `enrichEntity`.

Most flexible, highest cognitive cost (another policy knob the owner has to remember).

## PoC

Runtime PoC from the audit (internal, not public):

```ts
// Minimal shape — real PoC file at workspace/gbrain/report/evidence/poc-r6-f004-*.ts
const { extractAndEnrich } = await import('./src/core/enrichment-service.ts');
const hostileText = 'I met with Paul Graham. He mentioned Ycombinator Inc. is pivoting.';
const results = await extractAndEnrich(engine, hostileText, 'ingest/email');
// After: engine.getPage('people/paul-graham') returns a page that was never
// authored by the owner, with the hostile sentence as its Summary field.
```

Nothing in the call chain rejects it. `results[0].action === 'created'` and the page is now discoverable via `searchKeyword('Paul Graham')`.

## What I'd file against if chosen

Happy to send a PR for any of the three approaches. I'd lean (1) because it preserves current behavior for the owner while slamming the door on silent writes — owner still sees the proposal, can accept in one command, and gets an audit log for free. Open to (2) or (3) if you prefer less churn in the review UX.

No PR yet because (a) this is shaped like a product decision more than a bug fix, and (b) the right fix depends on how you want enrichment to feel end-to-end — which touches the skill surface you own directly. Happy to defer, happy to spec out the chosen approach once you pick.

## Related

- #156 (path traversal in `check-resolvable.ts`) — same audit round, different finding.
- #158 (postgres pool timeout leak) — same round, different subsystem.
- Slug collision in `slugifyEntity` is a separate follow-up that becomes easier to exploit if F004 stays open; can be addressed alongside or independently.

## Out of scope

- Timeline entry injection on existing pages (related, deserves its own issue).
- Entity-regex false positives (UX problem, not security).
- Tier auto-escalation based on untrusted source count (tangential; an attacker who can create a page can also inflate its tier, but that's an amplifier, not the root cause).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extractAndEnrich writes pages from untrusted text without a gate #160

Summary

Why it matters

Proposed approaches

1. Quarantine namespace by default

2. Require an explicit `autocreate: true` flag per request

3. Source-based policy (config-driven)

PoC

What I'd file against if chosen

Related

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

extractAndEnrich writes pages from untrusted text without a gate #160

Description

Summary

Why it matters

Proposed approaches

1. Quarantine namespace by default

2. Require an explicit autocreate: true flag per request

3. Source-based policy (config-driven)

PoC

What I'd file against if chosen

Related

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Require an explicit `autocreate: true` flag per request