Skip to content

[recipes] Gmail smart pull — sensitivity routing + contact entities#220

Open
alanshurafa wants to merge 10 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/gmail-smart-pull
Open

[recipes] Gmail smart pull — sensitivity routing + contact entities#220
alanshurafa wants to merge 10 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/gmail-smart-pull

Conversation

@alanshurafa
Copy link
Copy Markdown
Contributor

Summary

Adds recipes/gmail-smart-pull/ — a Gmail puller that routes messages by sensitivity tier (restricted / personal / standard), maintains a contact entity cache with relationship_tier, and atomizes thread content via a four-provider LLM fan-out (Anthropic API, OpenRouter, Claude CLI, Codex CLI). Read scope only: gmail.readonly.

Ships two idempotent SQL migrations:

  • 001_merge_thought_metadata.sql — shallow-merge RPC so repeated pulls update thought metadata without clobbering unrelated keys.
  • 002_entities_canonical_email.sql — adds canonical_email column plus supporting indexes for contact lookups.

Coupling

  • Atomizer recipe: soft-coupled. This PR ships a self-contained lib/atomize-text.mjs copy so either recipe can merge first.
  • CRM person tiers recipe: soft-optional. The contacts cache reads tier data when that recipe is installed, otherwise falls back to defaults.

Security review

Four P1 findings from pre-review were fixed; they absorbed five P2 findings along the way. Two commits in this branch exist specifically to harden security surface area.

OAuth hardening bundle:

  • CSRF state parameter with constant-time-ish comparison on the callback.
  • Callback server bound to 127.0.0.1 (loopback only).
  • error query param HTML-escaped before reflection.
  • tokenRes.ok checks on all token exchange calls.
  • Bounded retry/backoff honoring Retry-After, with jitter, limited to 429/5xx.
  • Atomic token.json write with 0o600 permissions.

Sensitivity classifier: 9 new restricted-tier patterns — OpenAI (sk-proj, sk-svcacct, sk-admin), Anthropic (sk-ant), AWS (AKIA/ASIA + secret-key proximity), GCP (AIza), JWTs, PEM private keys, GitHub tokens (ghp, gho, ghu, ghs, ghr), and Slack xoxp.

Prompt injection: the default atomize prompt now frames email content as untrusted input inside an explicit SECURITY section. The --dangerously-bypass-approvals-and-sandbox default on the Codex atomizer was dropped — bypass is now gated behind an explicit opt-in env var.

Credentials: the README credential tracker no longer references a service_role key.

Known follow-ups

  • Tune sensitivity classifier false-positive rate on innocuous alphanumeric strings.
  • Stronger prompt-injection defense (requiring atoms to be verbatim substrings of the input) was scoped out — it costs ~3x throughput.
  • Path-traversal hardening on CONTACTS_CACHE_PATH and GMAIL_TOKEN_PATH.
  • Streaming pack output for very large pulls, RFC 2822 group-address parsing, IDN/EAI support, and a richer htmlToText parser.

Four P2 findings and three P3 findings are documented as deferred follow-ups in the branch review notes.

alanshurafa and others added 8 commits April 21, 2026 16:31
Core Gmail puller script for a new recipe under recipes/gmail-smart-pull/.
The puller fetches messages from the Gmail API (read-only scope), strips
quoted replies and signatures, filters auto-generated noise, and emits
an OB1 ingest pack that downstream pipelines can feed into fingerprint
dedup + sensitivity-gate + upsert.

Also includes two small pure-JS libs the puller depends on:

- scripts/lib/sensitivity.mjs tags each message body against two
  pattern sets (restricted: SSN, passport, bank, API keys, passwords,
  credit cards; personal: email/phone/health/financial signals) so the
  ingest side can route tiers to the right store. Tagging only — the
  recipe does not enforce a routing policy itself.

- scripts/lib/entity-resolver.mjs does RFC 2822 header parsing
  (From/To/Cc with quoted commas, display-name variants) into
  { name, email } pairs so structured correspondents can be carried in
  the pack and upserted as first-class entities later.

OAuth credentials come from GMAIL_OAUTH_CLIENT_ID and
GMAIL_OAUTH_CLIENT_SECRET env vars. No real email addresses, client
IDs, or secrets are embedded anywhere. The only scope requested is
https://www.googleapis.com/auth/gmail.readonly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LLM atomizer splits long email bodies into multiple atomic thoughts
before the puller emits them in the pack. Two behaviors carried over
from upstream experience running this at scale:

1. Prompts are piped to CLI providers via stdin, not via the -p
   command-line argument. On Windows shell:true cmd.exe mangled
   multi-line prompts containing quotes and newlines so the child
   process received a truncated/empty string and the LLM replied
   conversationally ("Looks like your message got cut off..."). 190/190
   atomize calls in one real batch failed this way until stdin fixed
   it. Same fix applied to the codex provider.

2. A new 'codex' provider shells out to `codex exec` so users
   orchestrating the recipe from a Codex session can atomize without
   crossing the streams with a nested claude-cli (which would fail
   nested-process detection). The `claude-cli` provider still works
   from standalone terminals and refuses to run inside Claude Code.

OB1 users will typically use provider='anthropic' (direct Messages
API) or 'openrouter' since OB1 is cloud-first and those are already
provisioned. CLI providers are opt-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anonical_email

Two idempotent migrations that complete the pack's handoff to a
downstream ingest pipeline:

1. merge_thought_metadata(p_id, p_patch) — shallow-merge a JSONB patch
   into a thought's metadata without re-triggering the full upsert
   path (no embedding regen, no enrichment, no fingerprint recompute).
   Useful for per-row metadata backfills like flipping a
   relationship_tier on a batch of thoughts after regenerating the
   contacts cache.

2. entities.canonical_email — adds a nullable TEXT column + a partial
   unique index to public.entities so email correspondents parsed from
   the pack's structured From/To/Cc blocks can be upserted by normalized
   email address. Existing uniqueness on (entity_type, normalized_name)
   is preserved because two people can legitimately share a display
   name; email is the stable identifier.

Both use CREATE OR REPLACE / IF NOT EXISTS guards — safe to re-run.
Neither drops or renames existing columns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README documents the full setup path: Gmail OAuth Desktop-app client,
env vars (no credentials on disk), first-run consent flow, dry-run,
real run, and optional migration install. Explicitly covers the four
design choices most likely to surprise a new user:

- Sensitivity routing is tag-only — the recipe does not enforce a
  policy, the ingest pipeline does. Calls out that OB1 is cloud-first
  so "restricted stays local" needs explicit wiring (two-store setup
  or block-on-import).
- Engagement filter defaults to engaged-only with STARRED/IMPORTANT
  bypass, with clear instructions to disable or rebuild.
- Relationship tier is metadata (contact/known/unknown), not a gate.
  Three ways to produce the contacts cache documented.
- Atomization is opt-in per-message (>= 150 words default) with
  anthropic/openrouter/claude-cli/codex provider choice. Graceful
  fallback to whole-message capture on atomizer failure.

metadata.json follows the schema template at recipes/_template/ with
required fields (name, description, category, author, version,
requires.open_brain, tags, difficulty, estimated_time) and no extras.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l tracker

Codex review (P2 originally, elevated to P1 in triage): the credential
tracker in the README asked users to paste their Supabase service-role
key into a plaintext doc, but this recipe never touches that key — the
puller only emits a pack file, and any downstream ingest pipeline that
needs service_role should read it from env/secret manager, not from a
user's text editor.

Removing the field avoids an entirely avoidable leak path for a highly
privileged database secret, and adds a note so contributors who copy
this tracker pattern into other recipes don't reintroduce the mistake.
… loopback bind + HTML escape + HTTP checks)

Codex identified four coupled OAuth weaknesses in scripts/pull-gmail.mjs:

- No OAuth state parameter: authUrl was built without a random state and
  the callback handler accepted the first ?code= it saw. Any local
  process or malicious localhost page could race the browser redirect
  and bind the script to an attacker-controlled Google account.
- server.listen() without an address defaulted to IPv6-any/0.0.0.0 on
  some platforms, briefly exposing the callback to the LAN.
- URL error parameter reflected into HTML without escaping — low-impact
  reflected XSS but trivial to fix.
- Token exchange and refresh called res.json() before checking res.ok,
  so proxy/5xx responses produced a useless JSON parse error instead of
  a useful OAuth failure with status + body.

Fix: generate 16 bytes of random hex as state, require the callback to
echo it back (mismatch -> hard reject), bind createServer to 127.0.0.1
explicitly, HTML-escape the error param before reflecting, and gate
both token POSTs on res.ok with a bounded body preview on failure.
…s to sensitivity classifier

Codex flagged (originally P2, elevated to P1 in triage because the
sensitivity tier drives downstream routing): the restricted-tier pattern
set missed several common secret formats, so emails containing them
would be classified 'standard' and flow into the general thoughts pool
instead of the restricted-only store.

Adds patterns for:

- openai_key       — sk-proj-, sk-svcacct-, sk-admin- variants
- anthropic_key    — sk-ant-api / sk-ant-admin tokens
- aws_access_key_id    — AKIA/ASIA/AROA/AIDA prefixes
- aws_secret_access_key — proximity match near "aws secret" label
- gcp_api_key      — AIza<35 chars> canonical form
- jwt_token        — eyJ<header>.<payload>.<sig> three-segment form
- pem_private_key  — BEGIN PRIVATE KEY blocks (RSA, EC, DSA, OPENSSH, PGP, ENCRYPTED)
- github_token     — ghp/gho/ghu/ghs/ghr _ 36+ char bodies
- slack_token      — xox[aboprs]- tokens

The existing generic api_key_pattern is kept as a belt-and-suspenders
fallback. All patterns still fail-open (standard tier) on no match —
classification never throws, so a missing pattern degrades gracefully
rather than blocking the pull.
…+ harden atomize prompt against injection

Codex flagged this as the highest-severity finding in the atomize lib
(originally tagged P1-5 + P2): the 'codex' provider spawned
`codex exec --dangerously-bypass-approvals-and-sandbox -` with an email
body interpolated directly into the prompt. A malicious sender can embed
'IGNORE PREVIOUS INSTRUCTIONS' or tool-call primers, and because the
child Codex agent ran with the sandbox disabled, prompt-injection
escalated to arbitrary local command/file access.

Fixes:

1. Remove the --dangerously-bypass-approvals-and-sandbox flag from the
   default codex invocation. Users who actively need it for an
   atomization-only run can opt in via GMAIL_ATOMIZE_CODEX_BYPASS=1 env
   var, which documents the risk at the opt-in site.
2. Strengthen DEFAULT_ATOMIZE_PROMPT with an explicit SECURITY section
   that frames the INPUT THOUGHT as untrusted data, not instructions,
   and forbids emitting system/tool/assistant markers in the output.
3. Add a top-of-file comment describing the prompt-injection threat
   model so callers who override the prompt don't silently drop the
   hardening.

This does not eliminate prompt injection (no prompt-only defense can),
but it removes the most dangerous escalation path and raises the bar
from "read email -> run code" to "read email -> influence atoms".
@github-actions github-actions Bot added the recipe Contribution: step-by-step recipe label Apr 21, 2026
The previous regex `\b(?:aws[_ -]?secret|aws[_ -]?access[_ -]?key)\b`
could not match `aws_secret_access_key=...` — the most common env-var
form — because `_` is a word char, so the `\b` between `t` and `_` in
`aws_secret_access_key` didn't fire, and neither alternation caught
the combined phrase.

Restructured the alternation so `aws_secret` can optionally absorb the
trailing `_access_key`:

  aws[_ -]?(?:secret(?:[_ -]?access[_ -]?key)?|access[_ -]?key)

Verified against 8 test cases covering kvp form, uppercase, hyphen
separators, space separators, standalone `aws_secret`, standalone
`aws_access_key`, a negative case, and the full env-var pair. All
pass with no false positives.
@github-actions github-actions Bot added the schema Contribution: database extension label Apr 22, 2026
@alanshurafa
Copy link
Copy Markdown
Contributor Author

Refreshing upstream checks after fork-side readiness cleanup.

@alanshurafa alanshurafa reopened this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

recipe Contribution: step-by-step recipe schema Contribution: database extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant