[recipes] Gmail smart pull — sensitivity routing + contact entities by alanshurafa · Pull Request #220 · NateBJones-Projects/OB1

alanshurafa · 2026-04-21T21:18:02Z

Summary

Adds recipes/gmail-smart-pull/ — a Gmail puller that routes messages by sensitivity tier (restricted / personal / standard), maintains a contact entity cache with relationship_tier, and atomizes thread content via a four-provider LLM fan-out (Anthropic API, OpenRouter, Claude CLI, Codex CLI). Read scope only: gmail.readonly.

Ships two idempotent SQL migrations:

001_merge_thought_metadata.sql — shallow-merge RPC so repeated pulls update thought metadata without clobbering unrelated keys.
002_entities_canonical_email.sql — adds canonical_email column plus supporting indexes for contact lookups.

Coupling

Atomizer recipe: soft-coupled. This PR ships a self-contained lib/atomize-text.mjs copy so either recipe can merge first.
CRM person tiers recipe: soft-optional. The contacts cache reads tier data when that recipe is installed, otherwise falls back to defaults.

Security review

Four P1 findings from pre-review were fixed; they absorbed five P2 findings along the way. Two commits in this branch exist specifically to harden security surface area.

OAuth hardening bundle:

CSRF state parameter with constant-time-ish comparison on the callback.
Callback server bound to 127.0.0.1 (loopback only).
error query param HTML-escaped before reflection.
tokenRes.ok checks on all token exchange calls.
Bounded retry/backoff honoring Retry-After, with jitter, limited to 429/5xx.
Atomic token.json write with 0o600 permissions.

Sensitivity classifier: 9 new restricted-tier patterns — OpenAI (sk-proj, sk-svcacct, sk-admin), Anthropic (sk-ant), AWS (AKIA/ASIA + secret-key proximity), GCP (AIza), JWTs, PEM private keys, GitHub tokens (ghp, gho, ghu, ghs, ghr), and Slack xoxp.

Prompt injection: the default atomize prompt now frames email content as untrusted input inside an explicit SECURITY section. The --dangerously-bypass-approvals-and-sandbox default on the Codex atomizer was dropped — bypass is now gated behind an explicit opt-in env var.

Credentials: the README credential tracker no longer references a service_role key.

Known follow-ups

Tune sensitivity classifier false-positive rate on innocuous alphanumeric strings.
Stronger prompt-injection defense (requiring atoms to be verbatim substrings of the input) was scoped out — it costs ~3x throughput.
Path-traversal hardening on CONTACTS_CACHE_PATH and GMAIL_TOKEN_PATH.
Streaming pack output for very large pulls, RFC 2822 group-address parsing, IDN/EAI support, and a richer htmlToText parser.

Four P2 findings and three P3 findings are documented as deferred follow-ups in the branch review notes.

Core Gmail puller script for a new recipe under recipes/gmail-smart-pull/. The puller fetches messages from the Gmail API (read-only scope), strips quoted replies and signatures, filters auto-generated noise, and emits an OB1 ingest pack that downstream pipelines can feed into fingerprint dedup + sensitivity-gate + upsert. Also includes two small pure-JS libs the puller depends on: - scripts/lib/sensitivity.mjs tags each message body against two pattern sets (restricted: SSN, passport, bank, API keys, passwords, credit cards; personal: email/phone/health/financial signals) so the ingest side can route tiers to the right store. Tagging only — the recipe does not enforce a routing policy itself. - scripts/lib/entity-resolver.mjs does RFC 2822 header parsing (From/To/Cc with quoted commas, display-name variants) into { name, email } pairs so structured correspondents can be carried in the pack and upserted as first-class entities later. OAuth credentials come from GMAIL_OAUTH_CLIENT_ID and GMAIL_OAUTH_CLIENT_SECRET env vars. No real email addresses, client IDs, or secrets are embedded anywhere. The only scope requested is https://www.googleapis.com/auth/gmail.readonly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The LLM atomizer splits long email bodies into multiple atomic thoughts before the puller emits them in the pack. Two behaviors carried over from upstream experience running this at scale: 1. Prompts are piped to CLI providers via stdin, not via the -p command-line argument. On Windows shell:true cmd.exe mangled multi-line prompts containing quotes and newlines so the child process received a truncated/empty string and the LLM replied conversationally ("Looks like your message got cut off..."). 190/190 atomize calls in one real batch failed this way until stdin fixed it. Same fix applied to the codex provider. 2. A new 'codex' provider shells out to `codex exec` so users orchestrating the recipe from a Codex session can atomize without crossing the streams with a nested claude-cli (which would fail nested-process detection). The `claude-cli` provider still works from standalone terminals and refuses to run inside Claude Code. OB1 users will typically use provider='anthropic' (direct Messages API) or 'openrouter' since OB1 is cloud-first and those are already provisioned. CLI providers are opt-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anonical_email Two idempotent migrations that complete the pack's handoff to a downstream ingest pipeline: 1. merge_thought_metadata(p_id, p_patch) — shallow-merge a JSONB patch into a thought's metadata without re-triggering the full upsert path (no embedding regen, no enrichment, no fingerprint recompute). Useful for per-row metadata backfills like flipping a relationship_tier on a batch of thoughts after regenerating the contacts cache. 2. entities.canonical_email — adds a nullable TEXT column + a partial unique index to public.entities so email correspondents parsed from the pack's structured From/To/Cc blocks can be upserted by normalized email address. Existing uniqueness on (entity_type, normalized_name) is preserved because two people can legitimately share a display name; email is the stable identifier. Both use CREATE OR REPLACE / IF NOT EXISTS guards — safe to re-run. Neither drops or renames existing columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README documents the full setup path: Gmail OAuth Desktop-app client, env vars (no credentials on disk), first-run consent flow, dry-run, real run, and optional migration install. Explicitly covers the four design choices most likely to surprise a new user: - Sensitivity routing is tag-only — the recipe does not enforce a policy, the ingest pipeline does. Calls out that OB1 is cloud-first so "restricted stays local" needs explicit wiring (two-store setup or block-on-import). - Engagement filter defaults to engaged-only with STARRED/IMPORTANT bypass, with clear instructions to disable or rebuild. - Relationship tier is metadata (contact/known/unknown), not a gate. Three ways to produce the contacts cache documented. - Atomization is opt-in per-message (>= 150 words default) with anthropic/openrouter/claude-cli/codex provider choice. Graceful fallback to whole-message capture on atomizer failure. metadata.json follows the schema template at recipes/_template/ with required fields (name, description, category, author, version, requires.open_brain, tags, difficulty, estimated_time) and no extras. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l tracker Codex review (P2 originally, elevated to P1 in triage): the credential tracker in the README asked users to paste their Supabase service-role key into a plaintext doc, but this recipe never touches that key — the puller only emits a pack file, and any downstream ingest pipeline that needs service_role should read it from env/secret manager, not from a user's text editor. Removing the field avoids an entirely avoidable leak path for a highly privileged database secret, and adds a note so contributors who copy this tracker pattern into other recipes don't reintroduce the mistake.

… loopback bind + HTML escape + HTTP checks) Codex identified four coupled OAuth weaknesses in scripts/pull-gmail.mjs: - No OAuth state parameter: authUrl was built without a random state and the callback handler accepted the first ?code= it saw. Any local process or malicious localhost page could race the browser redirect and bind the script to an attacker-controlled Google account. - server.listen() without an address defaulted to IPv6-any/0.0.0.0 on some platforms, briefly exposing the callback to the LAN. - URL error parameter reflected into HTML without escaping — low-impact reflected XSS but trivial to fix. - Token exchange and refresh called res.json() before checking res.ok, so proxy/5xx responses produced a useless JSON parse error instead of a useful OAuth failure with status + body. Fix: generate 16 bytes of random hex as state, require the callback to echo it back (mismatch -> hard reject), bind createServer to 127.0.0.1 explicitly, HTML-escape the error param before reflecting, and gate both token POSTs on res.ok with a bounded body preview on failure.

…s to sensitivity classifier Codex flagged (originally P2, elevated to P1 in triage because the sensitivity tier drives downstream routing): the restricted-tier pattern set missed several common secret formats, so emails containing them would be classified 'standard' and flow into the general thoughts pool instead of the restricted-only store. Adds patterns for: - openai_key — sk-proj-, sk-svcacct-, sk-admin- variants - anthropic_key — sk-ant-api / sk-ant-admin tokens - aws_access_key_id — AKIA/ASIA/AROA/AIDA prefixes - aws_secret_access_key — proximity match near "aws secret" label - gcp_api_key — AIza<35 chars> canonical form - jwt_token — eyJ<header>.<payload>.<sig> three-segment form - pem_private_key — BEGIN PRIVATE KEY blocks (RSA, EC, DSA, OPENSSH, PGP, ENCRYPTED) - github_token — ghp/gho/ghu/ghs/ghr _ 36+ char bodies - slack_token — xox[aboprs]- tokens The existing generic api_key_pattern is kept as a belt-and-suspenders fallback. All patterns still fail-open (standard tier) on no match — classification never throws, so a missing pattern degrades gracefully rather than blocking the pull.

…+ harden atomize prompt against injection Codex flagged this as the highest-severity finding in the atomize lib (originally tagged P1-5 + P2): the 'codex' provider spawned `codex exec --dangerously-bypass-approvals-and-sandbox -` with an email body interpolated directly into the prompt. A malicious sender can embed 'IGNORE PREVIOUS INSTRUCTIONS' or tool-call primers, and because the child Codex agent ran with the sandbox disabled, prompt-injection escalated to arbitrary local command/file access. Fixes: 1. Remove the --dangerously-bypass-approvals-and-sandbox flag from the default codex invocation. Users who actively need it for an atomization-only run can opt in via GMAIL_ATOMIZE_CODEX_BYPASS=1 env var, which documents the risk at the opt-in site. 2. Strengthen DEFAULT_ATOMIZE_PROMPT with an explicit SECURITY section that frames the INPUT THOUGHT as untrusted data, not instructions, and forbids emitting system/tool/assistant markers in the output. 3. Add a top-of-file comment describing the prompt-injection threat model so callers who override the prompt don't silently drop the hardening. This does not eliminate prompt injection (no prompt-only defense can), but it removes the most dangerous escalation path and raises the bar from "read email -> run code" to "read email -> influence atoms".

The previous regex `\b(?:aws[_ -]?secret|aws[_ -]?access[_ -]?key)\b` could not match `aws_secret_access_key=...` — the most common env-var form — because `_` is a word char, so the `\b` between `t` and `_` in `aws_secret_access_key` didn't fire, and neither alternation caught the combined phrase. Restructured the alternation so `aws_secret` can optionally absorb the trailing `_access_key`: aws[_ -]?(?:secret(?:[_ -]?access[_ -]?key)?|access[_ -]?key) Verified against 8 test cases covering kvp form, uppercase, hyphen separators, space separators, standalone `aws_secret`, standalone `aws_access_key`, a negative case, and the full env-var pair. All pass with no false positives.

alanshurafa · 2026-04-22T17:21:55Z

Refreshing upstream checks after fork-side readiness cleanup.

alanshurafa and others added 8 commits April 21, 2026 16:31

github-actions Bot added the recipe Contribution: step-by-step recipe label Apr 21, 2026

alanshurafa added 2 commits April 22, 2026 08:29

[docs] Fix pre-existing markdownlint errors across 8 files

2be2047

github-actions Bot added the schema Contribution: database extension label Apr 22, 2026

alanshurafa closed this Apr 22, 2026

alanshurafa reopened this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipes] Gmail smart pull — sensitivity routing + contact entities#220

[recipes] Gmail smart pull — sensitivity routing + contact entities#220
alanshurafa wants to merge 10 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/gmail-smart-pull

alanshurafa commented Apr 21, 2026

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alanshurafa commented Apr 21, 2026

Summary

Coupling

Security review

Known follow-ups

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant