[recipes] Gmail smart pull — sensitivity routing + contact entities#220
Open
alanshurafa wants to merge 10 commits intoNateBJones-Projects:mainfrom
Open
[recipes] Gmail smart pull — sensitivity routing + contact entities#220alanshurafa wants to merge 10 commits intoNateBJones-Projects:mainfrom
alanshurafa wants to merge 10 commits intoNateBJones-Projects:mainfrom
Conversation
Core Gmail puller script for a new recipe under recipes/gmail-smart-pull/.
The puller fetches messages from the Gmail API (read-only scope), strips
quoted replies and signatures, filters auto-generated noise, and emits
an OB1 ingest pack that downstream pipelines can feed into fingerprint
dedup + sensitivity-gate + upsert.
Also includes two small pure-JS libs the puller depends on:
- scripts/lib/sensitivity.mjs tags each message body against two
pattern sets (restricted: SSN, passport, bank, API keys, passwords,
credit cards; personal: email/phone/health/financial signals) so the
ingest side can route tiers to the right store. Tagging only — the
recipe does not enforce a routing policy itself.
- scripts/lib/entity-resolver.mjs does RFC 2822 header parsing
(From/To/Cc with quoted commas, display-name variants) into
{ name, email } pairs so structured correspondents can be carried in
the pack and upserted as first-class entities later.
OAuth credentials come from GMAIL_OAUTH_CLIENT_ID and
GMAIL_OAUTH_CLIENT_SECRET env vars. No real email addresses, client
IDs, or secrets are embedded anywhere. The only scope requested is
https://www.googleapis.com/auth/gmail.readonly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LLM atomizer splits long email bodies into multiple atomic thoughts
before the puller emits them in the pack. Two behaviors carried over
from upstream experience running this at scale:
1. Prompts are piped to CLI providers via stdin, not via the -p
command-line argument. On Windows shell:true cmd.exe mangled
multi-line prompts containing quotes and newlines so the child
process received a truncated/empty string and the LLM replied
conversationally ("Looks like your message got cut off..."). 190/190
atomize calls in one real batch failed this way until stdin fixed
it. Same fix applied to the codex provider.
2. A new 'codex' provider shells out to `codex exec` so users
orchestrating the recipe from a Codex session can atomize without
crossing the streams with a nested claude-cli (which would fail
nested-process detection). The `claude-cli` provider still works
from standalone terminals and refuses to run inside Claude Code.
OB1 users will typically use provider='anthropic' (direct Messages
API) or 'openrouter' since OB1 is cloud-first and those are already
provisioned. CLI providers are opt-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anonical_email Two idempotent migrations that complete the pack's handoff to a downstream ingest pipeline: 1. merge_thought_metadata(p_id, p_patch) — shallow-merge a JSONB patch into a thought's metadata without re-triggering the full upsert path (no embedding regen, no enrichment, no fingerprint recompute). Useful for per-row metadata backfills like flipping a relationship_tier on a batch of thoughts after regenerating the contacts cache. 2. entities.canonical_email — adds a nullable TEXT column + a partial unique index to public.entities so email correspondents parsed from the pack's structured From/To/Cc blocks can be upserted by normalized email address. Existing uniqueness on (entity_type, normalized_name) is preserved because two people can legitimately share a display name; email is the stable identifier. Both use CREATE OR REPLACE / IF NOT EXISTS guards — safe to re-run. Neither drops or renames existing columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README documents the full setup path: Gmail OAuth Desktop-app client, env vars (no credentials on disk), first-run consent flow, dry-run, real run, and optional migration install. Explicitly covers the four design choices most likely to surprise a new user: - Sensitivity routing is tag-only — the recipe does not enforce a policy, the ingest pipeline does. Calls out that OB1 is cloud-first so "restricted stays local" needs explicit wiring (two-store setup or block-on-import). - Engagement filter defaults to engaged-only with STARRED/IMPORTANT bypass, with clear instructions to disable or rebuild. - Relationship tier is metadata (contact/known/unknown), not a gate. Three ways to produce the contacts cache documented. - Atomization is opt-in per-message (>= 150 words default) with anthropic/openrouter/claude-cli/codex provider choice. Graceful fallback to whole-message capture on atomizer failure. metadata.json follows the schema template at recipes/_template/ with required fields (name, description, category, author, version, requires.open_brain, tags, difficulty, estimated_time) and no extras. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l tracker Codex review (P2 originally, elevated to P1 in triage): the credential tracker in the README asked users to paste their Supabase service-role key into a plaintext doc, but this recipe never touches that key — the puller only emits a pack file, and any downstream ingest pipeline that needs service_role should read it from env/secret manager, not from a user's text editor. Removing the field avoids an entirely avoidable leak path for a highly privileged database secret, and adds a note so contributors who copy this tracker pattern into other recipes don't reintroduce the mistake.
… loopback bind + HTML escape + HTTP checks) Codex identified four coupled OAuth weaknesses in scripts/pull-gmail.mjs: - No OAuth state parameter: authUrl was built without a random state and the callback handler accepted the first ?code= it saw. Any local process or malicious localhost page could race the browser redirect and bind the script to an attacker-controlled Google account. - server.listen() without an address defaulted to IPv6-any/0.0.0.0 on some platforms, briefly exposing the callback to the LAN. - URL error parameter reflected into HTML without escaping — low-impact reflected XSS but trivial to fix. - Token exchange and refresh called res.json() before checking res.ok, so proxy/5xx responses produced a useless JSON parse error instead of a useful OAuth failure with status + body. Fix: generate 16 bytes of random hex as state, require the callback to echo it back (mismatch -> hard reject), bind createServer to 127.0.0.1 explicitly, HTML-escape the error param before reflecting, and gate both token POSTs on res.ok with a bounded body preview on failure.
…s to sensitivity classifier Codex flagged (originally P2, elevated to P1 in triage because the sensitivity tier drives downstream routing): the restricted-tier pattern set missed several common secret formats, so emails containing them would be classified 'standard' and flow into the general thoughts pool instead of the restricted-only store. Adds patterns for: - openai_key — sk-proj-, sk-svcacct-, sk-admin- variants - anthropic_key — sk-ant-api / sk-ant-admin tokens - aws_access_key_id — AKIA/ASIA/AROA/AIDA prefixes - aws_secret_access_key — proximity match near "aws secret" label - gcp_api_key — AIza<35 chars> canonical form - jwt_token — eyJ<header>.<payload>.<sig> three-segment form - pem_private_key — BEGIN PRIVATE KEY blocks (RSA, EC, DSA, OPENSSH, PGP, ENCRYPTED) - github_token — ghp/gho/ghu/ghs/ghr _ 36+ char bodies - slack_token — xox[aboprs]- tokens The existing generic api_key_pattern is kept as a belt-and-suspenders fallback. All patterns still fail-open (standard tier) on no match — classification never throws, so a missing pattern degrades gracefully rather than blocking the pull.
…+ harden atomize prompt against injection Codex flagged this as the highest-severity finding in the atomize lib (originally tagged P1-5 + P2): the 'codex' provider spawned `codex exec --dangerously-bypass-approvals-and-sandbox -` with an email body interpolated directly into the prompt. A malicious sender can embed 'IGNORE PREVIOUS INSTRUCTIONS' or tool-call primers, and because the child Codex agent ran with the sandbox disabled, prompt-injection escalated to arbitrary local command/file access. Fixes: 1. Remove the --dangerously-bypass-approvals-and-sandbox flag from the default codex invocation. Users who actively need it for an atomization-only run can opt in via GMAIL_ATOMIZE_CODEX_BYPASS=1 env var, which documents the risk at the opt-in site. 2. Strengthen DEFAULT_ATOMIZE_PROMPT with an explicit SECURITY section that frames the INPUT THOUGHT as untrusted data, not instructions, and forbids emitting system/tool/assistant markers in the output. 3. Add a top-of-file comment describing the prompt-injection threat model so callers who override the prompt don't silently drop the hardening. This does not eliminate prompt injection (no prompt-only defense can), but it removes the most dangerous escalation path and raises the bar from "read email -> run code" to "read email -> influence atoms".
The previous regex `\b(?:aws[_ -]?secret|aws[_ -]?access[_ -]?key)\b` could not match `aws_secret_access_key=...` — the most common env-var form — because `_` is a word char, so the `\b` between `t` and `_` in `aws_secret_access_key` didn't fire, and neither alternation caught the combined phrase. Restructured the alternation so `aws_secret` can optionally absorb the trailing `_access_key`: aws[_ -]?(?:secret(?:[_ -]?access[_ -]?key)?|access[_ -]?key) Verified against 8 test cases covering kvp form, uppercase, hyphen separators, space separators, standalone `aws_secret`, standalone `aws_access_key`, a negative case, and the full env-var pair. All pass with no false positives.
Contributor
Author
|
Refreshing upstream checks after fork-side readiness cleanup. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
recipes/gmail-smart-pull/— a Gmail puller that routes messages by sensitivity tier (restricted / personal / standard), maintains a contact entity cache withrelationship_tier, and atomizes thread content via a four-provider LLM fan-out (Anthropic API, OpenRouter, Claude CLI, Codex CLI). Read scope only:gmail.readonly.Ships two idempotent SQL migrations:
001_merge_thought_metadata.sql— shallow-merge RPC so repeated pulls update thought metadata without clobbering unrelated keys.002_entities_canonical_email.sql— addscanonical_emailcolumn plus supporting indexes for contact lookups.Coupling
lib/atomize-text.mjscopy so either recipe can merge first.Security review
Four P1 findings from pre-review were fixed; they absorbed five P2 findings along the way. Two commits in this branch exist specifically to harden security surface area.
OAuth hardening bundle:
stateparameter with constant-time-ish comparison on the callback.127.0.0.1(loopback only).errorquery param HTML-escaped before reflection.tokenRes.okchecks on all token exchange calls.Retry-After, with jitter, limited to 429/5xx.token.jsonwrite with0o600permissions.Sensitivity classifier: 9 new restricted-tier patterns — OpenAI (
sk-proj,sk-svcacct,sk-admin), Anthropic (sk-ant), AWS (AKIA/ASIA+ secret-key proximity), GCP (AIza), JWTs, PEM private keys, GitHub tokens (ghp,gho,ghu,ghs,ghr), and Slackxoxp.Prompt injection: the default atomize prompt now frames email content as untrusted input inside an explicit SECURITY section. The
--dangerously-bypass-approvals-and-sandboxdefault on the Codex atomizer was dropped — bypass is now gated behind an explicit opt-in env var.Credentials: the README credential tracker no longer references a
service_rolekey.Known follow-ups
CONTACTS_CACHE_PATHandGMAIL_TOKEN_PATH.htmlToTextparser.Four P2 findings and three P3 findings are documented as deferred follow-ups in the branch review notes.