[recipes] Gmail smart pull — sensitivity routing + contact entities + atomize fixes#20
[recipes] Gmail smart pull — sensitivity routing + contact entities + atomize fixes#20alanshurafa wants to merge 10 commits intomainfrom
Conversation
Core Gmail puller script for a new recipe under recipes/gmail-smart-pull/.
The puller fetches messages from the Gmail API (read-only scope), strips
quoted replies and signatures, filters auto-generated noise, and emits
an OB1 ingest pack that downstream pipelines can feed into fingerprint
dedup + sensitivity-gate + upsert.
Also includes two small pure-JS libs the puller depends on:
- scripts/lib/sensitivity.mjs tags each message body against two
pattern sets (restricted: SSN, passport, bank, API keys, passwords,
credit cards; personal: email/phone/health/financial signals) so the
ingest side can route tiers to the right store. Tagging only — the
recipe does not enforce a routing policy itself.
- scripts/lib/entity-resolver.mjs does RFC 2822 header parsing
(From/To/Cc with quoted commas, display-name variants) into
{ name, email } pairs so structured correspondents can be carried in
the pack and upserted as first-class entities later.
OAuth credentials come from GMAIL_OAUTH_CLIENT_ID and
GMAIL_OAUTH_CLIENT_SECRET env vars. No real email addresses, client
IDs, or secrets are embedded anywhere. The only scope requested is
https://www.googleapis.com/auth/gmail.readonly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LLM atomizer splits long email bodies into multiple atomic thoughts
before the puller emits them in the pack. Two behaviors carried over
from upstream experience running this at scale:
1. Prompts are piped to CLI providers via stdin, not via the -p
command-line argument. On Windows shell:true cmd.exe mangled
multi-line prompts containing quotes and newlines so the child
process received a truncated/empty string and the LLM replied
conversationally ("Looks like your message got cut off..."). 190/190
atomize calls in one real batch failed this way until stdin fixed
it. Same fix applied to the codex provider.
2. A new 'codex' provider shells out to `codex exec` so users
orchestrating the recipe from a Codex session can atomize without
crossing the streams with a nested claude-cli (which would fail
nested-process detection). The `claude-cli` provider still works
from standalone terminals and refuses to run inside Claude Code.
OB1 users will typically use provider='anthropic' (direct Messages
API) or 'openrouter' since OB1 is cloud-first and those are already
provisioned. CLI providers are opt-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anonical_email Two idempotent migrations that complete the pack's handoff to a downstream ingest pipeline: 1. merge_thought_metadata(p_id, p_patch) — shallow-merge a JSONB patch into a thought's metadata without re-triggering the full upsert path (no embedding regen, no enrichment, no fingerprint recompute). Useful for per-row metadata backfills like flipping a relationship_tier on a batch of thoughts after regenerating the contacts cache. 2. entities.canonical_email — adds a nullable TEXT column + a partial unique index to public.entities so email correspondents parsed from the pack's structured From/To/Cc blocks can be upserted by normalized email address. Existing uniqueness on (entity_type, normalized_name) is preserved because two people can legitimately share a display name; email is the stable identifier. Both use CREATE OR REPLACE / IF NOT EXISTS guards — safe to re-run. Neither drops or renames existing columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README documents the full setup path: Gmail OAuth Desktop-app client, env vars (no credentials on disk), first-run consent flow, dry-run, real run, and optional migration install. Explicitly covers the four design choices most likely to surprise a new user: - Sensitivity routing is tag-only — the recipe does not enforce a policy, the ingest pipeline does. Calls out that OB1 is cloud-first so "restricted stays local" needs explicit wiring (two-store setup or block-on-import). - Engagement filter defaults to engaged-only with STARRED/IMPORTANT bypass, with clear instructions to disable or rebuild. - Relationship tier is metadata (contact/known/unknown), not a gate. Three ways to produce the contacts cache documented. - Atomization is opt-in per-message (>= 150 words default) with anthropic/openrouter/claude-cli/codex provider choice. Graceful fallback to whole-message capture on atomizer failure. metadata.json follows the schema template at recipes/_template/ with required fields (name, description, category, author, version, requires.open_brain, tags, difficulty, estimated_time) and no extras. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5396bab6b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!over) return; | ||
|
|
||
| const thoughtId = active.id as number; | ||
| const newStatus = over.id as string; |
There was a problem hiding this comment.
Resolve drop target to a kanban status before updating
handleDragEnd treats over.id as the destination status, but with @dnd-kit/sortable the pointer is often over another card, so over.id is a numeric thought id in non-empty columns. In that case we send an invalid status to /api/kanban/update, the server rejects it, and cross-column drops into populated columns consistently revert. This breaks the primary drag-and-drop workflow whenever the target column already has cards.
Useful? React with 👍 / 👎.
| const results: Thought[] = []; | ||
| for (const thoughtType of ["task", "idea"]) { | ||
| const sp = new URLSearchParams(); | ||
| sp.set("per_page", "100"); |
There was a problem hiding this comment.
Fetch all kanban pages instead of truncating at 100
fetchKanbanThoughts hard-caps each thought type to a single per_page=100 request and never follows pagination, so boards with more than 100 task or idea records silently drop the remainder. That causes inaccurate workflow counts and makes a subset of items impossible to see or move from the kanban UI.
Useful? React with 👍 / 👎.
| else if (a.startsWith("--after=")) args.after = a.slice("--after=".length); | ||
| else if (a.startsWith("--before=")) args.before = a.slice("--before=".length); | ||
| else if (a.startsWith("--labels=")) { | ||
| args.labels = a.slice("--labels=".length).split(",").map((l) => l.trim().toUpperCase()).filter(Boolean); |
There was a problem hiding this comment.
Preserve Gmail label ID casing when parsing --labels
Uppercasing --labels values mutates user label IDs (for example IDs returned by --list-labels), but those IDs are passed directly to Gmail labelIds filtering and must match exactly. This makes custom-label pulls fail or return no messages when users provide label IDs as documented, limiting the recipe to a subset of label workflows.
Useful? React with 👍 / 👎.
…l tracker Codex review (P2 originally, elevated to P1 in triage): the credential tracker in the README asked users to paste their Supabase service-role key into a plaintext doc, but this recipe never touches that key — the puller only emits a pack file, and any downstream ingest pipeline that needs service_role should read it from env/secret manager, not from a user's text editor. Removing the field avoids an entirely avoidable leak path for a highly privileged database secret, and adds a note so contributors who copy this tracker pattern into other recipes don't reintroduce the mistake.
… loopback bind + HTML escape + HTTP checks) Codex identified four coupled OAuth weaknesses in scripts/pull-gmail.mjs: - No OAuth state parameter: authUrl was built without a random state and the callback handler accepted the first ?code= it saw. Any local process or malicious localhost page could race the browser redirect and bind the script to an attacker-controlled Google account. - server.listen() without an address defaulted to IPv6-any/0.0.0.0 on some platforms, briefly exposing the callback to the LAN. - URL error parameter reflected into HTML without escaping — low-impact reflected XSS but trivial to fix. - Token exchange and refresh called res.json() before checking res.ok, so proxy/5xx responses produced a useless JSON parse error instead of a useful OAuth failure with status + body. Fix: generate 16 bytes of random hex as state, require the callback to echo it back (mismatch -> hard reject), bind createServer to 127.0.0.1 explicitly, HTML-escape the error param before reflecting, and gate both token POSTs on res.ok with a bounded body preview on failure.
…s to sensitivity classifier Codex flagged (originally P2, elevated to P1 in triage because the sensitivity tier drives downstream routing): the restricted-tier pattern set missed several common secret formats, so emails containing them would be classified 'standard' and flow into the general thoughts pool instead of the restricted-only store. Adds patterns for: - openai_key — sk-proj-, sk-svcacct-, sk-admin- variants - anthropic_key — sk-ant-api / sk-ant-admin tokens - aws_access_key_id — AKIA/ASIA/AROA/AIDA prefixes - aws_secret_access_key — proximity match near "aws secret" label - gcp_api_key — AIza<35 chars> canonical form - jwt_token — eyJ<header>.<payload>.<sig> three-segment form - pem_private_key — BEGIN PRIVATE KEY blocks (RSA, EC, DSA, OPENSSH, PGP, ENCRYPTED) - github_token — ghp/gho/ghu/ghs/ghr _ 36+ char bodies - slack_token — xox[aboprs]- tokens The existing generic api_key_pattern is kept as a belt-and-suspenders fallback. All patterns still fail-open (standard tier) on no match — classification never throws, so a missing pattern degrades gracefully rather than blocking the pull.
…+ harden atomize prompt against injection Codex flagged this as the highest-severity finding in the atomize lib (originally tagged P1-5 + P2): the 'codex' provider spawned `codex exec --dangerously-bypass-approvals-and-sandbox -` with an email body interpolated directly into the prompt. A malicious sender can embed 'IGNORE PREVIOUS INSTRUCTIONS' or tool-call primers, and because the child Codex agent ran with the sandbox disabled, prompt-injection escalated to arbitrary local command/file access. Fixes: 1. Remove the --dangerously-bypass-approvals-and-sandbox flag from the default codex invocation. Users who actively need it for an atomization-only run can opt in via GMAIL_ATOMIZE_CODEX_BYPASS=1 env var, which documents the risk at the opt-in site. 2. Strengthen DEFAULT_ATOMIZE_PROMPT with an explicit SECURITY section that frames the INPUT THOUGHT as untrusted data, not instructions, and forbids emitting system/tool/assistant markers in the output. 3. Add a top-of-file comment describing the prompt-injection threat model so callers who override the prompt don't silently drop the hardening. This does not eliminate prompt injection (no prompt-only defense can), but it removes the most dangerous escalation path and raises the bar from "read email -> run code" to "read email -> influence atoms".
The previous regex `\b(?:aws[_ -]?secret|aws[_ -]?access[_ -]?key)\b` could not match `aws_secret_access_key=...` — the most common env-var form — because `_` is a word char, so the `\b` between `t` and `_` in `aws_secret_access_key` didn't fire, and neither alternation caught the combined phrase. Restructured the alternation so `aws_secret` can optionally absorb the trailing `_access_key`: aws[_ -]?(?:secret(?:[_ -]?access[_ -]?key)?|access[_ -]?key) Verified against 8 test cases covering kvp form, uppercase, hyphen separators, space separators, standalone `aws_secret`, standalone `aws_access_key`, a negative case, and the full env-var pair. All pass with no false positives.
|
Refreshing checks after markdownlint cleanup merged into fork main. |
|
Refreshing checks after fork markdownlint workflow fix. |
Summary
Adds
recipes/gmail-smart-pull/— a Gmail puller that emits an ingest-ready pack with local sensitivity routing, engagement filtering, contact-based relationship tiers, and LLM atomization of long messages. Ports the EXO-0129 through EXO-0137 work from Alan's ExoCortex second brain into a generalized OB1 recipe.Complements the existing
recipes/email-history-import/(one-email-one-thought onboarding). This recipe is for users whose mailbox is big enough that they need careful filtering, routing, and splitting before ingest.What's ported
scripts/pull-gmail.mjs): read-only Gmail API fetch, quoted-reply + signature stripping, auto-generated-noise filter, engagement gate (threads where you've replied), RFC 2822 threading headers captured at source, structured correspondents parsed once at pull time.scripts/lib/sensitivity.mjs): two pattern sets — restricted (SSN, passport, bank, API keys, passwords, credit cards) and personal (email/phone/health/financial) — tag-only, no enforcement.scripts/lib/atomize-text.mjs): long messages (>= 150 words default) split into atomic thoughts; providersanthropic/openrouter/claude-cli/codex; graceful fallback to whole-message on failure.scripts/lib/entity-resolver.mjs): pure parsing only — the pack carries{ name, email }arrays so a downstream job can upsert correspondents as first-class entities.sql/):merge_thought_metadataRPC for targeted metadata backfills, andentities.canonical_emailcolumn + indexes so the correspondents the pack carries can be upserted as entities. BothCREATE OR REPLACE/IF NOT EXISTS.Atomize fixes included
scripts/lib/atomize-text.mjscarries two fixes that surfaced during real-world use:-pcommand-line flag. Under Windowsshell:true, cmd.exe mangled multi-line prompts containing quotes/newlines so the child received a truncated string and the LLM replied conversationally ("Looks like your message got cut off..."). Same fix applied to thecodexprovider.codexprovider shells out tocodex execso users orchestrating from a Codex session can atomize without crossing streams with a nestedclaude-cli(which fails nested-process detection).Dependencies on other candidates
public.persons/person_tierstables become a natural source for the contacts-cache JSON file that powersrelationship_tier. Not a hard dependency — the README documents two other ways to build the cache (Google Contacts API dump, vCard export).What this recipe does NOT do
sensitivity+sensitiveReasons; your ingest pipeline decides what to do with restricted / personal atoms. The README spells this out explicitly because OB1 is cloud-first and "restricted stays local" isn't automatic.Pre-review status
This is the fork PR. Not pushing upstream yet — waiting on cross-AI review (
gsd-code-reviewer+codex exec) per Alan's OB1 PR protocol before opening the upstream PR toNateBJones-Projects/OB1.Test plan
node --checkpasses on all four JS files (pull-gmail.mjs,atomize-text.mjs,entity-resolver.mjs,sensitivity.mjs) — verified locallymetadata.jsonparses as valid JSON — verified locallymarkdownlint-cli2error count stays at 57 (baseline onorigin/main; this branch adds 0 new errors) — verified locallygmail.readonlyonly — verified inSCOPESconstantCREATE OR REPLACE,IF NOT EXISTS) and contain noDROP TABLE,TRUNCATE, or unqualifiedDELETE FROM— verified by inspectionprovider=anthropicon a >= 150-word synthetic email🤖 Generated with Claude Code