[recipes] Gmail smart pull — sensitivity routing + contact entities + atomize fixes by alanshurafa · Pull Request #20 · alanshurafa/OB1

alanshurafa · 2026-04-21T20:36:06Z

Summary

Adds recipes/gmail-smart-pull/ — a Gmail puller that emits an ingest-ready pack with local sensitivity routing, engagement filtering, contact-based relationship tiers, and LLM atomization of long messages. Ports the EXO-0129 through EXO-0137 work from Alan's ExoCortex second brain into a generalized OB1 recipe.

Complements the existing recipes/email-history-import/ (one-email-one-thought onboarding). This recipe is for users whose mailbox is big enough that they need careful filtering, routing, and splitting before ingest.

What's ported

Core puller (scripts/pull-gmail.mjs): read-only Gmail API fetch, quoted-reply + signature stripping, auto-generated-noise filter, engagement gate (threads where you've replied), RFC 2822 threading headers captured at source, structured correspondents parsed once at pull time.
Local sensitivity detection (scripts/lib/sensitivity.mjs): two pattern sets — restricted (SSN, passport, bank, API keys, passwords, credit cards) and personal (email/phone/health/financial) — tag-only, no enforcement.
Relationship tier: contact / known / unknown, metadata-only (does not gate routing), driven by a JSON contacts cache you can produce from a CRM schema, the Google Contacts API, or a vCard export.
LLM atomization (scripts/lib/atomize-text.mjs): long messages (>= 150 words default) split into atomic thoughts; providers anthropic / openrouter / claude-cli / codex; graceful fallback to whole-message on failure.
RFC 2822 header parser (scripts/lib/entity-resolver.mjs): pure parsing only — the pack carries { name, email } arrays so a downstream job can upsert correspondents as first-class entities.
Two idempotent migrations (sql/): merge_thought_metadata RPC for targeted metadata backfills, and entities.canonical_email column + indexes so the correspondents the pack carries can be upserted as entities. Both CREATE OR REPLACE / IF NOT EXISTS.

Atomize fixes included

scripts/lib/atomize-text.mjs carries two fixes that surfaced during real-world use:

Multi-line prompts now pipe via stdin instead of the -p command-line flag. Under Windows shell:true, cmd.exe mangled multi-line prompts containing quotes/newlines so the child received a truncated string and the LLM replied conversationally ("Looks like your message got cut off..."). Same fix applied to the codex provider.
A new codex provider shells out to codex exec so users orchestrating from a Codex session can atomize without crossing streams with a nested claude-cli (which fails nested-process detection).

[!NOTE] Coordination with candidate #1 (Atomizer)
These fixes may overlap with #1 Atomizer (recipes/atomizer/), which ports the separate re-atomization/audit batch scripts (re-atomize-gmail-thought.mjs, atomize-packs.mjs, etc.). Both branches reference the same underlying atomize-text.mjs library. If #1 ships first with its own copy of scripts/lib/atomize-text.mjs, this recipe should be updated to import from the shared location during final review. I've kept a local copy here so this recipe is self-contained and can merge in either order.

Dependencies on other candidates

[schemas] Knowledge graph tables and extraction trigger #5 CRM person tiers (optional, soft): if that recipe lands, its public.persons / person_tiers tables become a natural source for the contacts-cache JSON file that powers relationship_tier. Not a hard dependency — the README documents two other ways to build the cache (Google Contacts API dump, vCard export).
[schemas] Enhanced thoughts columns and utility RPCs #1 Atomizer (coordination, not dependency): see the note above.

What this recipe does NOT do

It does not ingest into Supabase itself — it produces a pack file. Your ingest pipeline consumes it. The separation keeps the recipe portable across Open Brain deployments.
It does not enforce a sensitivity routing policy. The pack records carry sensitivity + sensitiveReasons; your ingest pipeline decides what to do with restricted / personal atoms. The README spells this out explicitly because OB1 is cloud-first and "restricted stays local" isn't automatic.
It does not ship a contacts-export step. Different deployments have different authoritative sources; the README documents three options.

Pre-review status

This is the fork PR. Not pushing upstream yet — waiting on cross-AI review (gsd-code-reviewer + codex exec) per Alan's OB1 PR protocol before opening the upstream PR to NateBJones-Projects/OB1.

Test plan

node --check passes on all four JS files (pull-gmail.mjs, atomize-text.mjs, entity-resolver.mjs, sensitivity.mjs) — verified locally
metadata.json parses as valid JSON — verified locally
Whole-repo markdownlint-cli2 error count stays at 57 (baseline on origin/main; this branch adds 0 new errors) — verified locally
No OAuth credentials, real email addresses, or personal data embedded anywhere — verified by inspection
Gmail scope scoped to gmail.readonly only — verified in SCOPES constant
Migrations are idempotent (CREATE OR REPLACE, IF NOT EXISTS) and contain no DROP TABLE, TRUNCATE, or unqualified DELETE FROM — verified by inspection
Smoke test the full OAuth flow + dry-run on a small STARRED window against a test Gmail account
Smoke test atomization with provider=anthropic on a >= 150-word synthetic email

🤖 Generated with Claude Code

Core Gmail puller script for a new recipe under recipes/gmail-smart-pull/. The puller fetches messages from the Gmail API (read-only scope), strips quoted replies and signatures, filters auto-generated noise, and emits an OB1 ingest pack that downstream pipelines can feed into fingerprint dedup + sensitivity-gate + upsert. Also includes two small pure-JS libs the puller depends on: - scripts/lib/sensitivity.mjs tags each message body against two pattern sets (restricted: SSN, passport, bank, API keys, passwords, credit cards; personal: email/phone/health/financial signals) so the ingest side can route tiers to the right store. Tagging only — the recipe does not enforce a routing policy itself. - scripts/lib/entity-resolver.mjs does RFC 2822 header parsing (From/To/Cc with quoted commas, display-name variants) into { name, email } pairs so structured correspondents can be carried in the pack and upserted as first-class entities later. OAuth credentials come from GMAIL_OAUTH_CLIENT_ID and GMAIL_OAUTH_CLIENT_SECRET env vars. No real email addresses, client IDs, or secrets are embedded anywhere. The only scope requested is https://www.googleapis.com/auth/gmail.readonly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The LLM atomizer splits long email bodies into multiple atomic thoughts before the puller emits them in the pack. Two behaviors carried over from upstream experience running this at scale: 1. Prompts are piped to CLI providers via stdin, not via the -p command-line argument. On Windows shell:true cmd.exe mangled multi-line prompts containing quotes and newlines so the child process received a truncated/empty string and the LLM replied conversationally ("Looks like your message got cut off..."). 190/190 atomize calls in one real batch failed this way until stdin fixed it. Same fix applied to the codex provider. 2. A new 'codex' provider shells out to `codex exec` so users orchestrating the recipe from a Codex session can atomize without crossing the streams with a nested claude-cli (which would fail nested-process detection). The `claude-cli` provider still works from standalone terminals and refuses to run inside Claude Code. OB1 users will typically use provider='anthropic' (direct Messages API) or 'openrouter' since OB1 is cloud-first and those are already provisioned. CLI providers are opt-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anonical_email Two idempotent migrations that complete the pack's handoff to a downstream ingest pipeline: 1. merge_thought_metadata(p_id, p_patch) — shallow-merge a JSONB patch into a thought's metadata without re-triggering the full upsert path (no embedding regen, no enrichment, no fingerprint recompute). Useful for per-row metadata backfills like flipping a relationship_tier on a batch of thoughts after regenerating the contacts cache. 2. entities.canonical_email — adds a nullable TEXT column + a partial unique index to public.entities so email correspondents parsed from the pack's structured From/To/Cc blocks can be upserted by normalized email address. Existing uniqueness on (entity_type, normalized_name) is preserved because two people can legitimately share a display name; email is the stable identifier. Both use CREATE OR REPLACE / IF NOT EXISTS guards — safe to re-run. Neither drops or renames existing columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README documents the full setup path: Gmail OAuth Desktop-app client, env vars (no credentials on disk), first-run consent flow, dry-run, real run, and optional migration install. Explicitly covers the four design choices most likely to surprise a new user: - Sensitivity routing is tag-only — the recipe does not enforce a policy, the ingest pipeline does. Calls out that OB1 is cloud-first so "restricted stays local" needs explicit wiring (two-store setup or block-on-import). - Engagement filter defaults to engaged-only with STARRED/IMPORTANT bypass, with clear instructions to disable or rebuild. - Relationship tier is metadata (contact/known/unknown), not a gate. Three ways to produce the contacts cache documented. - Atomization is opt-in per-message (>= 150 words default) with anthropic/openrouter/claude-cli/codex provider choice. Graceful fallback to whole-message capture on atomizer failure. metadata.json follows the schema template at recipes/_template/ with required fields (name, description, category, author, version, requires.open_brain, tags, difficulty, estimated_time) and no extras. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5396bab6b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T20:40:11Z

+    if (!over) return;
+
+    const thoughtId = active.id as number;
+    const newStatus = over.id as string;


Resolve drop target to a kanban status before updating

handleDragEnd treats over.id as the destination status, but with @dnd-kit/sortable the pointer is often over another card, so over.id is a numeric thought id in non-empty columns. In that case we send an invalid status to /api/kanban/update, the server rejects it, and cross-column drops into populated columns consistently revert. This breaks the primary drag-and-drop workflow whenever the target column already has cards.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T20:40:11Z

+  const results: Thought[] = [];
+  for (const thoughtType of ["task", "idea"]) {
+    const sp = new URLSearchParams();
+    sp.set("per_page", "100");


Fetch all kanban pages instead of truncating at 100

fetchKanbanThoughts hard-caps each thought type to a single per_page=100 request and never follows pagination, so boards with more than 100 task or idea records silently drop the remainder. That causes inaccurate workflow counts and makes a subset of items impossible to see or move from the kanban UI.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T20:40:11Z

+    else if (a.startsWith("--after=")) args.after = a.slice("--after=".length);
+    else if (a.startsWith("--before=")) args.before = a.slice("--before=".length);
+    else if (a.startsWith("--labels=")) {
+      args.labels = a.slice("--labels=".length).split(",").map((l) => l.trim().toUpperCase()).filter(Boolean);


Preserve Gmail label ID casing when parsing --labels

Uppercasing --labels values mutates user label IDs (for example IDs returned by --list-labels), but those IDs are passed directly to Gmail labelIds filtering and must match exactly. This makes custom-label pulls fail or return no messages when users provide label IDs as documented, limiting the recipe to a subset of label workflows.

Useful? React with 👍 / 👎.

…l tracker Codex review (P2 originally, elevated to P1 in triage): the credential tracker in the README asked users to paste their Supabase service-role key into a plaintext doc, but this recipe never touches that key — the puller only emits a pack file, and any downstream ingest pipeline that needs service_role should read it from env/secret manager, not from a user's text editor. Removing the field avoids an entirely avoidable leak path for a highly privileged database secret, and adds a note so contributors who copy this tracker pattern into other recipes don't reintroduce the mistake.

… loopback bind + HTML escape + HTTP checks) Codex identified four coupled OAuth weaknesses in scripts/pull-gmail.mjs: - No OAuth state parameter: authUrl was built without a random state and the callback handler accepted the first ?code= it saw. Any local process or malicious localhost page could race the browser redirect and bind the script to an attacker-controlled Google account. - server.listen() without an address defaulted to IPv6-any/0.0.0.0 on some platforms, briefly exposing the callback to the LAN. - URL error parameter reflected into HTML without escaping — low-impact reflected XSS but trivial to fix. - Token exchange and refresh called res.json() before checking res.ok, so proxy/5xx responses produced a useless JSON parse error instead of a useful OAuth failure with status + body. Fix: generate 16 bytes of random hex as state, require the callback to echo it back (mismatch -> hard reject), bind createServer to 127.0.0.1 explicitly, HTML-escape the error param before reflecting, and gate both token POSTs on res.ok with a bounded body preview on failure.

…s to sensitivity classifier Codex flagged (originally P2, elevated to P1 in triage because the sensitivity tier drives downstream routing): the restricted-tier pattern set missed several common secret formats, so emails containing them would be classified 'standard' and flow into the general thoughts pool instead of the restricted-only store. Adds patterns for: - openai_key — sk-proj-, sk-svcacct-, sk-admin- variants - anthropic_key — sk-ant-api / sk-ant-admin tokens - aws_access_key_id — AKIA/ASIA/AROA/AIDA prefixes - aws_secret_access_key — proximity match near "aws secret" label - gcp_api_key — AIza<35 chars> canonical form - jwt_token — eyJ<header>.<payload>.<sig> three-segment form - pem_private_key — BEGIN PRIVATE KEY blocks (RSA, EC, DSA, OPENSSH, PGP, ENCRYPTED) - github_token — ghp/gho/ghu/ghs/ghr _ 36+ char bodies - slack_token — xox[aboprs]- tokens The existing generic api_key_pattern is kept as a belt-and-suspenders fallback. All patterns still fail-open (standard tier) on no match — classification never throws, so a missing pattern degrades gracefully rather than blocking the pull.

…+ harden atomize prompt against injection Codex flagged this as the highest-severity finding in the atomize lib (originally tagged P1-5 + P2): the 'codex' provider spawned `codex exec --dangerously-bypass-approvals-and-sandbox -` with an email body interpolated directly into the prompt. A malicious sender can embed 'IGNORE PREVIOUS INSTRUCTIONS' or tool-call primers, and because the child Codex agent ran with the sandbox disabled, prompt-injection escalated to arbitrary local command/file access. Fixes: 1. Remove the --dangerously-bypass-approvals-and-sandbox flag from the default codex invocation. Users who actively need it for an atomization-only run can opt in via GMAIL_ATOMIZE_CODEX_BYPASS=1 env var, which documents the risk at the opt-in site. 2. Strengthen DEFAULT_ATOMIZE_PROMPT with an explicit SECURITY section that frames the INPUT THOUGHT as untrusted data, not instructions, and forbids emitting system/tool/assistant markers in the output. 3. Add a top-of-file comment describing the prompt-injection threat model so callers who override the prompt don't silently drop the hardening. This does not eliminate prompt injection (no prompt-only defense can), but it removes the most dangerous escalation path and raises the bar from "read email -> run code" to "read email -> influence atoms".

The previous regex `\b(?:aws[_ -]?secret|aws[_ -]?access[_ -]?key)\b` could not match `aws_secret_access_key=...` — the most common env-var form — because `_` is a word char, so the `\b` between `t` and `_` in `aws_secret_access_key` didn't fire, and neither alternation caught the combined phrase. Restructured the alternation so `aws_secret` can optionally absorb the trailing `_access_key`: aws[_ -]?(?:secret(?:[_ -]?access[_ -]?key)?|access[_ -]?key) Verified against 8 test cases covering kvp form, uppercase, hyphen separators, space separators, standalone `aws_secret`, standalone `aws_access_key`, a negative case, and the full env-var pair. All pass with no false positives.

alanshurafa · 2026-04-22T16:54:32Z

Refreshing checks after markdownlint cleanup merged into fork main.

alanshurafa · 2026-04-22T16:57:39Z

Refreshing checks after fork markdownlint workflow fix.

alanshurafa and others added 4 commits April 21, 2026 16:31

github-actions Bot added dashboard documentation Improvements or additions to documentation extension integration primitive recipe labels Apr 21, 2026

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

alanshurafa added 5 commits April 21, 2026 16:53

alanshurafa closed this Apr 22, 2026

alanshurafa reopened this Apr 22, 2026

alanshurafa closed this Apr 22, 2026

alanshurafa reopened this Apr 22, 2026

[docs] Fix pre-existing markdownlint errors across 8 files

2be2047

github-actions Bot added the schema label Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipes] Gmail smart pull — sensitivity routing + contact entities + atomize fixes#20

[recipes] Gmail smart pull — sensitivity routing + contact entities + atomize fixes#20
alanshurafa wants to merge 10 commits intomainfrom
contrib/alanshurafa/gmail-smart-pull

alanshurafa commented Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alanshurafa commented Apr 21, 2026

Summary

What's ported

Atomize fixes included

Dependencies on other candidates

What this recipe does NOT do

Pre-review status

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant