Skip to content

[recipes] Gmail smart pull — sensitivity routing + contact entities + atomize fixes#20

Open
alanshurafa wants to merge 10 commits intomainfrom
contrib/alanshurafa/gmail-smart-pull
Open

[recipes] Gmail smart pull — sensitivity routing + contact entities + atomize fixes#20
alanshurafa wants to merge 10 commits intomainfrom
contrib/alanshurafa/gmail-smart-pull

Conversation

@alanshurafa
Copy link
Copy Markdown
Owner

Summary

Adds recipes/gmail-smart-pull/ — a Gmail puller that emits an ingest-ready pack with local sensitivity routing, engagement filtering, contact-based relationship tiers, and LLM atomization of long messages. Ports the EXO-0129 through EXO-0137 work from Alan's ExoCortex second brain into a generalized OB1 recipe.

Complements the existing recipes/email-history-import/ (one-email-one-thought onboarding). This recipe is for users whose mailbox is big enough that they need careful filtering, routing, and splitting before ingest.

What's ported

  • Core puller (scripts/pull-gmail.mjs): read-only Gmail API fetch, quoted-reply + signature stripping, auto-generated-noise filter, engagement gate (threads where you've replied), RFC 2822 threading headers captured at source, structured correspondents parsed once at pull time.
  • Local sensitivity detection (scripts/lib/sensitivity.mjs): two pattern sets — restricted (SSN, passport, bank, API keys, passwords, credit cards) and personal (email/phone/health/financial) — tag-only, no enforcement.
  • Relationship tier: contact / known / unknown, metadata-only (does not gate routing), driven by a JSON contacts cache you can produce from a CRM schema, the Google Contacts API, or a vCard export.
  • LLM atomization (scripts/lib/atomize-text.mjs): long messages (>= 150 words default) split into atomic thoughts; providers anthropic / openrouter / claude-cli / codex; graceful fallback to whole-message on failure.
  • RFC 2822 header parser (scripts/lib/entity-resolver.mjs): pure parsing only — the pack carries { name, email } arrays so a downstream job can upsert correspondents as first-class entities.
  • Two idempotent migrations (sql/): merge_thought_metadata RPC for targeted metadata backfills, and entities.canonical_email column + indexes so the correspondents the pack carries can be upserted as entities. Both CREATE OR REPLACE / IF NOT EXISTS.

Atomize fixes included

scripts/lib/atomize-text.mjs carries two fixes that surfaced during real-world use:

  1. Multi-line prompts now pipe via stdin instead of the -p command-line flag. Under Windows shell:true, cmd.exe mangled multi-line prompts containing quotes/newlines so the child received a truncated string and the LLM replied conversationally ("Looks like your message got cut off..."). Same fix applied to the codex provider.
  2. A new codex provider shells out to codex exec so users orchestrating from a Codex session can atomize without crossing streams with a nested claude-cli (which fails nested-process detection).

[!NOTE] Coordination with candidate #1 (Atomizer)
These fixes may overlap with #1 Atomizer (recipes/atomizer/), which ports the separate re-atomization/audit batch scripts (re-atomize-gmail-thought.mjs, atomize-packs.mjs, etc.). Both branches reference the same underlying atomize-text.mjs library. If #1 ships first with its own copy of scripts/lib/atomize-text.mjs, this recipe should be updated to import from the shared location during final review. I've kept a local copy here so this recipe is self-contained and can merge in either order.

Dependencies on other candidates

What this recipe does NOT do

  • It does not ingest into Supabase itself — it produces a pack file. Your ingest pipeline consumes it. The separation keeps the recipe portable across Open Brain deployments.
  • It does not enforce a sensitivity routing policy. The pack records carry sensitivity + sensitiveReasons; your ingest pipeline decides what to do with restricted / personal atoms. The README spells this out explicitly because OB1 is cloud-first and "restricted stays local" isn't automatic.
  • It does not ship a contacts-export step. Different deployments have different authoritative sources; the README documents three options.

Pre-review status

This is the fork PR. Not pushing upstream yet — waiting on cross-AI review (gsd-code-reviewer + codex exec) per Alan's OB1 PR protocol before opening the upstream PR to NateBJones-Projects/OB1.

Test plan

  • node --check passes on all four JS files (pull-gmail.mjs, atomize-text.mjs, entity-resolver.mjs, sensitivity.mjs) — verified locally
  • metadata.json parses as valid JSON — verified locally
  • Whole-repo markdownlint-cli2 error count stays at 57 (baseline on origin/main; this branch adds 0 new errors) — verified locally
  • No OAuth credentials, real email addresses, or personal data embedded anywhere — verified by inspection
  • Gmail scope scoped to gmail.readonly only — verified in SCOPES constant
  • Migrations are idempotent (CREATE OR REPLACE, IF NOT EXISTS) and contain no DROP TABLE, TRUNCATE, or unqualified DELETE FROM — verified by inspection
  • Smoke test the full OAuth flow + dry-run on a small STARRED window against a test Gmail account
  • Smoke test atomization with provider=anthropic on a >= 150-word synthetic email

🤖 Generated with Claude Code

alanshurafa and others added 4 commits April 21, 2026 16:31
Core Gmail puller script for a new recipe under recipes/gmail-smart-pull/.
The puller fetches messages from the Gmail API (read-only scope), strips
quoted replies and signatures, filters auto-generated noise, and emits
an OB1 ingest pack that downstream pipelines can feed into fingerprint
dedup + sensitivity-gate + upsert.

Also includes two small pure-JS libs the puller depends on:

- scripts/lib/sensitivity.mjs tags each message body against two
  pattern sets (restricted: SSN, passport, bank, API keys, passwords,
  credit cards; personal: email/phone/health/financial signals) so the
  ingest side can route tiers to the right store. Tagging only — the
  recipe does not enforce a routing policy itself.

- scripts/lib/entity-resolver.mjs does RFC 2822 header parsing
  (From/To/Cc with quoted commas, display-name variants) into
  { name, email } pairs so structured correspondents can be carried in
  the pack and upserted as first-class entities later.

OAuth credentials come from GMAIL_OAUTH_CLIENT_ID and
GMAIL_OAUTH_CLIENT_SECRET env vars. No real email addresses, client
IDs, or secrets are embedded anywhere. The only scope requested is
https://www.googleapis.com/auth/gmail.readonly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LLM atomizer splits long email bodies into multiple atomic thoughts
before the puller emits them in the pack. Two behaviors carried over
from upstream experience running this at scale:

1. Prompts are piped to CLI providers via stdin, not via the -p
   command-line argument. On Windows shell:true cmd.exe mangled
   multi-line prompts containing quotes and newlines so the child
   process received a truncated/empty string and the LLM replied
   conversationally ("Looks like your message got cut off..."). 190/190
   atomize calls in one real batch failed this way until stdin fixed
   it. Same fix applied to the codex provider.

2. A new 'codex' provider shells out to `codex exec` so users
   orchestrating the recipe from a Codex session can atomize without
   crossing the streams with a nested claude-cli (which would fail
   nested-process detection). The `claude-cli` provider still works
   from standalone terminals and refuses to run inside Claude Code.

OB1 users will typically use provider='anthropic' (direct Messages
API) or 'openrouter' since OB1 is cloud-first and those are already
provisioned. CLI providers are opt-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anonical_email

Two idempotent migrations that complete the pack's handoff to a
downstream ingest pipeline:

1. merge_thought_metadata(p_id, p_patch) — shallow-merge a JSONB patch
   into a thought's metadata without re-triggering the full upsert
   path (no embedding regen, no enrichment, no fingerprint recompute).
   Useful for per-row metadata backfills like flipping a
   relationship_tier on a batch of thoughts after regenerating the
   contacts cache.

2. entities.canonical_email — adds a nullable TEXT column + a partial
   unique index to public.entities so email correspondents parsed from
   the pack's structured From/To/Cc blocks can be upserted by normalized
   email address. Existing uniqueness on (entity_type, normalized_name)
   is preserved because two people can legitimately share a display
   name; email is the stable identifier.

Both use CREATE OR REPLACE / IF NOT EXISTS guards — safe to re-run.
Neither drops or renames existing columns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README documents the full setup path: Gmail OAuth Desktop-app client,
env vars (no credentials on disk), first-run consent flow, dry-run,
real run, and optional migration install. Explicitly covers the four
design choices most likely to surprise a new user:

- Sensitivity routing is tag-only — the recipe does not enforce a
  policy, the ingest pipeline does. Calls out that OB1 is cloud-first
  so "restricted stays local" needs explicit wiring (two-store setup
  or block-on-import).
- Engagement filter defaults to engaged-only with STARRED/IMPORTANT
  bypass, with clear instructions to disable or rebuild.
- Relationship tier is metadata (contact/known/unknown), not a gate.
  Three ways to produce the contacts cache documented.
- Atomization is opt-in per-message (>= 150 words default) with
  anthropic/openrouter/claude-cli/codex provider choice. Graceful
  fallback to whole-message capture on atomizer failure.

metadata.json follows the schema template at recipes/_template/ with
required fields (name, description, category, author, version,
requires.open_brain, tags, difficulty, estimated_time) and no extras.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5396bab6b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (!over) return;

const thoughtId = active.id as number;
const newStatus = over.id as string;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Resolve drop target to a kanban status before updating

handleDragEnd treats over.id as the destination status, but with @dnd-kit/sortable the pointer is often over another card, so over.id is a numeric thought id in non-empty columns. In that case we send an invalid status to /api/kanban/update, the server rejects it, and cross-column drops into populated columns consistently revert. This breaks the primary drag-and-drop workflow whenever the target column already has cards.

Useful? React with 👍 / 👎.

const results: Thought[] = [];
for (const thoughtType of ["task", "idea"]) {
const sp = new URLSearchParams();
sp.set("per_page", "100");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fetch all kanban pages instead of truncating at 100

fetchKanbanThoughts hard-caps each thought type to a single per_page=100 request and never follows pagination, so boards with more than 100 task or idea records silently drop the remainder. That causes inaccurate workflow counts and makes a subset of items impossible to see or move from the kanban UI.

Useful? React with 👍 / 👎.

else if (a.startsWith("--after=")) args.after = a.slice("--after=".length);
else if (a.startsWith("--before=")) args.before = a.slice("--before=".length);
else if (a.startsWith("--labels=")) {
args.labels = a.slice("--labels=".length).split(",").map((l) => l.trim().toUpperCase()).filter(Boolean);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve Gmail label ID casing when parsing --labels

Uppercasing --labels values mutates user label IDs (for example IDs returned by --list-labels), but those IDs are passed directly to Gmail labelIds filtering and must match exactly. This makes custom-label pulls fail or return no messages when users provide label IDs as documented, limiting the recipe to a subset of label workflows.

Useful? React with 👍 / 👎.

…l tracker

Codex review (P2 originally, elevated to P1 in triage): the credential
tracker in the README asked users to paste their Supabase service-role
key into a plaintext doc, but this recipe never touches that key — the
puller only emits a pack file, and any downstream ingest pipeline that
needs service_role should read it from env/secret manager, not from a
user's text editor.

Removing the field avoids an entirely avoidable leak path for a highly
privileged database secret, and adds a note so contributors who copy
this tracker pattern into other recipes don't reintroduce the mistake.
… loopback bind + HTML escape + HTTP checks)

Codex identified four coupled OAuth weaknesses in scripts/pull-gmail.mjs:

- No OAuth state parameter: authUrl was built without a random state and
  the callback handler accepted the first ?code= it saw. Any local
  process or malicious localhost page could race the browser redirect
  and bind the script to an attacker-controlled Google account.
- server.listen() without an address defaulted to IPv6-any/0.0.0.0 on
  some platforms, briefly exposing the callback to the LAN.
- URL error parameter reflected into HTML without escaping — low-impact
  reflected XSS but trivial to fix.
- Token exchange and refresh called res.json() before checking res.ok,
  so proxy/5xx responses produced a useless JSON parse error instead of
  a useful OAuth failure with status + body.

Fix: generate 16 bytes of random hex as state, require the callback to
echo it back (mismatch -> hard reject), bind createServer to 127.0.0.1
explicitly, HTML-escape the error param before reflecting, and gate
both token POSTs on res.ok with a bounded body preview on failure.
…s to sensitivity classifier

Codex flagged (originally P2, elevated to P1 in triage because the
sensitivity tier drives downstream routing): the restricted-tier pattern
set missed several common secret formats, so emails containing them
would be classified 'standard' and flow into the general thoughts pool
instead of the restricted-only store.

Adds patterns for:

- openai_key       — sk-proj-, sk-svcacct-, sk-admin- variants
- anthropic_key    — sk-ant-api / sk-ant-admin tokens
- aws_access_key_id    — AKIA/ASIA/AROA/AIDA prefixes
- aws_secret_access_key — proximity match near "aws secret" label
- gcp_api_key      — AIza<35 chars> canonical form
- jwt_token        — eyJ<header>.<payload>.<sig> three-segment form
- pem_private_key  — BEGIN PRIVATE KEY blocks (RSA, EC, DSA, OPENSSH, PGP, ENCRYPTED)
- github_token     — ghp/gho/ghu/ghs/ghr _ 36+ char bodies
- slack_token      — xox[aboprs]- tokens

The existing generic api_key_pattern is kept as a belt-and-suspenders
fallback. All patterns still fail-open (standard tier) on no match —
classification never throws, so a missing pattern degrades gracefully
rather than blocking the pull.
…+ harden atomize prompt against injection

Codex flagged this as the highest-severity finding in the atomize lib
(originally tagged P1-5 + P2): the 'codex' provider spawned
`codex exec --dangerously-bypass-approvals-and-sandbox -` with an email
body interpolated directly into the prompt. A malicious sender can embed
'IGNORE PREVIOUS INSTRUCTIONS' or tool-call primers, and because the
child Codex agent ran with the sandbox disabled, prompt-injection
escalated to arbitrary local command/file access.

Fixes:

1. Remove the --dangerously-bypass-approvals-and-sandbox flag from the
   default codex invocation. Users who actively need it for an
   atomization-only run can opt in via GMAIL_ATOMIZE_CODEX_BYPASS=1 env
   var, which documents the risk at the opt-in site.
2. Strengthen DEFAULT_ATOMIZE_PROMPT with an explicit SECURITY section
   that frames the INPUT THOUGHT as untrusted data, not instructions,
   and forbids emitting system/tool/assistant markers in the output.
3. Add a top-of-file comment describing the prompt-injection threat
   model so callers who override the prompt don't silently drop the
   hardening.

This does not eliminate prompt injection (no prompt-only defense can),
but it removes the most dangerous escalation path and raises the bar
from "read email -> run code" to "read email -> influence atoms".
The previous regex `\b(?:aws[_ -]?secret|aws[_ -]?access[_ -]?key)\b`
could not match `aws_secret_access_key=...` — the most common env-var
form — because `_` is a word char, so the `\b` between `t` and `_` in
`aws_secret_access_key` didn't fire, and neither alternation caught
the combined phrase.

Restructured the alternation so `aws_secret` can optionally absorb the
trailing `_access_key`:

  aws[_ -]?(?:secret(?:[_ -]?access[_ -]?key)?|access[_ -]?key)

Verified against 8 test cases covering kvp form, uppercase, hyphen
separators, space separators, standalone `aws_secret`, standalone
`aws_access_key`, a negative case, and the full env-var pair. All
pass with no false positives.
@alanshurafa
Copy link
Copy Markdown
Owner Author

Refreshing checks after markdownlint cleanup merged into fork main.

@alanshurafa alanshurafa reopened this Apr 22, 2026
@alanshurafa
Copy link
Copy Markdown
Owner Author

Refreshing checks after fork markdownlint workflow fix.

@alanshurafa alanshurafa reopened this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant