[recipes] Atomizer — generic + Gmail re-atomization toolkit by alanshurafa · Pull Request #217 · NateBJones-Projects/OB1

alanshurafa · 2026-04-21T21:16:25Z

Adds the Atomizer toolkit — a recipe for LLM-driven decomposition of large thoughts into smaller atomic facts, with Gmail re-atomization, pipeline audit, and correspondents backfill utilities.

What this adds

Generic atomizer — decomposes any large thought into atomic facts via LLM, preserving lineage through thought_edges.
Gmail re-atomizer — targeted pipeline for Gmail-sourced thoughts, with correspondents extraction and re-linking.
Pipeline audit — reports on atomization coverage, orphaned atoms, and graph integrity.
Correspondents backfill — resolves and deduplicates entities extracted from email metadata.
4 LLM providers — Anthropic direct, OpenRouter, Claude CLI, and Codex CLI are all supported behind a pluggable provider interface.

Dependencies

Assumes the enhanced-thoughts / knowledge-graph schema:

Tables: entities, thought_entities, thought_edges
RPC: upsert_thought

These are part of the schema track in open upstream PRs (see #191 enhanced-thoughts and related). The README documents the minimum DDL so the recipe can be adopted independently of those PRs landing.

Review process

Generalized from Alan's ExoCortex work (2026-04 session) and cross-AI ultra-reviewed across 3 iteration rounds using Claude gsd-code-reviewer, Codex codex exec, and security-reviewer. Approximately 20 findings resolved across 7 fix commits. Notable fixes:

Removed Codex sandbox bypass (P0 security)
Prompt-injection hardening on untrusted thought bodies
PII redaction before LLM calls
Entity-resolver race condition fix
Env-loader idempotency

Known follow-ups (not blockers)

Non-transactional re-atomize pipeline — documented with a CAUTION block in the README; an RPC-based transactional version is a future enhancement.
Windows shell: true + process-tree leak — tracked for a separate PR.
Cost-budget pre-flight for --all runs — tracked for a separate PR.

Review history

Fork pre-review PR with full review trail: alanshurafa#19 (FIXED-AND-CLEAN after 3 rounds).

Ship a community recipe for splitting compound thoughts into atomic single-topic thoughts via an LLM, plus Gmail-specific repair tooling. Components: - atomize-packs.mjs — generic pack-file atomizer with heuristic compound detection and four-provider LLM backend (Claude CLI, Codex, Anthropic, OpenRouter). - re-atomize-gmail-thought.mjs — heals Gmail imports where long bodies were stored whole; splits via the atomizer, re-inserts via upsert_thought, redirects replies_to edges, re-links correspondents. - audit-gmail-pipeline.mjs — JSON/MD report covering scale, metadata completeness, entity-graph integrity, classification distributions, and retrieval probes. - backfill-gmail-correspondents.mjs — idempotent backfill that pre-filters on author-edge presence specifically. - lib/ — shared atomize-text, entity-resolver, and Claude CLI utilities. - test-atomize.mjs — zero-setup sanity test. Ported from the author's private capture pipeline; all personal emails, internal ticket IDs, and hardcoded paths generalized. No secrets; no modifications to the core thoughts table; no DROP / TRUNCATE / unqualified DELETE. Markdownlint clean; metadata.json validates against the OB1 schema.

Codex provider was spawning `codex exec --dangerously-bypass-approvals-and-sandbox` with arbitrary user-controlled memory/email text as the prompt. A prompt injection in a hostile email body could trigger local code execution via the agent's tool access. Removed the codex provider entirely (OpenRouter, Anthropic, and claude-cli cover all use cases without tool access). Added prompt-injection hardening: wrap all user content in <INPUT>...</INPUT> delimiters with an "inert data" instruction, escape literal </INPUT> tags. Redact raw model output from error messages (gated behind ATOMIZE_DEBUG=1).

…ction atomize-packs.mjs now: - loads recipes/atomizer/.env.local resolved relative to the script (so the documented `node atomize-packs.mjs --provider=openrouter` path no longer fails with "requires OPENROUTER_API_KEY" when the key lives in .env.local) - defaults to openrouter provider (codex provider was removed) - warns when --concurrency > MAX_CONCURRENCY is clamped, instead of silent - skips memories whose memoryId matches -split-N$ or that carry metadata.atomization.parent_id, so re-runs don't double-split children - writes only a 60-char preview + fingerprint into atomization-errors.json by default (full text persists only with ATOMIZE_DEBUG_ERRORS=1) to avoid duplicating sensitive memory content

…mize - re-atomize-gmail-thought.mjs: load .env.local relative to script, add &order=id.asc, warn when hitting the default 1000-row cap on --all, wrap main in an async function with .catch() for consistent exit, document partial-failure recovery via metadata.re_atomized_from. - backfill-gmail-correspondents.mjs: script-relative .env.local; move the per-2000 progress log inside the per-thought loop so it actually fires. - audit-gmail-pipeline.mjs: script-relative .env.local; sbCount() now throws on !res.ok instead of silently returning 0 (was hiding auth/query errors as "zero findings"); use explicit jsonb aliases like `thread_id:metadata->gmail->>thread_id` so reads don't break on PostgREST version changes. - test-atomize.mjs: script-relative .env.local, drop codex provider reference, default to openrouter.

- upsertPersonByEmail orphan-adoption is now race-safe: PATCH conditionally on `canonical_email=is.null`, re-SELECT by email if the winner already adopted the orphan with a different canonical_email. Prevents two concurrent backfill workers from linking two different emails to the same entity row. - Resolver log includes only email domain by default (set ENTITY_RESOLVER_DEBUG=1 for full addresses); the 23505 fallback error drops the email and reports only the domain. - makeSbClient.call() error message strips the query string by default so PostgREST filter values (emails, thread_ids) don't leak into shared logs. Full URL available behind ENTITY_RESOLVER_DEBUG=1. - Document loadEnv constraints (UPPER_SNAKE keys, single-line values, process.env wins, caller should pass absolute script-relative path).

- Replace the "Credential Tracker" section that instructed users to paste service-role keys into a text editor with a structured table + security warning. Keeps service-role keys confined to .env.local. - Drop codex from the supported provider list + document why it was removed (prompt-injection → local-code-execution on untrusted input). - Document the 1000-row default cap on re-atomize --all, the partial- failure recovery via metadata.re_atomized_from, and the new debug env flags (ATOMIZE_DEBUG, ATOMIZE_DEBUG_ERRORS, ENTITY_RESOLVER_DEBUG). - Add recipes/atomizer/.env.example with placeholder values (uses "your-…-placeholder" strings that match the Gate's .env allowlist).

…tale refs - entity-resolver.mjs: orphan adoption now detects zero-row PATCH via Prefer: return=representation. If another worker already adopted the orphan with a DIFFERENT email, we fall through to the disambiguated insert path (case c) instead of incorrectly returning the winner's id. Extracted into tryAdoptOrDisambiguate() for clarity. - claude-cli.mjs: stderr/stdout snippets in error messages are gated behind ATOMIZE_DEBUG=1; default prints only byte counts so arbitrary user email/memory text the CLI echoed doesn't end up in logs. - re-atomize-gmail-thought.mjs: buildAtomizeOpts() pre-loads the OpenRouter key when no --provider flag is passed, because atomize-text.mjs defaults to 'openrouter'. Previously running `node re-atomize-gmail-thought.mjs --id=123` without the flag hit a spurious "opts.openrouterApiKey" error at runtime. - README.md / metadata.json: drop stale 4-provider references that still listed Codex as a supported option.

Add a non-blockquote line between two adjacent GitHub alert callouts so markdownlint stops treating them as one blockquote with a blank line inside. Restores a clean run on recipes/atomizer/README.md. The rest of the repo-wide Markdown Lint failure is pre-existing and covered by a separate cleanup PR (tracker in MEMORY, pattern matches NateBJones-Projects#161/NateBJones-Projects#215).

alanshurafa · 2026-04-22T17:21:44Z

Refreshing upstream checks after fork-side readiness cleanup.

alanshurafa added 8 commits April 21, 2026 16:32

github-actions Bot added the recipe Contribution: step-by-step recipe label Apr 21, 2026

This was referenced Apr 22, 2026

[docs] Markdownlint sweep for existing recipe/schema docs alanshurafa/OB1#22

Merged

[docs] Markdownlint sweep for existing recipe/schema docs #224

Open

[docs] Fix pre-existing markdownlint errors across 8 files

3290fca

github-actions Bot added the schema Contribution: database extension label Apr 22, 2026

alanshurafa closed this Apr 22, 2026

alanshurafa reopened this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipes] Atomizer — generic + Gmail re-atomization toolkit#217

[recipes] Atomizer — generic + Gmail re-atomization toolkit#217
alanshurafa wants to merge 9 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/atomizer

alanshurafa commented Apr 21, 2026

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alanshurafa commented Apr 21, 2026

What this adds

Dependencies

Review process

Known follow-ups (not blockers)

Review history

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant