[recipes] Atomizer — generic + Gmail re-atomization toolkit#217
Open
alanshurafa wants to merge 9 commits intoNateBJones-Projects:mainfrom
Open
[recipes] Atomizer — generic + Gmail re-atomization toolkit#217alanshurafa wants to merge 9 commits intoNateBJones-Projects:mainfrom
alanshurafa wants to merge 9 commits intoNateBJones-Projects:mainfrom
Conversation
Ship a community recipe for splitting compound thoughts into atomic single-topic thoughts via an LLM, plus Gmail-specific repair tooling. Components: - atomize-packs.mjs — generic pack-file atomizer with heuristic compound detection and four-provider LLM backend (Claude CLI, Codex, Anthropic, OpenRouter). - re-atomize-gmail-thought.mjs — heals Gmail imports where long bodies were stored whole; splits via the atomizer, re-inserts via upsert_thought, redirects replies_to edges, re-links correspondents. - audit-gmail-pipeline.mjs — JSON/MD report covering scale, metadata completeness, entity-graph integrity, classification distributions, and retrieval probes. - backfill-gmail-correspondents.mjs — idempotent backfill that pre-filters on author-edge presence specifically. - lib/ — shared atomize-text, entity-resolver, and Claude CLI utilities. - test-atomize.mjs — zero-setup sanity test. Ported from the author's private capture pipeline; all personal emails, internal ticket IDs, and hardcoded paths generalized. No secrets; no modifications to the core thoughts table; no DROP / TRUNCATE / unqualified DELETE. Markdownlint clean; metadata.json validates against the OB1 schema.
Codex provider was spawning `codex exec --dangerously-bypass-approvals-and-sandbox` with arbitrary user-controlled memory/email text as the prompt. A prompt injection in a hostile email body could trigger local code execution via the agent's tool access. Removed the codex provider entirely (OpenRouter, Anthropic, and claude-cli cover all use cases without tool access). Added prompt-injection hardening: wrap all user content in <INPUT>...</INPUT> delimiters with an "inert data" instruction, escape literal </INPUT> tags. Redact raw model output from error messages (gated behind ATOMIZE_DEBUG=1).
…ction atomize-packs.mjs now: - loads recipes/atomizer/.env.local resolved relative to the script (so the documented `node atomize-packs.mjs --provider=openrouter` path no longer fails with "requires OPENROUTER_API_KEY" when the key lives in .env.local) - defaults to openrouter provider (codex provider was removed) - warns when --concurrency > MAX_CONCURRENCY is clamped, instead of silent - skips memories whose memoryId matches -split-N$ or that carry metadata.atomization.parent_id, so re-runs don't double-split children - writes only a 60-char preview + fingerprint into atomization-errors.json by default (full text persists only with ATOMIZE_DEBUG_ERRORS=1) to avoid duplicating sensitive memory content
…mize - re-atomize-gmail-thought.mjs: load .env.local relative to script, add &order=id.asc, warn when hitting the default 1000-row cap on --all, wrap main in an async function with .catch() for consistent exit, document partial-failure recovery via metadata.re_atomized_from. - backfill-gmail-correspondents.mjs: script-relative .env.local; move the per-2000 progress log inside the per-thought loop so it actually fires. - audit-gmail-pipeline.mjs: script-relative .env.local; sbCount() now throws on !res.ok instead of silently returning 0 (was hiding auth/query errors as "zero findings"); use explicit jsonb aliases like `thread_id:metadata->gmail->>thread_id` so reads don't break on PostgREST version changes. - test-atomize.mjs: script-relative .env.local, drop codex provider reference, default to openrouter.
- upsertPersonByEmail orphan-adoption is now race-safe: PATCH conditionally on `canonical_email=is.null`, re-SELECT by email if the winner already adopted the orphan with a different canonical_email. Prevents two concurrent backfill workers from linking two different emails to the same entity row. - Resolver log includes only email domain by default (set ENTITY_RESOLVER_DEBUG=1 for full addresses); the 23505 fallback error drops the email and reports only the domain. - makeSbClient.call() error message strips the query string by default so PostgREST filter values (emails, thread_ids) don't leak into shared logs. Full URL available behind ENTITY_RESOLVER_DEBUG=1. - Document loadEnv constraints (UPPER_SNAKE keys, single-line values, process.env wins, caller should pass absolute script-relative path).
- Replace the "Credential Tracker" section that instructed users to paste service-role keys into a text editor with a structured table + security warning. Keeps service-role keys confined to .env.local. - Drop codex from the supported provider list + document why it was removed (prompt-injection → local-code-execution on untrusted input). - Document the 1000-row default cap on re-atomize --all, the partial- failure recovery via metadata.re_atomized_from, and the new debug env flags (ATOMIZE_DEBUG, ATOMIZE_DEBUG_ERRORS, ENTITY_RESOLVER_DEBUG). - Add recipes/atomizer/.env.example with placeholder values (uses "your-…-placeholder" strings that match the Gate's .env allowlist).
…tale refs - entity-resolver.mjs: orphan adoption now detects zero-row PATCH via Prefer: return=representation. If another worker already adopted the orphan with a DIFFERENT email, we fall through to the disambiguated insert path (case c) instead of incorrectly returning the winner's id. Extracted into tryAdoptOrDisambiguate() for clarity. - claude-cli.mjs: stderr/stdout snippets in error messages are gated behind ATOMIZE_DEBUG=1; default prints only byte counts so arbitrary user email/memory text the CLI echoed doesn't end up in logs. - re-atomize-gmail-thought.mjs: buildAtomizeOpts() pre-loads the OpenRouter key when no --provider flag is passed, because atomize-text.mjs defaults to 'openrouter'. Previously running `node re-atomize-gmail-thought.mjs --id=123` without the flag hit a spurious "opts.openrouterApiKey" error at runtime. - README.md / metadata.json: drop stale 4-provider references that still listed Codex as a supported option.
Add a non-blockquote line between two adjacent GitHub alert callouts so markdownlint stops treating them as one blockquote with a blank line inside. Restores a clean run on recipes/atomizer/README.md. The rest of the repo-wide Markdown Lint failure is pre-existing and covered by a separate cleanup PR (tracker in MEMORY, pattern matches NateBJones-Projects#161/NateBJones-Projects#215).
This was referenced Apr 22, 2026
Contributor
Author
|
Refreshing upstream checks after fork-side readiness cleanup. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the Atomizer toolkit — a recipe for LLM-driven decomposition of large thoughts into smaller atomic facts, with Gmail re-atomization, pipeline audit, and correspondents backfill utilities.
What this adds
thought_edges.Dependencies
Assumes the enhanced-thoughts / knowledge-graph schema:
entities,thought_entities,thought_edgesupsert_thoughtThese are part of the schema track in open upstream PRs (see #191 enhanced-thoughts and related). The README documents the minimum DDL so the recipe can be adopted independently of those PRs landing.
Review process
Generalized from Alan's ExoCortex work (2026-04 session) and cross-AI ultra-reviewed across 3 iteration rounds using Claude gsd-code-reviewer, Codex
codex exec, and security-reviewer. Approximately 20 findings resolved across 7 fix commits. Notable fixes:Known follow-ups (not blockers)
shell: true+ process-tree leak — tracked for a separate PR.--allruns — tracked for a separate PR.Review history
Fork pre-review PR with full review trail: alanshurafa#19 (FIXED-AND-CLEAN after 3 rounds).