[recipes] Atomizer — generic + Gmail re-atomization toolkit#19
[recipes] Atomizer — generic + Gmail re-atomization toolkit#19alanshurafa wants to merge 9 commits intomainfrom
Conversation
Ship a community recipe for splitting compound thoughts into atomic single-topic thoughts via an LLM, plus Gmail-specific repair tooling. Components: - atomize-packs.mjs — generic pack-file atomizer with heuristic compound detection and four-provider LLM backend (Claude CLI, Codex, Anthropic, OpenRouter). - re-atomize-gmail-thought.mjs — heals Gmail imports where long bodies were stored whole; splits via the atomizer, re-inserts via upsert_thought, redirects replies_to edges, re-links correspondents. - audit-gmail-pipeline.mjs — JSON/MD report covering scale, metadata completeness, entity-graph integrity, classification distributions, and retrieval probes. - backfill-gmail-correspondents.mjs — idempotent backfill that pre-filters on author-edge presence specifically. - lib/ — shared atomize-text, entity-resolver, and Claude CLI utilities. - test-atomize.mjs — zero-setup sanity test. Ported from the author's private capture pipeline; all personal emails, internal ticket IDs, and hardcoded paths generalized. No secrets; no modifications to the core thoughts table; no DROP / TRUNCATE / unqualified DELETE. Markdownlint clean; metadata.json validates against the OB1 schema.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f6fd1ffd4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const thoughtId = active.id as number; | ||
| const newStatus = over.id as string; | ||
|
|
There was a problem hiding this comment.
Derive drop status from container, not
over.id
In handleDragEnd, newStatus is taken directly from over.id, but cards are registered as sortable items with numeric IDs (useSortable({ id: thought.id })), so dropping onto another card yields a number instead of a kanban status string. That sends invalid status values to /api/kanban/update and triggers rollbacks, making drag-and-drop fail whenever the target column is non-empty; you need to read the destination container/status ID instead of the hovered item's ID.
Useful? React with 👍 / 👎.
| sp.set("per_page", "100"); | ||
| sp.set("sort", "importance"); | ||
| sp.set("order", "desc"); | ||
| sp.set("type", thoughtType); | ||
| if (params?.status) sp.set("status", params.status); | ||
| if (params?.exclude_restricted !== undefined) | ||
| sp.set("exclude_restricted", String(params.exclude_restricted)); | ||
| const qs = sp.toString(); | ||
| const data = await apiFetch<BrowseResponse>(apiKey, `/thoughts?${qs}`); |
There was a problem hiding this comment.
Fetch all kanban pages instead of first 100 only
The kanban loader hard-caps each type query to per_page=100 and performs only one request, even though the browse response includes pagination metadata (total, page, per_page). For users with more than 100 task or idea thoughts, items beyond the first page never appear on the board or summary, so workflow state becomes incomplete; iterate pages until the full result set is collected.
Useful? React with 👍 / 👎.
| const m = /^(\[Email[^\]]*\])\s*([\s\S]*)$/.exec(content); | ||
| if (!m) return { prefix: "", body: content }; | ||
| return { prefix: m[1], body: m[2] }; |
There was a problem hiding this comment.
Parse Gmail prefix robustly when subject contains
]
The prefix parser matches \[Email[^\]]*\], which stops at the first ]; Gmail subjects are inserted raw into this bracketed prefix by the importer, so any subject containing ] truncates the parsed prefix/body boundary. That can produce malformed prefixWithAtomTag output and incorrect re-atomized content for those messages; parsing should use a delimiter strategy that tolerates ] inside subject text (or rely on structured metadata fields).
Useful? React with 👍 / 👎.
Codex provider was spawning `codex exec --dangerously-bypass-approvals-and-sandbox` with arbitrary user-controlled memory/email text as the prompt. A prompt injection in a hostile email body could trigger local code execution via the agent's tool access. Removed the codex provider entirely (OpenRouter, Anthropic, and claude-cli cover all use cases without tool access). Added prompt-injection hardening: wrap all user content in <INPUT>...</INPUT> delimiters with an "inert data" instruction, escape literal </INPUT> tags. Redact raw model output from error messages (gated behind ATOMIZE_DEBUG=1).
…ction atomize-packs.mjs now: - loads recipes/atomizer/.env.local resolved relative to the script (so the documented `node atomize-packs.mjs --provider=openrouter` path no longer fails with "requires OPENROUTER_API_KEY" when the key lives in .env.local) - defaults to openrouter provider (codex provider was removed) - warns when --concurrency > MAX_CONCURRENCY is clamped, instead of silent - skips memories whose memoryId matches -split-N$ or that carry metadata.atomization.parent_id, so re-runs don't double-split children - writes only a 60-char preview + fingerprint into atomization-errors.json by default (full text persists only with ATOMIZE_DEBUG_ERRORS=1) to avoid duplicating sensitive memory content
…mize - re-atomize-gmail-thought.mjs: load .env.local relative to script, add &order=id.asc, warn when hitting the default 1000-row cap on --all, wrap main in an async function with .catch() for consistent exit, document partial-failure recovery via metadata.re_atomized_from. - backfill-gmail-correspondents.mjs: script-relative .env.local; move the per-2000 progress log inside the per-thought loop so it actually fires. - audit-gmail-pipeline.mjs: script-relative .env.local; sbCount() now throws on !res.ok instead of silently returning 0 (was hiding auth/query errors as "zero findings"); use explicit jsonb aliases like `thread_id:metadata->gmail->>thread_id` so reads don't break on PostgREST version changes. - test-atomize.mjs: script-relative .env.local, drop codex provider reference, default to openrouter.
- upsertPersonByEmail orphan-adoption is now race-safe: PATCH conditionally on `canonical_email=is.null`, re-SELECT by email if the winner already adopted the orphan with a different canonical_email. Prevents two concurrent backfill workers from linking two different emails to the same entity row. - Resolver log includes only email domain by default (set ENTITY_RESOLVER_DEBUG=1 for full addresses); the 23505 fallback error drops the email and reports only the domain. - makeSbClient.call() error message strips the query string by default so PostgREST filter values (emails, thread_ids) don't leak into shared logs. Full URL available behind ENTITY_RESOLVER_DEBUG=1. - Document loadEnv constraints (UPPER_SNAKE keys, single-line values, process.env wins, caller should pass absolute script-relative path).
- Replace the "Credential Tracker" section that instructed users to paste service-role keys into a text editor with a structured table + security warning. Keeps service-role keys confined to .env.local. - Drop codex from the supported provider list + document why it was removed (prompt-injection → local-code-execution on untrusted input). - Document the 1000-row default cap on re-atomize --all, the partial- failure recovery via metadata.re_atomized_from, and the new debug env flags (ATOMIZE_DEBUG, ATOMIZE_DEBUG_ERRORS, ENTITY_RESOLVER_DEBUG). - Add recipes/atomizer/.env.example with placeholder values (uses "your-…-placeholder" strings that match the Gate's .env allowlist).
…tale refs - entity-resolver.mjs: orphan adoption now detects zero-row PATCH via Prefer: return=representation. If another worker already adopted the orphan with a DIFFERENT email, we fall through to the disambiguated insert path (case c) instead of incorrectly returning the winner's id. Extracted into tryAdoptOrDisambiguate() for clarity. - claude-cli.mjs: stderr/stdout snippets in error messages are gated behind ATOMIZE_DEBUG=1; default prints only byte counts so arbitrary user email/memory text the CLI echoed doesn't end up in logs. - re-atomize-gmail-thought.mjs: buildAtomizeOpts() pre-loads the OpenRouter key when no --provider flag is passed, because atomize-text.mjs defaults to 'openrouter'. Previously running `node re-atomize-gmail-thought.mjs --id=123` without the flag hit a spurious "opts.openrouterApiKey" error at runtime. - README.md / metadata.json: drop stale 4-provider references that still listed Codex as a supported option.
Add a non-blockquote line between two adjacent GitHub alert callouts so markdownlint stops treating them as one blockquote with a blank line inside. Restores a clean run on recipes/atomizer/README.md. The rest of the repo-wide Markdown Lint failure is pre-existing and covered by a separate cleanup PR (tracker in MEMORY, pattern matches NateBJones-Projects#161/NateBJones-Projects#215).
|
Refreshing checks after markdownlint cleanup merged into fork main. |
|
Refreshing checks after fork markdownlint workflow fix. |
What this adds
A new community recipe under
recipes/atomizer/that splits compoundmulti-topic thoughts into atomic single-topic thoughts via an LLM, plus a
Gmail-specific repair + audit toolkit.
Source: ported from my private ExoCortex capture pipeline (generic
atomization orchestrator + Gmail re-atomize / audit / correspondent-backfill
scripts). All personal emails, internal ticket IDs, and hardcoded paths
were stripped during the port.
Files
atomize-packs.mjs— generic JSON pack-file atomizer. Heuristic compounddetection (sentence count, enumerations, semicolons, conjunction density)
followed by LLM split. Three-provider backend: OpenRouter (default),
Anthropic API, Claude CLI.
re-atomize-gmail-thought.mjs— heals whole-bodygmail_exportthoughts.Parses the
[Email from X to Y | Subject ... | date]prefix, atomizes thebody, inserts atoms via
upsert_thought, redirectsreplies_toedges,re-links correspondents, deletes the original.
audit-gmail-pipeline.mjs— JSON or markdown report: scale, metadatacompleteness, entity-graph integrity, classification distributions, top
correspondents, atom samples, retrieval probes.
backfill-gmail-correspondents.mjs— idempotent backfill withauthor-edge-specific pre-filter.
lib/— sharedatomize-text.mjs,entity-resolver.mjs,claude-cli.mjs.test-atomize.mjs— zero-setup sanity test..env.example— credential template.Requires
npm install— pure built-ins).thoughts.source_type,thoughts.metadatajsonb,entities,thought_entities,thought_edges,plus an
upsert_thought(p_content, p_payload)Postgres function. TheREADME lists the minimum columns required per table.
Tested
Live on my own Open Brain instance over a 350-thought STARRED Gmail corpus:
27% atomized, 100% author-edge coverage after backfill, 144
replies_toedges built, entity-keyed retrieval probe exact-match.
Pre-review pipeline
This branch went through an automated Phase B ultra-review (Claude
gsd-code-reviewer role + Codex
codex exec+ dedicated security pass)before opening the upstream PR. Round 1 surfaced 1 P0, 5 P1, 11 P2,
8 P3 findings across three reviewers. Round 2 surfaced 1 new P1.
Round 3 reviewed clean. All P0 / P1 findings are fixed:
codexprovider entirely — it shelled out with--dangerously-bypass-approvals-and-sandboxon arbitrary user-controlledmemory/email text, which is a prompt-injection → local-code-execution
primitive. Replaced with
<INPUT>delimiter hardening for remainingproviders.
atomize-packs.mjsnow loads.env.localscript-relative sothe documented
node atomize-packs.mjs --provider=openrouterpath works.re-atomize-gmail-thought.mjspre-loadsOPENROUTER_API_KEYfor the default-provider path, warns when hitting the implicit 1000-row
cap, wraps
main()in async try/catch.entity-resolver.mjsorphan adoption is now race-safe — PATCHwith
Prefer: return=representationdetects zero-row results, fallsthrough to the disambiguated insert path instead of returning the wrong
entity id.
(gated behind
ENTITY_RESOLVER_DEBUG=1); Claude CLI and LLM responsesnippets in errors are gated behind
ATOMIZE_DEBUG=1;atomization-errors.jsonpersists only a 60-char preview + fingerprintper failure unless
ATOMIZE_DEBUG_ERRORS=1.service-role keys into a text editor was replaced with a security table
and warning to keep keys only in
.env.local.P2/P3 follow-ups (non-blocking):
upsert_thought+ edge-redirect + delete RPC.Documented in README "Partial-failure recovery" for now.
shell: true+child.kill()can leak the underlying CLI process onWindows. Not a correctness bug; timeouts still fire.
Known architectural flag
This recipe assumes an Enhanced-Thoughts-style schema (
entities,thought_entities,thought_edgestables +upsert_thoughtRPC) that isnot yet part of upstream OB1
origin/main. This is deliberate — therecipe is a stand-alone opt-in capability, not a core-schema change. The
README documents the minimum DDL required and users without those tables
will see clear errors pointing at the prerequisites.
Note to reviewers
This is the pre-review fork PR. The upstream PR to
NateBJones-Projects/OB1will be opened in a separate Phase C step.Please do not merge this fork PR — it exists so the review loop has a
stable target.