[recipes] Atomizer — generic + Gmail re-atomization toolkit by alanshurafa · Pull Request #19 · alanshurafa/OB1

alanshurafa · 2026-04-21T20:33:29Z

What this adds

A new community recipe under recipes/atomizer/ that splits compound
multi-topic thoughts into atomic single-topic thoughts via an LLM, plus a
Gmail-specific repair + audit toolkit.

Source: ported from my private ExoCortex capture pipeline (generic
atomization orchestrator + Gmail re-atomize / audit / correspondent-backfill
scripts). All personal emails, internal ticket IDs, and hardcoded paths
were stripped during the port.

Files

atomize-packs.mjs — generic JSON pack-file atomizer. Heuristic compound
detection (sentence count, enumerations, semicolons, conjunction density)
followed by LLM split. Three-provider backend: OpenRouter (default),
Anthropic API, Claude CLI.
re-atomize-gmail-thought.mjs — heals whole-body gmail_export thoughts.
Parses the [Email from X to Y | Subject ... | date] prefix, atomizes the
body, inserts atoms via upsert_thought, redirects replies_to edges,
re-links correspondents, deletes the original.
audit-gmail-pipeline.mjs — JSON or markdown report: scale, metadata
completeness, entity-graph integrity, classification distributions, top
correspondents, atom samples, retrieval probes.
backfill-gmail-correspondents.mjs — idempotent backfill with
author-edge-specific pre-filter.
lib/ — shared atomize-text.mjs, entity-resolver.mjs, claude-cli.mjs.
test-atomize.mjs — zero-setup sanity test.
.env.example — credential template.

Requires

OpenRouter (recommended, default) or Anthropic API key for the LLM backend.
Node.js 18+ (no npm install — pure built-ins).
An Enhanced-Thoughts-style schema: thoughts.source_type,
thoughts.metadata jsonb, entities, thought_entities, thought_edges,
plus an upsert_thought(p_content, p_payload) Postgres function. The
README lists the minimum columns required per table.

Tested

Live on my own Open Brain instance over a 350-thought STARRED Gmail corpus:
27% atomized, 100% author-edge coverage after backfill, 144 replies_to
edges built, entity-keyed retrieval probe exact-match.

Pre-review pipeline

This branch went through an automated Phase B ultra-review (Claude
gsd-code-reviewer role + Codex codex exec + dedicated security pass)
before opening the upstream PR. Round 1 surfaced 1 P0, 5 P1, 11 P2,
8 P3 findings across three reviewers. Round 2 surfaced 1 new P1.
Round 3 reviewed clean. All P0 / P1 findings are fixed:

P0 (sec): removed codex provider entirely — it shelled out with
--dangerously-bypass-approvals-and-sandbox on arbitrary user-controlled
memory/email text, which is a prompt-injection → local-code-execution
primitive. Replaced with <INPUT> delimiter hardening for remaining
providers.
P1: atomize-packs.mjs now loads .env.local script-relative so
the documented node atomize-packs.mjs --provider=openrouter path works.
P1: re-atomize-gmail-thought.mjs pre-loads OPENROUTER_API_KEY
for the default-provider path, warns when hitting the implicit 1000-row
cap, wraps main() in async try/catch.
P1: entity-resolver.mjs orphan adoption is now race-safe — PATCH
with Prefer: return=representation detects zero-row results, falls
through to the disambiguated insert path instead of returning the wrong
entity id.
P1: PostgREST error messages strip query-string values by default
(gated behind ENTITY_RESOLVER_DEBUG=1); Claude CLI and LLM response
snippets in errors are gated behind ATOMIZE_DEBUG=1;
atomization-errors.json persists only a 60-char preview + fingerprint
per failure unless ATOMIZE_DEBUG_ERRORS=1.
P1: README "Credential Tracker" that instructed users to paste
service-role keys into a text editor was replaced with a security table
and warning to keep keys only in .env.local.

P2/P3 follow-ups (non-blocking):

Full transactional upsert_thought + edge-redirect + delete RPC.
Documented in README "Partial-failure recovery" for now.
shell: true + child.kill() can leak the underlying CLI process on
Windows. Not a correctness bug; timeouts still fire.

Known architectural flag

This recipe assumes an Enhanced-Thoughts-style schema (entities,
thought_entities, thought_edges tables + upsert_thought RPC) that is
not yet part of upstream OB1 origin/main. This is deliberate — the
recipe is a stand-alone opt-in capability, not a core-schema change. The
README documents the minimum DDL required and users without those tables
will see clear errors pointing at the prerequisites.

Note to reviewers

This is the pre-review fork PR. The upstream PR to
NateBJones-Projects/OB1 will be opened in a separate Phase C step.
Please do not merge this fork PR — it exists so the review loop has a
stable target.

Ship a community recipe for splitting compound thoughts into atomic single-topic thoughts via an LLM, plus Gmail-specific repair tooling. Components: - atomize-packs.mjs — generic pack-file atomizer with heuristic compound detection and four-provider LLM backend (Claude CLI, Codex, Anthropic, OpenRouter). - re-atomize-gmail-thought.mjs — heals Gmail imports where long bodies were stored whole; splits via the atomizer, re-inserts via upsert_thought, redirects replies_to edges, re-links correspondents. - audit-gmail-pipeline.mjs — JSON/MD report covering scale, metadata completeness, entity-graph integrity, classification distributions, and retrieval probes. - backfill-gmail-correspondents.mjs — idempotent backfill that pre-filters on author-edge presence specifically. - lib/ — shared atomize-text, entity-resolver, and Claude CLI utilities. - test-atomize.mjs — zero-setup sanity test. Ported from the author's private capture pipeline; all personal emails, internal ticket IDs, and hardcoded paths generalized. No secrets; no modifications to the core thoughts table; no DROP / TRUNCATE / unqualified DELETE. Markdownlint clean; metadata.json validates against the OB1 schema.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6fd1ffd4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T20:37:44Z

+    const thoughtId = active.id as number;
+    const newStatus = over.id as string;
+


Derive drop status from container, not over.id

In handleDragEnd, newStatus is taken directly from over.id, but cards are registered as sortable items with numeric IDs (useSortable({ id: thought.id })), so dropping onto another card yields a number instead of a kanban status string. That sends invalid status values to /api/kanban/update and triggers rollbacks, making drag-and-drop fail whenever the target column is non-empty; you need to read the destination container/status ID instead of the hovered item's ID.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T20:37:44Z

+    sp.set("per_page", "100");
+    sp.set("sort", "importance");
+    sp.set("order", "desc");
+    sp.set("type", thoughtType);
+    if (params?.status) sp.set("status", params.status);
+    if (params?.exclude_restricted !== undefined)
+      sp.set("exclude_restricted", String(params.exclude_restricted));
+    const qs = sp.toString();
+    const data = await apiFetch<BrowseResponse>(apiKey, `/thoughts?${qs}`);


Fetch all kanban pages instead of first 100 only

The kanban loader hard-caps each type query to per_page=100 and performs only one request, even though the browse response includes pagination metadata (total, page, per_page). For users with more than 100 task or idea thoughts, items beyond the first page never appear on the board or summary, so workflow state becomes incomplete; iterate pages until the full result set is collected.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T20:37:44Z

+  const m = /^(\[Email[^\]]*\])\s*([\s\S]*)$/.exec(content);
+  if (!m) return { prefix: "", body: content };
+  return { prefix: m[1], body: m[2] };


Parse Gmail prefix robustly when subject contains ]

The prefix parser matches \[Email[^\]]*\], which stops at the first ]; Gmail subjects are inserted raw into this bracketed prefix by the importer, so any subject containing ] truncates the parsed prefix/body boundary. That can produce malformed prefixWithAtomTag output and incorrect re-atomized content for those messages; parsing should use a delimiter strategy that tolerates ] inside subject text (or rely on structured metadata fields).

Useful? React with 👍 / 👎.

Codex provider was spawning `codex exec --dangerously-bypass-approvals-and-sandbox` with arbitrary user-controlled memory/email text as the prompt. A prompt injection in a hostile email body could trigger local code execution via the agent's tool access. Removed the codex provider entirely (OpenRouter, Anthropic, and claude-cli cover all use cases without tool access). Added prompt-injection hardening: wrap all user content in <INPUT>...</INPUT> delimiters with an "inert data" instruction, escape literal </INPUT> tags. Redact raw model output from error messages (gated behind ATOMIZE_DEBUG=1).

…ction atomize-packs.mjs now: - loads recipes/atomizer/.env.local resolved relative to the script (so the documented `node atomize-packs.mjs --provider=openrouter` path no longer fails with "requires OPENROUTER_API_KEY" when the key lives in .env.local) - defaults to openrouter provider (codex provider was removed) - warns when --concurrency > MAX_CONCURRENCY is clamped, instead of silent - skips memories whose memoryId matches -split-N$ or that carry metadata.atomization.parent_id, so re-runs don't double-split children - writes only a 60-char preview + fingerprint into atomization-errors.json by default (full text persists only with ATOMIZE_DEBUG_ERRORS=1) to avoid duplicating sensitive memory content

…mize - re-atomize-gmail-thought.mjs: load .env.local relative to script, add &order=id.asc, warn when hitting the default 1000-row cap on --all, wrap main in an async function with .catch() for consistent exit, document partial-failure recovery via metadata.re_atomized_from. - backfill-gmail-correspondents.mjs: script-relative .env.local; move the per-2000 progress log inside the per-thought loop so it actually fires. - audit-gmail-pipeline.mjs: script-relative .env.local; sbCount() now throws on !res.ok instead of silently returning 0 (was hiding auth/query errors as "zero findings"); use explicit jsonb aliases like `thread_id:metadata->gmail->>thread_id` so reads don't break on PostgREST version changes. - test-atomize.mjs: script-relative .env.local, drop codex provider reference, default to openrouter.

- upsertPersonByEmail orphan-adoption is now race-safe: PATCH conditionally on `canonical_email=is.null`, re-SELECT by email if the winner already adopted the orphan with a different canonical_email. Prevents two concurrent backfill workers from linking two different emails to the same entity row. - Resolver log includes only email domain by default (set ENTITY_RESOLVER_DEBUG=1 for full addresses); the 23505 fallback error drops the email and reports only the domain. - makeSbClient.call() error message strips the query string by default so PostgREST filter values (emails, thread_ids) don't leak into shared logs. Full URL available behind ENTITY_RESOLVER_DEBUG=1. - Document loadEnv constraints (UPPER_SNAKE keys, single-line values, process.env wins, caller should pass absolute script-relative path).

- Replace the "Credential Tracker" section that instructed users to paste service-role keys into a text editor with a structured table + security warning. Keeps service-role keys confined to .env.local. - Drop codex from the supported provider list + document why it was removed (prompt-injection → local-code-execution on untrusted input). - Document the 1000-row default cap on re-atomize --all, the partial- failure recovery via metadata.re_atomized_from, and the new debug env flags (ATOMIZE_DEBUG, ATOMIZE_DEBUG_ERRORS, ENTITY_RESOLVER_DEBUG). - Add recipes/atomizer/.env.example with placeholder values (uses "your-…-placeholder" strings that match the Gate's .env allowlist).

…tale refs - entity-resolver.mjs: orphan adoption now detects zero-row PATCH via Prefer: return=representation. If another worker already adopted the orphan with a DIFFERENT email, we fall through to the disambiguated insert path (case c) instead of incorrectly returning the winner's id. Extracted into tryAdoptOrDisambiguate() for clarity. - claude-cli.mjs: stderr/stdout snippets in error messages are gated behind ATOMIZE_DEBUG=1; default prints only byte counts so arbitrary user email/memory text the CLI echoed doesn't end up in logs. - re-atomize-gmail-thought.mjs: buildAtomizeOpts() pre-loads the OpenRouter key when no --provider flag is passed, because atomize-text.mjs defaults to 'openrouter'. Previously running `node re-atomize-gmail-thought.mjs --id=123` without the flag hit a spurious "opts.openrouterApiKey" error at runtime. - README.md / metadata.json: drop stale 4-provider references that still listed Codex as a supported option.

Add a non-blockquote line between two adjacent GitHub alert callouts so markdownlint stops treating them as one blockquote with a blank line inside. Restores a clean run on recipes/atomizer/README.md. The rest of the repo-wide Markdown Lint failure is pre-existing and covered by a separate cleanup PR (tracker in MEMORY, pattern matches NateBJones-Projects#161/NateBJones-Projects#215).

alanshurafa · 2026-04-22T16:54:28Z

Refreshing checks after markdownlint cleanup merged into fork main.

alanshurafa · 2026-04-22T16:57:36Z

Refreshing checks after fork markdownlint workflow fix.

github-actions Bot added dashboard documentation Improvements or additions to documentation extension integration primitive recipe labels Apr 21, 2026

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

alanshurafa added 7 commits April 21, 2026 16:54

alanshurafa mentioned this pull request Apr 21, 2026

[recipes] Atomizer — generic + Gmail re-atomization toolkit NateBJones-Projects/OB1#217

Open

alanshurafa closed this Apr 22, 2026

alanshurafa reopened this Apr 22, 2026

alanshurafa closed this Apr 22, 2026

alanshurafa reopened this Apr 22, 2026

[docs] Fix pre-existing markdownlint errors across 8 files

3290fca

github-actions Bot added the schema label Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipes] Atomizer — generic + Gmail re-atomization toolkit#19

[recipes] Atomizer — generic + Gmail re-atomization toolkit#19
alanshurafa wants to merge 9 commits intomainfrom
contrib/alanshurafa/atomizer

alanshurafa commented Apr 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const thoughtId = active.id as number;
		const newStatus = over.id as string;

Conversation

alanshurafa commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

Files

Requires

Tested

Pre-review pipeline

Known architectural flag

Note to reviewers

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

alanshurafa commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alanshurafa commented Apr 21, 2026 •

edited

Loading