Skip to content

[recipes] Atomizer — generic + Gmail re-atomization toolkit#19

Open
alanshurafa wants to merge 9 commits intomainfrom
contrib/alanshurafa/atomizer
Open

[recipes] Atomizer — generic + Gmail re-atomization toolkit#19
alanshurafa wants to merge 9 commits intomainfrom
contrib/alanshurafa/atomizer

Conversation

@alanshurafa
Copy link
Copy Markdown
Owner

@alanshurafa alanshurafa commented Apr 21, 2026

What this adds

A new community recipe under recipes/atomizer/ that splits compound
multi-topic thoughts into atomic single-topic thoughts via an LLM, plus a
Gmail-specific repair + audit toolkit.

Source: ported from my private ExoCortex capture pipeline (generic
atomization orchestrator + Gmail re-atomize / audit / correspondent-backfill
scripts). All personal emails, internal ticket IDs, and hardcoded paths
were stripped during the port.

Files

  • atomize-packs.mjs — generic JSON pack-file atomizer. Heuristic compound
    detection (sentence count, enumerations, semicolons, conjunction density)
    followed by LLM split. Three-provider backend: OpenRouter (default),
    Anthropic API, Claude CLI.
  • re-atomize-gmail-thought.mjs — heals whole-body gmail_export thoughts.
    Parses the [Email from X to Y | Subject ... | date] prefix, atomizes the
    body, inserts atoms via upsert_thought, redirects replies_to edges,
    re-links correspondents, deletes the original.
  • audit-gmail-pipeline.mjs — JSON or markdown report: scale, metadata
    completeness, entity-graph integrity, classification distributions, top
    correspondents, atom samples, retrieval probes.
  • backfill-gmail-correspondents.mjs — idempotent backfill with
    author-edge-specific pre-filter.
  • lib/ — shared atomize-text.mjs, entity-resolver.mjs, claude-cli.mjs.
  • test-atomize.mjs — zero-setup sanity test.
  • .env.example — credential template.

Requires

  • OpenRouter (recommended, default) or Anthropic API key for the LLM backend.
  • Node.js 18+ (no npm install — pure built-ins).
  • An Enhanced-Thoughts-style schema: thoughts.source_type,
    thoughts.metadata jsonb, entities, thought_entities, thought_edges,
    plus an upsert_thought(p_content, p_payload) Postgres function. The
    README lists the minimum columns required per table.

Tested

Live on my own Open Brain instance over a 350-thought STARRED Gmail corpus:
27% atomized, 100% author-edge coverage after backfill, 144 replies_to
edges built, entity-keyed retrieval probe exact-match.

Pre-review pipeline

This branch went through an automated Phase B ultra-review (Claude
gsd-code-reviewer role + Codex codex exec + dedicated security pass)
before opening the upstream PR. Round 1 surfaced 1 P0, 5 P1, 11 P2,
8 P3 findings across three reviewers. Round 2 surfaced 1 new P1.
Round 3 reviewed clean. All P0 / P1 findings are fixed:

  • P0 (sec): removed codex provider entirely — it shelled out with
    --dangerously-bypass-approvals-and-sandbox on arbitrary user-controlled
    memory/email text, which is a prompt-injection → local-code-execution
    primitive. Replaced with <INPUT> delimiter hardening for remaining
    providers.
  • P1: atomize-packs.mjs now loads .env.local script-relative so
    the documented node atomize-packs.mjs --provider=openrouter path works.
  • P1: re-atomize-gmail-thought.mjs pre-loads OPENROUTER_API_KEY
    for the default-provider path, warns when hitting the implicit 1000-row
    cap, wraps main() in async try/catch.
  • P1: entity-resolver.mjs orphan adoption is now race-safe — PATCH
    with Prefer: return=representation detects zero-row results, falls
    through to the disambiguated insert path instead of returning the wrong
    entity id.
  • P1: PostgREST error messages strip query-string values by default
    (gated behind ENTITY_RESOLVER_DEBUG=1); Claude CLI and LLM response
    snippets in errors are gated behind ATOMIZE_DEBUG=1;
    atomization-errors.json persists only a 60-char preview + fingerprint
    per failure unless ATOMIZE_DEBUG_ERRORS=1.
  • P1: README "Credential Tracker" that instructed users to paste
    service-role keys into a text editor was replaced with a security table
    and warning to keep keys only in .env.local.

P2/P3 follow-ups (non-blocking):

  • Full transactional upsert_thought + edge-redirect + delete RPC.
    Documented in README "Partial-failure recovery" for now.
  • shell: true + child.kill() can leak the underlying CLI process on
    Windows. Not a correctness bug; timeouts still fire.

Known architectural flag

This recipe assumes an Enhanced-Thoughts-style schema (entities,
thought_entities, thought_edges tables + upsert_thought RPC) that is
not yet part of upstream OB1 origin/main. This is deliberate — the
recipe is a stand-alone opt-in capability, not a core-schema change. The
README documents the minimum DDL required and users without those tables
will see clear errors pointing at the prerequisites.

Note to reviewers

This is the pre-review fork PR. The upstream PR to
NateBJones-Projects/OB1 will be opened in a separate Phase C step.
Please do not merge this fork PR — it exists so the review loop has a
stable target.

Ship a community recipe for splitting compound thoughts into atomic
single-topic thoughts via an LLM, plus Gmail-specific repair tooling.

Components:
- atomize-packs.mjs  — generic pack-file atomizer with heuristic compound
  detection and four-provider LLM backend (Claude CLI, Codex, Anthropic,
  OpenRouter).
- re-atomize-gmail-thought.mjs  — heals Gmail imports where long bodies
  were stored whole; splits via the atomizer, re-inserts via upsert_thought,
  redirects replies_to edges, re-links correspondents.
- audit-gmail-pipeline.mjs  — JSON/MD report covering scale, metadata
  completeness, entity-graph integrity, classification distributions, and
  retrieval probes.
- backfill-gmail-correspondents.mjs  — idempotent backfill that pre-filters
  on author-edge presence specifically.
- lib/  — shared atomize-text, entity-resolver, and Claude CLI utilities.
- test-atomize.mjs  — zero-setup sanity test.

Ported from the author's private capture pipeline; all personal emails,
internal ticket IDs, and hardcoded paths generalized. No secrets; no
modifications to the core thoughts table; no DROP / TRUNCATE / unqualified
DELETE. Markdownlint clean; metadata.json validates against the OB1 schema.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6fd1ffd4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +115 to +117
const thoughtId = active.id as number;
const newStatus = over.id as string;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive drop status from container, not over.id

In handleDragEnd, newStatus is taken directly from over.id, but cards are registered as sortable items with numeric IDs (useSortable({ id: thought.id })), so dropping onto another card yields a number instead of a kanban status string. That sends invalid status values to /api/kanban/update and triggers rollbacks, making drag-and-drop fail whenever the target column is non-empty; you need to read the destination container/status ID instead of the hovered item's ID.

Useful? React with 👍 / 👎.

Comment on lines +109 to +117
sp.set("per_page", "100");
sp.set("sort", "importance");
sp.set("order", "desc");
sp.set("type", thoughtType);
if (params?.status) sp.set("status", params.status);
if (params?.exclude_restricted !== undefined)
sp.set("exclude_restricted", String(params.exclude_restricted));
const qs = sp.toString();
const data = await apiFetch<BrowseResponse>(apiKey, `/thoughts?${qs}`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fetch all kanban pages instead of first 100 only

The kanban loader hard-caps each type query to per_page=100 and performs only one request, even though the browse response includes pagination metadata (total, page, per_page). For users with more than 100 task or idea thoughts, items beyond the first page never appear on the board or summary, so workflow state becomes incomplete; iterate pages until the full result set is collected.

Useful? React with 👍 / 👎.

Comment on lines +119 to +121
const m = /^(\[Email[^\]]*\])\s*([\s\S]*)$/.exec(content);
if (!m) return { prefix: "", body: content };
return { prefix: m[1], body: m[2] };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse Gmail prefix robustly when subject contains ]

The prefix parser matches \[Email[^\]]*\], which stops at the first ]; Gmail subjects are inserted raw into this bracketed prefix by the importer, so any subject containing ] truncates the parsed prefix/body boundary. That can produce malformed prefixWithAtomTag output and incorrect re-atomized content for those messages; parsing should use a delimiter strategy that tolerates ] inside subject text (or rely on structured metadata fields).

Useful? React with 👍 / 👎.

Codex provider was spawning `codex exec --dangerously-bypass-approvals-and-sandbox`
with arbitrary user-controlled memory/email text as the prompt. A prompt
injection in a hostile email body could trigger local code execution via the
agent's tool access. Removed the codex provider entirely (OpenRouter,
Anthropic, and claude-cli cover all use cases without tool access). Added
prompt-injection hardening: wrap all user content in <INPUT>...</INPUT>
delimiters with an "inert data" instruction, escape literal </INPUT> tags.
Redact raw model output from error messages (gated behind ATOMIZE_DEBUG=1).
…ction

atomize-packs.mjs now:
- loads recipes/atomizer/.env.local resolved relative to the script (so the
  documented `node atomize-packs.mjs --provider=openrouter` path no longer
  fails with "requires OPENROUTER_API_KEY" when the key lives in .env.local)
- defaults to openrouter provider (codex provider was removed)
- warns when --concurrency > MAX_CONCURRENCY is clamped, instead of silent
- skips memories whose memoryId matches -split-N$ or that carry
  metadata.atomization.parent_id, so re-runs don't double-split children
- writes only a 60-char preview + fingerprint into atomization-errors.json
  by default (full text persists only with ATOMIZE_DEBUG_ERRORS=1) to avoid
  duplicating sensitive memory content
…mize

- re-atomize-gmail-thought.mjs: load .env.local relative to script, add
  &order=id.asc, warn when hitting the default 1000-row cap on --all,
  wrap main in an async function with .catch() for consistent exit,
  document partial-failure recovery via metadata.re_atomized_from.
- backfill-gmail-correspondents.mjs: script-relative .env.local; move the
  per-2000 progress log inside the per-thought loop so it actually fires.
- audit-gmail-pipeline.mjs: script-relative .env.local; sbCount() now
  throws on !res.ok instead of silently returning 0 (was hiding auth/query
  errors as "zero findings"); use explicit jsonb aliases like
  `thread_id:metadata->gmail->>thread_id` so reads don't break on
  PostgREST version changes.
- test-atomize.mjs: script-relative .env.local, drop codex provider
  reference, default to openrouter.
- upsertPersonByEmail orphan-adoption is now race-safe: PATCH conditionally
  on `canonical_email=is.null`, re-SELECT by email if the winner already
  adopted the orphan with a different canonical_email. Prevents two
  concurrent backfill workers from linking two different emails to the
  same entity row.
- Resolver log includes only email domain by default (set
  ENTITY_RESOLVER_DEBUG=1 for full addresses); the 23505 fallback error
  drops the email and reports only the domain.
- makeSbClient.call() error message strips the query string by default so
  PostgREST filter values (emails, thread_ids) don't leak into shared logs.
  Full URL available behind ENTITY_RESOLVER_DEBUG=1.
- Document loadEnv constraints (UPPER_SNAKE keys, single-line values,
  process.env wins, caller should pass absolute script-relative path).
- Replace the "Credential Tracker" section that instructed users to paste
  service-role keys into a text editor with a structured table + security
  warning. Keeps service-role keys confined to .env.local.
- Drop codex from the supported provider list + document why it was
  removed (prompt-injection → local-code-execution on untrusted input).
- Document the 1000-row default cap on re-atomize --all, the partial-
  failure recovery via metadata.re_atomized_from, and the new debug
  env flags (ATOMIZE_DEBUG, ATOMIZE_DEBUG_ERRORS, ENTITY_RESOLVER_DEBUG).
- Add recipes/atomizer/.env.example with placeholder values (uses
  "your-…-placeholder" strings that match the Gate's .env allowlist).
…tale refs

- entity-resolver.mjs: orphan adoption now detects zero-row PATCH via
  Prefer: return=representation. If another worker already adopted the
  orphan with a DIFFERENT email, we fall through to the disambiguated
  insert path (case c) instead of incorrectly returning the winner's id.
  Extracted into tryAdoptOrDisambiguate() for clarity.
- claude-cli.mjs: stderr/stdout snippets in error messages are gated
  behind ATOMIZE_DEBUG=1; default prints only byte counts so arbitrary
  user email/memory text the CLI echoed doesn't end up in logs.
- re-atomize-gmail-thought.mjs: buildAtomizeOpts() pre-loads the
  OpenRouter key when no --provider flag is passed, because
  atomize-text.mjs defaults to 'openrouter'. Previously running
  `node re-atomize-gmail-thought.mjs --id=123` without the flag hit
  a spurious "opts.openrouterApiKey" error at runtime.
- README.md / metadata.json: drop stale 4-provider references that
  still listed Codex as a supported option.
Add a non-blockquote line between two adjacent GitHub alert callouts so
markdownlint stops treating them as one blockquote with a blank line inside.
Restores a clean run on recipes/atomizer/README.md. The rest of the
repo-wide Markdown Lint failure is pre-existing and covered by a separate
cleanup PR (tracker in MEMORY, pattern matches NateBJones-Projects#161/NateBJones-Projects#215).
@alanshurafa
Copy link
Copy Markdown
Owner Author

Refreshing checks after markdownlint cleanup merged into fork main.

@alanshurafa alanshurafa reopened this Apr 22, 2026
@alanshurafa
Copy link
Copy Markdown
Owner Author

Refreshing checks after fork markdownlint workflow fix.

@alanshurafa alanshurafa reopened this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant