Skip to content

[recipes] Atomizer — generic + Gmail re-atomization toolkit#217

Open
alanshurafa wants to merge 9 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/atomizer
Open

[recipes] Atomizer — generic + Gmail re-atomization toolkit#217
alanshurafa wants to merge 9 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/atomizer

Conversation

@alanshurafa
Copy link
Copy Markdown
Contributor

Adds the Atomizer toolkit — a recipe for LLM-driven decomposition of large thoughts into smaller atomic facts, with Gmail re-atomization, pipeline audit, and correspondents backfill utilities.

What this adds

  • Generic atomizer — decomposes any large thought into atomic facts via LLM, preserving lineage through thought_edges.
  • Gmail re-atomizer — targeted pipeline for Gmail-sourced thoughts, with correspondents extraction and re-linking.
  • Pipeline audit — reports on atomization coverage, orphaned atoms, and graph integrity.
  • Correspondents backfill — resolves and deduplicates entities extracted from email metadata.
  • 4 LLM providers — Anthropic direct, OpenRouter, Claude CLI, and Codex CLI are all supported behind a pluggable provider interface.

Dependencies

Assumes the enhanced-thoughts / knowledge-graph schema:

  • Tables: entities, thought_entities, thought_edges
  • RPC: upsert_thought

These are part of the schema track in open upstream PRs (see #191 enhanced-thoughts and related). The README documents the minimum DDL so the recipe can be adopted independently of those PRs landing.

Review process

Generalized from Alan's ExoCortex work (2026-04 session) and cross-AI ultra-reviewed across 3 iteration rounds using Claude gsd-code-reviewer, Codex codex exec, and security-reviewer. Approximately 20 findings resolved across 7 fix commits. Notable fixes:

  • Removed Codex sandbox bypass (P0 security)
  • Prompt-injection hardening on untrusted thought bodies
  • PII redaction before LLM calls
  • Entity-resolver race condition fix
  • Env-loader idempotency

Known follow-ups (not blockers)

  • Non-transactional re-atomize pipeline — documented with a CAUTION block in the README; an RPC-based transactional version is a future enhancement.
  • Windows shell: true + process-tree leak — tracked for a separate PR.
  • Cost-budget pre-flight for --all runs — tracked for a separate PR.

Review history

Fork pre-review PR with full review trail: alanshurafa#19 (FIXED-AND-CLEAN after 3 rounds).

Ship a community recipe for splitting compound thoughts into atomic
single-topic thoughts via an LLM, plus Gmail-specific repair tooling.

Components:
- atomize-packs.mjs  — generic pack-file atomizer with heuristic compound
  detection and four-provider LLM backend (Claude CLI, Codex, Anthropic,
  OpenRouter).
- re-atomize-gmail-thought.mjs  — heals Gmail imports where long bodies
  were stored whole; splits via the atomizer, re-inserts via upsert_thought,
  redirects replies_to edges, re-links correspondents.
- audit-gmail-pipeline.mjs  — JSON/MD report covering scale, metadata
  completeness, entity-graph integrity, classification distributions, and
  retrieval probes.
- backfill-gmail-correspondents.mjs  — idempotent backfill that pre-filters
  on author-edge presence specifically.
- lib/  — shared atomize-text, entity-resolver, and Claude CLI utilities.
- test-atomize.mjs  — zero-setup sanity test.

Ported from the author's private capture pipeline; all personal emails,
internal ticket IDs, and hardcoded paths generalized. No secrets; no
modifications to the core thoughts table; no DROP / TRUNCATE / unqualified
DELETE. Markdownlint clean; metadata.json validates against the OB1 schema.
Codex provider was spawning `codex exec --dangerously-bypass-approvals-and-sandbox`
with arbitrary user-controlled memory/email text as the prompt. A prompt
injection in a hostile email body could trigger local code execution via the
agent's tool access. Removed the codex provider entirely (OpenRouter,
Anthropic, and claude-cli cover all use cases without tool access). Added
prompt-injection hardening: wrap all user content in <INPUT>...</INPUT>
delimiters with an "inert data" instruction, escape literal </INPUT> tags.
Redact raw model output from error messages (gated behind ATOMIZE_DEBUG=1).
…ction

atomize-packs.mjs now:
- loads recipes/atomizer/.env.local resolved relative to the script (so the
  documented `node atomize-packs.mjs --provider=openrouter` path no longer
  fails with "requires OPENROUTER_API_KEY" when the key lives in .env.local)
- defaults to openrouter provider (codex provider was removed)
- warns when --concurrency > MAX_CONCURRENCY is clamped, instead of silent
- skips memories whose memoryId matches -split-N$ or that carry
  metadata.atomization.parent_id, so re-runs don't double-split children
- writes only a 60-char preview + fingerprint into atomization-errors.json
  by default (full text persists only with ATOMIZE_DEBUG_ERRORS=1) to avoid
  duplicating sensitive memory content
…mize

- re-atomize-gmail-thought.mjs: load .env.local relative to script, add
  &order=id.asc, warn when hitting the default 1000-row cap on --all,
  wrap main in an async function with .catch() for consistent exit,
  document partial-failure recovery via metadata.re_atomized_from.
- backfill-gmail-correspondents.mjs: script-relative .env.local; move the
  per-2000 progress log inside the per-thought loop so it actually fires.
- audit-gmail-pipeline.mjs: script-relative .env.local; sbCount() now
  throws on !res.ok instead of silently returning 0 (was hiding auth/query
  errors as "zero findings"); use explicit jsonb aliases like
  `thread_id:metadata->gmail->>thread_id` so reads don't break on
  PostgREST version changes.
- test-atomize.mjs: script-relative .env.local, drop codex provider
  reference, default to openrouter.
- upsertPersonByEmail orphan-adoption is now race-safe: PATCH conditionally
  on `canonical_email=is.null`, re-SELECT by email if the winner already
  adopted the orphan with a different canonical_email. Prevents two
  concurrent backfill workers from linking two different emails to the
  same entity row.
- Resolver log includes only email domain by default (set
  ENTITY_RESOLVER_DEBUG=1 for full addresses); the 23505 fallback error
  drops the email and reports only the domain.
- makeSbClient.call() error message strips the query string by default so
  PostgREST filter values (emails, thread_ids) don't leak into shared logs.
  Full URL available behind ENTITY_RESOLVER_DEBUG=1.
- Document loadEnv constraints (UPPER_SNAKE keys, single-line values,
  process.env wins, caller should pass absolute script-relative path).
- Replace the "Credential Tracker" section that instructed users to paste
  service-role keys into a text editor with a structured table + security
  warning. Keeps service-role keys confined to .env.local.
- Drop codex from the supported provider list + document why it was
  removed (prompt-injection → local-code-execution on untrusted input).
- Document the 1000-row default cap on re-atomize --all, the partial-
  failure recovery via metadata.re_atomized_from, and the new debug
  env flags (ATOMIZE_DEBUG, ATOMIZE_DEBUG_ERRORS, ENTITY_RESOLVER_DEBUG).
- Add recipes/atomizer/.env.example with placeholder values (uses
  "your-…-placeholder" strings that match the Gate's .env allowlist).
…tale refs

- entity-resolver.mjs: orphan adoption now detects zero-row PATCH via
  Prefer: return=representation. If another worker already adopted the
  orphan with a DIFFERENT email, we fall through to the disambiguated
  insert path (case c) instead of incorrectly returning the winner's id.
  Extracted into tryAdoptOrDisambiguate() for clarity.
- claude-cli.mjs: stderr/stdout snippets in error messages are gated
  behind ATOMIZE_DEBUG=1; default prints only byte counts so arbitrary
  user email/memory text the CLI echoed doesn't end up in logs.
- re-atomize-gmail-thought.mjs: buildAtomizeOpts() pre-loads the
  OpenRouter key when no --provider flag is passed, because
  atomize-text.mjs defaults to 'openrouter'. Previously running
  `node re-atomize-gmail-thought.mjs --id=123` without the flag hit
  a spurious "opts.openrouterApiKey" error at runtime.
- README.md / metadata.json: drop stale 4-provider references that
  still listed Codex as a supported option.
Add a non-blockquote line between two adjacent GitHub alert callouts so
markdownlint stops treating them as one blockquote with a blank line inside.
Restores a clean run on recipes/atomizer/README.md. The rest of the
repo-wide Markdown Lint failure is pre-existing and covered by a separate
cleanup PR (tracker in MEMORY, pattern matches NateBJones-Projects#161/NateBJones-Projects#215).
@github-actions github-actions Bot added the schema Contribution: database extension label Apr 22, 2026
@alanshurafa
Copy link
Copy Markdown
Contributor Author

Refreshing upstream checks after fork-side readiness cleanup.

@alanshurafa alanshurafa reopened this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

recipe Contribution: step-by-step recipe schema Contribution: database extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant