[integrations] Entity extraction worker#11
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a72f6cb8e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const dryRun = url.searchParams.get("dry_run") === "true"; | ||
|
|
||
| // Step 1: Claim queue items | ||
| const claimed = await claimQueueItems(limit); |
There was a problem hiding this comment.
Avoid claiming queue items during dry runs
When dry_run=true, the handler still executes claimQueueItems(limit), which updates queue rows to processing; later the dry-run branch exits without calling markComplete or markError. This means a preview request mutates production queue state and can leave items stuck in processing, so subsequent real runs will skip them until a manual reset.
Useful? React with 👍 / 👎.
| return []; | ||
| } | ||
|
|
||
| return pending; |
There was a problem hiding this comment.
Return only queue rows actually claimed
This function returns the originally selected pending rows even though the claim update is a separate statement; under concurrent workers, one worker can select rows, fail to update any of them because another worker claimed first, and still process those thoughts. That creates duplicate extraction work and can inflate graph edge support counts.
Useful? React with 👍 / 👎.
|
|
||
| if (thoughtError || !thought?.content) { | ||
| console.error(`Failed to fetch thought ${item.thought_id}:`, thoughtError); | ||
| if (!dryRun) await markError(item.thought_id, thoughtError?.message ?? "Thought not found", 0); |
There was a problem hiding this comment.
Preserve retry count on thought fetch failures
On thought lookup failure, markError is always called with attemptCount hardcoded to 0, so attempt_count is repeatedly reset to 1 instead of incrementing across retries. For missing/deleted thoughts this prevents reaching MAX_ATTEMPTS, causing perpetual requeueing instead of eventual terminal failed status.
Useful? React with 👍 / 👎.
* [recipes] Add repo learning coach recipe * [recipes] Harden repo learning coach sync and reads
…es-Projects#146) * [dashboards] Add Workflow kanban board with drag-and-drop, mobile support, and MCP progress_task tool Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [dashboards] Mobile UX fixes: modal centering, landscape layout, touch drag-and-drop - Fix modal positioning with createPortal to escape DnD transform context - Add phone landscape CSS to hide sidebar and show mobile topbar - Switch to MouseSensor + TouchSensor for proper mobile drag delay - Add touchAction pan-y for scroll + drag coexistence - Add allowedDevOrigins for mobile dev testing - Add suppressHydrationWarning for browser extension compatibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [dashboards] Allow pinch-to-zoom on kanban cards Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [schemas] Add workflow status tracking columns for kanban board Adds status and status_updated_at columns to the thoughts table, enabling kanban-style workflow management for task and idea types. Includes migration SQL, backfill for existing thoughts, and partial index. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [dashboards] Add Workflow kanban board with drag-and-drop and mobile support Adds a full kanban board interface for managing task and idea thoughts: - Drag-and-drop between status columns (New/Planning/Active/Review/Done) - Touch-friendly with 200ms hold delay, pinch-to-zoom enabled - Collapsible columns with localStorage persistence - Inline edit modal for status, priority, type, and content - Dashboard summary widget showing active workflow items - Mobile-first responsive layout with full-screen edit on small screens - @dnd-kit for accessible drag-and-drop (mouse + touch sensors) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [dashboards] Add delete button to kanban card edit modal Adds a Delete button in the kanban card modal footer with a confirmation banner before permanently deleting the thought. Wires up a new /api/kanban/delete route and optimistic removal from the board. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [dashboards] Make delete confirmation a separate popup dialog Replace the inline banner with a standalone centered dialog that overlays on top of the edit modal, with clear title, description, and Cancel/Delete buttons. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [dashboards] Fix deleteThought parsing empty response body The REST API returns an empty body on DELETE, but apiFetch always called res.json() causing a parse error. Inline the fetch so it skips JSON parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Ivan <ivan@openbrain.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng (NateBJones-Projects#141) Syncs Claude Code's local memory saves to Open Brain via mcp__open-brain__capture_thought so memories are accessible from ChatGPT, Claude Desktop, Codex, and any MCP-connected client. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fix skill divergence (NateBJones-Projects#135) * [recipes] Update life-engine schema: user_id TEXT, add weekly_review/cron_state types - Changed user_id from UUID to TEXT across all 5 tables (supports Telegram chat_id as identifier without UUID padding hacks) - Added weekly_review and cron_state to briefing_type check constraint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [recipes] Clean up Life Engine: add state table, simplify loop timing, fix skill divergence - Add life_engine_state key-value table for runtime state (cron job ID, sleep schedule) instead of overloading briefing log with cron_state type - Remove cron_state from briefing_type CHECK constraint - Simplify Dynamic Loop Timing from 6 tiers to 4 (15m/30m/60m/one-shot) - Replace duplicate embedded skill in README with pointer to life-engine-skill.md - Add user_responded update logic to Rule 7 for self-improvement engagement tracking - Add timezone note to skill time windows - Fix platform references to include Discord alongside Telegram - Add RLS comment explaining why no row policies are needed - Update metadata.json date Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [recipes] Harden Life Engine permissions: lead with settings.json allowlist, scope MCP tools - Restructure Step 6 to recommend settings.json allowlist as default (Option A) - Replace broad mcp__open-brain__* and mcp__supabase__* wildcards with specific tool names (search_thoughts, list_thoughts, execute_sql, etc.) - Include CronCreate and CronDelete in the default allowlist - Demote --dangerously-skip-permissions to Option D (testing only) - Update Quick Setup and Step 7 launch commands to use settings.json approach - Addresses HIGH finding from security audit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [recipes] Add rain forecast to Life Engine morning briefing via Open-Meteo - Add Weather section to skill with Open-Meteo API call (free, no API key) - Include rain windows with time ranges and probability in morning briefing - Default coordinates: Portland, OR (45.52, -122.68), configurable via life_engine_state - Only show rain line when precipitation_probability >= 30% - Update schema comment to document latitude/longitude state keys Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [recipes] Add Daily Capture, portable customizations, and manual sync rule to Life Engine Backport portable customizations from installed SKILL.md into the recipe: date anchor, database note, user identity, valid briefing types, proactive chat_id, rules 9-14. Add Daily Capture prompt in evening window with capture_thought integration. Add Rule 14 requiring manual sync between recipe and installed skill files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [recipes] Fix hallucinated column name: briefings table uses 'content' not 'summary' Add explicit column reference note to prevent the LLM from hallucinating a 'summary' column on life_engine_briefings — the correct column is 'content'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [recipes] Address PR review: Discord support, migration steps, permission docs Fixes all issues from PR NateBJones-Projects#135 review: - P1: Add Bash(date/curl) and capture_thought to README allowlist examples - P1: Make channel event handling platform-agnostic (Telegram + Discord) in skill Rules 7, 10, 11 and Channel Tools section - P1: Add upgrade migration steps to schema.sql for user_id UUID→TEXT - P2: Add CHECK constraint on delivered_via ('telegram', 'discord') - P2: Add single-user assumption comment on life_engine_state table - Bump version to 1.1.0, update date to 2026-04-01 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [recipes] Broaden Bash permission to Bash(*) — scoped patterns are fragile Scoped Bash patterns like Bash(date *) and Bash(curl -s *api.open-meteo.com*) break when the LLM varies its exact command syntax between runs, causing silent permission blocks during unattended operation. Replace with Bash(*) since Life Engine only uses benign read-only commands (date, curl) and Rule 11 prevents dangerous execution from external triggers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…teBJones-Projects#125) Replaces the empty stub with a working zero-infrastructure approach using Claude Code scheduled tasks + Open Brain MCP + Gmail MCP. Preserves the Edge Function approach as a planned future option. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…es-Projects#37) * [recipes] Vercel + Neon + Telegram alternative architecture Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [fix] Replace local MCP pattern with custom connectors (PR review feedback) Replace claude_desktop_config.json + mcp-remote bridge instructions with Claude Desktop custom connectors UI approach in both Step 8 and the Troubleshooting section, aligning with CONTRIBUTING.md Rule #14. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…BJones-Projects#171) * [recipes] ChatGPT import v2: multi-thought knowledge extraction Replace 1-3 sentence summarization with structured knowledge extraction that produces 2-5 typed thoughts per conversation (decision, preference, learning, context, brainstorm, reference) with enriched metadata. Key changes to import-chatgpt.py: - Branch resolution via current_node parent-pointer walk - Content type dispatch for 14 export message formats (voice, reasoning, web search, code) - Signal-based filtering replaces regex title matching - Session boundary detection for multi-day conversations - Semantic deduplication via match_thoughts RPC - Re-import handling with update_time/content_hash detection - Embed thought content, not [ChatGPT: title] prefix - --store-conversations for optional conversation history with pyramid summaries - --focus flag with presets (tech, strategy, personal, creative) and custom text - --openrouter-model flag for model selection - --max-words flag to skip oversized conversations (default: 50000) - Robust JSON parsing for non-OpenAI models (Anthropic, Ollama) - Accurate progress display with percentage and skip counts New files: - chatgpt_parser.py: parsing, content dispatch, filtering, session detection - schema.sql: chatgpt_conversations table with pyramid summaries and indexes All existing CLI flags preserved (--dry-run, --model ollama, --after/--before, --limit, --report, --verbose, --raw, --ingest-endpoint). * [recipes] Fix ChatGPT import filtering defaults --------- Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
NateBJones-Projects#160) * [recipes] Local Ollama embeddings — zero-cost alternative to OpenRouter Generate embeddings locally via Ollama and insert into Supabase. Keeps the existing OB1 architecture, only swaps the embedding provider. Five models tested including gte-qwen2-1.5b (1536-dim) which is drop-in compatible with the default Open Brain schema. Includes quality benchmarks comparing discrimination power across all five models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix markdown lint errors in README Add blank lines around fenced code blocks (MD031) and merge consecutive blockquotes (MD028). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [recipes] Fix local Ollama env loading docs --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…s-Projects#150) * [docs] Fix MD028 blank line between blockquotes in getting-started guide Removes blank line between WARNING and IMPORTANT blockquotes that was failing markdownlint across all PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix claudeception recipe: convert multi-line YAML descriptions to single-line Multi-line descriptions (description: |) break agent routing silently. Nate's March 2026 Skills Standard requires single-line YAML descriptions for reliable semantic matching. Fixed 3 instances: the recipe's own description and 2 template examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [recipes] Clean up Claudeception docs formatting --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…NateBJones-Projects#148) * fix(professional-crm): remove Accept header patch causing SSE reconnect loop The Accept: text/event-stream header patch forced StreamableHTTPTransport into SSE mode on every request. Since Supabase edge functions are stateless, the SSE stream terminates immediately after each response — causing the MCP client to reconnect every ~2 seconds (~43k invocations/day). StreamableHTTPTransport is request/response by design. Removing the patch lets it respond with plain JSON, eliminating the reconnect loop entirely. * fix(professional-crm): force JSON-only Accept header to prevent SSE reconnect loop Removing text/event-stream from the Accept header before it reaches StreamableHTTPTransport prevents it from opening SSE streams. MCP clients send Accept: application/json, text/event-stream per spec -- this is what triggers SSE mode even without the original workaround. JSON-only responses close cleanly, eliminating the boot/shutdown cycle.
…ateBJones-Projects#139) * recipes: add adaptive capture classification with confidence gating * recipes: address review — fix author, OB1 types, add TypeScript implementation * recipes: incorporate GitHub edits to README, classifier prompt, and metadata * [recipes] Tighten adaptive capture setup and threshold updates --------- Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…ateBJones-Projects#133) * Add update_professional_contact tool to CRM extension Adds the ability to update existing contact fields (name, company, title, email, phone, tags, notes, follow_up_date, etc.) which was proposed in NateBJones-Projects#93 but never implemented. Only provided fields are updated, and the existing updated_at trigger handles timestamping. * Allow clearing follow_up_date by passing null or empty string Fixes the case where a follow-up date, once set, could never be cleared — leaving contacts permanently stuck in get_follow_ups_due. * [extensions] Document contact update tool --------- Co-authored-by: Matt Hallett <matthallett@gmail.com> Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…es-Projects#161) * Fix pre-existing markdownlint errors across 15 files Add blank lines around headings (MD022), fenced code blocks (MD031), and between adjacent blockquotes (MD028). Fix broken link fragment (MD051) and remove extra blank line (MD012). No content changes. These errors were blocking CI on all open PRs since the lint check runs repo-wide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [docs] Preserve README links during markdown cleanup --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…raphics (NateBJones-Projects#85) * [recipes] Infographic Generator: turn research docs into visual infographics Second recipe from @jaredirish. Part of the Open Brain Flywheel (capture-process-visualize loop, see Issue NateBJones-Projects#84). Takes any markdown doc or Open Brain thought cluster and generates professional infographic images via Gemini's free-tier API. Auto-chunks content, writes verbose prompts (300+ words each), generates PNGs with specific colors/layout/typography. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [recipes] Fix broken relative links in infographic-generator README ../brain-dump-processor/ → ../panning-for-gold/ ../auto-capture-protocol/ → ../auto-capture/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [recipes] Address review feedback on infographic generator - Sync generate.py with working local version (cleaner error handling, fix --redo display counter bug) - Fix auto-capture link: directory doesn't exist until PR NateBJones-Projects#42 merges, so link to the PR instead of a non-existent directory Note: part.as_image() and gemini-2.5-flash-image are both valid per the official google-genai SDK docs. Reviewer concerns on those were based on outdated information. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [recipes] Fix infographic redo progress output --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
* [recipes] Add OB-Graph knowledge graph layer Adds graph database functionality for Open Brain using PostgreSQL nodes + edges with recursive CTE traversal. Includes schema, MCP server with 10 tools, and setup documentation. https://claude.ai/code/session_015Z8wCeokTMTdrVMthqzGKJ * [recipes] Clarify OB-Graph deployment setup --------- Co-authored-by: Claude <noreply@anthropic.com>
* [docs] Fix Cursor MCP connection — use native url field, not mcp-remote mcp-remote@latest now attempts OAuth client registration before sending custom headers, which breaks against Open Brain's simple key-based auth. Cursor supports remote MCP servers natively via the url field, so mcp-remote is unnecessary. Changes: - Add dedicated Cursor section to getting-started guide (7.5) and remote-mcp primitive with native url config - Update mcp-remote examples to pass key via ?key= query parameter instead of --header to avoid OAuth discovery issues - Clarify x-brain-key (core) vs x-access-key (extensions) in troubleshooting guides Made-with: Cursor * [primitives] Bring remote MCP docs in line with repo format --------- Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
* [skills] Add weekly signal diff skill pack * [skills] Fix markdownlint numbering in weekly signal diff
…rojects#181) * [recipes] Add Bring Your Own Context recipe * [recipes] Fix markdownlint regression in activation README
* [repo] Sweep fix-now backlog issues * [docs] Fix setup-guide markdownlint regression
- dry_run now uses peekQueueItems() (read-only SELECT) instead of claimQueueItems(), so items stay "pending" during preview runs - claimQueueItems() returns only rows actually claimed via .select(), preventing race conditions where concurrent workers see stale results - markError() clears started_at and worker_version when resetting to "pending" so retryable items don't appear stale in monitoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why: Schema stores thoughts.id / entity_extraction_queue.thought_id as UUID (gen_random_uuid()), not BIGINT. TypeScript types were declaring number, which is a load-bearing lie: PostgREST returns UUIDs as strings, and any arithmetic or Number-coerce on a consumer path would produce NaN. Updated claimQueueItems, peekQueueItems, markComplete, markError, linkThoughtEntity signatures to string. Entity IDs remain number (BIGSERIAL).
Why: The worker had no upper bound on LLM spend — a misconfigured cron on a large queue could mint unbounded OpenRouter/OpenAI/Anthropic cost before anyone noticed. Added ENTITY_EXTRACTION_MAX_CALLS env (default 10000, 0 = unlimited), a module-scoped llmCallCount counter, a pre-call gate that throws ExtractionCostCapError when the cap is reached, and graceful abort in the main loop that returns remaining claimed rows to 'pending' so the next invocation can resume. Summary now reports truncated / truncated_reason / llm_calls so callers can observe the cap firing.
Why: Deno's global fetch has no body-read timeout on Supabase Edge. A stuck OpenRouter / OpenAI / Anthropic upstream would hang the invocation until the 150s platform wall-clock killed it, leaving claimed rows in 'processing' with no status update. Added fetchWithTimeout() helper that wraps AbortController around fetch, defaulting to 60s and overridable via FETCH_TIMEOUT_MS. All three LLM call sites in extractEntities now route through it. Timeouts surface as thrown errors, caught by the per-item try/catch, and flow through markError — so the standard retry-then-fail path handles them (reset to 'pending' on attempt < 5, 'failed' on cap).
Why: Thought content was interpolated directly into the LLM prompt, giving
any captured text (emails, browser history, Slack dumps) a direct channel
to override the extraction instructions — which would then flow unescaped
into entities.canonical_name, a TEXT column rendered by dashboards and
MCP tools (stored XSS vector). Four layered defenses:
1. Wrap content in <thought_content>...</thought_content> tags and tell the
model explicitly that content inside is untrusted data, not instructions.
2. Escape literal tag occurrences in content so an attacker can't forge a
close-tag and break out of the wrapper.
3. Enforce response_format: { type: "json_object" } on OpenRouter (OpenAI
already had it) so prose wrapping doesn't crash the JSON parser.
4. sanitizeEntityName() strips control chars and clips to 200 chars before
entities land in the DB — caps the blast radius of a surviving injection.
Why: Supabase Edge Functions hard-kill at 150s. With limit=50 and LLM calls averaging 3s each, cumulative latency alone could exceed the budget — killing the invocation mid-loop and leaving rows stuck in 'processing' with no recovery except the manual SQL from the README. Added startTime at invocation, INVOCATION_BUDGET_MS = 140000 (30s headroom), and a per-item gate that releases remaining claimed rows back to 'pending' so the next invocation picks them up. Surfaces as summary.truncated=true with truncated_reason='wall_clock_budget', plus elapsed_ms for monitoring.
Why: The knowledge-graph schema's queue_entity_extraction trigger re-queues
a thought when its content changes. The worker then re-runs extraction but
only upserted new thought_entities rows — it never deleted links from the
prior extraction. So editing a thought from {Alice, Bob, PostgreSQL} to
{Alice, Redis} ended up with the thought linked to all four entities. Over
time this silently corrupts the graph: edges.support_count inflates with
thoughts that no longer mention the underlying entity. Delete our own
prior links (scoped to source='entity_worker' so we don't clobber links
from other sources) before re-writing. Non-fatal if DELETE fails — we
still attempt the upserts, because missed extraction is worse than drift.
Why: The code fixes for BLOCKER-3 (ENTITY_EXTRACTION_MAX_CALLS), BLOCKER-4 (FETCH_TIMEOUT_MS), and WARNING-2 (wall-clock budget) added env knobs and summary fields (truncated, truncated_reason, llm_calls, elapsed_ms) that weren't surfaced anywhere users would see. Also documented the 'skipped' queue state per INFO-1 — the worker marks system-generated thoughts (metadata.generated_by) as skipped, and until now only the schema comment knew that. Added queue status reference and a short note that dry-run leaves the queue untouched.
6bcd022 to
322ba57
Compare
Summary
entity_extraction_queueto build a knowledge graphschemas/knowledge-graph(PR [schemas] Knowledge graph tables and extraction trigger #5) for entities, edges, thought_entities, and queue tablesWhat It Does
Processes pending items from the extraction queue in batches. For each thought, calls an LLM to extract named entities (person, project, topic, tool, organization, place) and relationships (works_on, uses, related_to, etc.), then upserts into the graph tables.
Key Features
metadata.generated_byFiles
index.ts_shared/helpers.ts_shared/config.tsREADME.mdmetadata.jsondeno.jsonTest plan
🤖 Generated with Claude Code