Skip to content

[integrations] Entity extraction worker#11

Open
alanshurafa wants to merge 32 commits intomainfrom
contrib/alanshurafa/entity-extraction-worker
Open

[integrations] Entity extraction worker#11
alanshurafa wants to merge 32 commits intomainfrom
contrib/alanshurafa/entity-extraction-worker

Conversation

@alanshurafa
Copy link
Copy Markdown
Owner

Summary

  • Async Supabase Edge Function that drains the entity_extraction_queue to build a knowledge graph
  • Ported from ExoCortex production entity-extraction-worker with OB1 adaptations
  • Depends on schemas/knowledge-graph (PR [schemas] Knowledge graph tables and extraction trigger #5) for entities, edges, thought_entities, and queue tables

What It Does

Processes pending items from the extraction queue in batches. For each thought, calls an LLM to extract named entities (person, project, topic, tool, organization, place) and relationships (works_on, uses, related_to, etc.), then upserts into the graph tables.

Key Features

  • Batch processing with atomic queue claiming (no duplicate work)
  • Retry/backoff — up to 5 attempts before permanent failure
  • Dry-run mode — preview extractions without writing
  • Symmetric edge dedup — canonical ordering for co_occurs_with/related_to
  • System-generated skip — ignores thoughts with metadata.generated_by
  • OpenRouter-first LLM provider order (OB1 standard)

Files

File Lines Purpose
index.ts 533 Worker with queue management, LLM extraction, graph upserts
_shared/helpers.ts 770 Shared utilities (from enhanced-mcp)
_shared/config.ts 204 Constants and types
README.md 170 Setup guide with backfill SQL, API ref, troubleshooting
metadata.json 18 OB1 contribution metadata
deno.json 5 Deno import map

Test plan

  • Verify all gate checks pass
  • Validate metadata.json against schema
  • Confirm README has prerequisites, numbered steps, expected outcome
  • Confirm "05-tool-audit" string in README
  • Deploy to test project, enqueue thoughts, run worker, verify graph tables populated

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a72f6cb8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const dryRun = url.searchParams.get("dry_run") === "true";

// Step 1: Claim queue items
const claimed = await claimQueueItems(limit);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid claiming queue items during dry runs

When dry_run=true, the handler still executes claimQueueItems(limit), which updates queue rows to processing; later the dry-run branch exits without calling markComplete or markError. This means a preview request mutates production queue state and can leave items stuck in processing, so subsequent real runs will skip them until a manual reset.

Useful? React with 👍 / 👎.

return [];
}

return pending;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return only queue rows actually claimed

This function returns the originally selected pending rows even though the claim update is a separate statement; under concurrent workers, one worker can select rows, fail to update any of them because another worker claimed first, and still process those thoughts. That creates duplicate extraction work and can inflate graph edge support counts.

Useful? React with 👍 / 👎.


if (thoughtError || !thought?.content) {
console.error(`Failed to fetch thought ${item.thought_id}:`, thoughtError);
if (!dryRun) await markError(item.thought_id, thoughtError?.message ?? "Thought not found", 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve retry count on thought fetch failures

On thought lookup failure, markError is always called with attemptCount hardcoded to 0, so attempt_count is repeatedly reset to 1 instead of incrementing across retries. For missing/deleted thoughts this prevents reaching MAX_ATTEMPTS, causing perpetual requeueing instead of eventual terminal failed status.

Useful? React with 👍 / 👎.

@github-actions github-actions Bot added documentation Improvements or additions to documentation recipe labels Apr 6, 2026
justfinethanku and others added 26 commits April 12, 2026 21:57
* [recipes] Add repo learning coach recipe

* [recipes] Harden repo learning coach sync and reads
…es-Projects#146)

* [dashboards] Add Workflow kanban board with drag-and-drop, mobile support, and MCP progress_task tool

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* [dashboards] Mobile UX fixes: modal centering, landscape layout, touch drag-and-drop

- Fix modal positioning with createPortal to escape DnD transform context
- Add phone landscape CSS to hide sidebar and show mobile topbar
- Switch to MouseSensor + TouchSensor for proper mobile drag delay
- Add touchAction pan-y for scroll + drag coexistence
- Add allowedDevOrigins for mobile dev testing
- Add suppressHydrationWarning for browser extension compatibility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [dashboards] Allow pinch-to-zoom on kanban cards

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [schemas] Add workflow status tracking columns for kanban board

Adds status and status_updated_at columns to the thoughts table,
enabling kanban-style workflow management for task and idea types.
Includes migration SQL, backfill for existing thoughts, and partial index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [dashboards] Add Workflow kanban board with drag-and-drop and mobile support

Adds a full kanban board interface for managing task and idea thoughts:
- Drag-and-drop between status columns (New/Planning/Active/Review/Done)
- Touch-friendly with 200ms hold delay, pinch-to-zoom enabled
- Collapsible columns with localStorage persistence
- Inline edit modal for status, priority, type, and content
- Dashboard summary widget showing active workflow items
- Mobile-first responsive layout with full-screen edit on small screens
- @dnd-kit for accessible drag-and-drop (mouse + touch sensors)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [dashboards] Add delete button to kanban card edit modal

Adds a Delete button in the kanban card modal footer with a confirmation
banner before permanently deleting the thought. Wires up a new
/api/kanban/delete route and optimistic removal from the board.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* [dashboards] Make delete confirmation a separate popup dialog

Replace the inline banner with a standalone centered dialog that
overlays on top of the edit modal, with clear title, description,
and Cancel/Delete buttons.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [dashboards] Fix deleteThought parsing empty response body

The REST API returns an empty body on DELETE, but apiFetch always
called res.json() causing a parse error. Inline the fetch so it
skips JSON parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Ivan <ivan@openbrain.dev>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng (NateBJones-Projects#141)

Syncs Claude Code's local memory saves to Open Brain via
mcp__open-brain__capture_thought so memories are accessible
from ChatGPT, Claude Desktop, Codex, and any MCP-connected client.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fix skill divergence (NateBJones-Projects#135)

* [recipes] Update life-engine schema: user_id TEXT, add weekly_review/cron_state types

- Changed user_id from UUID to TEXT across all 5 tables (supports
  Telegram chat_id as identifier without UUID padding hacks)
- Added weekly_review and cron_state to briefing_type check constraint

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [recipes] Clean up Life Engine: add state table, simplify loop timing, fix skill divergence

- Add life_engine_state key-value table for runtime state (cron job ID,
  sleep schedule) instead of overloading briefing log with cron_state type
- Remove cron_state from briefing_type CHECK constraint
- Simplify Dynamic Loop Timing from 6 tiers to 4 (15m/30m/60m/one-shot)
- Replace duplicate embedded skill in README with pointer to life-engine-skill.md
- Add user_responded update logic to Rule 7 for self-improvement engagement tracking
- Add timezone note to skill time windows
- Fix platform references to include Discord alongside Telegram
- Add RLS comment explaining why no row policies are needed
- Update metadata.json date

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [recipes] Harden Life Engine permissions: lead with settings.json allowlist, scope MCP tools

- Restructure Step 6 to recommend settings.json allowlist as default (Option A)
- Replace broad mcp__open-brain__* and mcp__supabase__* wildcards with
  specific tool names (search_thoughts, list_thoughts, execute_sql, etc.)
- Include CronCreate and CronDelete in the default allowlist
- Demote --dangerously-skip-permissions to Option D (testing only)
- Update Quick Setup and Step 7 launch commands to use settings.json approach
- Addresses HIGH finding from security audit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [recipes] Add rain forecast to Life Engine morning briefing via Open-Meteo

- Add Weather section to skill with Open-Meteo API call (free, no API key)
- Include rain windows with time ranges and probability in morning briefing
- Default coordinates: Portland, OR (45.52, -122.68), configurable via life_engine_state
- Only show rain line when precipitation_probability >= 30%
- Update schema comment to document latitude/longitude state keys

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [recipes] Add Daily Capture, portable customizations, and manual sync rule to Life Engine

Backport portable customizations from installed SKILL.md into the recipe:
date anchor, database note, user identity, valid briefing types, proactive
chat_id, rules 9-14. Add Daily Capture prompt in evening window with
capture_thought integration. Add Rule 14 requiring manual sync between
recipe and installed skill files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [recipes] Fix hallucinated column name: briefings table uses 'content' not 'summary'

Add explicit column reference note to prevent the LLM from hallucinating
a 'summary' column on life_engine_briefings — the correct column is 'content'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [recipes] Address PR review: Discord support, migration steps, permission docs

Fixes all issues from PR NateBJones-Projects#135 review:
- P1: Add Bash(date/curl) and capture_thought to README allowlist examples
- P1: Make channel event handling platform-agnostic (Telegram + Discord)
  in skill Rules 7, 10, 11 and Channel Tools section
- P1: Add upgrade migration steps to schema.sql for user_id UUID→TEXT
- P2: Add CHECK constraint on delivered_via ('telegram', 'discord')
- P2: Add single-user assumption comment on life_engine_state table
- Bump version to 1.1.0, update date to 2026-04-01

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [recipes] Broaden Bash permission to Bash(*) — scoped patterns are fragile

Scoped Bash patterns like Bash(date *) and Bash(curl -s *api.open-meteo.com*)
break when the LLM varies its exact command syntax between runs, causing
silent permission blocks during unattended operation. Replace with Bash(*)
since Life Engine only uses benign read-only commands (date, curl) and
Rule 11 prevents dangerous execution from external triggers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…teBJones-Projects#125)

Replaces the empty stub with a working zero-infrastructure approach
using Claude Code scheduled tasks + Open Brain MCP + Gmail MCP.
Preserves the Edge Function approach as a planned future option.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…es-Projects#37)

* [recipes] Vercel + Neon + Telegram alternative architecture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [fix] Replace local MCP pattern with custom connectors (PR review feedback)

Replace claude_desktop_config.json + mcp-remote bridge instructions with
Claude Desktop custom connectors UI approach in both Step 8 and the
Troubleshooting section, aligning with CONTRIBUTING.md Rule #14.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…BJones-Projects#171)

* [recipes] ChatGPT import v2: multi-thought knowledge extraction

Replace 1-3 sentence summarization with structured knowledge extraction
that produces 2-5 typed thoughts per conversation (decision, preference,
learning, context, brainstorm, reference) with enriched metadata.

Key changes to import-chatgpt.py:
- Branch resolution via current_node parent-pointer walk
- Content type dispatch for 14 export message formats (voice, reasoning, web search, code)
- Signal-based filtering replaces regex title matching
- Session boundary detection for multi-day conversations
- Semantic deduplication via match_thoughts RPC
- Re-import handling with update_time/content_hash detection
- Embed thought content, not [ChatGPT: title] prefix
- --store-conversations for optional conversation history with pyramid summaries
- --focus flag with presets (tech, strategy, personal, creative) and custom text
- --openrouter-model flag for model selection
- --max-words flag to skip oversized conversations (default: 50000)
- Robust JSON parsing for non-OpenAI models (Anthropic, Ollama)
- Accurate progress display with percentage and skip counts

New files:
- chatgpt_parser.py: parsing, content dispatch, filtering, session detection
- schema.sql: chatgpt_conversations table with pyramid summaries and indexes

All existing CLI flags preserved (--dry-run, --model ollama, --after/--before,
--limit, --report, --verbose, --raw, --ingest-endpoint).

* [recipes] Fix ChatGPT import filtering defaults

---------

Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
NateBJones-Projects#160)

* [recipes] Local Ollama embeddings — zero-cost alternative to OpenRouter

Generate embeddings locally via Ollama and insert into Supabase.
Keeps the existing OB1 architecture, only swaps the embedding provider.

Five models tested including gte-qwen2-1.5b (1536-dim) which is
drop-in compatible with the default Open Brain schema.

Includes quality benchmarks comparing discrimination power across
all five models.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix markdown lint errors in README

Add blank lines around fenced code blocks (MD031) and merge
consecutive blockquotes (MD028).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [recipes] Fix local Ollama env loading docs

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…s-Projects#150)

* [docs] Fix MD028 blank line between blockquotes in getting-started guide

Removes blank line between WARNING and IMPORTANT blockquotes that was
failing markdownlint across all PRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix claudeception recipe: convert multi-line YAML descriptions to single-line

Multi-line descriptions (description: |) break agent routing silently.
Nate's March 2026 Skills Standard requires single-line YAML descriptions
for reliable semantic matching. Fixed 3 instances: the recipe's own
description and 2 template examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [recipes] Clean up Claudeception docs formatting

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…NateBJones-Projects#148)

* fix(professional-crm): remove Accept header patch causing SSE reconnect loop

The Accept: text/event-stream header patch forced StreamableHTTPTransport
into SSE mode on every request. Since Supabase edge functions are stateless,
the SSE stream terminates immediately after each response — causing the MCP
client to reconnect every ~2 seconds (~43k invocations/day).

StreamableHTTPTransport is request/response by design. Removing the patch
lets it respond with plain JSON, eliminating the reconnect loop entirely.

* fix(professional-crm): force JSON-only Accept header to prevent SSE reconnect loop

Removing text/event-stream from the Accept header before it reaches
StreamableHTTPTransport prevents it from opening SSE streams. MCP clients
send Accept: application/json, text/event-stream per spec -- this is what
triggers SSE mode even without the original workaround.

JSON-only responses close cleanly, eliminating the boot/shutdown cycle.
…ateBJones-Projects#139)

* recipes: add adaptive capture classification with confidence gating

* recipes: address review — fix author, OB1 types, add TypeScript implementation

* recipes: incorporate GitHub edits to README, classifier prompt, and metadata

* [recipes] Tighten adaptive capture setup and threshold updates

---------

Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…ateBJones-Projects#133)

* Add update_professional_contact tool to CRM extension

Adds the ability to update existing contact fields (name, company,
title, email, phone, tags, notes, follow_up_date, etc.) which was
proposed in NateBJones-Projects#93 but never implemented. Only provided fields are
updated, and the existing updated_at trigger handles timestamping.

* Allow clearing follow_up_date by passing null or empty string

Fixes the case where a follow-up date, once set, could never be
cleared — leaving contacts permanently stuck in get_follow_ups_due.

* [extensions] Document contact update tool

---------

Co-authored-by: Matt Hallett <matthallett@gmail.com>
Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…es-Projects#161)

* Fix pre-existing markdownlint errors across 15 files

Add blank lines around headings (MD022), fenced code blocks (MD031),
and between adjacent blockquotes (MD028). Fix broken link fragment
(MD051) and remove extra blank line (MD012). No content changes.

These errors were blocking CI on all open PRs since the lint check
runs repo-wide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [docs] Preserve README links during markdown cleanup

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
…raphics (NateBJones-Projects#85)

* [recipes] Infographic Generator: turn research docs into visual infographics

Second recipe from @jaredirish. Part of the Open Brain Flywheel
(capture-process-visualize loop, see Issue NateBJones-Projects#84).

Takes any markdown doc or Open Brain thought cluster and generates
professional infographic images via Gemini's free-tier API.
Auto-chunks content, writes verbose prompts (300+ words each),
generates PNGs with specific colors/layout/typography.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [recipes] Fix broken relative links in infographic-generator README

../brain-dump-processor/ → ../panning-for-gold/
../auto-capture-protocol/ → ../auto-capture/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [recipes] Address review feedback on infographic generator

- Sync generate.py with working local version (cleaner error handling,
  fix --redo display counter bug)
- Fix auto-capture link: directory doesn't exist until PR NateBJones-Projects#42 merges,
  so link to the PR instead of a non-existent directory

Note: part.as_image() and gemini-2.5-flash-image are both valid per
the official google-genai SDK docs. Reviewer concerns on those were
based on outdated information.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [recipes] Fix infographic redo progress output

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
* [recipes] Add OB-Graph knowledge graph layer

Adds graph database functionality for Open Brain using PostgreSQL nodes + edges
with recursive CTE traversal. Includes schema, MCP server with 10 tools, and
setup documentation.

https://claude.ai/code/session_015Z8wCeokTMTdrVMthqzGKJ

* [recipes] Clarify OB-Graph deployment setup

---------

Co-authored-by: Claude <noreply@anthropic.com>
* [docs] Fix Cursor MCP connection — use native url field, not mcp-remote

mcp-remote@latest now attempts OAuth client registration before sending
custom headers, which breaks against Open Brain's simple key-based auth.
Cursor supports remote MCP servers natively via the url field, so
mcp-remote is unnecessary.

Changes:
- Add dedicated Cursor section to getting-started guide (7.5) and
  remote-mcp primitive with native url config
- Update mcp-remote examples to pass key via ?key= query parameter
  instead of --header to avoid OAuth discovery issues
- Clarify x-brain-key (core) vs x-access-key (extensions) in
  troubleshooting guides

Made-with: Cursor

* [primitives] Bring remote MCP docs in line with repo format

---------

Co-authored-by: Jonathan Edwards <justfinethanku@gmail.com>
* [skills] Add weekly signal diff skill pack

* [skills] Fix markdownlint numbering in weekly signal diff
…rojects#181)

* [recipes] Add Bring Your Own Context recipe

* [recipes] Fix markdownlint regression in activation README
* [repo] Sweep fix-now backlog issues

* [docs] Fix setup-guide markdownlint regression
- dry_run now uses peekQueueItems() (read-only SELECT) instead of
  claimQueueItems(), so items stay "pending" during preview runs
- claimQueueItems() returns only rows actually claimed via .select(),
  preventing race conditions where concurrent workers see stale results
- markError() clears started_at and worker_version when resetting to
  "pending" so retryable items don't appear stale in monitoring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why: Schema stores thoughts.id / entity_extraction_queue.thought_id as UUID
(gen_random_uuid()), not BIGINT. TypeScript types were declaring number,
which is a load-bearing lie: PostgREST returns UUIDs as strings, and any
arithmetic or Number-coerce on a consumer path would produce NaN. Updated
claimQueueItems, peekQueueItems, markComplete, markError, linkThoughtEntity
signatures to string. Entity IDs remain number (BIGSERIAL).
Why: The worker had no upper bound on LLM spend — a misconfigured cron on
a large queue could mint unbounded OpenRouter/OpenAI/Anthropic cost before
anyone noticed. Added ENTITY_EXTRACTION_MAX_CALLS env (default 10000,
0 = unlimited), a module-scoped llmCallCount counter, a pre-call gate that
throws ExtractionCostCapError when the cap is reached, and graceful abort
in the main loop that returns remaining claimed rows to 'pending' so the
next invocation can resume. Summary now reports truncated / truncated_reason
/ llm_calls so callers can observe the cap firing.
Why: Deno's global fetch has no body-read timeout on Supabase Edge. A
stuck OpenRouter / OpenAI / Anthropic upstream would hang the invocation
until the 150s platform wall-clock killed it, leaving claimed rows in
'processing' with no status update. Added fetchWithTimeout() helper that
wraps AbortController around fetch, defaulting to 60s and overridable via
FETCH_TIMEOUT_MS. All three LLM call sites in extractEntities now route
through it. Timeouts surface as thrown errors, caught by the per-item
try/catch, and flow through markError — so the standard retry-then-fail
path handles them (reset to 'pending' on attempt < 5, 'failed' on cap).
Why: Thought content was interpolated directly into the LLM prompt, giving
any captured text (emails, browser history, Slack dumps) a direct channel
to override the extraction instructions — which would then flow unescaped
into entities.canonical_name, a TEXT column rendered by dashboards and
MCP tools (stored XSS vector). Four layered defenses:

1. Wrap content in <thought_content>...</thought_content> tags and tell the
   model explicitly that content inside is untrusted data, not instructions.
2. Escape literal tag occurrences in content so an attacker can't forge a
   close-tag and break out of the wrapper.
3. Enforce response_format: { type: "json_object" } on OpenRouter (OpenAI
   already had it) so prose wrapping doesn't crash the JSON parser.
4. sanitizeEntityName() strips control chars and clips to 200 chars before
   entities land in the DB — caps the blast radius of a surviving injection.
Why: Supabase Edge Functions hard-kill at 150s. With limit=50 and LLM calls
averaging 3s each, cumulative latency alone could exceed the budget — killing
the invocation mid-loop and leaving rows stuck in 'processing' with no
recovery except the manual SQL from the README. Added startTime at
invocation, INVOCATION_BUDGET_MS = 140000 (30s headroom), and a per-item
gate that releases remaining claimed rows back to 'pending' so the next
invocation picks them up. Surfaces as summary.truncated=true with
truncated_reason='wall_clock_budget', plus elapsed_ms for monitoring.
Why: The knowledge-graph schema's queue_entity_extraction trigger re-queues
a thought when its content changes. The worker then re-runs extraction but
only upserted new thought_entities rows — it never deleted links from the
prior extraction. So editing a thought from {Alice, Bob, PostgreSQL} to
{Alice, Redis} ended up with the thought linked to all four entities. Over
time this silently corrupts the graph: edges.support_count inflates with
thoughts that no longer mention the underlying entity. Delete our own
prior links (scoped to source='entity_worker' so we don't clobber links
from other sources) before re-writing. Non-fatal if DELETE fails — we
still attempt the upserts, because missed extraction is worse than drift.
Why: The code fixes for BLOCKER-3 (ENTITY_EXTRACTION_MAX_CALLS),
BLOCKER-4 (FETCH_TIMEOUT_MS), and WARNING-2 (wall-clock budget) added
env knobs and summary fields (truncated, truncated_reason, llm_calls,
elapsed_ms) that weren't surfaced anywhere users would see. Also
documented the 'skipped' queue state per INFO-1 — the worker marks
system-generated thoughts (metadata.generated_by) as skipped, and until
now only the schema comment knew that. Added queue status reference and
a short note that dry-run leaves the queue untouched.
@alanshurafa alanshurafa force-pushed the contrib/alanshurafa/entity-extraction-worker branch from 6bcd022 to 322ba57 Compare April 18, 2026 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.