Skip to content

recipes: local-ollama-embeddings — upsert, reembed, content_fingerprint#252

Open
snapsynapse wants to merge 1 commit intoNateBJones-Projects:mainfrom
snapsynapse:main
Open

recipes: local-ollama-embeddings — upsert, reembed, content_fingerprint#252
snapsynapse wants to merge 1 commit intoNateBJones-Projects:mainfrom
snapsynapse:main

Conversation

@snapsynapse
Copy link
Copy Markdown
Contributor

Summary

Adds three production features to embed-local.py that make the local-embeddings recipe usable as a repeatable seed pipeline:

  • --upsert — uses ON CONFLICT (content_fingerprint) with resolution=merge-duplicates via the PostgREST Prefer header. Requires a UNIQUE constraint on content_fingerprint; no-ops gracefully if the constraint is absent (falls back to plain insert).
  • --reembed + --reembed-limit N — fetches all existing rows (id + content), re-generates embeddings via Ollama, and PATCHes the embedding column. Useful after switching models (e.g. nomic-embed-text → mxbai-embed-large). Does not touch content or metadata.
  • content_fingerprint passthrough — JSONL inputs with a content_fingerprint key now forward the value to the insert/upsert call (was silently dropped before).

New helpers

  • _supa_headers(prefer) — extracted from inline dicts to remove duplication
  • update_embedding(row_id, embedding) — PATCH for reembed path
  • fetch_all_rows() — GET id,content ordered by created_at

Input collection change

When --reembed is active, the script skips stdin/file/arg input collection entirely (no --file or piped input required).

Test plan

  • python embed-local.py --file thoughts.jsonl --upsert — confirm ON CONFLICT merge on second run
  • python embed-local.py --reembed --dry-run — confirm rows fetched, embeddings generated, no DB writes
  • python embed-local.py --reembed --reembed-limit 5 — confirm only 5 rows patched
  • python embed-local.py --file plain.txt — confirm unchanged plain-insert path still works
  • python embed-local.py (no args, no stdin) — confirm still exits with clear error

🤖 Generated with Claude Code

…ed-local.py

- `--upsert`: ON CONFLICT(content_fingerprint) merge-duplicates via Prefer header
- `--reembed`: fetch all existing rows and re-embed them (update embedding column only);
  useful after switching embedding models. Supports `--reembed-limit N`.
- `--content_fingerprint` passthrough in JSONL input (already in read_thoughts_from_file)
- Refactored `ingest_thought` to accept optional fingerprint + upsert params
- Added `update_embedding(row_id, embedding)` and `fetch_all_rows()` helpers
- `_supa_headers()` extracted to avoid header dict duplication
- Input collection block skipped when `--reembed` is active (no stdin/file required)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added extension Contribution: curated learning path build recipe Contribution: step-by-step recipe labels Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

extension Contribution: curated learning path build recipe Contribution: step-by-step recipe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant