[integrations] Add Readwise highlight capture (backfill + webhook + schema) by mlava · Pull Request #237 · NateBJones-Projects/OB1

mlava · 2026-04-24T00:48:03Z

Contribution Type

Recipe (/recipes/readwise-import)
Schema (/schemas/readwise-books)
Integration (/integrations/readwise-capture)

What does this do?

Adds end-to-end Readwise highlight capture in three cooperating pieces:

schemas/readwise-books — cache table for book-level metadata (title, author, category, source, cover, highlight count) keyed by Readwise's user_book_id. Plus get_book_highlights and increment_book_highlight_count RPCs.
recipes/readwise-import — Python backfill that pages through /api/v2/export/, upserts each book, and inserts highlights as thoughts with source_type='readwise'. Idempotent, filter-aware (by book, source, category, date range, or --list-books discovery), batch-embeds, and recovers from Supabase statement timeouts by splitting the insert batch.
integrations/readwise-capture — Supabase Edge Function subscribing to readwise.highlight.created. Covers every source Readwise aggregates (Kindle, Reader, Twitter, Hypothesis, Instapaper, Apple Books, podcasts, Roam, physical books via OCR, etc.). Uses the cache as a write-through lookup.

Shipping together because the integration and recipe both depend on the schema.

Requirements

Services: Readwise account (Free tier works — webhook creation is available on all plans)
Tools: Supabase CLI, Python 3.10+
APIs: Readwise /api/v2/export/ and /api/v2/books/, OpenRouter embeddings (existing OB dependency)

Testing

Full end-to-end test on my own Open Brain:

Schema: applied cleanly; idempotent.
Backfill: imported 11,173 highlights / 656 books in under 4 min. Filter paths (source, category, date-range, book-id, list-books) all verified. Mid-run statement_timeout exercised and recovered via batch split.
Webhook: deployed, registered, verified live capture from a fresh Roam→Readwise sync. Book-metadata lookup and cache upsert both verified end-to-end (including cache-miss path after DELETE FROM readwise_books WHERE book_id = X).

Follow-ups (not in this PR)

Add metadata.book_source to highlight rows — parallel to book_title/book_author/book_category — so source filtering in search_thoughts doesn't need a join. Requires a small UPDATE migration.
Optional: top up ~1% export-vs-highlights gap via /api/v2/highlights/ with is_discard=false.

Checklist

I've read CONTRIBUTING.md
Each contribution has a README with prerequisites, steps, and expected outcome
Each metadata.json has all required fields
Tested on my own Open Brain instance
No credentials or secrets included

Adds a side-table of Readwise book-level metadata keyed by user_book_id, so highlights stored in thoughts can reference a book without denormalising title/author into every highlight row. Includes two RPCs: - get_book_highlights: returns a book's highlights in in-source location order so a reader can review them the way they were encountered. - increment_book_highlight_count: keeps num_highlights and last_highlight_at fresh without a COUNT over thoughts. Required by the forthcoming readwise-capture integration and readwise-import recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Supabase Edge Function that subscribes to readwise.highlight.created webhooks and inserts each highlight as a thought with source_type='readwise'. Every highlight Readwise aggregates -- from Kindle, Apple Books, Reader, Instapaper, Hypothesis, Airr/Snipd podcasts, and the Readwise OCR app -- flows through the same event and lands in one searchable feed. Design choices: - Auth via body.secret echoed by Readwise, compared against READWISE_WEBHOOK_SECRET. - Other event types (Reader document events, tag updates) are ignored with a 200 response so subscribing broader doesn't break us. - Dedup query filters on source_type first to use the index from enhanced-thoughts before the metadata JSONB contains check. - Book metadata resolved via a write-through cache in readwise_books so repeat highlights from the same book don't re-hit the Readwise API. - No per-highlight LLM metadata extraction; book title/author plus the highlight text are the metadata. Topic enrichment is left to the existing thought-enrichment recipe as a separate pass. Depends on the readwise-books schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Python script that pages through Readwise's /api/v2/export/ endpoint, upserts each book into readwise_books, batch-embeds highlight text via OpenRouter, and inserts each highlight into thoughts with source_type='readwise'. Idempotent on re-run via readwise_highlight_id dedup. Design choices: - Uses /export/ (grouped books with nested highlights) rather than /highlights/ so no separate book lookup per highlight. - Batches embeddings 100 at a time; OpenAI accepts up to 2048 per call but 100 keeps latency reasonable and runs ~20x faster than serial. - --dry-run and --limit for first-run sanity checks before embedding thousands of rows. - Periodic heartbeat every 500 highlights so long runs don't look hung, independent of --verbose which prints per-book. - Respects Retry-After on 429s (Readwise caps /export/ at 20 req/min). Captures highlights only -- not Reader reading history. Reader documents fire through readwise.reader.* webhooks which this recipe and the matching readwise-capture integration deliberately skip. That split is documented in both READMEs. Depends on the readwise-books schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Expands import-readwise.py with a set of filters so users can run targeted backfills instead of being forced to ingest everything at once: - --book-id ID (repeatable): limit to specific Readwise book_ids - --source NAME (repeatable): limit to sources (kindle, reader, ...) - --category NAME (repeatable): limit to categories (books, articles, podcasts, tweets, supplementals) - --highlighted-after / --highlighted-before: date-range filter on when the highlight was made (the usual mental model) - --updated-before: upper bound on the record update timestamp, pair with the existing --updated-after for reconciliation use cases - --list-books: print a TSV of book_id/count/source/category/title and exit; discovery helper for grabbing --book-id values All new filters are AND-combined and apply client-side after fetching pages from /api/v2/export/. The export endpoint only accepts updatedAfter server-side, so client-side is the only place additional predicates can live. Pagination still happens across the full library, but skipped books are a no-op -- no embedding, no insert. Highlight-level filters exclude highlights where highlighted_at is null (tweets, some podcast snippets) when any --highlighted-* flag is set, since we can't place them in time. Summary output now distinguishes three kinds of not-inserted highlights: books skipped by filter, highlights filtered by date range, and highlights already present. README updated with a "Selective backfill" section that documents every flag, explains highlighted_at vs updated semantics, and includes example invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two issues visible on the first real run: 1. BrokenPipeError when piping into `head`. Restore default SIGPIPE handling on Unix so the script exits cleanly instead of raising mid-print. No-op on platforms without SIGPIPE (Windows). 2. num_highlights column showed 0 for every book. The /api/v2/export/ endpoint doesn't populate that field on book records; it's on /api/v2/books/ instead. Since /export/ already returns every highlight nested under its book, count len(book["highlights"]) directly -- authoritative, no extra API calls, and correctly reports the subset count when --updated-after is set. Renamed the column header from "num_highlights" to "highlights" to match the new semantics, and added a trailing total-books + total-highlights line to stderr so the summary doesn't get swallowed by downstream pipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

upsert_book was trusting book["num_highlights"] from the /export/ response, which isn't populated on that endpoint -- every backfilled book row ended up with num_highlights=0 despite correctly imported highlights in the thoughts table. Cosmetic (no impact on search or get_book_highlights), but breaks any dashboard that reads the count. Count from len(book["highlights"]) in the inline response instead. For full backfills that's the authoritative total. On --updated-after runs the array is a subset, so we skip the field entirely; the upsert's ON CONFLICT DO UPDATE only touches the columns we set, so any count from a previous full run is preserved. Verified on a re-run for book 52113662: upsert with 47 now sets num_highlights=47, matching the count of inserted rows in thoughts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A full backfill hit Postgres error 57014 (canceling statement due to statement timeout) on the thoughts table insert. Supabase's default statement_timeout for the authenticated role is ~8s, which a 100-row batch of 1536-dim vectors can blow past when pgvector does index maintenance on top of the insert itself. Two changes: 1. Decouple insert batch size from embedding batch size. Embedding calls stay at 100 (OpenRouter API, throughput-bound). Inserts drop to 25 (DB-write-bound). For a typical thoughts table with an ivfflat/hnsw index, 25-row inserts land comfortably under the timeout. 2. Add insert_thoughts() that catches APIError with code 57014 and splits the batch in half, retrying each side. Bottoms out at single-row inserts if the split chain continues; raises on any non-timeout error so real failures aren't swallowed. Partial runs are recoverable: rows inserted before the timeout will be skipped on re-run via the readwise_highlight_id dedup check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Readwise's "Test Webhook" button validates the URL by sending a request with an empty body; req.json() crashed on the "Unexpected end of JSON input" and returned 500, which Readwise's setup flow treats as a failed test and blocks saving the webhook. Three small additions in the handler: - GET requests return 200 "readwise-capture is live" for uptime probes or casual curl-checks. - Empty POST bodies return 200 "ok (empty body)" so the Readwise test validator can pass. Authentic event payloads always include secret + event_type, so nothing downstream breaks from this. - Non-JSON POST bodies return 400 with a logged snippet (truncated to 500 chars) so we can diagnose if Readwise or something else ever sends a malformed payload. Real events still go through the secret check and event-type filter unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The webhook creation page is at readwise.io/webhook, not readwise.io/integrations as previously written. Verified against Readwise's live UI during setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The README instructed users to generate their own webhook secret with openssl rand -base64 32 and paste it into both Readwise and Supabase. In practice Readwise generates the secret itself and does not accept a user-provided value -- the supplied field was silently ignored, so Readwise's secret and ours never matched, and every real event 401'd while the empty-body test passed. Restructured Steps 2-5: - Removed old Step 2 ("Generate a Webhook Secret"). Deferred setting READWISE_WEBHOOK_SECRET until we have the value Readwise generates. - Deploy (now Step 2) only sets OPENROUTER_API_KEY and READWISE_ACCESS_TOKEN. Added a note that real webhooks will 401 until Step 4 configures the secret; the test-webhook button still passes because the empty-body guard runs before the secret check. - Register (now Step 3) drops the "paste a secret" instruction; describes the flow where Readwise shows a generated secret you then copy into the tracker. - New Step 4 stores Readwise's generated secret in Supabase with a note that Edge Function secrets are read at runtime, so no redeploy is needed. - Credential tracker moved "Webhook secret" from an input to a "generated during setup" value. Verified end-to-end: after setting the Readwise-generated value as READWISE_WEBHOOK_SECRET, webhook events start arriving and the secret check passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously stated webhooks required a paid plan ("Readwise Free doesn't expose webhooks"). Verified against a fresh Free-tier account: readwise.io/webhook is available and webhook creation + test succeeds on Free plans. Reworded prerequisites and cost section to reflect this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mlava and others added 11 commits April 24, 2026 07:40

github-actions Bot added integration Contribution: MCP extension or capture source recipe Contribution: step-by-step recipe schema Contribution: database extension labels Apr 24, 2026

thewoolleyman mentioned this pull request Apr 25, 2026

Clarify canonical source attribution: thoughts.source_type vs thoughts.metadata.source #240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[integrations] Add Readwise highlight capture (backfill + webhook + schema)#237

[integrations] Add Readwise highlight capture (backfill + webhook + schema)#237
mlava wants to merge 11 commits intoNateBJones-Projects:mainfrom
mlava:contrib/mlava/readwise

mlava commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mlava commented Apr 24, 2026

Contribution Type

What does this do?

Requirements

Testing

Follow-ups (not in this PR)

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant