[integrations] Add Readwise highlight capture (backfill + webhook + schema)#237
Open
mlava wants to merge 11 commits intoNateBJones-Projects:mainfrom
Open
[integrations] Add Readwise highlight capture (backfill + webhook + schema)#237mlava wants to merge 11 commits intoNateBJones-Projects:mainfrom
mlava wants to merge 11 commits intoNateBJones-Projects:mainfrom
Conversation
Adds a side-table of Readwise book-level metadata keyed by user_book_id, so highlights stored in thoughts can reference a book without denormalising title/author into every highlight row. Includes two RPCs: - get_book_highlights: returns a book's highlights in in-source location order so a reader can review them the way they were encountered. - increment_book_highlight_count: keeps num_highlights and last_highlight_at fresh without a COUNT over thoughts. Required by the forthcoming readwise-capture integration and readwise-import recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supabase Edge Function that subscribes to readwise.highlight.created webhooks and inserts each highlight as a thought with source_type='readwise'. Every highlight Readwise aggregates -- from Kindle, Apple Books, Reader, Instapaper, Hypothesis, Airr/Snipd podcasts, and the Readwise OCR app -- flows through the same event and lands in one searchable feed. Design choices: - Auth via body.secret echoed by Readwise, compared against READWISE_WEBHOOK_SECRET. - Other event types (Reader document events, tag updates) are ignored with a 200 response so subscribing broader doesn't break us. - Dedup query filters on source_type first to use the index from enhanced-thoughts before the metadata JSONB contains check. - Book metadata resolved via a write-through cache in readwise_books so repeat highlights from the same book don't re-hit the Readwise API. - No per-highlight LLM metadata extraction; book title/author plus the highlight text are the metadata. Topic enrichment is left to the existing thought-enrichment recipe as a separate pass. Depends on the readwise-books schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Python script that pages through Readwise's /api/v2/export/ endpoint, upserts each book into readwise_books, batch-embeds highlight text via OpenRouter, and inserts each highlight into thoughts with source_type='readwise'. Idempotent on re-run via readwise_highlight_id dedup. Design choices: - Uses /export/ (grouped books with nested highlights) rather than /highlights/ so no separate book lookup per highlight. - Batches embeddings 100 at a time; OpenAI accepts up to 2048 per call but 100 keeps latency reasonable and runs ~20x faster than serial. - --dry-run and --limit for first-run sanity checks before embedding thousands of rows. - Periodic heartbeat every 500 highlights so long runs don't look hung, independent of --verbose which prints per-book. - Respects Retry-After on 429s (Readwise caps /export/ at 20 req/min). Captures highlights only -- not Reader reading history. Reader documents fire through readwise.reader.* webhooks which this recipe and the matching readwise-capture integration deliberately skip. That split is documented in both READMEs. Depends on the readwise-books schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expands import-readwise.py with a set of filters so users can run targeted backfills instead of being forced to ingest everything at once: - --book-id ID (repeatable): limit to specific Readwise book_ids - --source NAME (repeatable): limit to sources (kindle, reader, ...) - --category NAME (repeatable): limit to categories (books, articles, podcasts, tweets, supplementals) - --highlighted-after / --highlighted-before: date-range filter on when the highlight was made (the usual mental model) - --updated-before: upper bound on the record update timestamp, pair with the existing --updated-after for reconciliation use cases - --list-books: print a TSV of book_id/count/source/category/title and exit; discovery helper for grabbing --book-id values All new filters are AND-combined and apply client-side after fetching pages from /api/v2/export/. The export endpoint only accepts updatedAfter server-side, so client-side is the only place additional predicates can live. Pagination still happens across the full library, but skipped books are a no-op -- no embedding, no insert. Highlight-level filters exclude highlights where highlighted_at is null (tweets, some podcast snippets) when any --highlighted-* flag is set, since we can't place them in time. Summary output now distinguishes three kinds of not-inserted highlights: books skipped by filter, highlights filtered by date range, and highlights already present. README updated with a "Selective backfill" section that documents every flag, explains highlighted_at vs updated semantics, and includes example invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues visible on the first real run: 1. BrokenPipeError when piping into `head`. Restore default SIGPIPE handling on Unix so the script exits cleanly instead of raising mid-print. No-op on platforms without SIGPIPE (Windows). 2. num_highlights column showed 0 for every book. The /api/v2/export/ endpoint doesn't populate that field on book records; it's on /api/v2/books/ instead. Since /export/ already returns every highlight nested under its book, count len(book["highlights"]) directly -- authoritative, no extra API calls, and correctly reports the subset count when --updated-after is set. Renamed the column header from "num_highlights" to "highlights" to match the new semantics, and added a trailing total-books + total-highlights line to stderr so the summary doesn't get swallowed by downstream pipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
upsert_book was trusting book["num_highlights"] from the /export/ response, which isn't populated on that endpoint -- every backfilled book row ended up with num_highlights=0 despite correctly imported highlights in the thoughts table. Cosmetic (no impact on search or get_book_highlights), but breaks any dashboard that reads the count. Count from len(book["highlights"]) in the inline response instead. For full backfills that's the authoritative total. On --updated-after runs the array is a subset, so we skip the field entirely; the upsert's ON CONFLICT DO UPDATE only touches the columns we set, so any count from a previous full run is preserved. Verified on a re-run for book 52113662: upsert with 47 now sets num_highlights=47, matching the count of inserted rows in thoughts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A full backfill hit Postgres error 57014 (canceling statement due to statement timeout) on the thoughts table insert. Supabase's default statement_timeout for the authenticated role is ~8s, which a 100-row batch of 1536-dim vectors can blow past when pgvector does index maintenance on top of the insert itself. Two changes: 1. Decouple insert batch size from embedding batch size. Embedding calls stay at 100 (OpenRouter API, throughput-bound). Inserts drop to 25 (DB-write-bound). For a typical thoughts table with an ivfflat/hnsw index, 25-row inserts land comfortably under the timeout. 2. Add insert_thoughts() that catches APIError with code 57014 and splits the batch in half, retrying each side. Bottoms out at single-row inserts if the split chain continues; raises on any non-timeout error so real failures aren't swallowed. Partial runs are recoverable: rows inserted before the timeout will be skipped on re-run via the readwise_highlight_id dedup check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Readwise's "Test Webhook" button validates the URL by sending a request with an empty body; req.json() crashed on the "Unexpected end of JSON input" and returned 500, which Readwise's setup flow treats as a failed test and blocks saving the webhook. Three small additions in the handler: - GET requests return 200 "readwise-capture is live" for uptime probes or casual curl-checks. - Empty POST bodies return 200 "ok (empty body)" so the Readwise test validator can pass. Authentic event payloads always include secret + event_type, so nothing downstream breaks from this. - Non-JSON POST bodies return 400 with a logged snippet (truncated to 500 chars) so we can diagnose if Readwise or something else ever sends a malformed payload. Real events still go through the secret check and event-type filter unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The webhook creation page is at readwise.io/webhook, not readwise.io/integrations as previously written. Verified against Readwise's live UI during setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README instructed users to generate their own webhook secret with
openssl rand -base64 32 and paste it into both Readwise and Supabase.
In practice Readwise generates the secret itself and does not accept a
user-provided value -- the supplied field was silently ignored, so
Readwise's secret and ours never matched, and every real event 401'd
while the empty-body test passed.
Restructured Steps 2-5:
- Removed old Step 2 ("Generate a Webhook Secret"). Deferred setting
READWISE_WEBHOOK_SECRET until we have the value Readwise generates.
- Deploy (now Step 2) only sets OPENROUTER_API_KEY and
READWISE_ACCESS_TOKEN. Added a note that real webhooks will 401
until Step 4 configures the secret; the test-webhook button still
passes because the empty-body guard runs before the secret check.
- Register (now Step 3) drops the "paste a secret" instruction;
describes the flow where Readwise shows a generated secret you then
copy into the tracker.
- New Step 4 stores Readwise's generated secret in Supabase with a
note that Edge Function secrets are read at runtime, so no
redeploy is needed.
- Credential tracker moved "Webhook secret" from an input to a
"generated during setup" value.
Verified end-to-end: after setting the Readwise-generated value as
READWISE_WEBHOOK_SECRET, webhook events start arriving and the secret
check passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously stated webhooks required a paid plan ("Readwise Free
doesn't expose webhooks"). Verified against a fresh Free-tier account:
readwise.io/webhook is available and webhook creation + test succeeds
on Free plans. Reworded prerequisites and cost section to reflect
this.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contribution Type
/recipes/readwise-import)/schemas/readwise-books)/integrations/readwise-capture)What does this do?
Adds end-to-end Readwise highlight capture in three cooperating pieces:
schemas/readwise-books — cache table for book-level metadata (title, author, category, source, cover, highlight count) keyed by Readwise's
user_book_id. Plusget_book_highlightsandincrement_book_highlight_countRPCs.recipes/readwise-import — Python backfill that pages through
/api/v2/export/, upserts each book, and inserts highlights as thoughts withsource_type='readwise'. Idempotent, filter-aware (by book, source, category, date range, or--list-booksdiscovery), batch-embeds, and recovers from Supabase statement timeouts by splitting the insert batch.integrations/readwise-capture — Supabase Edge Function subscribing to
readwise.highlight.created. Covers every source Readwise aggregates (Kindle, Reader, Twitter, Hypothesis, Instapaper, Apple Books, podcasts, Roam, physical books via OCR, etc.). Uses the cache as a write-through lookup.Shipping together because the integration and recipe both depend on the schema.
Requirements
/api/v2/export/and/api/v2/books/, OpenRouter embeddings (existing OB dependency)Testing
Full end-to-end test on my own Open Brain:
DELETE FROM readwise_books WHERE book_id = X).Follow-ups (not in this PR)
metadata.book_sourceto highlight rows — parallel tobook_title/book_author/book_category— so source filtering insearch_thoughtsdoesn't need a join. Requires a small UPDATE migration./api/v2/highlights/withis_discard=false.Checklist
metadata.jsonhas all required fields