Skip to content

[integrations] Add Readwise highlight capture (backfill + webhook + schema)#237

Open
mlava wants to merge 11 commits intoNateBJones-Projects:mainfrom
mlava:contrib/mlava/readwise
Open

[integrations] Add Readwise highlight capture (backfill + webhook + schema)#237
mlava wants to merge 11 commits intoNateBJones-Projects:mainfrom
mlava:contrib/mlava/readwise

Conversation

@mlava
Copy link
Copy Markdown

@mlava mlava commented Apr 24, 2026

Contribution Type

  • Recipe (/recipes/readwise-import)
  • Schema (/schemas/readwise-books)
  • Integration (/integrations/readwise-capture)

What does this do?

Adds end-to-end Readwise highlight capture in three cooperating pieces:

  1. schemas/readwise-books — cache table for book-level metadata (title, author, category, source, cover, highlight count) keyed by Readwise's user_book_id. Plus get_book_highlights and increment_book_highlight_count RPCs.

  2. recipes/readwise-import — Python backfill that pages through /api/v2/export/, upserts each book, and inserts highlights as thoughts with source_type='readwise'. Idempotent, filter-aware (by book, source, category, date range, or --list-books discovery), batch-embeds, and recovers from Supabase statement timeouts by splitting the insert batch.

  3. integrations/readwise-capture — Supabase Edge Function subscribing to readwise.highlight.created. Covers every source Readwise aggregates (Kindle, Reader, Twitter, Hypothesis, Instapaper, Apple Books, podcasts, Roam, physical books via OCR, etc.). Uses the cache as a write-through lookup.

Shipping together because the integration and recipe both depend on the schema.

Requirements

  • Services: Readwise account (Free tier works — webhook creation is available on all plans)
  • Tools: Supabase CLI, Python 3.10+
  • APIs: Readwise /api/v2/export/ and /api/v2/books/, OpenRouter embeddings (existing OB dependency)

Testing

Full end-to-end test on my own Open Brain:

  • Schema: applied cleanly; idempotent.
  • Backfill: imported 11,173 highlights / 656 books in under 4 min. Filter paths (source, category, date-range, book-id, list-books) all verified. Mid-run statement_timeout exercised and recovered via batch split.
  • Webhook: deployed, registered, verified live capture from a fresh Roam→Readwise sync. Book-metadata lookup and cache upsert both verified end-to-end (including cache-miss path after DELETE FROM readwise_books WHERE book_id = X).

Follow-ups (not in this PR)

  1. Add metadata.book_source to highlight rows — parallel to book_title/book_author/book_category — so source filtering in search_thoughts doesn't need a join. Requires a small UPDATE migration.
  2. Optional: top up ~1% export-vs-highlights gap via /api/v2/highlights/ with is_discard=false.

Checklist

  • I've read CONTRIBUTING.md
  • Each contribution has a README with prerequisites, steps, and expected outcome
  • Each metadata.json has all required fields
  • Tested on my own Open Brain instance
  • No credentials or secrets included

mlava and others added 11 commits April 24, 2026 07:40
Adds a side-table of Readwise book-level metadata keyed by user_book_id,
so highlights stored in thoughts can reference a book without
denormalising title/author into every highlight row. Includes two RPCs:

- get_book_highlights: returns a book's highlights in in-source location
  order so a reader can review them the way they were encountered.
- increment_book_highlight_count: keeps num_highlights and
  last_highlight_at fresh without a COUNT over thoughts.

Required by the forthcoming readwise-capture integration and
readwise-import recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supabase Edge Function that subscribes to readwise.highlight.created
webhooks and inserts each highlight as a thought with
source_type='readwise'. Every highlight Readwise aggregates -- from
Kindle, Apple Books, Reader, Instapaper, Hypothesis, Airr/Snipd
podcasts, and the Readwise OCR app -- flows through the same event
and lands in one searchable feed.

Design choices:
- Auth via body.secret echoed by Readwise, compared against
  READWISE_WEBHOOK_SECRET.
- Other event types (Reader document events, tag updates) are ignored
  with a 200 response so subscribing broader doesn't break us.
- Dedup query filters on source_type first to use the index from
  enhanced-thoughts before the metadata JSONB contains check.
- Book metadata resolved via a write-through cache in readwise_books
  so repeat highlights from the same book don't re-hit the Readwise
  API.
- No per-highlight LLM metadata extraction; book title/author plus
  the highlight text are the metadata. Topic enrichment is left to
  the existing thought-enrichment recipe as a separate pass.

Depends on the readwise-books schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Python script that pages through Readwise's /api/v2/export/ endpoint,
upserts each book into readwise_books, batch-embeds highlight text
via OpenRouter, and inserts each highlight into thoughts with
source_type='readwise'. Idempotent on re-run via
readwise_highlight_id dedup.

Design choices:
- Uses /export/ (grouped books with nested highlights) rather than
  /highlights/ so no separate book lookup per highlight.
- Batches embeddings 100 at a time; OpenAI accepts up to 2048 per
  call but 100 keeps latency reasonable and runs ~20x faster than
  serial.
- --dry-run and --limit for first-run sanity checks before embedding
  thousands of rows.
- Periodic heartbeat every 500 highlights so long runs don't look
  hung, independent of --verbose which prints per-book.
- Respects Retry-After on 429s (Readwise caps /export/ at 20 req/min).

Captures highlights only -- not Reader reading history. Reader
documents fire through readwise.reader.* webhooks which this recipe
and the matching readwise-capture integration deliberately skip.
That split is documented in both READMEs.

Depends on the readwise-books schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expands import-readwise.py with a set of filters so users can run
targeted backfills instead of being forced to ingest everything at
once:

- --book-id ID (repeatable): limit to specific Readwise book_ids
- --source NAME (repeatable): limit to sources (kindle, reader, ...)
- --category NAME (repeatable): limit to categories (books, articles,
  podcasts, tweets, supplementals)
- --highlighted-after / --highlighted-before: date-range filter on
  when the highlight was made (the usual mental model)
- --updated-before: upper bound on the record update timestamp, pair
  with the existing --updated-after for reconciliation use cases
- --list-books: print a TSV of book_id/count/source/category/title
  and exit; discovery helper for grabbing --book-id values

All new filters are AND-combined and apply client-side after fetching
pages from /api/v2/export/. The export endpoint only accepts
updatedAfter server-side, so client-side is the only place additional
predicates can live. Pagination still happens across the full library,
but skipped books are a no-op -- no embedding, no insert.

Highlight-level filters exclude highlights where highlighted_at is
null (tweets, some podcast snippets) when any --highlighted-* flag
is set, since we can't place them in time.

Summary output now distinguishes three kinds of not-inserted
highlights: books skipped by filter, highlights filtered by date
range, and highlights already present.

README updated with a "Selective backfill" section that documents
every flag, explains highlighted_at vs updated semantics, and
includes example invocations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues visible on the first real run:

1. BrokenPipeError when piping into `head`. Restore default SIGPIPE
   handling on Unix so the script exits cleanly instead of raising
   mid-print. No-op on platforms without SIGPIPE (Windows).

2. num_highlights column showed 0 for every book. The /api/v2/export/
   endpoint doesn't populate that field on book records; it's on
   /api/v2/books/ instead. Since /export/ already returns every
   highlight nested under its book, count len(book["highlights"])
   directly -- authoritative, no extra API calls, and correctly
   reports the subset count when --updated-after is set.

Renamed the column header from "num_highlights" to "highlights" to
match the new semantics, and added a trailing total-books +
total-highlights line to stderr so the summary doesn't get swallowed
by downstream pipes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
upsert_book was trusting book["num_highlights"] from the /export/
response, which isn't populated on that endpoint -- every backfilled
book row ended up with num_highlights=0 despite correctly imported
highlights in the thoughts table. Cosmetic (no impact on search or
get_book_highlights), but breaks any dashboard that reads the count.

Count from len(book["highlights"]) in the inline response instead.
For full backfills that's the authoritative total. On --updated-after
runs the array is a subset, so we skip the field entirely; the
upsert's ON CONFLICT DO UPDATE only touches the columns we set, so
any count from a previous full run is preserved.

Verified on a re-run for book 52113662: upsert with 47 now sets
num_highlights=47, matching the count of inserted rows in thoughts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A full backfill hit Postgres error 57014 (canceling statement due to
statement timeout) on the thoughts table insert. Supabase's default
statement_timeout for the authenticated role is ~8s, which a 100-row
batch of 1536-dim vectors can blow past when pgvector does index
maintenance on top of the insert itself.

Two changes:

1. Decouple insert batch size from embedding batch size. Embedding
   calls stay at 100 (OpenRouter API, throughput-bound). Inserts drop
   to 25 (DB-write-bound). For a typical thoughts table with an
   ivfflat/hnsw index, 25-row inserts land comfortably under the
   timeout.

2. Add insert_thoughts() that catches APIError with code 57014 and
   splits the batch in half, retrying each side. Bottoms out at
   single-row inserts if the split chain continues; raises on any
   non-timeout error so real failures aren't swallowed.

Partial runs are recoverable: rows inserted before the timeout will
be skipped on re-run via the readwise_highlight_id dedup check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Readwise's "Test Webhook" button validates the URL by sending a
request with an empty body; req.json() crashed on the "Unexpected
end of JSON input" and returned 500, which Readwise's setup flow
treats as a failed test and blocks saving the webhook.

Three small additions in the handler:

- GET requests return 200 "readwise-capture is live" for uptime
  probes or casual curl-checks.
- Empty POST bodies return 200 "ok (empty body)" so the Readwise
  test validator can pass. Authentic event payloads always include
  secret + event_type, so nothing downstream breaks from this.
- Non-JSON POST bodies return 400 with a logged snippet (truncated
  to 500 chars) so we can diagnose if Readwise or something else
  ever sends a malformed payload.

Real events still go through the secret check and event-type filter
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The webhook creation page is at readwise.io/webhook, not
readwise.io/integrations as previously written. Verified against
Readwise's live UI during setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README instructed users to generate their own webhook secret with
openssl rand -base64 32 and paste it into both Readwise and Supabase.
In practice Readwise generates the secret itself and does not accept a
user-provided value -- the supplied field was silently ignored, so
Readwise's secret and ours never matched, and every real event 401'd
while the empty-body test passed.

Restructured Steps 2-5:

- Removed old Step 2 ("Generate a Webhook Secret"). Deferred setting
  READWISE_WEBHOOK_SECRET until we have the value Readwise generates.
- Deploy (now Step 2) only sets OPENROUTER_API_KEY and
  READWISE_ACCESS_TOKEN. Added a note that real webhooks will 401
  until Step 4 configures the secret; the test-webhook button still
  passes because the empty-body guard runs before the secret check.
- Register (now Step 3) drops the "paste a secret" instruction;
  describes the flow where Readwise shows a generated secret you then
  copy into the tracker.
- New Step 4 stores Readwise's generated secret in Supabase with a
  note that Edge Function secrets are read at runtime, so no
  redeploy is needed.
- Credential tracker moved "Webhook secret" from an input to a
  "generated during setup" value.

Verified end-to-end: after setting the Readwise-generated value as
READWISE_WEBHOOK_SECRET, webhook events start arriving and the secret
check passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously stated webhooks required a paid plan ("Readwise Free
doesn't expose webhooks"). Verified against a fresh Free-tier account:
readwise.io/webhook is available and webhook creation + test succeeds
on Free plans. Reworded prerequisites and cost section to reflect
this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added integration Contribution: MCP extension or capture source recipe Contribution: step-by-step recipe schema Contribution: database extension labels Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration Contribution: MCP extension or capture source recipe Contribution: step-by-step recipe schema Contribution: database extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant