Context
RSS feeds typically carry a short <description> (50–300 chars) while the real article body lives at item.url. Today Distillery's poller embeds only that short description — both RelevanceScorer.score and downstream semantic search run against the blurb, not the article. That systematically undervalues feed items whose descriptions don't mention the topics the article actually covers, and produces weak embeddings for /radar and /recall against RSS-sourced entries.
Jina's Reader API (https://r.jina.ai/<url>) returns clean, LLM-ready markdown for any public URL. Because the repo already integrates Jina for embeddings (src/distillery/embedding/jina.py), adding Reader reuses the same vendor, secret, and retry patterns.
Scope is RSS feed poller only. Bookmark pipeline, /gh-sync, and the GitHub events adapter are out of scope — revisit later if needed.
Decisions
- Storage:
Entry.content is overwritten with Reader markdown when available; the original RSS <description> is preserved in metadata.original_content for provenance and debugging. Embeddings and search benefit from the full text.
- Auth: Reuse existing
JINA_API_KEY env var (same secret as JinaEmbeddingProvider). If the key is absent, Reader is silently disabled and the poller behaves exactly as today.
- Trigger: Call Reader only when
item.content is shorter than a configurable threshold (default 500 chars) and item.url is non-empty. Skips feeds that already publish full <content> and saves Jina tokens.
- Failure mode: Reader failures never abort a poll cycle. On error/empty response the poller falls back to original RSS content and logs a warning — same per-source isolation principle in existing
_poll_source.
Design
New module: src/distillery/feeds/reader.py
JinaReaderClient modeled on JinaEmbeddingProvider (src/distillery/embedding/jina.py:35-246):
- Uses
httpx.AsyncClient (poller is async; unlike the sync embedding client).
GET https://r.jina.ai/{quoted_url} with Authorization: Bearer <key> and Accept: text/plain headers. Default response body is markdown — no other headers required.
- Exponential backoff (max 2 retries) on 429 / 5xx / transport errors, honoring
Retry-After via the shared extract_retry_after helper in src/distillery/embedding/errors.py.
async fetch(url: str) -> str | None — returns markdown on success, None on any failure (never raises). This keeps the poller's error surface unchanged.
- Bounded concurrency via
asyncio.Semaphore (default limit 5) passed in from the poller, so one source with 40 items doesn't burst 40 parallel Reader calls.
- Structured logs including
url, status, attempt, duration_ms, bytes.
Config extension: src/distillery/config.py
Add a ReaderConfig dataclass nested under feeds:
@dataclass
class ReaderConfig:
enabled: bool = False
api_key_env: str = "JINA_API_KEY"
min_content_chars: int = 500
timeout_seconds: float = 30.0
max_retries: int = 2
concurrency: int = 5
Mirror existing EmbeddingConfig parsing patterns (src/distillery/config.py:55-75, 412-428). Disabled by default — opt-in per deployment via feeds.reader.enabled: true in the YAML config. Validation: min_content_chars >= 0, timeout_seconds > 0, max_retries >= 0, concurrency >= 1.
Poller integration: src/distillery/feeds/poller.py
FeedPoller.__init__ gains an optional reader: JinaReaderClient | None. Construction paths (MCP tool + webhook) read ReaderConfig and build the client only when enabled=True and the key resolves.
- In
_poll_source (src/distillery/feeds/poller.py:618-730), after adapter.fetch() and before the dedup check, call a new helper _maybe_enrich_with_reader(items) that:
- Filters to RSS items where
len(item.content or "") < min_content_chars and item.url is non-empty.
- Gathers
reader.fetch(item.url) concurrently under the semaphore.
- Returns a
dict[item_id, str] of successful enrichments.
- Pass the dict to
_item_to_entry_kwargs (src/distillery/feeds/poller.py:360-419). When an enrichment exists:
text = reader_markdown (still piped through truncate_content).
metadata["original_content"] = item.content.
metadata["enriched_by"] = "jina-reader"; metadata["enriched_at"] = <iso timestamp>.
- Source-type gate: only apply enrichment when
source.source_type == "rss". GitHub events adapter untouched.
PollResult gains items_enriched and enrichment_errors. Surface them in PollerSummary so /hooks/poll responses expose Reader activity.
Critical files
src/distillery/feeds/reader.py — new module
src/distillery/config.py — add ReaderConfig, extend FeedsConfig, parser + validation
src/distillery/feeds/poller.py — wire JinaReaderClient, enrichment helper, metadata plumbing in _item_to_entry_kwargs
src/distillery/feeds/models.py — no changes required (enrichment is poller-local state)
src/distillery/mcp/webhooks.py — construct FeedPoller with new reader param (/hooks/poll handler at line 271-382)
src/distillery/mcp/server.py — same wiring for the distillery_poll MCP tool
distillery-dev.yaml — commented feeds.reader block showing defaults
Reused utilities
src/distillery/embedding/errors.py::extract_retry_after — already-tested Retry-After parser.
src/distillery/feeds/truncation.py::truncate_content — applied to enriched markdown before embedding.
- Structured-log pattern from
src/distillery/embedding/jina.py:181-229.
Testing
tests/test_feeds_reader.py — unit tests for JinaReaderClient using httpx.MockTransport:
- Happy path 200 + markdown body → returns string.
- 429 with
Retry-After → retries, succeeds, backoff honored.
- 5xx exhausted → returns
None, no exception.
- Transport error exhausted → returns
None.
- Missing API key → factory returns
None.
- Empty response body → returns
None.
- Extend poller tests:
- Reader disabled → Entry.content unchanged, no Reader calls,
items_enriched == 0.
- Reader enabled + short RSS description → Entry.content is Reader markdown,
metadata.original_content holds RSS description, items_enriched == 1.
- Reader enabled + RSS description >= threshold → Reader skipped.
- Reader enabled + Reader returns
None → Entry.content falls back to RSS description, enrichment_errors == 1, poll does not fail.
source.source_type == "github" → Reader never invoked regardless of config.
All new unit tests mock httpx (no network). Strict mypy on src/distillery/feeds/reader.py.
Verification
pytest -m unit — new tests pass; coverage stays >= 80%.
mypy --strict src/distillery/ — passes.
ruff check src/ tests/ and ruff format --check src/ tests/ — clean.
- End-to-end local run:
- Set
JINA_API_KEY.
feeds.reader.enabled: true in distillery-dev.yaml.
- Configure a known RSS source with short descriptions.
curl -X POST http://localhost:8000/hooks/poll -H "Authorization: Bearer \$DISTILLERY_WEBHOOK_TOKEN" — response JSON includes items_enriched > 0.
- Inspect DuckDB: entries from that source have
metadata.original_content populated and Entry.content substantially longer than the original RSS description.
- Flip
feeds.reader.enabled: false, repeat — confirm zero Reader calls.
Context
RSS feeds typically carry a short
<description>(50–300 chars) while the real article body lives atitem.url. Today Distillery's poller embeds only that short description — bothRelevanceScorer.scoreand downstream semantic search run against the blurb, not the article. That systematically undervalues feed items whose descriptions don't mention the topics the article actually covers, and produces weak embeddings for/radarand/recallagainst RSS-sourced entries.Jina's Reader API (
https://r.jina.ai/<url>) returns clean, LLM-ready markdown for any public URL. Because the repo already integrates Jina for embeddings (src/distillery/embedding/jina.py), adding Reader reuses the same vendor, secret, and retry patterns.Scope is RSS feed poller only. Bookmark pipeline,
/gh-sync, and the GitHub events adapter are out of scope — revisit later if needed.Decisions
Entry.contentis overwritten with Reader markdown when available; the original RSS<description>is preserved inmetadata.original_contentfor provenance and debugging. Embeddings and search benefit from the full text.JINA_API_KEYenv var (same secret asJinaEmbeddingProvider). If the key is absent, Reader is silently disabled and the poller behaves exactly as today.item.contentis shorter than a configurable threshold (default 500 chars) anditem.urlis non-empty. Skips feeds that already publish full<content>and saves Jina tokens._poll_source.Design
New module:
src/distillery/feeds/reader.pyJinaReaderClientmodeled onJinaEmbeddingProvider(src/distillery/embedding/jina.py:35-246):httpx.AsyncClient(poller is async; unlike the sync embedding client).GET https://r.jina.ai/{quoted_url}withAuthorization: Bearer <key>andAccept: text/plainheaders. Default response body is markdown — no other headers required.Retry-Aftervia the sharedextract_retry_afterhelper insrc/distillery/embedding/errors.py.async fetch(url: str) -> str | None— returns markdown on success,Noneon any failure (never raises). This keeps the poller's error surface unchanged.asyncio.Semaphore(default limit 5) passed in from the poller, so one source with 40 items doesn't burst 40 parallel Reader calls.url,status,attempt,duration_ms,bytes.Config extension:
src/distillery/config.pyAdd a
ReaderConfigdataclass nested underfeeds:Mirror existing
EmbeddingConfigparsing patterns (src/distillery/config.py:55-75, 412-428). Disabled by default — opt-in per deployment viafeeds.reader.enabled: truein the YAML config. Validation:min_content_chars >= 0,timeout_seconds > 0,max_retries >= 0,concurrency >= 1.Poller integration:
src/distillery/feeds/poller.pyFeedPoller.__init__gains an optionalreader: JinaReaderClient | None. Construction paths (MCP tool + webhook) readReaderConfigand build the client only whenenabled=Trueand the key resolves._poll_source(src/distillery/feeds/poller.py:618-730), afteradapter.fetch()and before the dedup check, call a new helper_maybe_enrich_with_reader(items)that:len(item.content or "") < min_content_charsanditem.urlis non-empty.reader.fetch(item.url)concurrently under the semaphore.dict[item_id, str]of successful enrichments._item_to_entry_kwargs(src/distillery/feeds/poller.py:360-419). When an enrichment exists:text = reader_markdown(still piped throughtruncate_content).metadata["original_content"] = item.content.metadata["enriched_by"] = "jina-reader";metadata["enriched_at"] = <iso timestamp>.source.source_type == "rss". GitHub events adapter untouched.PollResultgainsitems_enrichedandenrichment_errors. Surface them inPollerSummaryso/hooks/pollresponses expose Reader activity.Critical files
src/distillery/feeds/reader.py— new modulesrc/distillery/config.py— addReaderConfig, extendFeedsConfig, parser + validationsrc/distillery/feeds/poller.py— wireJinaReaderClient, enrichment helper, metadata plumbing in_item_to_entry_kwargssrc/distillery/feeds/models.py— no changes required (enrichment is poller-local state)src/distillery/mcp/webhooks.py— constructFeedPollerwith newreaderparam (/hooks/pollhandler at line 271-382)src/distillery/mcp/server.py— same wiring for thedistillery_pollMCP tooldistillery-dev.yaml— commentedfeeds.readerblock showing defaultsReused utilities
src/distillery/embedding/errors.py::extract_retry_after— already-testedRetry-Afterparser.src/distillery/feeds/truncation.py::truncate_content— applied to enriched markdown before embedding.src/distillery/embedding/jina.py:181-229.Testing
tests/test_feeds_reader.py— unit tests forJinaReaderClientusinghttpx.MockTransport:Retry-After→ retries, succeeds, backoff honored.None, no exception.None.None.None.items_enriched == 0.metadata.original_contentholds RSS description,items_enriched == 1.None→ Entry.content falls back to RSS description,enrichment_errors == 1, poll does not fail.source.source_type == "github"→ Reader never invoked regardless of config.All new unit tests mock httpx (no network). Strict mypy on
src/distillery/feeds/reader.py.Verification
pytest -m unit— new tests pass; coverage stays >= 80%.mypy --strict src/distillery/— passes.ruff check src/ tests/andruff format --check src/ tests/— clean.JINA_API_KEY.feeds.reader.enabled: trueindistillery-dev.yaml.curl -X POST http://localhost:8000/hooks/poll -H "Authorization: Bearer \$DISTILLERY_WEBHOOK_TOKEN"— response JSON includesitems_enriched > 0.metadata.original_contentpopulated andEntry.contentsubstantially longer than the original RSS description.feeds.reader.enabled: false, repeat — confirm zero Reader calls.