Skip to content

feat(feeds): enrich RSS poller content via Jina Reader API #403

@norrietaylor

Description

@norrietaylor

Context

RSS feeds typically carry a short <description> (50–300 chars) while the real article body lives at item.url. Today Distillery's poller embeds only that short description — both RelevanceScorer.score and downstream semantic search run against the blurb, not the article. That systematically undervalues feed items whose descriptions don't mention the topics the article actually covers, and produces weak embeddings for /radar and /recall against RSS-sourced entries.

Jina's Reader API (https://r.jina.ai/<url>) returns clean, LLM-ready markdown for any public URL. Because the repo already integrates Jina for embeddings (src/distillery/embedding/jina.py), adding Reader reuses the same vendor, secret, and retry patterns.

Scope is RSS feed poller only. Bookmark pipeline, /gh-sync, and the GitHub events adapter are out of scope — revisit later if needed.

Decisions

  • Storage: Entry.content is overwritten with Reader markdown when available; the original RSS <description> is preserved in metadata.original_content for provenance and debugging. Embeddings and search benefit from the full text.
  • Auth: Reuse existing JINA_API_KEY env var (same secret as JinaEmbeddingProvider). If the key is absent, Reader is silently disabled and the poller behaves exactly as today.
  • Trigger: Call Reader only when item.content is shorter than a configurable threshold (default 500 chars) and item.url is non-empty. Skips feeds that already publish full <content> and saves Jina tokens.
  • Failure mode: Reader failures never abort a poll cycle. On error/empty response the poller falls back to original RSS content and logs a warning — same per-source isolation principle in existing _poll_source.

Design

New module: src/distillery/feeds/reader.py

JinaReaderClient modeled on JinaEmbeddingProvider (src/distillery/embedding/jina.py:35-246):

  • Uses httpx.AsyncClient (poller is async; unlike the sync embedding client).
  • GET https://r.jina.ai/{quoted_url} with Authorization: Bearer <key> and Accept: text/plain headers. Default response body is markdown — no other headers required.
  • Exponential backoff (max 2 retries) on 429 / 5xx / transport errors, honoring Retry-After via the shared extract_retry_after helper in src/distillery/embedding/errors.py.
  • async fetch(url: str) -> str | None — returns markdown on success, None on any failure (never raises). This keeps the poller's error surface unchanged.
  • Bounded concurrency via asyncio.Semaphore (default limit 5) passed in from the poller, so one source with 40 items doesn't burst 40 parallel Reader calls.
  • Structured logs including url, status, attempt, duration_ms, bytes.

Config extension: src/distillery/config.py

Add a ReaderConfig dataclass nested under feeds:

@dataclass
class ReaderConfig:
    enabled: bool = False
    api_key_env: str = "JINA_API_KEY"
    min_content_chars: int = 500
    timeout_seconds: float = 30.0
    max_retries: int = 2
    concurrency: int = 5

Mirror existing EmbeddingConfig parsing patterns (src/distillery/config.py:55-75, 412-428). Disabled by default — opt-in per deployment via feeds.reader.enabled: true in the YAML config. Validation: min_content_chars >= 0, timeout_seconds > 0, max_retries >= 0, concurrency >= 1.

Poller integration: src/distillery/feeds/poller.py

  1. FeedPoller.__init__ gains an optional reader: JinaReaderClient | None. Construction paths (MCP tool + webhook) read ReaderConfig and build the client only when enabled=True and the key resolves.
  2. In _poll_source (src/distillery/feeds/poller.py:618-730), after adapter.fetch() and before the dedup check, call a new helper _maybe_enrich_with_reader(items) that:
    • Filters to RSS items where len(item.content or "") < min_content_chars and item.url is non-empty.
    • Gathers reader.fetch(item.url) concurrently under the semaphore.
    • Returns a dict[item_id, str] of successful enrichments.
  3. Pass the dict to _item_to_entry_kwargs (src/distillery/feeds/poller.py:360-419). When an enrichment exists:
    • text = reader_markdown (still piped through truncate_content).
    • metadata["original_content"] = item.content.
    • metadata["enriched_by"] = "jina-reader"; metadata["enriched_at"] = <iso timestamp>.
  4. Source-type gate: only apply enrichment when source.source_type == "rss". GitHub events adapter untouched.
  5. PollResult gains items_enriched and enrichment_errors. Surface them in PollerSummary so /hooks/poll responses expose Reader activity.

Critical files

  • src/distillery/feeds/reader.py — new module
  • src/distillery/config.py — add ReaderConfig, extend FeedsConfig, parser + validation
  • src/distillery/feeds/poller.py — wire JinaReaderClient, enrichment helper, metadata plumbing in _item_to_entry_kwargs
  • src/distillery/feeds/models.py — no changes required (enrichment is poller-local state)
  • src/distillery/mcp/webhooks.py — construct FeedPoller with new reader param (/hooks/poll handler at line 271-382)
  • src/distillery/mcp/server.py — same wiring for the distillery_poll MCP tool
  • distillery-dev.yaml — commented feeds.reader block showing defaults

Reused utilities

  • src/distillery/embedding/errors.py::extract_retry_after — already-tested Retry-After parser.
  • src/distillery/feeds/truncation.py::truncate_content — applied to enriched markdown before embedding.
  • Structured-log pattern from src/distillery/embedding/jina.py:181-229.

Testing

  • tests/test_feeds_reader.py — unit tests for JinaReaderClient using httpx.MockTransport:
    • Happy path 200 + markdown body → returns string.
    • 429 with Retry-After → retries, succeeds, backoff honored.
    • 5xx exhausted → returns None, no exception.
    • Transport error exhausted → returns None.
    • Missing API key → factory returns None.
    • Empty response body → returns None.
  • Extend poller tests:
    • Reader disabled → Entry.content unchanged, no Reader calls, items_enriched == 0.
    • Reader enabled + short RSS description → Entry.content is Reader markdown, metadata.original_content holds RSS description, items_enriched == 1.
    • Reader enabled + RSS description >= threshold → Reader skipped.
    • Reader enabled + Reader returns None → Entry.content falls back to RSS description, enrichment_errors == 1, poll does not fail.
    • source.source_type == "github" → Reader never invoked regardless of config.

All new unit tests mock httpx (no network). Strict mypy on src/distillery/feeds/reader.py.

Verification

  1. pytest -m unit — new tests pass; coverage stays >= 80%.
  2. mypy --strict src/distillery/ — passes.
  3. ruff check src/ tests/ and ruff format --check src/ tests/ — clean.
  4. End-to-end local run:
    • Set JINA_API_KEY.
    • feeds.reader.enabled: true in distillery-dev.yaml.
    • Configure a known RSS source with short descriptions.
    • curl -X POST http://localhost:8000/hooks/poll -H "Authorization: Bearer \$DISTILLERY_WEBHOOK_TOKEN" — response JSON includes items_enriched > 0.
    • Inspect DuckDB: entries from that source have metadata.original_content populated and Entry.content substantially longer than the original RSS description.
  5. Flip feeds.reader.enabled: false, repeat — confirm zero Reader calls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions