Skip to content

Replace uReadability with markdown.new for content extraction#156

Closed
grayodesa wants to merge 2 commits intoradio-t:masterfrom
grayodesa:feature/markdown-new-content-extraction
Closed

Replace uReadability with markdown.new for content extraction#156
grayodesa wants to merge 2 commits intoradio-t:masterfrom
grayodesa:feature/markdown-new-content-extraction

Conversation

@grayodesa
Copy link
Copy Markdown

Summary

  • Problem: uReadability (ureadability.radio-t.com) often fails to extract content from websites that block server-side requests or require JS rendering, resulting in error messages or stubs being posted as summaries in the chat.
  • Solution: Replace uReadability with markdown.new — a Cloudflare-powered service with a three-tier fallback pipeline (content negotiation → Workers AI → browser rendering) that handles JS-heavy and bot-protected sites reliably.
  • Changes:
    • New MarkdownNewClient in app/bot/openai/mdnew.go implementing the existing uKeeperGetter interface (drop-in replacement)
    • Parses markdown.new response format: Title: header line + YAML frontmatter + markdown body
    • Replaced UKeeperClient wiring in main.go; removed --ur-api/--ur-token flags, added --mdnew-api (default: https://markdown.new, env: MDNEW_API)

Deployment notes

  • Remove UREADABILITY_API and UREADABILITY_TOKEN env vars from deployment config
  • Optionally set MDNEW_API if a different endpoint is needed (default works out of the box, no auth required)

Test plan

  • Unit tests for MarkdownNewClient.Get() with mock HTTP server
  • Unit tests for response parser (parseMarkdownNewResponse, stripFrontmatter)
  • Verified against real markdown.new with https://openai.com/index/introducing-gpt-5-4/ — title and content extracted correctly (31K chars of clean markdown)
  • All existing tests pass (go test ./app/...)
  • End-to-end: run bot locally, send RTJC message with ⚠️ + link, verify summary appears in chat

🤖 Generated with Claude Code

grayodesa and others added 2 commits March 14, 2026 13:53
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
uReadability often fails to extract content from websites that block
server-side requests. markdown.new (Cloudflare-powered) provides a
three-tier fallback pipeline (content negotiation → Workers AI →
browser rendering) that handles JS-heavy and bot-protected sites.

- add MarkdownNewClient implementing uKeeperGetter interface
- parse markdown.new response format (Title header + YAML frontmatter)
- replace UKeeperClient wiring in main.go
- remove --ur-api/--ur-token flags, add --mdnew-api flag
- add unit tests for client and response parser

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@grayodesa grayodesa requested a review from umputun as a code owner March 14, 2026 11:54
Copy link
Copy Markdown
Member

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the PR, the idea of improving content extraction makes sense. I tested markdown.new against real URLs from news.radio-t.com and it works for most sites (4/5), failing only on X/Twitter. couple concerns though:

  1. wrong level of abstraction - this change belongs in ukeeper-readability, not in super-bot. ukeeper is the content extraction layer, super-bot is the consumer. if we improve how ukeeper extracts content, the uKeeperGetter interface stays the same and super-bot doesn't change at all

  2. markdown.new is not a cloudflare product - it was built by an independent developer on top of cloudflare APIs. no SLA, no guarantees it stays up or keeps the same response format. cloudflare has an official Browser Rendering /markdown endpoint that does the same thing - fetches any URL, renders JS, returns clean markdown. it works on the free plan (10 min/day browser time, 1 req/10 sec rate limit) and doesn't require the target site to opt in or be on cloudflare. I tested it and it actually handles X/Twitter correctly (which markdown.new can't), returns clean JSON, and doesn't need custom response parsing

  3. CLAUDE.md rewrite should be a separate PR, unrelated to the content extraction change. it also drops some useful project-specific guidelines

the real problem here is sites behind cloudflare protection returning "just a moment..." to ureadability. the right fix would be upgrading ukeeper to use cloudflare's Browser Rendering /markdown API (POST /accounts/{id}/browser-rendering/markdown with {"url": "..."}) - official, stable, handles JS rendering, works with any URL

@umputun
Copy link
Copy Markdown
Member

umputun commented Apr 12, 2026

closing this — the content extraction improvement was addressed at the proper level in ukeeper-readability, which is where this logic belongs. thx for the idea though, it pointed to a real problem.

@umputun umputun closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants