Replace uReadability with markdown.new for content extraction#156
Replace uReadability with markdown.new for content extraction#156grayodesa wants to merge 2 commits intoradio-t:masterfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
uReadability often fails to extract content from websites that block server-side requests. markdown.new (Cloudflare-powered) provides a three-tier fallback pipeline (content negotiation → Workers AI → browser rendering) that handles JS-heavy and bot-protected sites. - add MarkdownNewClient implementing uKeeperGetter interface - parse markdown.new response format (Title header + YAML frontmatter) - replace UKeeperClient wiring in main.go - remove --ur-api/--ur-token flags, add --mdnew-api flag - add unit tests for client and response parser Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
umputun
left a comment
There was a problem hiding this comment.
thx for the PR, the idea of improving content extraction makes sense. I tested markdown.new against real URLs from news.radio-t.com and it works for most sites (4/5), failing only on X/Twitter. couple concerns though:
-
wrong level of abstraction - this change belongs in ukeeper-readability, not in super-bot. ukeeper is the content extraction layer, super-bot is the consumer. if we improve how ukeeper extracts content, the
uKeeperGetterinterface stays the same and super-bot doesn't change at all -
markdown.new is not a cloudflare product - it was built by an independent developer on top of cloudflare APIs. no SLA, no guarantees it stays up or keeps the same response format. cloudflare has an official Browser Rendering /markdown endpoint that does the same thing - fetches any URL, renders JS, returns clean markdown. it works on the free plan (10 min/day browser time, 1 req/10 sec rate limit) and doesn't require the target site to opt in or be on cloudflare. I tested it and it actually handles X/Twitter correctly (which markdown.new can't), returns clean JSON, and doesn't need custom response parsing
-
CLAUDE.md rewrite should be a separate PR, unrelated to the content extraction change. it also drops some useful project-specific guidelines
the real problem here is sites behind cloudflare protection returning "just a moment..." to ureadability. the right fix would be upgrading ukeeper to use cloudflare's Browser Rendering /markdown API (POST /accounts/{id}/browser-rendering/markdown with {"url": "..."}) - official, stable, handles JS rendering, works with any URL
|
closing this — the content extraction improvement was addressed at the proper level in ukeeper-readability, which is where this logic belongs. thx for the idea though, it pointed to a real problem. |
Summary
ureadability.radio-t.com) often fails to extract content from websites that block server-side requests or require JS rendering, resulting in error messages or stubs being posted as summaries in the chat.MarkdownNewClientinapp/bot/openai/mdnew.goimplementing the existinguKeeperGetterinterface (drop-in replacement)Title:header line + YAML frontmatter + markdown bodyUKeeperClientwiring inmain.go; removed--ur-api/--ur-tokenflags, added--mdnew-api(default:https://markdown.new, env:MDNEW_API)Deployment notes
UREADABILITY_APIandUREADABILITY_TOKENenv vars from deployment configMDNEW_APIif a different endpoint is needed (default works out of the box, no auth required)Test plan
MarkdownNewClient.Get()with mock HTTP serverparseMarkdownNewResponse,stripFrontmatter)https://openai.com/index/introducing-gpt-5-4/— title and content extracted correctly (31K chars of clean markdown)go test ./app/...)🤖 Generated with Claude Code