Caveat: this MCP server was vibecoded. Treat it with appropriate caution, review the code before using it in sensitive environments, and expect that some edge cases may need hardening.
A Model Context Protocol (MCP) server that turns web URLs into clean, LLM-friendly content. It extracts readable Markdown, HTML, or plain text from normal web pages with Mozilla Readability, and it handles public Reddit URLs through Reddit JSON by returning the post plus top comments.
This fork builds on the original work by Max Zimmer in emzimmer/server-moz-readability. The impulse to iterate on it was the need for proper Reddit handling, so Reddit URLs can be fetched through Reddit JSON rather than treated like ordinary pages.
This fork is distributed from GitHub and is not currently published to the npm registry.
- Removes ads, navigation, footers, and other page chrome from regular web pages
- Returns content as Markdown, HTML, or plain text
- Extracts public Reddit posts and up to 20 top-level comments sorted by top without OAuth
- Returns stable metadata including format, permalink, provider type, excerpt, byline, site name, and truncation status
- Caps fetched responses and rendered output by default so large pages stay predictable
- Blocks private and local hosts by default, with environment variables for stricter or local-only policies
- Handles invalid URLs, unsupported content types, redirects, timeouts, and oversized responses with clear errors
Run directly from GitHub:
npx -y --package github:stefanstr/make-content-parsable -- make-content-parsableInstall into a project from GitHub:
npm install github:stefanstr/make-content-parsableRun an installed copy:
npx make-content-parsableRequirements:
- Node.js 18 or newer
- An MCP client that can run stdio servers
Add the server with the Codex CLI:
codex mcp add make-content-parsable -- npx -y --package github:stefanstr/make-content-parsable -- make-content-parsableOr add it manually to ~/.codex/config.toml:
[mcp_servers.make-content-parsable]
command = "npx"
args = ["-y", "--package", "github:stefanstr/make-content-parsable", "--", "make-content-parsable"]With local policy controls:
[mcp_servers.make-content-parsable]
command = "npx"
args = ["-y", "--package", "github:stefanstr/make-content-parsable", "--", "make-content-parsable"]
[mcp_servers.make-content-parsable.env]
ALLOWED_HOSTS = "example.com,docs.example.com"
BLOCKED_HOSTS = "tracking.example.com"Add the server to claude_desktop_config.json:
{
"mcpServers": {
"make-content-parsable": {
"command": "npx",
"args": [
"-y",
"--package",
"github:stefanstr/make-content-parsable",
"--",
"make-content-parsable"
]
}
}
}With local policy controls:
{
"mcpServers": {
"make-content-parsable": {
"command": "npx",
"args": [
"-y",
"--package",
"github:stefanstr/make-content-parsable",
"--",
"make-content-parsable"
],
"env": {
"ALLOWED_HOSTS": "example.com,docs.example.com",
"BLOCKED_HOSTS": "tracking.example.com"
}
}
}
}The server exposes one tool: extract_web_content.
Use extract_web_content when the model needs to read, summarize, quote, analyze, or extract readable content from a URL. Normal web pages are extracted with Mozilla Readability. Public Reddit URLs are fetched through Reddit JSON and return the post plus top comments.
{
"url": "https://example.com/article",
"format": "markdown",
"maxChars": 50000,
"excerptMode": "start"
}| Argument | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | yes | none | Full http or https URL to extract. |
format |
string | no | markdown |
Output format. Supported values: markdown, html, text. |
maxChars |
integer | no | 50000 |
Maximum characters returned after rendering. Use -1 for no output cap. |
excerptMode |
string | no | start |
Truncation strategy for web article output. Supported values: start, best. |
excerptMode: "start" returns the beginning of rendered content when maxChars truncates output. excerptMode: "best" applies to Markdown and text web article output and tries to choose a high-signal content window when Readability leaves boilerplate near the start. HTML output and Reddit currently use start-style truncation even when best is passed.
The MCP tool returns JSON as text:
{
"title": "Article title",
"content": "Rendered article content...",
"metadata": {
"format": "markdown",
"excerpt": "Brief summary or null",
"byline": "Author information or null",
"siteName": "Source website name or null",
"truncated": false,
"permalink": "https://example.com/article",
"provider": {
"type": "web"
}
}
}For Reddit URLs, metadata.provider.type is reddit, siteName is Reddit, and provider details include subreddit, postId, commentsIncluded, and commentsTotal when Reddit supplies them.
Use Markdown for most LLM workflows:
{
"url": "https://example.com/deep-work",
"format": "markdown"
}Example content:
# Deep Work in Small Windows
Most calendars are not built for concentration...Use HTML when the caller needs the cleaned article structure:
{
"url": "https://example.com/deep-work",
"format": "html"
}Example content:
<article><h1>Deep Work in Small Windows</h1><p>Most calendars...</p></article>Use text when the caller wants compact prose without Markdown or HTML syntax:
{
"url": "https://example.com/deep-work",
"format": "text"
}Example content:
Deep Work in Small Windows Most calendars are not built for concentration...
Return only the first 2,000 rendered characters:
{
"url": "https://example.com/deep-work",
"format": "markdown",
"maxChars": 2000
}Disable the rendered output cap:
{
"url": "https://example.com/deep-work",
"format": "markdown",
"maxChars": -1
}Prefer a content-aware excerpt window when a page has boilerplate near the start:
{
"url": "https://example.com/cooling-centers/maps",
"format": "markdown",
"maxChars": 1200,
"excerptMode": "best"
}Fetch a public Reddit post and top comments:
{
"url": "https://www.reddit.com/r/test/comments/abc123/example_post/",
"format": "markdown",
"maxChars": 8000
}Short Reddit URLs are supported too:
{
"url": "https://redd.it/abc123",
"format": "text"
}The server only accepts http and https URLs. It blocks local and private network targets by default, including localhost, *.localhost, loopback addresses, link-local addresses, RFC1918 private IPv4 ranges, unique-local IPv6 addresses, and multicast/reserved address ranges. Redirect targets and DNS-resolved addresses are checked with the same policy.
Environment variables:
| Variable | Default | Description |
|---|---|---|
ALLOW_PRIVATE_HOSTS |
unset | Set to 1, true, or yes to allow private and local hosts. Useful for trusted local testing only. |
ALLOWED_HOSTS |
unset | Comma-separated exact host allowlist. When set, all other hosts are rejected. |
BLOCKED_HOSTS |
unset | Comma-separated exact host blocklist. Blocked hosts are rejected before fetching. |
Examples:
ALLOWED_HOSTS=docs.example.com npx -y --package github:stefanstr/make-content-parsable -- make-content-parsableALLOW_PRIVATE_HOSTS=true npx -y --package github:stefanstr/make-content-parsable -- make-content-parsable- Requests time out after 10 seconds.
- Redirects are followed manually and capped at 5 redirects.
- Fetched responses are capped at 5 MiB.
- Web pages must return
text/htmlorapplication/xhtml+xml. - Reddit JSON responses must return
application/jsonortext/json. - Rendered content is capped at 50,000 characters by default unless
maxCharsis set. maxCharsis applied after rendering, so Markdown, HTML, and text can truncate at different positions.maxChars: -1disables the rendered output cap but does not disable the 5 MiB fetch cap.- Reddit extraction includes top-level top comments only, capped at 20 comments.
- Reddit URLs are normalized to
www.reddit.comJSON endpoints withraw_json=1,sort=top, andlimit=20. excerptMode: "best"is deterministic and local to rendered content; it can skip obvious boilerplate, but it is still a heuristic.- HTML truncation can cut markup because it truncates the rendered HTML string.
Unlike simple fetch requests, this server:
- Extracts relevant content using provider-specific parsing
- Eliminates web page noise like ads, popups, and navigation menus
- Handles Reddit URLs through Reddit JSON instead of brittle page scraping
- Reduces token usage by removing unnecessary HTML and CSS
- Provides consistent formatting for better LLM processing
- Includes useful metadata about the content
- @modelcontextprotocol/sdk - MCP server support
- @mozilla/readability - Content extraction
- turndown - HTML to Markdown conversion
- jsdom - DOM parsing
- axios - HTTP requests
MIT