A native Rust implementation of the Readability algorithm for extracting the main readable content from HTML pages, stripping away navigation, ads, and other clutter.
This is a faithful port of Mozilla's Readability.js. The scoring algorithm, retry strategy, and heuristics match the upstream JavaScript library.
[dependencies]
readable-rs = "0.1.2"use readable_rs::{extract, ExtractOptions};
let html = r#"
<html><body>
<article>
<h1>My Article</h1>
<p>The actual article content goes here. It needs to be long enough
to clear the default character threshold before the algorithm will
consider it a successful extraction.</p>
</article>
<nav><a href="/">Home</a><a href="/about">About</a></nav>
</body></html>
"#;
let product = extract(html, "https://example.com/article", ExtractOptions::default());
if let Some(content) = &product.content {
println!("Title: {}", product.title);
println!("Byline: {}", product.by_line);
println!("Content: {}", content.to_string());
}Given an HTML page, extract will:
- Strip scripts, styles, comments, and navigation boilerplate
- Detect and resolve lazy-loaded images and
<noscript>fallbacks - Score candidate elements by content density (comma count, text length, link density, class/id heuristics)
- Pick the highest-scoring subtree as the article body
- Clean up the result: remove ads, empty nodes, presentation attributes, and rewrite relative URLs to absolute
- Extract metadata (title, byline, site name, excerpt, publish date) from
<meta>tags, JSON-LD, and heuristics
If the first pass yields fewer characters than char_threshold (default 500), the algorithm retries with progressively relaxed options.
All options live on ExtractOptions:
| Field | Default | Description |
|---|---|---|
char_threshold |
500 | Minimum character count for successful extraction |
strip_unlikelys |
true | Remove elements that look like navigation/ads |
clean_conditionally |
true | Remove low-density elements (few commas, high link ratio) |
weight_classes |
true | Use class/id names to adjust scoring |
remove_style_tags |
true | Strip <style> elements |
keep_classes |
true | Preserve CSS classes on output nodes |
ready_for_epub |
false | Apply stricter cleanup for EPUB compatibility |
Apache-2.0 — see LICENSE.