readable-rs

A native Rust implementation of the Readability algorithm for extracting the main readable content from HTML pages, stripping away navigation, ads, and other clutter.

This is a faithful port of Mozilla's Readability.js. The scoring algorithm, retry strategy, and heuristics match the upstream JavaScript library.

Installation

[dependencies]
readable-rs = "0.1.2"

Usage

use readable_rs::{extract, ExtractOptions};

let html = r#"
    <html><body>
        <article>
            <h1>My Article</h1>
            <p>The actual article content goes here. It needs to be long enough
            to clear the default character threshold before the algorithm will
            consider it a successful extraction.</p>
        </article>
        <nav><a href="/">Home</a><a href="/about">About</a></nav>
    </body></html>
"#;

let product = extract(html, "https://example.com/article", ExtractOptions::default());

if let Some(content) = &product.content {
    println!("Title:   {}", product.title);
    println!("Byline:  {}", product.by_line);
    println!("Content: {}", content.to_string());
}

How it works

Given an HTML page, extract will:

Strip scripts, styles, comments, and navigation boilerplate
Detect and resolve lazy-loaded images and <noscript> fallbacks
Score candidate elements by content density (comma count, text length, link density, class/id heuristics)
Pick the highest-scoring subtree as the article body
Clean up the result: remove ads, empty nodes, presentation attributes, and rewrite relative URLs to absolute
Extract metadata (title, byline, site name, excerpt, publish date) from <meta> tags, JSON-LD, and heuristics

If the first pass yields fewer characters than char_threshold (default 500), the algorithm retries with progressively relaxed options.

Configuration

All options live on ExtractOptions:

Field	Default	Description
`char_threshold`	500	Minimum character count for successful extraction
`strip_unlikelys`	true	Remove elements that look like navigation/ads
`clean_conditionally`	true	Remove low-density elements (few commas, high link ratio)
`weight_classes`	true	Use class/id names to adjust scoring
`remove_style_tags`	true	Strip `<style>` elements
`keep_classes`	true	Preserve CSS classes on output nodes
`ready_for_epub`	false	Apply stricter cleanup for EPUB compatibility

License

Apache-2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.claude		.claude
src		src
test_textures		test_textures
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

readable-rs

Installation

Usage

How it works

Configuration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

readable-rs

Installation

Usage

How it works

Configuration

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages