Skip to content

Ahmed-Ali/readable-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

readable-rs

A native Rust implementation of the Readability algorithm for extracting the main readable content from HTML pages, stripping away navigation, ads, and other clutter.

This is a faithful port of Mozilla's Readability.js. The scoring algorithm, retry strategy, and heuristics match the upstream JavaScript library.

Installation

[dependencies]
readable-rs = "0.1.2"

Usage

use readable_rs::{extract, ExtractOptions};

let html = r#"
    <html><body>
        <article>
            <h1>My Article</h1>
            <p>The actual article content goes here. It needs to be long enough
            to clear the default character threshold before the algorithm will
            consider it a successful extraction.</p>
        </article>
        <nav><a href="/">Home</a><a href="/about">About</a></nav>
    </body></html>
"#;

let product = extract(html, "https://example.com/article", ExtractOptions::default());

if let Some(content) = &product.content {
    println!("Title:   {}", product.title);
    println!("Byline:  {}", product.by_line);
    println!("Content: {}", content.to_string());
}

How it works

Given an HTML page, extract will:

  1. Strip scripts, styles, comments, and navigation boilerplate
  2. Detect and resolve lazy-loaded images and <noscript> fallbacks
  3. Score candidate elements by content density (comma count, text length, link density, class/id heuristics)
  4. Pick the highest-scoring subtree as the article body
  5. Clean up the result: remove ads, empty nodes, presentation attributes, and rewrite relative URLs to absolute
  6. Extract metadata (title, byline, site name, excerpt, publish date) from <meta> tags, JSON-LD, and heuristics

If the first pass yields fewer characters than char_threshold (default 500), the algorithm retries with progressively relaxed options.

Configuration

All options live on ExtractOptions:

Field Default Description
char_threshold 500 Minimum character count for successful extraction
strip_unlikelys true Remove elements that look like navigation/ads
clean_conditionally true Remove low-density elements (few commas, high link ratio)
weight_classes true Use class/id names to adjust scoring
remove_style_tags true Strip <style> elements
keep_classes true Preserve CSS classes on output nodes
ready_for_epub false Apply stricter cleanup for EPUB compatibility

License

Apache-2.0 — see LICENSE.

About

Native Rust port of Mozilla's Readability algorithm for extracting readable content from web pages

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages