Skip to content

feat(convert): add content-focused extraction with boilerplate stripping#78

Merged
chaliy merged 1 commit intomainfrom
claude/issue-72-content-focus
Mar 27, 2026
Merged

feat(convert): add content-focused extraction with boilerplate stripping#78
chaliy merged 1 commit intomainfrom
claude/issue-72-content-focus

Conversation

@chaliy
Copy link
Copy Markdown
Contributor

@chaliy chaliy commented Mar 27, 2026

What

Add strip_boilerplate() function and content_focus request field to reduce token waste from navigation, footers, and sidebars.

Why

Most web pages are 80%+ boilerplate. Agents waste LLM context tokens on nav menus, footers, and sidebars that are irrelevant to the content they're trying to understand.

How

  • New content_focus field on FetchRequest: "main" strips boilerplate, "full" (default) keeps everything
  • strip_boilerplate() strategy:
    1. If <main> or <article> exists, extract only that content
    2. If role="main" element exists, extract that
    3. Fallback: strip <nav>, <footer>, <aside>, <header> and elements with roles navigation, banner, contentinfo, complementary
  • Applied before HTML→Markdown/Text conversion in DefaultFetcher

Risk

  • Low — opt-in via content_focus: "main", default behavior unchanged
  • Handles nested tags correctly

Checklist

  • Unit tests passed (9 strip_boilerplate tests)
  • Clippy clean
  • Docs build clean

Closes #72

Add strip_boilerplate() that removes nav/footer/aside/header and
role-based boilerplate elements. When <main> or <article> is present,
extracts only that content. New content_focus field on FetchRequest
("main" strips boilerplate, "full" or omitted keeps everything).

Closes #72
@chaliy chaliy force-pushed the claude/issue-72-content-focus branch from e711f60 to a14d6d9 Compare March 27, 2026 03:04
@chaliy chaliy merged commit 4162557 into main Mar 27, 2026
10 checks passed
@chaliy chaliy deleted the claude/issue-72-content-focus branch March 27, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: content-focused extraction — strip nav/footer/sidebar boilerplate

1 participant