feat: Cleanup HTML page to reduce token usage by mguella · Pull Request #1073 · ItzCrazyKns/Vane

mguella · 2026-03-20T15:22:08Z

Issue

As reported by #1031 some pages are consuming too many tokens.

Cause

This is because the page is parsed directly to markdown after being fetched, so it contains data that we don't really care about (e.g. comments, styles, scripts).

Solution

There is another PR #1035 that tries to reduce the number of tokens sent to the LLM by truncating the HTML content.
However that approach risks to delete data we care about, especially because it truncates the data at a fixed point, and HTML pages tend to include style and script before the page content, so with that approach we might just end up including in the resulting text script and style without any of the body and the main page content.

This PR takes a different approach: cleanup the HTML by removing things the LLM doesn't need, like comments, script tags and style tags, so we can limit the token usage.

Next steps

If the approach from this PR is not enough, we could parse the page with Mozilla's Readability.js to keep only the main page content.
If that is also not enough we can combine both approaches above (HTML cleanup + Readability.js) with the truncate approach from #1035.

Summary by cubic

Clean up HTML pages before Markdown conversion to cut token usage while keeping the main content. We strip comments, remove scripts/styles/templates, and trim excess whitespace for HTML responses.

New Features
- Clean HTML in scrapeURL.ts with jsdom when Content-Type is text/html.
- Strip HTML comments and line-leading/trailing whitespace before Markdown conversion.
- Remove script, style, and template tags.
Bug Fixes
- Harden HTML comment regex for reliable cleanup.

^{Written for commit 43a1c35. Summary will update on new commits.}

cubic-dev-ai

1 issue found across 3 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">

<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:53">
P2: Comment-stripping regex only matches whitespace/dot comments, so most HTML comments remain and the cleanup fails to remove common comments.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/lib/agents/search/researcher/actions/scrapeURL.ts

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/lib/agents/search/researcher/actions/scrapeURL.ts">

<violation number="1" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:54">
P2: Malformed regex alternative (`\s+$<`) prevents intended trailing-whitespace trimming.</violation>

<violation number="2" location="src/lib/agents/search/researcher/actions/scrapeURL.ts:55">
P2: Global whitespace stripping around tags can remove meaningful spaces and produce merged/incorrect extracted markdown text.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/lib/agents/search/researcher/actions/scrapeURL.ts

cleanup html page to reduce token usage

dc36bef

cubic-dev-ai bot reviewed Mar 20, 2026

View reviewed changes

src/lib/agents/search/researcher/actions/scrapeURL.ts Outdated Show resolved Hide resolved

mguella changed the title ~~Cleanup HTML page to reduce token usage~~ feat: Cleanup HTML page to reduce token usage Mar 20, 2026

mguella added 2 commits March 20, 2026 16:39

fix comments cleanup regexp

5be0f5b

run prettier

43a9bb9

cubic-dev-ai bot reviewed Mar 21, 2026

View reviewed changes

src/lib/agents/search/researcher/actions/scrapeURL.ts Outdated Show resolved Hide resolved

src/lib/agents/search/researcher/actions/scrapeURL.ts Outdated Show resolved Hide resolved

fix regex

43a1c35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Cleanup HTML page to reduce token usage#1073

feat: Cleanup HTML page to reduce token usage#1073
mguella wants to merge 4 commits intoItzCrazyKns:masterfrom
mguella:feature/cleanup-html-page-to-reduce-token-usage

mguella commented Mar 20, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mguella commented Mar 20, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Cause

Solution

Next steps

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mguella commented Mar 20, 2026 •

edited by cubic-dev-ai bot

Loading