Skip to content

[Bug]: can_process_url() called with raw URL instead of normalized URL in deep crawl strategies #1743

@RasenGUY

Description

@RasenGUY

crawl4ai version

0.8.0

Expected Behavior

can_process_url() should receive base_url (the normalized absolute URL), not url (the raw relative href).

Current Behavior

In BestFirstCrawlingStrategy and BFSDeepCrawlStrategy, the link_discovery() method correctly normalizes relative URLs using normalize_url_for_deep_crawl(), but then passes the original raw URL to can_process_url() instead of the normalized URL. This causes valid relative URLs to be rejected with "Missing scheme or netloc" warnings.

In deep_crawling/bff_strategy.py, lines 123-131:

for link in links: url = link.get("href") # Raw href (relative URL) base_url = normalize_url_for_deep_crawl(url, source_url) # ✅ Normalized to absolute if base_url in visited: continue if not await self.can_process_url(url, new_depth): # ❌ BUG: should be base_url self.stats.urls_skipped += 1 continue

The same issue exists in deep_crawling/bfs_strategy.py around line 118.

Is this reproducible?

Yes

Inputs Causing the Bug

URL(s):

Target domain: https://peoplesjewellers.com
Example rejected URLs (relative hrefs from the page):
/valentines-day-specials/c/40000558?icid=HMP%3ATOPTSB%3AVDAYSPECIALS_UPTO50SS
/rings/c/1650308?icid=HMP%3ATB%3ATB1%3ARINGS
/necklaces/c/1977391?icid=HMP%3ATB%3ATB2%3ANECKLACES
/earrings/c/1280226?icid=HMP%3ATB%3ATB3%3AEARRINGS
/?icid=PEO%3ALOGO (homepage link)
Settings used:


# Deep crawl configuration
BestFirstCrawlingStrategy(
    max_depth=50,
    include_external=False,
    max_pages=50,
    filter_chain=FilterChain([
        URLPatternFilter(patterns=[...], reverse=True),
        DomainFilter(allowed_domains=["peoplesjewellers.com"]),
        ContentTypeFilter(allowed_types=["text/html"])
    ]),
    url_scorer=KeywordRelevanceScorer(keywords=[...], weight=0.8)
)
Input data:


{
    "domain": "peoplesjewellers.com",
    "max_pages": 50
}

Steps to Reproduce

1. Deep crawl any website with relative internal links (e.g., /products, /about-us)
2. Observe warning logs like:

Invalid URL: /valentines-day-specials/c/40000558?icid=HMP%3ATOPTSB%3AVDAYSPECIALS_UPTO50SS, error: Missing scheme or netloc

Code snippets

OS

linux

Python version

3.13.11

Browser

chrome

Browser version

No response

Error logs & Screenshots (if applicable)

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions