-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Description
crawl4ai version
0.8.0
Expected Behavior
can_process_url() should receive base_url (the normalized absolute URL), not url (the raw relative href).
Current Behavior
In BestFirstCrawlingStrategy and BFSDeepCrawlStrategy, the link_discovery() method correctly normalizes relative URLs using normalize_url_for_deep_crawl(), but then passes the original raw URL to can_process_url() instead of the normalized URL. This causes valid relative URLs to be rejected with "Missing scheme or netloc" warnings.
In deep_crawling/bff_strategy.py, lines 123-131:
for link in links: url = link.get("href") # Raw href (relative URL) base_url = normalize_url_for_deep_crawl(url, source_url) # ✅ Normalized to absolute if base_url in visited: continue if not await self.can_process_url(url, new_depth): # ❌ BUG: should be base_url self.stats.urls_skipped += 1 continue
The same issue exists in deep_crawling/bfs_strategy.py around line 118.
Is this reproducible?
Yes
Inputs Causing the Bug
URL(s):
Target domain: https://peoplesjewellers.com
Example rejected URLs (relative hrefs from the page):
/valentines-day-specials/c/40000558?icid=HMP%3ATOPTSB%3AVDAYSPECIALS_UPTO50SS
/rings/c/1650308?icid=HMP%3ATB%3ATB1%3ARINGS
/necklaces/c/1977391?icid=HMP%3ATB%3ATB2%3ANECKLACES
/earrings/c/1280226?icid=HMP%3ATB%3ATB3%3AEARRINGS
/?icid=PEO%3ALOGO (homepage link)
Settings used:
# Deep crawl configuration
BestFirstCrawlingStrategy(
max_depth=50,
include_external=False,
max_pages=50,
filter_chain=FilterChain([
URLPatternFilter(patterns=[...], reverse=True),
DomainFilter(allowed_domains=["peoplesjewellers.com"]),
ContentTypeFilter(allowed_types=["text/html"])
]),
url_scorer=KeywordRelevanceScorer(keywords=[...], weight=0.8)
)
Input data:
{
"domain": "peoplesjewellers.com",
"max_pages": 50
}Steps to Reproduce
1. Deep crawl any website with relative internal links (e.g., /products, /about-us)
2. Observe warning logs like:
Invalid URL: /valentines-day-specials/c/40000558?icid=HMP%3ATOPTSB%3AVDAYSPECIALS_UPTO50SS, error: Missing scheme or netlocCode snippets
OS
linux
Python version
3.13.11
Browser
chrome
Browser version
No response
Error logs & Screenshots (if applicable)
