[Bug]: can_process_url() called with raw URL instead of normalized URL in deep crawl strategies

### crawl4ai version

0.8.0

### Expected Behavior

can_process_url() should receive base_url (the normalized absolute URL), not url (the raw relative href).

### Current Behavior

In BestFirstCrawlingStrategy and BFSDeepCrawlStrategy, the link_discovery() method correctly normalizes relative URLs using normalize_url_for_deep_crawl(), but then passes the original raw URL to can_process_url() instead of the normalized URL. This causes valid relative URLs to be rejected with "Missing scheme or netloc" warnings.

In deep_crawling/bff_strategy.py, lines 123-131:

`
for link in links:
    url = link.get("href")                              # Raw href (relative URL)
    base_url = normalize_url_for_deep_crawl(url, source_url)  # ✅ Normalized to absolute
    if base_url in visited:
        continue
    if not await self.can_process_url(url, new_depth):  # ❌ BUG: should be base_url
        self.stats.urls_skipped += 1
        continue
`

The same issue exists in deep_crawling/bfs_strategy.py around line 118.

### Is this reproducible?

Yes

### Inputs Causing the Bug

```bash
URL(s):

Target domain: https://peoplesjewellers.com
Example rejected URLs (relative hrefs from the page):
/valentines-day-specials/c/40000558?icid=HMP%3ATOPTSB%3AVDAYSPECIALS_UPTO50SS
/rings/c/1650308?icid=HMP%3ATB%3ATB1%3ARINGS
/necklaces/c/1977391?icid=HMP%3ATB%3ATB2%3ANECKLACES
/earrings/c/1280226?icid=HMP%3ATB%3ATB3%3AEARRINGS
/?icid=PEO%3ALOGO (homepage link)
Settings used:


# Deep crawl configuration
BestFirstCrawlingStrategy(
    max_depth=50,
    include_external=False,
    max_pages=50,
    filter_chain=FilterChain([
        URLPatternFilter(patterns=[...], reverse=True),
        DomainFilter(allowed_domains=["peoplesjewellers.com"]),
        ContentTypeFilter(allowed_types=["text/html"])
    ]),
    url_scorer=KeywordRelevanceScorer(keywords=[...], weight=0.8)
)
Input data:


{
    "domain": "peoplesjewellers.com",
    "max_pages": 50
}
```

### Steps to Reproduce

```bash
1. Deep crawl any website with relative internal links (e.g., /products, /about-us)
2. Observe warning logs like:

Invalid URL: /valentines-day-specials/c/40000558?icid=HMP%3ATOPTSB%3AVDAYSPECIALS_UPTO50SS, error: Missing scheme or netloc
```

### Code snippets

```python

```

### OS

linux

### Python version

3.13.11

### Browser

chrome

### Browser version

_No response_

### Error logs & Screenshots (if applicable)

<img width="1111" height="191" alt="Image" src="https://github.com/user-attachments/assets/8742d8bb-a7a0-4899-a8c7-af3beb2e0af6" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: can_process_url() called with raw URL instead of normalized URL in deep crawl strategies #1743

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: can_process_url() called with raw URL instead of normalized URL in deep crawl strategies #1743

Description

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions