Skip to content

[Bug]: URL Seeding does not work when Common Crawl indexes are not accessible even on setting the source as "sitemap" only. #1747

@ChiragBellara

Description

@ChiragBellara

crawl4ai version

0.8.0

Expected Behavior

URL Seeding should not fail when the source is set to "sitemap" only even if the Common Crawl indexes are not reachable/updatable.

Current Behavior

On 01/29/2026, the Common Crawl servers were down and the indexing URL (https://index.commoncrawl.org/collinfo.json) was not accessible. This caused the URL Seeding to cause a httpx.ConnectTimeout error. However, this should not happen if the source is set to "sitemap" only as URL seeding using sitemaps should not require the latest Common Crawls index.

Is this reproducible?

Yes

Inputs Causing the Bug

The Common Crawl Indexing URL should not be accessible. However, we can reproduce this issue in code using the steps mentioned below.

Steps to Reproduce

Create a async function to perform the URL Seeding task. Before you call the urls() method, make the _latest_index value in the seeder, return a HTTP Error. This simulated the behavior of what would happen if the Common Crawl Indexes were not reachable. Below is a method that does exactly this.

Code snippets

async def recreate_cc_error():
    config = SeedingConfig(source="sitemap")

    async with AsyncUrlSeeder(logger=AsyncLogger(verbose=True)) as seeder:
        async def boom(*args, **kwargs):
            print("DEBUG: _latest_index called")
            raise httpx.ConnectTimeout("Simulated CommonCrawl outage")

        seeder._latest_index = boom
        try:
            await seeder.urls("https://docs.crawl4ai.com/", config)
            print("PASS: _latest_index was NOT called (expected after fix).")
        except httpx.ConnectTimeout:
            print("FAIL: _latest_index WAS called even though source='sitemap'.")

OS

macOS

Python version

3.12.12

Browser

Chrome, Safari

Browser version

No response

Error logs & Screenshots (if applicable)

Traceback (most recent call last):
File ".../site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
yield
File ".../site-packages/httpx/_transports/default.py", line 394, in handle_async_request
resp = await self._pool.handle_async_request(req)
File ".../site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
raise exc from None
File ".../site-packages/httpcore/_async/connection.py", line 124, in _connect
stream = await self._network_backend.connect_tcp(**kwargs)
File ".../site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ConnectTimeout

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File ".../crawl4ai/async_url_seeder.py", line 405, in urls
self.index_id = await self._latest_index()
File ".../crawl4ai/async_url_seeder.py", line 1754, in _latest_index
j = await c.get(COLLINFO_URL, timeout=10)
File ".../site-packages/httpx/_client.py", line 1768, in get
return await self.request(...)
File ".../site-packages/httpx/_transports/default.py", line 393, in handle_async_request
raise mapped_exc(message) from exc
httpx.ConnectTimeout

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions