-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Description
crawl4ai version
0.8.0
Expected Behavior
URL Seeding should not fail when the source is set to "sitemap" only even if the Common Crawl indexes are not reachable/updatable.
Current Behavior
On 01/29/2026, the Common Crawl servers were down and the indexing URL (https://index.commoncrawl.org/collinfo.json) was not accessible. This caused the URL Seeding to cause a httpx.ConnectTimeout error. However, this should not happen if the source is set to "sitemap" only as URL seeding using sitemaps should not require the latest Common Crawls index.
Is this reproducible?
Yes
Inputs Causing the Bug
The Common Crawl Indexing URL should not be accessible. However, we can reproduce this issue in code using the steps mentioned below.Steps to Reproduce
Create a async function to perform the URL Seeding task. Before you call the urls() method, make the _latest_index value in the seeder, return a HTTP Error. This simulated the behavior of what would happen if the Common Crawl Indexes were not reachable. Below is a method that does exactly this.Code snippets
async def recreate_cc_error():
config = SeedingConfig(source="sitemap")
async with AsyncUrlSeeder(logger=AsyncLogger(verbose=True)) as seeder:
async def boom(*args, **kwargs):
print("DEBUG: _latest_index called")
raise httpx.ConnectTimeout("Simulated CommonCrawl outage")
seeder._latest_index = boom
try:
await seeder.urls("https://docs.crawl4ai.com/", config)
print("PASS: _latest_index was NOT called (expected after fix).")
except httpx.ConnectTimeout:
print("FAIL: _latest_index WAS called even though source='sitemap'.")OS
macOS
Python version
3.12.12
Browser
Chrome, Safari
Browser version
No response
Error logs & Screenshots (if applicable)
Traceback (most recent call last):
File ".../site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
yield
File ".../site-packages/httpx/_transports/default.py", line 394, in handle_async_request
resp = await self._pool.handle_async_request(req)
File ".../site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
raise exc from None
File ".../site-packages/httpcore/_async/connection.py", line 124, in _connect
stream = await self._network_backend.connect_tcp(**kwargs)
File ".../site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ConnectTimeout
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../crawl4ai/async_url_seeder.py", line 405, in urls
self.index_id = await self._latest_index()
File ".../crawl4ai/async_url_seeder.py", line 1754, in _latest_index
j = await c.get(COLLINFO_URL, timeout=10)
File ".../site-packages/httpx/_client.py", line 1768, in get
return await self.request(...)
File ".../site-packages/httpx/_transports/default.py", line 393, in handle_async_request
raise mapped_exc(message) from exc
httpx.ConnectTimeout