-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Labels
bugSomething isn't working.Something isn't working.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Milestone
Description
Description
When running a Scrapy spider using the Apify + Scrapy integration (with specific setup, see the reproduction), the crawl crashes during scheduling with a pydantic_core.ValidationError
. It looks like Request.model_validate(...)
is invoked with None
, probably because get_request(...)
returns None
for a request ID retrieved from the RQ head.
Locally, with the FS storage client, it works; the problem occurs on the Apify platform only.
Follow-up to #404.
Reported by @honzajavorek
Reproduction
# spiders/my_spider.py
import logging
from scrapy import Spider as BaseSpider
from scrapy.http.response import Response
class MySpider(BaseSpider):
name = 'jobs-govcz'
category_id = '128'
def parse(self, response: Response) -> None:
self.log(f'Parsing page {response.url}', level=logging.INFO)
# __main__.py
from __future__ import annotations
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
import os
from .main import main
from apify.scrapy import initialize_logging, run_scrapy_actor
os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings'
if __name__ == '__main__':
initialize_logging()
run_scrapy_actor(main())
# main.py
from scrapy.crawler import CrawlerRunner
from scrapy.utils.defer import deferred_to_future
from .spiders import MySpider as Spider
from apify import Actor
from apify.scrapy import apply_apify_settings
START_URL = 'https://portal.isoss.gov.cz/irj/portal/anonymous/mvrest?path=/eosm-public-offer&officeLabels=%7B%7D&page=1&pageSize=100000&sortColumn=zdatzvsm&sortOrder=-1'
async def main() -> None:
"""Run the Scrapy crawler inside an Apify Actor."""
async with Actor:
settings = apply_apify_settings()
crawler_runner = CrawlerRunner(settings)
crawl_deferred = crawler_runner.crawl(Spider, start_urls=[START_URL])
await deferred_to_future(crawl_deferred)
# settings.py
BOT_NAME = 'titlebot'
DEPTH_LIMIT = 1
LOG_LEVEL = 'INFO'
NEWSPIDER_MODULE = 'src.spiders'
ROBOTSTXT_OBEY = True
SPIDER_MODULES = ['src.spiders']
TELNETCONSOLE_ENABLED = False
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
Log
2025-10-13T17:57:56.997Z ACTOR: Pulling container image of build AHYR0j0K7hIfXsPLV from registry.
2025-10-13T17:57:58.419Z ACTOR: Creating container.
2025-10-13T17:57:58.515Z ACTOR: Starting container.
2025-10-13T17:58:00.186Z [apify] INFO Initializing Actor...
2025-10-13T17:58:00.193Z [apify._configuration] WARN Actor is running on the Apify platform, `disable_browser_sandbox` was changed to True.
2025-10-13T17:58:00.194Z [apify] INFO System info ({"apify_sdk_version": "3.0.1", "apify_client_version": "2.1.0", "crawlee_version": "1.0.2", "python_version": "3.13.8", "os": "linux"})
2025-10-13T17:58:00.262Z [scrapy.addons] INFO Enabled addons:
2025-10-13T17:58:00.263Z [] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.321Z [scrapy.middleware] INFO Enabled extensions:
2025-10-13T17:58:00.322Z ['scrapy.extensions.corestats.CoreStats',
2025-10-13T17:58:00.323Z 'scrapy.extensions.memusage.MemoryUsage',
2025-10-13T17:58:00.324Z 'scrapy.extensions.logstats.LogStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.324Z [scrapy.crawler] INFO Overridden settings:
2025-10-13T17:58:00.325Z {'BOT_NAME': 'titlebot',
2025-10-13T17:58:00.326Z 'DEPTH_LIMIT': 1,
2025-10-13T17:58:00.326Z 'HTTPCACHE_STORAGE': 'apify.scrapy.extensions.ApifyCacheStorage',
2025-10-13T17:58:00.327Z 'LOG_LEVEL': 'INFO',
2025-10-13T17:58:00.328Z 'NEWSPIDER_MODULE': 'src.spiders',
2025-10-13T17:58:00.328Z 'ROBOTSTXT_OBEY': True,
2025-10-13T17:58:00.329Z 'SCHEDULER': 'apify.scrapy.scheduler.ApifyScheduler',
2025-10-13T17:58:00.330Z 'SPIDER_MODULES': ['src.spiders'],
2025-10-13T17:58:00.330Z 'TELNETCONSOLE_ENABLED': False}
2025-10-13T17:58:00.426Z [apify] INFO ApifyHttpProxyMiddleware is not going to be used. Object "proxyConfiguration" is probably missing in the Actor input.
2025-10-13T17:58:00.427Z [scrapy.middleware] INFO Enabled downloader middlewares:
2025-10-13T17:58:00.427Z ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
2025-10-13T17:58:00.428Z 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
2025-10-13T17:58:00.428Z 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
2025-10-13T17:58:00.429Z 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
2025-10-13T17:58:00.430Z 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
2025-10-13T17:58:00.430Z 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
2025-10-13T17:58:00.431Z 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
2025-10-13T17:58:00.432Z 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
2025-10-13T17:58:00.432Z 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
2025-10-13T17:58:00.433Z 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
2025-10-13T17:58:00.434Z 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
2025-10-13T17:58:00.434Z 'scrapy.downloadermiddlewares.stats.DownloaderStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.435Z [scrapy.middleware] INFO Enabled spider middlewares:
2025-10-13T17:58:00.436Z ['scrapy.spidermiddlewares.start.StartSpiderMiddleware',
2025-10-13T17:58:00.436Z 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
2025-10-13T17:58:00.437Z 'scrapy.spidermiddlewares.referer.RefererMiddleware',
2025-10-13T17:58:00.443Z 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
2025-10-13T17:58:00.444Z 'scrapy.spidermiddlewares.depth.DepthMiddleware'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.479Z [scrapy.middleware] INFO Enabled item pipelines:
2025-10-13T17:58:00.480Z ['apify.scrapy.pipelines.ActorDatasetPushPipeline'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.481Z [scrapy.core.engine] INFO Spider opened ({"spider": "<MySpider 'jobs-govcz' at 0x7d8a2494f380>"})
2025-10-13T17:58:00.561Z [scrapy.extensions.logstats] INFO Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<MySpider 'jobs-govcz' at 0x7d8a2494f380>"})
2025-10-13T17:58:00.701Z [apify.scrapy._async_thread] ERROR Coroutine execution raised an exception.
2025-10-13T17:58:00.701Z Traceback (most recent call last):
2025-10-13T17:58:00.702Z File "/usr/local/lib/python3.13/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-10-13T17:58:00.703Z return future.result(timeout=timeout.total_seconds())
2025-10-13T17:58:00.704Z ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.704Z File "/usr/local/lib/python3.13/concurrent/futures/_base.py", line 456, in result
2025-10-13T17:58:00.705Z return self.__get_result()
2025-10-13T17:58:00.705Z ~~~~~~~~~~~~~~~~~^^
2025-10-13T17:58:00.706Z File "/usr/local/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
2025-10-13T17:58:00.707Z raise self._exception
2025-10-13T17:58:00.707Z File "/usr/local/lib/python3.13/site-packages/crawlee/storages/_request_queue.py", line 232, in fetch_next_request
2025-10-13T17:58:00.708Z return await self._client.fetch_next_request()
2025-10-13T17:58:00.710Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.715Z File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_client.py", line 100, in fetch_next_request
2025-10-13T17:58:00.716Z return await self._implementation.fetch_next_request()
2025-10-13T17:58:00.716Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.717Z File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 205, in fetch_next_request
2025-10-13T17:58:00.717Z await self._ensure_head_is_non_empty()
2025-10-13T17:58:00.718Z File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 221, in _ensure_head_is_non_empty
2025-10-13T17:58:00.719Z await self._list_head()
2025-10-13T17:58:00.719Z File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 250, in _list_head
2025-10-13T17:58:00.720Z request = Request.model_validate(
2025-10-13T17:58:00.721Z await self._api_client.get_request(unique_key_to_request_id(request.unique_key))
2025-10-13T17:58:00.721Z )
2025-10-13T17:58:00.722Z File "/usr/local/lib/python3.13/site-packages/pydantic/main.py", line 705, in model_validate
2025-10-13T17:58:00.722Z return cls.__pydantic_validator__.validate_python(
2025-10-13T17:58:00.723Z ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-10-13T17:58:00.724Z obj, strict=strict, from_attributes=from_attributes, context=context, by_alias=by_alias, by_name=by_name
2025-10-13T17:58:00.724Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.725Z )
2025-10-13T17:58:00.725Z ^
2025-10-13T17:58:00.726Z pydantic_core._pydantic_core.ValidationError: 1 validation error for Request
2025-10-13T17:58:00.726Z Input should be a valid dictionary or instance of Request [type=model_type, input_value=None, input_type=NoneType]
2025-10-13T17:58:00.727Z For further information visit https://errors.pydantic.dev/2.11/v/model_type
2025-10-13T17:58:00.728Z Traceback (most recent call last):
2025-10-13T17:58:00.733Z File "/usr/local/lib/python3.13/site-packages/apify/scrapy/scheduler.py", line 152, in next_request
2025-10-13T17:58:00.733Z apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-10-13T17:58:00.734Z File "/usr/local/lib/python3.13/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-10-13T17:58:00.735Z return future.result(timeout=timeout.total_seconds())
2025-10-13T17:58:00.735Z ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.736Z File "/usr/local/lib/python3.13/concurrent/futures/_base.py", line 456, in result
2025-10-13T17:58:00.736Z return self.__get_result()
2025-10-13T17:58:00.737Z ~~~~~~~~~~~~~~~~~^^
2025-10-13T17:58:00.737Z File "/usr/local/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
2025-10-13T17:58:00.738Z raise self._exception
2025-10-13T17:58:00.739Z File "/usr/local/lib/python3.13/site-packages/crawlee/storages/_request_queue.py", line 232, in fetch_next_request
2025-10-13T17:58:00.739Z return await self._client.fetch_next_request()
2025-10-13T17:58:00.740Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.741Z File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_client.py", line 100, in fetch_next_request
2025-10-13T17:58:00.741Z return await self._implementation.fetch_next_request()
2025-10-13T17:58:00.742Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.743Z File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 205, in fetch_next_request
2025-10-13T17:58:00.748Z await self._ensure_head_is_non_empty()
2025-10-13T17:58:00.749Z File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 221, in _ensure_head_is_non_empty
2025-10-13T17:58:00.750Z await self._list_head()
2025-10-13T17:58:00.750Z File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 250, in _list_head
2025-10-13T17:58:00.751Z request = Request.model_validate(
2025-10-13T17:58:00.752Z await self._api_client.get_request(unique_key_to_request_id(request.unique_key))
2025-10-13T17:58:00.753Z )
2025-10-13T17:58:00.753Z File "/usr/local/lib/python3.13/site-packages/pydantic/main.py", line 705, in model_validate
2025-10-13T17:58:00.754Z return cls.__pydantic_validator__.validate_python(
2025-10-13T17:58:00.754Z ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-10-13T17:58:00.755Z obj, strict=strict, from_attributes=from_attributes, context=context, by_alias=by_alias, by_name=by_name
2025-10-13T17:58:00.756Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.757Z )
2025-10-13T17:58:00.757Z ^
2025-10-13T17:58:00.758Z pydantic_core._pydantic_core.ValidationError: 1 validation error for Request
2025-10-13T17:58:00.759Z Input should be a valid dictionary or instance of Request [type=model_type, input_value=None, input_type=NoneType]
2025-10-13T17:58:00.760Z For further information visit https://errors.pydantic.dev/2.11/v/model_type
honzajavorek
Metadata
Metadata
Assignees
Labels
bugSomething isn't working.Something isn't working.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.