Skip to content

SDK + Scrapy integration: RequestQueue.fetch_next_request crashes with Pydantic ValidationError #627

@vdusek

Description

@vdusek

Description

When running a Scrapy spider using the Apify + Scrapy integration (with specific setup, see the reproduction), the crawl crashes during scheduling with a pydantic_core.ValidationError. It looks like Request.model_validate(...) is invoked with None, probably because get_request(...) returns None for a request ID retrieved from the RQ head.

Locally, with the FS storage client, it works; the problem occurs on the Apify platform only.

Follow-up to #404.

Reported by @honzajavorek

Reproduction

# spiders/my_spider.py
import logging

from scrapy import Spider as BaseSpider
from scrapy.http.response import Response


class MySpider(BaseSpider):
    name = 'jobs-govcz'
    category_id = '128'

    def parse(self, response: Response) -> None:
        self.log(f'Parsing page {response.url}', level=logging.INFO)
# __main__.py
from __future__ import annotations

from scrapy.utils.reactor import install_reactor

install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')

import os

from .main import main
from apify.scrapy import initialize_logging, run_scrapy_actor

os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings'


if __name__ == '__main__':
    initialize_logging()
    run_scrapy_actor(main())
# main.py
from scrapy.crawler import CrawlerRunner
from scrapy.utils.defer import deferred_to_future

from .spiders import MySpider as Spider
from apify import Actor
from apify.scrapy import apply_apify_settings

START_URL = 'https://portal.isoss.gov.cz/irj/portal/anonymous/mvrest?path=/eosm-public-offer&officeLabels=%7B%7D&page=1&pageSize=100000&sortColumn=zdatzvsm&sortOrder=-1'


async def main() -> None:
    """Run the Scrapy crawler inside an Apify Actor."""
    async with Actor:
        settings = apply_apify_settings()
        crawler_runner = CrawlerRunner(settings)
        crawl_deferred = crawler_runner.crawl(Spider, start_urls=[START_URL])
        await deferred_to_future(crawl_deferred)
# settings.py
BOT_NAME = 'titlebot'
DEPTH_LIMIT = 1
LOG_LEVEL = 'INFO'
NEWSPIDER_MODULE = 'src.spiders'
ROBOTSTXT_OBEY = True
SPIDER_MODULES = ['src.spiders']
TELNETCONSOLE_ENABLED = False
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

Log

2025-10-13T17:57:56.997Z ACTOR: Pulling container image of build AHYR0j0K7hIfXsPLV from registry.
2025-10-13T17:57:58.419Z ACTOR: Creating container.
2025-10-13T17:57:58.515Z ACTOR: Starting container.
2025-10-13T17:58:00.186Z [apify] INFO  Initializing Actor...
2025-10-13T17:58:00.193Z [apify._configuration] WARN  Actor is running on the Apify platform, `disable_browser_sandbox` was changed to True.
2025-10-13T17:58:00.194Z [apify] INFO  System info ({"apify_sdk_version": "3.0.1", "apify_client_version": "2.1.0", "crawlee_version": "1.0.2", "python_version": "3.13.8", "os": "linux"})
2025-10-13T17:58:00.262Z [scrapy.addons] INFO  Enabled addons:
2025-10-13T17:58:00.263Z [] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.321Z [scrapy.middleware] INFO  Enabled extensions:
2025-10-13T17:58:00.322Z ['scrapy.extensions.corestats.CoreStats',
2025-10-13T17:58:00.323Z  'scrapy.extensions.memusage.MemoryUsage',
2025-10-13T17:58:00.324Z  'scrapy.extensions.logstats.LogStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.324Z [scrapy.crawler] INFO  Overridden settings:
2025-10-13T17:58:00.325Z {'BOT_NAME': 'titlebot',
2025-10-13T17:58:00.326Z  'DEPTH_LIMIT': 1,
2025-10-13T17:58:00.326Z  'HTTPCACHE_STORAGE': 'apify.scrapy.extensions.ApifyCacheStorage',
2025-10-13T17:58:00.327Z  'LOG_LEVEL': 'INFO',
2025-10-13T17:58:00.328Z  'NEWSPIDER_MODULE': 'src.spiders',
2025-10-13T17:58:00.328Z  'ROBOTSTXT_OBEY': True,
2025-10-13T17:58:00.329Z  'SCHEDULER': 'apify.scrapy.scheduler.ApifyScheduler',
2025-10-13T17:58:00.330Z  'SPIDER_MODULES': ['src.spiders'],
2025-10-13T17:58:00.330Z  'TELNETCONSOLE_ENABLED': False}
2025-10-13T17:58:00.426Z [apify] INFO  ApifyHttpProxyMiddleware is not going to be used. Object "proxyConfiguration" is probably missing  in the Actor input.
2025-10-13T17:58:00.427Z [scrapy.middleware] INFO  Enabled downloader middlewares:
2025-10-13T17:58:00.427Z ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
2025-10-13T17:58:00.428Z  'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
2025-10-13T17:58:00.428Z  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
2025-10-13T17:58:00.429Z  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
2025-10-13T17:58:00.430Z  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
2025-10-13T17:58:00.430Z  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
2025-10-13T17:58:00.431Z  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
2025-10-13T17:58:00.432Z  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
2025-10-13T17:58:00.432Z  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
2025-10-13T17:58:00.433Z  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
2025-10-13T17:58:00.434Z  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
2025-10-13T17:58:00.434Z  'scrapy.downloadermiddlewares.stats.DownloaderStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.435Z [scrapy.middleware] INFO  Enabled spider middlewares:
2025-10-13T17:58:00.436Z ['scrapy.spidermiddlewares.start.StartSpiderMiddleware',
2025-10-13T17:58:00.436Z  'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
2025-10-13T17:58:00.437Z  'scrapy.spidermiddlewares.referer.RefererMiddleware',
2025-10-13T17:58:00.443Z  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
2025-10-13T17:58:00.444Z  'scrapy.spidermiddlewares.depth.DepthMiddleware'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.479Z [scrapy.middleware] INFO  Enabled item pipelines:
2025-10-13T17:58:00.480Z ['apify.scrapy.pipelines.ActorDatasetPushPipeline'] ({"crawler": "<scrapy.crawler.Crawler object at 0x7d8a2494ef90>"})
2025-10-13T17:58:00.481Z [scrapy.core.engine] INFO  Spider opened ({"spider": "<MySpider 'jobs-govcz' at 0x7d8a2494f380>"})
2025-10-13T17:58:00.561Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<MySpider 'jobs-govcz' at 0x7d8a2494f380>"})
2025-10-13T17:58:00.701Z [apify.scrapy._async_thread] ERROR Coroutine execution raised an exception.
2025-10-13T17:58:00.701Z       Traceback (most recent call last):
2025-10-13T17:58:00.702Z         File "/usr/local/lib/python3.13/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-10-13T17:58:00.703Z           return future.result(timeout=timeout.total_seconds())
2025-10-13T17:58:00.704Z                  ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.704Z         File "/usr/local/lib/python3.13/concurrent/futures/_base.py", line 456, in result
2025-10-13T17:58:00.705Z           return self.__get_result()
2025-10-13T17:58:00.705Z                  ~~~~~~~~~~~~~~~~~^^
2025-10-13T17:58:00.706Z         File "/usr/local/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
2025-10-13T17:58:00.707Z           raise self._exception
2025-10-13T17:58:00.707Z         File "/usr/local/lib/python3.13/site-packages/crawlee/storages/_request_queue.py", line 232, in fetch_next_request
2025-10-13T17:58:00.708Z           return await self._client.fetch_next_request()
2025-10-13T17:58:00.710Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.715Z         File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_client.py", line 100, in fetch_next_request
2025-10-13T17:58:00.716Z           return await self._implementation.fetch_next_request()
2025-10-13T17:58:00.716Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.717Z         File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 205, in fetch_next_request
2025-10-13T17:58:00.717Z           await self._ensure_head_is_non_empty()
2025-10-13T17:58:00.718Z         File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 221, in _ensure_head_is_non_empty
2025-10-13T17:58:00.719Z           await self._list_head()
2025-10-13T17:58:00.719Z         File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 250, in _list_head
2025-10-13T17:58:00.720Z           request = Request.model_validate(
2025-10-13T17:58:00.721Z               await self._api_client.get_request(unique_key_to_request_id(request.unique_key))
2025-10-13T17:58:00.721Z           )
2025-10-13T17:58:00.722Z         File "/usr/local/lib/python3.13/site-packages/pydantic/main.py", line 705, in model_validate
2025-10-13T17:58:00.722Z           return cls.__pydantic_validator__.validate_python(
2025-10-13T17:58:00.723Z                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-10-13T17:58:00.724Z               obj, strict=strict, from_attributes=from_attributes, context=context, by_alias=by_alias, by_name=by_name
2025-10-13T17:58:00.724Z               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.725Z           )
2025-10-13T17:58:00.725Z           ^
2025-10-13T17:58:00.726Z       pydantic_core._pydantic_core.ValidationError: 1 validation error for Request
2025-10-13T17:58:00.726Z         Input should be a valid dictionary or instance of Request [type=model_type, input_value=None, input_type=NoneType]
2025-10-13T17:58:00.727Z           For further information visit https://errors.pydantic.dev/2.11/v/model_type
2025-10-13T17:58:00.728Z Traceback (most recent call last):
2025-10-13T17:58:00.733Z   File "/usr/local/lib/python3.13/site-packages/apify/scrapy/scheduler.py", line 152, in next_request
2025-10-13T17:58:00.733Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-10-13T17:58:00.734Z   File "/usr/local/lib/python3.13/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-10-13T17:58:00.735Z     return future.result(timeout=timeout.total_seconds())
2025-10-13T17:58:00.735Z            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.736Z   File "/usr/local/lib/python3.13/concurrent/futures/_base.py", line 456, in result
2025-10-13T17:58:00.736Z     return self.__get_result()
2025-10-13T17:58:00.737Z            ~~~~~~~~~~~~~~~~~^^
2025-10-13T17:58:00.737Z   File "/usr/local/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
2025-10-13T17:58:00.738Z     raise self._exception
2025-10-13T17:58:00.739Z   File "/usr/local/lib/python3.13/site-packages/crawlee/storages/_request_queue.py", line 232, in fetch_next_request
2025-10-13T17:58:00.739Z     return await self._client.fetch_next_request()
2025-10-13T17:58:00.740Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.741Z   File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_client.py", line 100, in fetch_next_request
2025-10-13T17:58:00.741Z     return await self._implementation.fetch_next_request()
2025-10-13T17:58:00.742Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.743Z   File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 205, in fetch_next_request
2025-10-13T17:58:00.748Z     await self._ensure_head_is_non_empty()
2025-10-13T17:58:00.749Z   File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 221, in _ensure_head_is_non_empty
2025-10-13T17:58:00.750Z     await self._list_head()
2025-10-13T17:58:00.750Z   File "/usr/local/lib/python3.13/site-packages/apify/storage_clients/_apify/_request_queue_single_client.py", line 250, in _list_head
2025-10-13T17:58:00.751Z     request = Request.model_validate(
2025-10-13T17:58:00.752Z         await self._api_client.get_request(unique_key_to_request_id(request.unique_key))
2025-10-13T17:58:00.753Z     )
2025-10-13T17:58:00.753Z   File "/usr/local/lib/python3.13/site-packages/pydantic/main.py", line 705, in model_validate
2025-10-13T17:58:00.754Z     return cls.__pydantic_validator__.validate_python(
2025-10-13T17:58:00.754Z            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-10-13T17:58:00.755Z         obj, strict=strict, from_attributes=from_attributes, context=context, by_alias=by_alias, by_name=by_name
2025-10-13T17:58:00.756Z         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-10-13T17:58:00.757Z     )
2025-10-13T17:58:00.757Z     ^
2025-10-13T17:58:00.758Z pydantic_core._pydantic_core.ValidationError: 1 validation error for Request
2025-10-13T17:58:00.759Z   Input should be a valid dictionary or instance of Request [type=model_type, input_value=None, input_type=NoneType]
2025-10-13T17:58:00.760Z     For further information visit https://errors.pydantic.dev/2.11/v/model_type

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions