Skip to content

Race condition: concurrent pipeline load requests cause transient failures #513

@livepeer-tessa

Description

@livepeer-tessa

Summary

When multiple pipeline load requests arrive concurrently for the same pipeline, a race condition causes transient failures. The second request detects that another thread is already loading the pipeline but returns a failure instead of waiting for the first load to complete.

Error Logs

From Grafana fal.ai logs (2026-02-21 04:33 UTC):

04:33:27.277 - Loading 1 pipeline(s): ['streamdiffusionv2']
04:33:27.645 - Loading 1 pipeline(s): ['streamdiffusionv2']  # Second concurrent request
04:33:28.134 - Loading pipeline: streamdiffusionv2
04:33:28.135 - Pipeline streamdiffusionv2 already loading by another thread
04:33:28.138 - ERROR - Failed to load pipeline: streamdiffusionv2
04:33:28.139 - ERROR - Some pipelines failed to load

The pipeline eventually loaded successfully ~27 seconds later, but the intermediate failure triggers error logs and potentially user-facing errors.

Expected Behavior

When a pipeline load request detects that another thread is already loading the same pipeline, it should:

  1. Wait for the first load to complete
  2. Return success if the pipeline is now loaded (reusing the result from the first load)
  3. Only fail if the first load also failed

Current Behavior

The second concurrent request immediately returns a failure when it detects another thread is loading.

Impact

  • Unnecessary ERROR level logs in monitoring
  • Potential user-facing errors during concurrent operations
  • Self-recovers but creates confusion in logs

Component

scope/server/pipeline_manager.py


Filed automatically by Scope Error Monitor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions