Skip to content

Adaptive SLURM scaling kills workers before compute step runs #267

@JanStreffing

Description

@JanStreffing

Problem:
When using dask_cluster_scaling_mode: "adapt" with SLURM, workers get cancelled before the heavy computation starts. pycmor pipelines run many fast lazy xarray steps (complete in <1s), then a single trigger_compute step that submits the real Dask graph. The adaptive scaler sees the fast steps finish, decides no workers are needed, and issues SLURM cancellations — right before .compute() submits the actual work. Workers die mid-startup with ~30-40s runtime and near-zero memory usage.

This is a race condition between Prefect's sequential task submission pattern and dask.distributed's Adaptive algorithm, which monitors pending tasks in the scheduler queue. The adapt() call is upstream (dask.distributed.deploy.Adaptive), not our code.

Workaround:
Use fixed scaling instead of adaptive:
dask_cluster_scaling_mode: "fixed"
dask_cluster_scaling_fixed_jobs: 1

Other options:

  1. Pass tuning parameters to .adapt() — interval, wait_count, target_duration — to slow down scale-down decisions. Would require exposing these in the YAML config schema.
  2. Add a minimum cooldown period after scale-up before allowing scale-down. Would need a wrapper around the Adaptive class.
  3. Pre-submit a dummy long-running task to keep workers alive until the real compute graph arrives.
  4. Restructure pipelines so the lazy steps and .compute() are a single Dask task, so the scheduler sees continuous work.

Option 1 (expose adapt tuning params) is the lowest-effort real fix. For now, document that fixed mode is preferred for sequential pipelines with few rules. Adaptive mode is only useful when processing many independent rules in parallel. Of course, we will end up in the latter use case soon.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions