Adaptive SLURM scaling kills workers before compute step runs

**Problem:**
When using dask_cluster_scaling_mode: "adapt" with SLURM, workers get cancelled before the heavy computation starts. pycmor pipelines run many fast lazy xarray steps (complete in <1s), then a single trigger_compute step that submits the real Dask graph. The adaptive scaler sees the fast steps finish, decides no workers are needed, and issues SLURM cancellations — right before .compute() submits the actual work. Workers die mid-startup with ~30-40s runtime and near-zero memory usage.

This is a race condition between Prefect's sequential task submission pattern and dask.distributed's Adaptive algorithm, which monitors pending tasks in the scheduler queue. The adapt() call is upstream (dask.distributed.deploy.Adaptive), not our code.

**Workaround:**
Use fixed scaling instead of adaptive:
  dask_cluster_scaling_mode: "fixed"
  dask_cluster_scaling_fixed_jobs: 1

**Other options:**
1. Pass tuning parameters to .adapt() — interval, wait_count, target_duration — to slow down scale-down decisions. Would require exposing these in the YAML config schema.
2. Add a minimum cooldown period after scale-up before allowing scale-down. Would need a wrapper around the Adaptive class.
3. Pre-submit a dummy long-running task to keep workers alive until the real compute graph arrives.
4. Restructure pipelines so the lazy steps and .compute() are a single Dask task, so the scheduler sees continuous work.


Option 1 (expose adapt tuning params) is the lowest-effort real fix. For now, document that fixed mode is preferred for sequential pipelines with few rules. Adaptive mode is only useful when processing many independent rules in parallel.  Of course, we will end up in the latter use case soon.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive SLURM scaling kills workers before compute step runs #267

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adaptive SLURM scaling kills workers before compute step runs #267

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions