🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling #8423

sanderegg · 2025-09-25T09:59:54Z

What do these changes do?

The autoscaling service when running in computational mode tries to understand via the dask client what are the needs for EC2 instances.

In billable systems it finds what is the EC2 instance type assigned to the task,
In non-billable systems it gets the resources assigned to the task (e.g. CPUs, RAM, ...) and then find a best fit to start the right EC2

Until this PR, the autoscaling service would estimate the resources from an EC2 using what the AWS EC2 API returns, which is not exactly what Dask workers actually have. This would sometimes create dead locks where a machine that should in theory handle the task would actually not since the dask worker "sees" a bit less memory or cpus. This PR shall correct this fact.

Also in Dask, every worker has a defined number of tasks, which are either defined via DASK_NTHREADS, or that are equivalent to the number of CPUs * DASK_NTHREADS_MULTIPLIER. That means that even if a task is supposed to use 0.1 CPU and a worker has that resource available, it will only run if a worker thread can run it. This PR will also fix this by adding both ENVs to the autoscaling service in computational mode.

Related issue/s

fixes Autoscaling: in non-billable systems the chosen machine type does not take in account the removed resources as the dask-sidecar does #6320

How to test

Dev-ops

codecov · 2025-09-25T10:11:23Z

Codecov Report

❌ Patch coverage is 96.96970% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.59%. Comparing base (0c9e189) to head (2c11b7e).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8423      +/-   ##
==========================================
+ Coverage   87.52%   88.59%   +1.06%     
==========================================
  Files        2008     1357     -651     
  Lines       78421    56719   -21702     
  Branches     1344      255    -1089     
==========================================
- Hits        68641    50252   -18389     
+ Misses       9378     6395    -2983     
+ Partials      402       72     -330

Flag	Coverage Δ
integrationtests	`60.42% <ø> (-3.57%)`	⬇️
unittests	`87.55% <96.96%> (+1.30%)`	⬆️

Components	Coverage Δ
pkg_aws_library	`94.75% <96.61%> (+1.13%)`	⬆️
pkg_celery_library	`∅ <ø> (∅)`
pkg_dask_task_models_library	`79.63% <100.00%> (+0.30%)`	⬆️
pkg_models_library	`∅ <ø> (∅)`
pkg_notifications_library	`∅ <ø> (∅)`
pkg_postgres_database	`∅ <ø> (∅)`
pkg_service_integration	`∅ <ø> (∅)`
pkg_service_library	`∅ <ø> (∅)`
pkg_settings_library	`∅ <ø> (∅)`
pkg_simcore_sdk	`84.89% <ø> (ø)`
agent	`93.10% <ø> (ø)`
api_server	`91.62% <ø> (ø)`
autoscaling	`∅ <ø> (∅)`
catalog	`92.06% <ø> (ø)`
clusters_keeper	`99.14% <ø> (ø)`
dask_sidecar	`91.60% <ø> (-0.79%)`	⬇️
datcore_adapter	`97.95% <ø> (ø)`
director	`75.81% <ø> (+0.08%)`	⬆️
director_v2	`85.29% <ø> (-5.63%)`	⬇️
dynamic_scheduler	`96.70% <ø> (-0.12%)`	⬇️
dynamic_sidecar	`90.37% <ø> (-0.07%)`	⬇️
efs_guardian	`89.83% <ø> (ø)`
invitations	`90.90% <ø> (ø)`
payments	`92.80% <ø> (ø)`
resource_usage_tracker	`91.84% <ø> (-0.38%)`	⬇️
storage	`86.61% <ø> (-0.37%)`	⬇️
webclient	`∅ <ø> (∅)`
webserver	`87.10% <ø> (+0.02%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c9e189...2c11b7e. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mergify · 2025-09-25T10:20:56Z

🧪 CI Insights

Here's what we observed from your CI run for 2c11b7e.

❌ Job Failures

Pipeline	Job	Retries	🔍 CI Insights	📄 Logs
`CI`	`integration-tests`	`0`	View	View
	`system-tests`	`0`	View	View
	`unit-tests`	`0`	View	View

Copilot

Pull Request Overview

This PR enhances the computational backend to properly compute and track the number of threads for dask-workers in the autoscaling system. The key changes involve adding support for generic resources (particularly threads) to the Resources model and extending the Dask monitoring configuration.

Add generic resources support to the Resources model with proper arithmetic operations
Introduce DASK_NTHREADS and DASK_NTHREADS_MULTIPLIER configuration settings
Update resource comparison and computation logic to handle the new generic resources

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
services/clusters-keeper/src/simcore_service_clusters_keeper/data/docker-compose.yml	Adds DASK_NTHREADS environment variables to the compose file
services/autoscaling/src/simcore_service_autoscaling/core/settings.py	Introduces new Dask thread configuration settings
packages/aws-library/src/aws_library/ec2/_models.py	Extends Resources model with generic_resources field and related operations
services/autoscaling/src/simcore_service_autoscaling/modules/dask.py	Adds function to compute instance thread resources
services/autoscaling/src/simcore_service_autoscaling/utils/cluster_scaling.py	Updates resource comparison logic
services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py	Refactors resource mapping logic and updates tests

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

packages/aws-library/src/aws_library/ec2/_models.py

packages/aws-library/tests/test_ec2_models.py

Copilot

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py

...toscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_computational.py

Copilot

Pull Request Overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-19T20:30:50Z

services/autoscaling/tests/unit/test_utils_cluster_scaling.py

    assert app_settings.AUTOSCALING_EC2_INSTANCES
    machine_buffer_type = get_hot_buffer_type(random_fake_available_instances)
    _NUM_DRAINED_NODES = 20
+    assert app_settings.AUTOSCALING_EC2_INSTANCES


Duplicate assertion on AUTOSCALING_EC2_INSTANCES; the second assert at line 299 is redundant and can be removed to reduce noise.

Suggested change

assert app_settings.AUTOSCALING_EC2_INSTANCES

Copilot · 2025-10-19T20:30:51Z

packages/aws-library/src/aws_library/ec2/_models.py

    def __gt__(self, other: "Resources") -> bool:
-        return self.cpus > other.cpus or self.ram > other.ram
+        """operator for > comparison
+        if self has any resources gretaer than other, returns True (even if different resource types are smaller)


Corrected spelling of 'gretaer' to 'greater'.

Suggested change

if self has any resources gretaer than other, returns True (even if different resource types are smaller)

if self has any resources greater than other, returns True (even if different resource types are smaller)

Copilot · 2025-10-19T20:30:51Z

packages/aws-library/tests/test_ec2_models.py

+        (
+            Resources(cpus=0.1, ram=ByteSize(1), generic_resources={"GPU": "2"}),
+            Resources(cpus=0.1, ram=ByteSize(1), generic_resources={"GPU": 2}),
+            True,  # string resrouces are not comparable so "2" is considered larger


Corrected spelling of 'resrouces' to 'resources'.

Suggested change

True, # string resrouces are not comparable so "2" is considered larger

True, # string resources are not comparable so "2" is considered larger

Copilot · 2025-10-19T20:30:51Z

services/autoscaling/src/simcore_service_autoscaling/core/errors.py

 class TaskRequirementsAboveRequiredEC2InstanceTypeError(AutoscalingRuntimeError):
    msg_template: str = (
-        "Task {task} requires {instance_type} but requires {resources}. "
+        "Task {task} requires {instance_type} but requires {resources}. {resources_diff} are missing! "


Error message is grammatically unclear ('requires ... but requires ...'); consider revising to 'Task {task} specifies instance type {instance_type} but requests {resources}. {resources_diff} are missing!'.

Suggested change

"Task {task} requires {instance_type} but requires {resources}. {resources_diff} are missing! "

"Task {task} specifies instance type {instance_type} but requests {resources}. {resources_diff} are missing! "

Copilot · 2025-10-19T20:30:52Z

packages/aws-library/src/aws_library/ec2/_models.py

+        if self has greater or equal resources than other, returns True
+        This will return True only if any of the resources in self is greater or equal to other
+
+        Note that generic_resources are compared only if they are numeric
+        Non-numeric generic resources must be equal in both or only defined in self
+        to be considered greater or equal
+        """
+        if self == other:
+            return True
+        return self > other



[nitpick] The ge implementation delegates to gt when resources differ, making 'a >= b' true if any single resource is greater (even when others are smaller), which diverges from conventional >= semantics and may introduce logical errors where full dominance is expected; consider implementing per-resource '>=', including numeric generic resources, or providing a separate method (e.g. any_resource_greater) and keeping ge as strict component-wise comparison.

Suggested change

if self has greater or equal resources than other, returns True

This will return True only if any of the resources in self is greater or equal to other

Note that generic_resources are compared only if they are numeric

Non-numeric generic resources must be equal in both or only defined in self

to be considered greater or equal

"""

if self == other:

return True

return self > other

Returns True only if all resources in self are greater than or equal to those in other.

Note that generic_resources are compared only if they are numeric.

Non-numeric generic resources must be equal in both or only defined in self

to be considered greater or equal.

"""

if self.cpus < other.cpus:

return False

if self.ram < other.ram:

return False

keys = set(self.generic_resources) | set(other.generic_resources)

for k in keys:

a = self.generic_resources.get(k)

b = other.generic_resources.get(k)

if b is None:

continue # resource only in self, that's fine

if a is None:

return False # resource required in other, missing in self

if isinstance(a, int | float) and isinstance(b, int | float):

if a < b:

return False

elif a != b:

return False

return True

sonarqubecloud · 2025-10-20T06:56:38Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

sanderegg added this to the Cheops milestone Sep 25, 2025

sanderegg self-assigned this Sep 25, 2025

sanderegg added a:autoscaling autoscaling service in simcore's stack a:computational clusters labels Sep 25, 2025

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 1eb7d20 to 9b3ec9f Compare September 25, 2025 11:56

sanderegg requested a review from Copilot September 25, 2025 15:24

Copilot AI reviewed Sep 25, 2025

View reviewed changes

packages/aws-library/src/aws_library/ec2/_models.py Outdated Show resolved Hide resolved

packages/aws-library/tests/test_ec2_models.py Outdated Show resolved Hide resolved

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch 3 times, most recently from ec5ef47 to f68a40d Compare September 26, 2025 14:53

sanderegg requested a review from Copilot September 26, 2025 16:28

Copilot AI reviewed Sep 26, 2025

View reviewed changes

.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py Outdated Show resolved Hide resolved

...toscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_computational.py Outdated Show resolved Hide resolved

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch 3 times, most recently from f96a7d5 to e6c3fc7 Compare October 16, 2025 16:25

sanderegg added 13 commits October 17, 2025 16:36

disable hack

030d16b

refactor

0fe0190

added generic resource to ec2 resource model

aa02ea9

do not define gt it is weird for resources

80d4bb7

improve coverage

4c7bcfe

ruff

f35325f

add missing test

82bdf44

ruff

3778fc5

use ge operator

d8765ba

added model dump flat

adb2cfb

both direction

0aec1f3

added ENV variables for nthreads and threads multiplier

ba8d9ba

use ge operator instead of incorrect gt operator

25ba0f5

sanderegg added 19 commits October 17, 2025 16:36

re-added gt operator

f0a7944

use Required

f1fa871

improve docs

73918ff

moved docs

15a4070

added mapping

2f6b547

added mapping

35ef335

improve error

2beb0ee

make private

48dab2d

simplify

2023c10

simplify

65fba89

fix computation

43c2ab7

type

3175137

type

20ceb3a

no need to call items

90020df

revert

a9cb742

refactor

f47573a

better assert

1c6c25c

improving test

de93669

add generic instances based on the provider

83bf3b0

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 8b7db41 to 83bf3b0 Compare October 17, 2025 14:37

sanderegg added 4 commits October 17, 2025 16:39

sonar

bc79807

fix?

ee47115

improve error

d8981c9

add resource info on instance types as well

8f456d9

sanderegg requested a review from Copilot October 19, 2025 20:29

Copilot AI reviewed Oct 19, 2025

View reviewed changes

sanderegg added 3 commits October 19, 2025 22:40

mypy

8f8e966

need to be fixed

001db74

done

2c11b7e

	if self has any resources gretaer than other, returns True (even if different resource types are smaller)
	if self has any resources greater than other, returns True (even if different resource types are smaller)

	True, # string resrouces are not comparable so "2" is considered larger
	True, # string resources are not comparable so "2" is considered larger

	"Task {task} requires {instance_type} but requires {resources}. {resources_diff} are missing! "
	"Task {task} specifies instance type {instance_type} but requests {resources}. {resources_diff} are missing! "

-        if self has greater or equal resources than other, returns True
-        This will return True only if any of the resources in self is greater or equal to other
-        Note that generic_resources are compared only if they are numeric
-        Non-numeric generic resources must be equal in both or only defined in self
-        to be considered greater or equal
-        """
-        if self == other:
-            return True
-        return self > other
+        Returns True only if all resources in self are greater than or equal to those in other.
+        Note that generic_resources are compared only if they are numeric.
+        Non-numeric generic resources must be equal in both or only defined in self
+        to be considered greater or equal.
+        """
+        if self.cpus < other.cpus:
+            return False
+        if self.ram < other.ram:
+            return False
+        keys = set(self.generic_resources) | set(other.generic_resources)
+        for k in keys:
+            a = self.generic_resources.get(k)
+            b = other.generic_resources.get(k)
+            if b is None:
+                continue  # resource only in self, that's fine
+            if a is None:
+                return False  # resource required in other, missing in self
+            if isinstance(a, int | float) and isinstance(b, int | float):
+                if a < b:
+                    return False
+            elif a != b:
+                return False
+        return True

🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling #8423

Are you sure you want to change the base?

🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling #8423

Conversation

sanderegg commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue/s

How to test

Dev-ops

Uh oh!

codecov bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mergify bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

❌ Job Failures

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Oct 20, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanderegg commented Sep 25, 2025 •

edited

Loading

codecov bot commented Sep 25, 2025 •

edited

Loading

mergify bot commented Sep 25, 2025 •

edited

Loading