-
Notifications
You must be signed in to change notification settings - Fork 32
🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling #8423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling #8423
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8423 +/- ##
==========================================
+ Coverage 87.52% 88.59% +1.06%
==========================================
Files 2008 1357 -651
Lines 78421 56719 -21702
Branches 1344 255 -1089
==========================================
- Hits 68641 50252 -18389
+ Misses 9378 6395 -2983
+ Partials 402 72 -330
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
1eb7d20
to
9b3ec9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the computational backend to properly compute and track the number of threads for dask-workers in the autoscaling system. The key changes involve adding support for generic resources (particularly threads) to the Resources model and extending the Dask monitoring configuration.
- Add generic resources support to the Resources model with proper arithmetic operations
- Introduce DASK_NTHREADS and DASK_NTHREADS_MULTIPLIER configuration settings
- Update resource comparison and computation logic to handle the new generic resources
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
services/clusters-keeper/src/simcore_service_clusters_keeper/data/docker-compose.yml | Adds DASK_NTHREADS environment variables to the compose file |
services/autoscaling/src/simcore_service_autoscaling/core/settings.py | Introduces new Dask thread configuration settings |
packages/aws-library/src/aws_library/ec2/_models.py | Extends Resources model with generic_resources field and related operations |
services/autoscaling/src/simcore_service_autoscaling/modules/dask.py | Adds function to compute instance thread resources |
services/autoscaling/src/simcore_service_autoscaling/utils/cluster_scaling.py | Updates resource comparison logic |
services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py | Refactors resource mapping logic and updates tests |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
ec5ef47
to
f68a40d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py
Outdated
Show resolved
Hide resolved
...toscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_computational.py
Outdated
Show resolved
Hide resolved
f96a7d5
to
e6c3fc7
Compare
8b7db41
to
83bf3b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
assert app_settings.AUTOSCALING_EC2_INSTANCES | ||
machine_buffer_type = get_hot_buffer_type(random_fake_available_instances) | ||
_NUM_DRAINED_NODES = 20 | ||
assert app_settings.AUTOSCALING_EC2_INSTANCES |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate assertion on AUTOSCALING_EC2_INSTANCES; the second assert at line 299 is redundant and can be removed to reduce noise.
assert app_settings.AUTOSCALING_EC2_INSTANCES |
Copilot uses AI. Check for mistakes.
def __gt__(self, other: "Resources") -> bool: | ||
return self.cpus > other.cpus or self.ram > other.ram | ||
"""operator for > comparison | ||
if self has any resources gretaer than other, returns True (even if different resource types are smaller) |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'gretaer' to 'greater'.
if self has any resources gretaer than other, returns True (even if different resource types are smaller) | |
if self has any resources greater than other, returns True (even if different resource types are smaller) |
Copilot uses AI. Check for mistakes.
( | ||
Resources(cpus=0.1, ram=ByteSize(1), generic_resources={"GPU": "2"}), | ||
Resources(cpus=0.1, ram=ByteSize(1), generic_resources={"GPU": 2}), | ||
True, # string resrouces are not comparable so "2" is considered larger |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'resrouces' to 'resources'.
True, # string resrouces are not comparable so "2" is considered larger | |
True, # string resources are not comparable so "2" is considered larger |
Copilot uses AI. Check for mistakes.
class TaskRequirementsAboveRequiredEC2InstanceTypeError(AutoscalingRuntimeError): | ||
msg_template: str = ( | ||
"Task {task} requires {instance_type} but requires {resources}. " | ||
"Task {task} requires {instance_type} but requires {resources}. {resources_diff} are missing! " |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message is grammatically unclear ('requires ... but requires ...'); consider revising to 'Task {task} specifies instance type {instance_type} but requests {resources}. {resources_diff} are missing!'.
"Task {task} requires {instance_type} but requires {resources}. {resources_diff} are missing! " | |
"Task {task} specifies instance type {instance_type} but requests {resources}. {resources_diff} are missing! " |
Copilot uses AI. Check for mistakes.
if self has greater or equal resources than other, returns True | ||
This will return True only if any of the resources in self is greater or equal to other | ||
Note that generic_resources are compared only if they are numeric | ||
Non-numeric generic resources must be equal in both or only defined in self | ||
to be considered greater or equal | ||
""" | ||
if self == other: | ||
return True | ||
return self > other | ||
|
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The ge implementation delegates to gt when resources differ, making 'a >= b' true if any single resource is greater (even when others are smaller), which diverges from conventional >= semantics and may introduce logical errors where full dominance is expected; consider implementing per-resource '>=', including numeric generic resources, or providing a separate method (e.g. any_resource_greater) and keeping ge as strict component-wise comparison.
if self has greater or equal resources than other, returns True | |
This will return True only if any of the resources in self is greater or equal to other | |
Note that generic_resources are compared only if they are numeric | |
Non-numeric generic resources must be equal in both or only defined in self | |
to be considered greater or equal | |
""" | |
if self == other: | |
return True | |
return self > other | |
Returns True only if all resources in self are greater than or equal to those in other. | |
Note that generic_resources are compared only if they are numeric. | |
Non-numeric generic resources must be equal in both or only defined in self | |
to be considered greater or equal. | |
""" | |
if self.cpus < other.cpus: | |
return False | |
if self.ram < other.ram: | |
return False | |
keys = set(self.generic_resources) | set(other.generic_resources) | |
for k in keys: | |
a = self.generic_resources.get(k) | |
b = other.generic_resources.get(k) | |
if b is None: | |
continue # resource only in self, that's fine | |
if a is None: | |
return False # resource required in other, missing in self | |
if isinstance(a, int | float) and isinstance(b, int | float): | |
if a < b: | |
return False | |
elif a != b: | |
return False | |
return True |
Copilot uses AI. Check for mistakes.
|
What do these changes do?
The
autoscaling
service when running in computational mode tries to understand via the dask client what are the needs for EC2 instances.Until this PR, the
autoscaling
service would estimate the resources from an EC2 using what the AWS EC2 API returns, which is not exactly what Dask workers actually have. This would sometimes create dead locks where a machine that should in theory handle the task would actually not since the dask worker "sees" a bit less memory or cpus. This PR shall correct this fact.Also in Dask, every worker has a defined number of tasks, which are either defined via DASK_NTHREADS, or that are equivalent to the number of CPUs * DASK_NTHREADS_MULTIPLIER. That means that even if a task is supposed to use 0.1 CPU and a worker has that resource available, it will only run if a worker thread can run it. This PR will also fix this by adding both ENVs to the autoscaling service in computational mode.
Related issue/s
How to test
Dev-ops