Skip to content

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 25, 2025

What do these changes do?

The autoscaling service when running in computational mode tries to understand via the dask client what are the needs for EC2 instances.

  • In billable systems it finds what is the EC2 instance type assigned to the task,
  • In non-billable systems it gets the resources assigned to the task (e.g. CPUs, RAM, ...) and then find a best fit to start the right EC2

Until this PR, the autoscaling service would estimate the resources from an EC2 using what the AWS EC2 API returns, which is not exactly what Dask workers actually have. This would sometimes create dead locks where a machine that should in theory handle the task would actually not since the dask worker "sees" a bit less memory or cpus. This PR shall correct this fact.

Also in Dask, every worker has a defined number of tasks, which are either defined via DASK_NTHREADS, or that are equivalent to the number of CPUs * DASK_NTHREADS_MULTIPLIER. That means that even if a task is supposed to use 0.1 CPU and a worker has that resource available, it will only run if a worker thread can run it. This PR will also fix this by adding both ENVs to the autoscaling service in computational mode.

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 25, 2025
@sanderegg sanderegg self-assigned this Sep 25, 2025
@sanderegg sanderegg added a:autoscaling autoscaling service in simcore's stack a:computational clusters labels Sep 25, 2025
Copy link

codecov bot commented Sep 25, 2025

Codecov Report

❌ Patch coverage is 96.96970% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.59%. Comparing base (0c9e189) to head (2c11b7e).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8423      +/-   ##
==========================================
+ Coverage   87.52%   88.59%   +1.06%     
==========================================
  Files        2008     1357     -651     
  Lines       78421    56719   -21702     
  Branches     1344      255    -1089     
==========================================
- Hits        68641    50252   -18389     
+ Misses       9378     6395    -2983     
+ Partials      402       72     -330     
Flag Coverage Δ
integrationtests 60.42% <ø> (-3.57%) ⬇️
unittests 87.55% <96.96%> (+1.30%) ⬆️
Components Coverage Δ
pkg_aws_library 94.75% <96.61%> (+1.13%) ⬆️
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library 79.63% <100.00%> (+0.30%) ⬆️
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library ∅ <ø> (∅)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 84.89% <ø> (ø)
agent 93.10% <ø> (ø)
api_server 91.62% <ø> (ø)
autoscaling ∅ <ø> (∅)
catalog 92.06% <ø> (ø)
clusters_keeper 99.14% <ø> (ø)
dask_sidecar 91.60% <ø> (-0.79%) ⬇️
datcore_adapter 97.95% <ø> (ø)
director 75.81% <ø> (+0.08%) ⬆️
director_v2 85.29% <ø> (-5.63%) ⬇️
dynamic_scheduler 96.70% <ø> (-0.12%) ⬇️
dynamic_sidecar 90.37% <ø> (-0.07%) ⬇️
efs_guardian 89.83% <ø> (ø)
invitations 90.90% <ø> (ø)
payments 92.80% <ø> (ø)
resource_usage_tracker 91.84% <ø> (-0.38%) ⬇️
storage 86.61% <ø> (-0.37%) ⬇️
webclient ∅ <ø> (∅)
webserver 87.10% <ø> (+0.02%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c9e189...2c11b7e. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

mergify bot commented Sep 25, 2025

🧪 CI Insights

Here's what we observed from your CI run for 2c11b7e.

❌ Job Failures

Pipeline Job Health on master Retries 🔍 CI Insights 📄 Logs
CI integration-tests Healthy 0 View View
system-tests Broken 0 View View
unit-tests Healthy 0 View View

@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 1eb7d20 to 9b3ec9f Compare September 25, 2025 11:56
@sanderegg sanderegg requested a review from Copilot September 25, 2025 15:24
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the computational backend to properly compute and track the number of threads for dask-workers in the autoscaling system. The key changes involve adding support for generic resources (particularly threads) to the Resources model and extending the Dask monitoring configuration.

  • Add generic resources support to the Resources model with proper arithmetic operations
  • Introduce DASK_NTHREADS and DASK_NTHREADS_MULTIPLIER configuration settings
  • Update resource comparison and computation logic to handle the new generic resources

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
services/clusters-keeper/src/simcore_service_clusters_keeper/data/docker-compose.yml Adds DASK_NTHREADS environment variables to the compose file
services/autoscaling/src/simcore_service_autoscaling/core/settings.py Introduces new Dask thread configuration settings
packages/aws-library/src/aws_library/ec2/_models.py Extends Resources model with generic_resources field and related operations
services/autoscaling/src/simcore_service_autoscaling/modules/dask.py Adds function to compute instance thread resources
services/autoscaling/src/simcore_service_autoscaling/utils/cluster_scaling.py Updates resource comparison logic
services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py Refactors resource mapping logic and updates tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch 3 times, most recently from ec5ef47 to f68a40d Compare September 26, 2025 14:53
@sanderegg sanderegg requested a review from Copilot September 26, 2025 16:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch 3 times, most recently from f96a7d5 to e6c3fc7 Compare October 16, 2025 16:25
@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 8b7db41 to 83bf3b0 Compare October 17, 2025 14:37
@sanderegg sanderegg requested a review from Copilot October 19, 2025 20:29
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

assert app_settings.AUTOSCALING_EC2_INSTANCES
machine_buffer_type = get_hot_buffer_type(random_fake_available_instances)
_NUM_DRAINED_NODES = 20
assert app_settings.AUTOSCALING_EC2_INSTANCES
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate assertion on AUTOSCALING_EC2_INSTANCES; the second assert at line 299 is redundant and can be removed to reduce noise.

Suggested change
assert app_settings.AUTOSCALING_EC2_INSTANCES

Copilot uses AI. Check for mistakes.

def __gt__(self, other: "Resources") -> bool:
return self.cpus > other.cpus or self.ram > other.ram
"""operator for > comparison
if self has any resources gretaer than other, returns True (even if different resource types are smaller)
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'gretaer' to 'greater'.

Suggested change
if self has any resources gretaer than other, returns True (even if different resource types are smaller)
if self has any resources greater than other, returns True (even if different resource types are smaller)

Copilot uses AI. Check for mistakes.

(
Resources(cpus=0.1, ram=ByteSize(1), generic_resources={"GPU": "2"}),
Resources(cpus=0.1, ram=ByteSize(1), generic_resources={"GPU": 2}),
True, # string resrouces are not comparable so "2" is considered larger
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'resrouces' to 'resources'.

Suggested change
True, # string resrouces are not comparable so "2" is considered larger
True, # string resources are not comparable so "2" is considered larger

Copilot uses AI. Check for mistakes.

class TaskRequirementsAboveRequiredEC2InstanceTypeError(AutoscalingRuntimeError):
msg_template: str = (
"Task {task} requires {instance_type} but requires {resources}. "
"Task {task} requires {instance_type} but requires {resources}. {resources_diff} are missing! "
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message is grammatically unclear ('requires ... but requires ...'); consider revising to 'Task {task} specifies instance type {instance_type} but requests {resources}. {resources_diff} are missing!'.

Suggested change
"Task {task} requires {instance_type} but requires {resources}. {resources_diff} are missing! "
"Task {task} specifies instance type {instance_type} but requests {resources}. {resources_diff} are missing! "

Copilot uses AI. Check for mistakes.

Comment on lines 48 to 58
if self has greater or equal resources than other, returns True
This will return True only if any of the resources in self is greater or equal to other
Note that generic_resources are compared only if they are numeric
Non-numeric generic resources must be equal in both or only defined in self
to be considered greater or equal
"""
if self == other:
return True
return self > other

Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The ge implementation delegates to gt when resources differ, making 'a >= b' true if any single resource is greater (even when others are smaller), which diverges from conventional >= semantics and may introduce logical errors where full dominance is expected; consider implementing per-resource '>=', including numeric generic resources, or providing a separate method (e.g. any_resource_greater) and keeping ge as strict component-wise comparison.

Suggested change
if self has greater or equal resources than other, returns True
This will return True only if any of the resources in self is greater or equal to other
Note that generic_resources are compared only if they are numeric
Non-numeric generic resources must be equal in both or only defined in self
to be considered greater or equal
"""
if self == other:
return True
return self > other
Returns True only if all resources in self are greater than or equal to those in other.
Note that generic_resources are compared only if they are numeric.
Non-numeric generic resources must be equal in both or only defined in self
to be considered greater or equal.
"""
if self.cpus < other.cpus:
return False
if self.ram < other.ram:
return False
keys = set(self.generic_resources) | set(other.generic_resources)
for k in keys:
a = self.generic_resources.get(k)
b = other.generic_resources.get(k)
if b is None:
continue # resource only in self, that's fine
if a is None:
return False # resource required in other, missing in self
if isinstance(a, int | float) and isinstance(b, int | float):
if a < b:
return False
elif a != b:
return False
return True

Copilot uses AI. Check for mistakes.

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:autoscaling autoscaling service in simcore's stack a:computational clusters

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autoscaling: in non-billable systems the chosen machine type does not take in account the removed resources as the dask-sidecar does

1 participant