Find the Elastic repository #7
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tracking issue
Why are the changes needed?
The
PyTorchElasticFunctionTask'stask_typewas set once during initialization and did not update when thetask_config(specificallynnodes) was overridden usingwith_overrides(). This caused tasks intended to run as single-node standalone pods (e.g.,nnodes=1) to incorrectly execute as multi-node PyTorchJobs if the original task was defined withnnodes > 1.What changes were proposed in this pull request?
This PR addresses the issue by making the
task_typeproperty ofPytorchElasticFunctionTaskdynamic, ensuring it reflects the currentnnodesvalue from the task's configuration, even after overrides.Key changes include:
task_typeproperty: Thetask_typeis now a property that dynamically evaluates_task_config.nnodesto return"python-task"for single-node configurations (nnodes=1or"1","1:1") and"pytorch"for multi-node configurations.__init__andget_custom: Modified these methods to correctly handle both integer and string representations ofnnodeswhen determining the task type and serialization logic.Elasticclass docstring to clarify the dynamictask_typebehavior based onnnodesand its interaction withwith_overrides().How was this patch tested?
Comprehensive unit tests were added to
tests/test_elastic_task.pyto verify the fix:task_typecorrectly changes from"pytorch"to"python-task".task_typecorrectly changes from"python-task"to"pytorch".nnodesvalues (e.g.,"1","1:1","1:4") and verifying correcttask_typedetermination and override behavior.Setup process
N/A
Screenshots
N/A
Check all the applicable boxes
Related PRs
Docs link
Slack Thread