Fix Elastic PyTorch task nnodes override issue #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Users reported that when overriding Elastic PyTorch tasks with
nnodes=1when the default task config wasnnodes=2, the task seemed to be stuck in multi-node mode regardless of the override.Root Cause
The issue was a timing mismatch in Flytekit's architecture:
get_custom()method is called during workflow definition/serialization time to determine whether to create a Pod (single-node) or PyTorchJob (multi-node)with_overrides(task_config=Elastic(nnodes=1))updates the task config, but this happens after the resource type has already been "baked in" during serializationnnodes=2), and runtime overrides tonnodes=1couldn't change the fundamental Kubernetes resource typeSolution
Enhanced
get_custom()Method_task_configstate which includes any overridesAdded Debugging Support
with_overrides()method with detailed logging to track override applicationnnodeschangesComprehensive Testing
Key Changes
plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py:get_custom()to properly read current task config stateflytekit/core/node.py:plugins/flytekit-kf-pytorch/tests/test_elastic_task.py:test_nnodes_override()andtest_nnodes_override_reverse()Testing
The fix includes comprehensive tests that verify:
nnodes=2tonnodes=1works (multi-node → single-node)nnodes=1tonnodes=2works (single-node → multi-node)Impact
This fix resolves the architectural limitation where task configuration overrides weren't being properly reflected during serialization. Users can now successfully override Elastic PyTorch task configurations and have them take effect properly.
Fixes the issue described in the original problem where
with_overrides(task_config=Elastic(nnodes=1))was not switching from multi-node to single-node execution mode.💻 View my work • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks