Skip to content

Conversation

@codegen-sh
Copy link

@codegen-sh codegen-sh bot commented Aug 7, 2025

Problem

Users reported that when overriding Elastic PyTorch tasks with nnodes=1 when the default task config was nnodes=2, the task seemed to be stuck in multi-node mode regardless of the override.

Root Cause

The issue was a timing mismatch in Flytekit's architecture:

  1. Serialization Time: The get_custom() method is called during workflow definition/serialization time to determine whether to create a Pod (single-node) or PyTorchJob (multi-node)
  2. Override Time: The with_overrides(task_config=Elastic(nnodes=1)) updates the task config, but this happens after the resource type has already been "baked in" during serialization
  3. Result: The task was serialized as a multi-node PyTorchJob (because original nnodes=2), and runtime overrides to nnodes=1 couldn't change the fundamental Kubernetes resource type

Solution

Enhanced get_custom() Method

  • Modified the method to explicitly read the current _task_config state which includes any overrides
  • Added comprehensive debugging logs to track when overrides are applied and what values are being used
  • Ensured the decision between Pod vs PyTorchJob uses the most current configuration

Added Debugging Support

  • Enhanced with_overrides() method with detailed logging to track override application
  • Added specific logging for Elastic PyTorch nnodes changes
  • Helps diagnose similar issues in the future

Comprehensive Testing

  • Added unit tests that verify override behavior works correctly
  • Tests both directions: multi-node → single-node and single-node → multi-node
  • Validates that serialization produces different custom configs when overrides are applied

Key Changes

  1. plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py:

    • Enhanced get_custom() to properly read current task config state
    • Added debugging logs to track nnodes values and execution paths
  2. flytekit/core/node.py:

    • Added comprehensive debugging for task config overrides
    • Specific logging for Elastic PyTorch nnodes changes
  3. plugins/flytekit-kf-pytorch/tests/test_elastic_task.py:

    • Added test_nnodes_override() and test_nnodes_override_reverse()
    • Validates that overrides properly change serialization behavior

Testing

The fix includes comprehensive tests that verify:

  • ✅ Override from nnodes=2 to nnodes=1 works (multi-node → single-node)
  • ✅ Override from nnodes=1 to nnodes=2 works (single-node → multi-node)
  • ✅ Serialization produces different custom configs when overrides are applied
  • ✅ Original task behavior is preserved when no overrides are applied

Impact

This fix resolves the architectural limitation where task configuration overrides weren't being properly reflected during serialization. Users can now successfully override Elastic PyTorch task configurations and have them take effect properly.

Fixes the issue described in the original problem where with_overrides(task_config=Elastic(nnodes=1)) was not switching from multi-node to single-node execution mode.


💻 View my workAbout Codegen
⛔ Remove Codegen from PR🚫 Ban action checks

codegen-sh bot added 2 commits August 7, 2025 01:15
- Enhanced get_custom() method to properly read current task config state
- Added comprehensive debugging logs to track override application
- Added unit tests to verify override behavior works correctly
- Fixes issue where with_overrides(task_config=Elastic(nnodes=1)) was not
  taking effect due to timing mismatch between serialization and override

The fix ensures that when get_custom() is called during serialization,
it reads the most current _task_config which includes any overrides
applied via with_overrides(). This allows proper switching between
single-node (Pod) and multi-node (PyTorchJob) execution modes.
- Added override-aware serialization logic in translator.py
- Creates temporary task copies with overrides applied during serialization
- Ensures get_custom() sees the correct configuration without affecting original task
- Prevents shared task entity issues where overrides affect all nodes
- Added comprehensive test to verify serialization isolation
- Addresses architectural timing mismatch between override application and serialization

This fix complements Fix 1 by ensuring that during workflow serialization,
each node's task configuration overrides are properly applied to temporary
task copies, allowing get_custom() to make the correct Pod vs PyTorchJob
decision based on the overridden nnodes value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants