Skip to content

Conversation

@jld-adriano
Copy link

Tracking issue

Why are the changes needed?

The PyTorchElasticFunctionTask's task_type was set once during initialization and did not update when the task_config (specifically nnodes) was overridden using with_overrides(). This caused tasks intended to run as single-node standalone pods (e.g., nnodes=1) to incorrectly execute as multi-node PyTorchJobs if the original task was defined with nnodes > 1.

What changes were proposed in this pull request?

This PR addresses the issue by making the task_type property of PytorchElasticFunctionTask dynamic, ensuring it reflects the current nnodes value from the task's configuration, even after overrides.

Key changes include:

  1. Dynamic task_type property: The task_type is now a property that dynamically evaluates _task_config.nnodes to return "python-task" for single-node configurations (nnodes=1 or "1", "1:1") and "pytorch" for multi-node configurations.
  2. Updated __init__ and get_custom: Modified these methods to correctly handle both integer and string representations of nnodes when determining the task type and serialization logic.
  3. Documentation Update: Added a note to the Elastic class docstring to clarify the dynamic task_type behavior based on nnodes and its interaction with with_overrides().

How was this patch tested?

Comprehensive unit tests were added to tests/test_elastic_task.py to verify the fix:

  • Tests for overriding a multi-node task to a single-node task, ensuring the task_type correctly changes from "pytorch" to "python-task".
  • Tests for overriding a single-node task to a multi-node task, ensuring the task_type correctly changes from "python-task" to "pytorch".
  • Tests for handling string nnodes values (e.g., "1", "1:1", "1:4") and verifying correct task_type determination and override behavior.

Setup process

N/A

Screenshots

N/A

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link


Slack Thread

Open in Cursor Open in Web

@cursor
Copy link

cursor bot commented Aug 6, 2025

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

cursoragent and others added 6 commits August 6, 2025 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants