Skip to content

Conversation

@codegen-sh
Copy link

@codegen-sh codegen-sh bot commented Aug 7, 2025

Summary

  • Implement Option A: make the task type for Elastic PyTorch dynamic so nnodes overrides are respected at serialization time

Details

  • Override task_type property in PytorchElasticFunctionTask to return "python-task" when current task_config.nnodes denotes single-node (1 or "1" or "1:1"), else "pytorch"
  • Avoid torch imports in property; simple and resilient parsing for ints and range strings (e.g., "1:1")
  • Keeps task_type_version unchanged and preserves existing get_custom branching behavior
  • Single-pod execute path now ignores PET_NNODES and forces nnodes=1
  • Single-pod execute path pins rdzv_endpoint=127.0.0.1:0 (avoids any DNS/IPv6 localhost issues)
  • Added detailed debug logs of mode, raw nnodes string, parsed min/max, nproc_per_node, rdzv_backend, rdzv_endpoint, start_method to aid troubleshooting

Why

  • Previously task_type was decided once in init from the initial config. Later overrides via Node.with_overrides(task_config=Elastic(...)) update _task_config but do not update task_type. This caused tasks defined with nnodes>1 to remain distributed even when overrides set nnodes=1.
  • In some environments, env overrides (like PET_NNODES, PET_RDZV_ENDPOINT) can cause single-pod paths to wait for multiple nodes or use cross-pod rdzv, leading to rendezvous timeouts. We now pin nnodes and rdzv in single-pod mode.

Testing

  • Existing tests around Elastic behavior should continue to pass. The dynamic property only affects the serialized TaskTemplate type when nnodes flips between single and multi.
  • Single-pod mode now logs effective settings and no longer honors PET_NNODES>1 or external rdzv endpoints.

Notes

  • This change ensures runtime overrides (e.g., in workflows/launch plans) can flip between single-pod torchrun and multi-node PyTorchJob without redefining the task, and prevents misconfig from env vars in single-pod mode.

💻 View my workAbout Codegen

codegen-sh bot added 3 commits August 7, 2025 19:50
…ngle-pod (python-task) vs distributed (pytorch)\n\n- Override task_type property in PytorchElasticFunctionTask to derive from current task_config.nnodes\n- Support int and range strings (e.g., 1:1) without torch import\n- Keeps task_type_version unchanged and preserves existing get_custom branching\n\nCo-authored-by: Adriano <adriano@exa.ai>
…vous misconfig from env overrides\n\n- When task_type resolves to python-task, ignore PET_NNODES and pass 1 to parse_min_max_nnodes\n- Prevents elastic launcher waiting for >1 nodes in single-pod runs\n\nCo-authored-by: Adriano <adriano@exa.ai>
…-pod mode\n\n- Log mode, nnodes_str (raw), parsed min/max, nproc_per_node, rdzv_backend, rdzv_endpoint, start_method\n- Derive single_pod_mode from dynamic task_type\n- Use 127.0.0.1:0 for rendezvous in single-pod to avoid DNS/IPv6 localhost edge cases\n\nCo-authored-by: Adriano <adriano@exa.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants