Fix SGLang engine startup diagnostics and stagger init#572
Fix SGLang engine startup diagnostics and stagger init#572alisonshao wants to merge 3 commits intomainfrom
Conversation
…zation 1. _wait_server_healthy now reports exit code, signal name (e.g. SIGKILL for OOM), elapsed time, and URL when a server process crashes, instead of the bare "Server process terminated unexpectedly." message. 2. Rollout engines are now initialized sequentially instead of simultaneously. When all engines start at once with RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1, transient CUDA context allocations on shared GPUs can cause OOM on machines with tight memory margins (e.g. 2.47 GB available on H100 80GB with mem_fraction_static=0.8). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @alisonshao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the robustness and stability of SGLang engine initialization, particularly in resource-constrained environments. It tackles a critical OOM issue during concurrent engine startup by staggering their initialization and provides much-needed diagnostic clarity when an engine fails to start, allowing for quicker identification and resolution of underlying problems. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two valuable improvements. First, it significantly enhances the startup diagnostics for SGLang engines by providing detailed crash information, which will make debugging much easier. Second, it mitigates potential OOM issues by staggering engine initialization, moving from a parallel to a sequential startup process. The changes are well-reasoned and correctly implemented. I have one minor suggestion to improve code style.
| def _raise_terminated(phase): | ||
| elapsed = time.time() - start_time | ||
| exitcode = getattr(process, "exitcode", None) if process is not None else None | ||
| msg = ( | ||
| f"Server process terminated unexpectedly during {phase}. " | ||
| f"exitcode={exitcode}, elapsed={elapsed:.1f}s, url={base_url}" | ||
| ) | ||
| if exitcode is not None and exitcode < 0: | ||
| import signal | ||
|
|
||
| try: | ||
| sig_name = signal.Signals(-exitcode).name | ||
| msg += f" (killed by {sig_name})" | ||
| except (ValueError, AttributeError): | ||
| pass | ||
| if exitcode == -9: | ||
| msg += ". This is likely an OOM kill — check GPU memory availability." | ||
| logger.error(msg) | ||
| raise RuntimeError(msg) |
There was a problem hiding this comment.
For better code style and readability, it's recommended to place imports at the top of the function scope where they are used, rather than inside conditional blocks. Moving import signal to the beginning of the _raise_terminated function aligns with this practice and makes the dependency clearer.
| def _raise_terminated(phase): | |
| elapsed = time.time() - start_time | |
| exitcode = getattr(process, "exitcode", None) if process is not None else None | |
| msg = ( | |
| f"Server process terminated unexpectedly during {phase}. " | |
| f"exitcode={exitcode}, elapsed={elapsed:.1f}s, url={base_url}" | |
| ) | |
| if exitcode is not None and exitcode < 0: | |
| import signal | |
| try: | |
| sig_name = signal.Signals(-exitcode).name | |
| msg += f" (killed by {sig_name})" | |
| except (ValueError, AttributeError): | |
| pass | |
| if exitcode == -9: | |
| msg += ". This is likely an OOM kill — check GPU memory availability." | |
| logger.error(msg) | |
| raise RuntimeError(msg) | |
| def _raise_terminated(phase): | |
| import signal | |
| elapsed = time.time() - start_time | |
| exitcode = getattr(process, "exitcode", None) if process is not None else None | |
| msg = ( | |
| f"Server process terminated unexpectedly during {phase}. " | |
| f"exitcode={exitcode}, elapsed={elapsed:.1f}s, url={base_url}" | |
| ) | |
| if exitcode is not None and exitcode < 0: | |
| try: | |
| sig_name = signal.Signals(-exitcode).name | |
| msg += f" (killed by {sig_name})" | |
| except (ValueError, AttributeError): | |
| pass | |
| if exitcode == -9: | |
| msg += ". This is likely an OOM kill — check GPU memory availability." | |
| logger.error(msg) | |
| raise RuntimeError(msg) |
… daemon Previously, NCCL_NVLS_ENABLE was set to "1" whenever NVLink was detected. This force-enables NVLS (NVLink SHARP), which requires the IMEX daemon to function. On H100 machines with NVLink but without IMEX, this causes: CUDA error 401 'the operation cannot be performed in the present state' Failed to bind NVLink SHARP (NVLS) Multicast memory Now we only set NCCL_NVLS_ENABLE=0 when NVLink is absent. When NVLink is present, we leave the variable unset so NCCL auto-detects NVLS support.
Summary
_wait_server_healthy: When an SGLang server process dies during startup, the error now includes the exit code, signal name (e.g.SIGKILLfor OOM), elapsed time, and server URL. Previously it just said"Server process terminated unexpectedly."with no diagnostic info.RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1, simultaneous startup causes transient CUDA context allocations on shared GPUs, leading to OOM on machines with tight memory margins.Root Cause
The 8-GPU megatron PPO test creates 4 SGLang engines (TP=2 each). All 4 were initialized simultaneously via
ray.get([engine.init.remote(...) for ...]). Each engine subprocess can see all 8 GPUs during NCCL init, causing transient memory spikes. Withmem_fraction_static=0.8, only ~2.47 GB remained available per H100 80GB — not enough margin for concurrent startup overhead.Changes
miles/backends/sglang_utils/sglang_engine.py_wait_server_healthynow accepts aprocessparam, reports exit code + signal on crashmiles/ray/rollout.pyray.get([...])to sequential loopTest plan
e2e/megatron/test_qwen3_4B_ppo.pyon 8-GPU machine to verify engines start successfullyexitcode=-9, killed by SIGKILL)_init_external) still works with the updated_wait_server_healthysignature