-
Notifications
You must be signed in to change notification settings - Fork 45
fix: DeepEP env var crash + profiling support for xPyD (on top of PR#150) #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
raviguptaamd
wants to merge
6
commits into
ROCm:develop
Choose a base branch
from
raviguptaamd:ravgupta/deepseek-r1-xpyd-fixes
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
84dc77f
[DistInf] Enable multi-node MoRI EP disaggregated inference (2P/2D+)
raviguptaamd 34fe87f
Add FULL_DECODE_ONLY CUDA graph mode support for decode nodes
raviguptaamd 1c94375
Default MORI_SOCKET_IFNAME to eth0 for bootstrap communication
raviguptaamd a3f196e
Address Copilot review comments on PR #221
raviguptaamd 8811c62
fix(patches): fail fast on download/verification failures
raviguptaamd 0b1cc69
fix: DeepEP env var crash + add profiling support for xPyD benchmarks
raviguptaamd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| #!/bin/bash | ||
| # apply_moriio_2pd_patches.sh — Apply vLLM PR #39276 at container startup | ||
| # ============================================================================= | ||
| # Downloads and applies the patch from vllm-project/vllm PR #39276 which adds: | ||
| # 1. engine_id collision fix (core.py, utils.py) | ||
| # 2. MoRIIOConnector multi-node DP fixes (moriio_connector.py, moriio_common.py) | ||
| # 3. MoRIIO robustness fixes (moriio_engine.py) | ||
| # | ||
| # Idempotent: already-applied patches are skipped via --forward flag. | ||
| # Once PR #39276 is merged upstream, this script becomes a no-op. | ||
| # ============================================================================= | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| PR_NUM=39276 | ||
| PATCH_URL="https://github.com/vllm-project/vllm/pull/${PR_NUM}.patch" | ||
| PATCH_FILE="/tmp/vllm_pr_${PR_NUM}.patch" | ||
|
|
||
| # Locate the vLLM installation directory | ||
| VLLM_INSTALL_DIR="" | ||
| _PYTHON_VLLM_CANDIDATE="$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))" 2>/dev/null || true)" | ||
| for _candidate in \ | ||
| /usr/local/lib/python3.12/dist-packages/vllm \ | ||
| /usr/local/lib/python3.*/dist-packages/vllm; do | ||
| if [ -d "$_candidate" ]; then | ||
| VLLM_INSTALL_DIR="$_candidate" | ||
| break | ||
| fi | ||
| done | ||
|
|
||
| if [ -z "${VLLM_INSTALL_DIR}" ] && [ -n "${_PYTHON_VLLM_CANDIDATE}" ] && [ -d "${_PYTHON_VLLM_CANDIDATE}" ]; then | ||
| VLLM_INSTALL_DIR="${_PYTHON_VLLM_CANDIDATE}" | ||
| fi | ||
|
|
||
| if [ -z "${VLLM_INSTALL_DIR}" ]; then | ||
| echo "[PR#${PR_NUM}] ERROR: Cannot find vLLM installation directory" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # The egg-info / dist-info root is one level up from the vllm package | ||
| VLLM_ROOT="$(dirname "${VLLM_INSTALL_DIR}")" | ||
| echo "[PR#${PR_NUM}] vLLM root: ${VLLM_ROOT}" | ||
|
|
||
| # Download the patch | ||
| echo "[PR#${PR_NUM}] Downloading patch from ${PATCH_URL}..." | ||
| if ! curl -sL "${PATCH_URL}" -o "${PATCH_FILE}" 2>/dev/null; then | ||
| echo "[PR#${PR_NUM}] ERROR: Failed to download patch — check network connectivity" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Verify we got a real patch file (not an HTML error page) | ||
| if ! head -1 "${PATCH_FILE}" | grep -q "^From "; then | ||
| echo "[PR#${PR_NUM}] ERROR: Downloaded file is not a valid patch" | ||
| echo "[PR#${PR_NUM}] First line: $(head -1 "${PATCH_FILE}")" | ||
| rm -f "${PATCH_FILE}" | ||
| exit 1 | ||
| fi | ||
|
|
||
| PATCH_LINES=$(wc -l < "${PATCH_FILE}") | ||
| echo "[PR#${PR_NUM}] Downloaded patch: ${PATCH_LINES} lines" | ||
|
|
||
| # Apply the patch | ||
| # --forward: skip already-applied hunks (idempotent) | ||
| # --reject-file=-: don't create .rej files | ||
| # -p1 strips the first path component (a/vllm/... -> vllm/...) | ||
| echo "[PR#${PR_NUM}] Applying patch to ${VLLM_ROOT}..." | ||
| cd "${VLLM_ROOT}" | ||
|
|
||
| if patch -p1 --forward --reject-file=- < "${PATCH_FILE}" 2>&1; then | ||
| echo "[PR#${PR_NUM}] Patch applied successfully" | ||
| elif [ $? -eq 1 ]; then | ||
| echo "[PR#${PR_NUM}] Patch already applied or partially applied (some hunks skipped)" | ||
| else | ||
| echo "[PR#${PR_NUM}] WARNING: Patch application had errors — some fixes may not be active" | ||
| fi | ||
|
|
||
| # Verify key files were patched by checking for known fix markers | ||
| echo "[PR#${PR_NUM}] Verifying patches..." | ||
| _ok=0 | ||
| _total=0 | ||
|
|
||
| _check_patch() { | ||
| local file="$1" | ||
| local marker="$2" | ||
| local desc="$3" | ||
| _total=$((_total + 1)) | ||
| if [ -f "${VLLM_INSTALL_DIR}/${file}" ] && grep -q "${marker}" "${VLLM_INSTALL_DIR}/${file}" 2>/dev/null; then | ||
| echo " ✓ ${desc}" | ||
| _ok=$((_ok + 1)) | ||
| else | ||
| echo " ✗ ${desc} — marker '${marker}' not found in ${file}" | ||
| fi | ||
| } | ||
|
|
||
| _check_patch "v1/engine/core.py" "dp_rank" "engine_id collision fix" | ||
| _check_patch "distributed/kv_transfer/kv_connector/v1/moriio/moriio_common.py" "data_parallel_size_local" "multi-node DP sizing" | ||
| _check_patch "distributed/kv_transfer/kv_connector/v1/moriio/moriio_connector.py" "_req_kv_params" "kv_transfer_params caching" | ||
| _check_patch "distributed/kv_transfer/kv_connector/v1/moriio/moriio_connector.py" "_is_kv_master" "child node guard" | ||
| _check_patch "distributed/kv_transfer/kv_connector/v1/moriio/moriio_engine.py" "VLLM_MORIIO_TRANSFER_TIMEOUT_S" "transfer timeout" | ||
|
|
||
| echo "[PR#${PR_NUM}] Verification: ${_ok}/${_total} checks passed" | ||
|
|
||
| rm -f "${PATCH_FILE}" | ||
|
|
||
| if [ "${_ok}" -ne "${_total}" ]; then | ||
| echo "[PR#${PR_NUM}] ERROR: Patch verification failed — refusing to continue with partial patches" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "[PR#${PR_NUM}] Done" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script downloads and applies the patch from a live GitHub PR URL at runtime. Because PR patch content can change over time (force-push/new commits) and requires outbound network access, startup becomes non-reproducible and can fail in restricted clusters. Consider vendoring the patch into the repo or pinning to a specific commit/tag patch URL (and optionally verifying a checksum) so the applied changes are deterministic.