Skip to content

Commit 48b0551

Browse files
authored
Merge pull request #68 from james-tang17/platform
Add platform related scripts, faq and known issues
2 parents 85bdc1d + 9c97fc2 commit 48b0551

18 files changed

+1155
-0
lines changed

vllm/FAQ.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Table of Contents
2+
3+
## Installation
4+
- [Can I run the platform benchmark under a bare-metal Ubuntu environment?](#can-i-run-the-platform-benchmark-under-a-bare-metal-ubuntu-environment)
5+
- [Can I use Ubuntu 24.04 LTS as the base OS?](#can-i-use-ubuntu-2404-lts-as-the-base-os)
6+
- [Why can't I see the desktop even with Ubuntu 25.04 desktop version installed?](#why-cant-i-see-the-desktop-even-with-ubuntu-2504-desktop-version-installed)
7+
- [Can I update the kernel version or other drivers of Ubuntu to get the latest fixes?](#can-i-update-the-kernel-version-or-other-drivers-of-ubuntu-to-get-the-latest-fixes)
8+
- [Why do I need to run `native_bkc_setup.sh` before using the `vllm/platform` Docker image?](#why-do-i-need-to-run-native_bkc_setupsh-before-using-the-vllmplatform-docker-image)
9+
10+
## Hardware & Firmware
11+
- [No re-sizable BAR configuration in my BIOS. What can I do to enable B60 with a larger BAR2 size?](#no-re-sizable-bar-configuration-in-my-bios-what-can-i-do-to-enable-b60-with-a-larger-bar2-size)
12+
- [Maxsun 2x GPU Card Not Detected Behind PCIe Switch](#maxsun-2x-gpu-card-not-detected-behind-pcie-switch)
13+
14+
## Benchmarking
15+
- [Why do I see unusually high Device-to-Device bandwidth in `ze_peak` benchmark?](#why-do-i-see-unusually-high-device-to-device-bandwidth-in-ze_peak-benchmark)
16+
- [How can I verify if the benchmark data from `platform_basic_evaluation.sh` is valid?](#how-can-i-verify-if-the-benchmark-data-from-platform_basic_evaluationsh-is-valid)
17+
18+
## Tools
19+
- [Why can't I see `xpu-smi` in the `vllm` Docker image?](#why-cant-i-see-xpu-smi-in-the-vllm-docker-image)
20+
- [Why can't I see GPU utilization with `xpu-smi`?](#why-cant-i-see-gpu-utilization-with-xpu-smi)
21+
22+
---
23+
24+
# Installation
25+
26+
## Can I run the platform benchmark under a bare-metal Ubuntu environment?
27+
28+
Yes. Please contact the Intel support team to obtain an offline installer for native setup.
29+
We also plan to make the offline installer publicly available on the Intel RDC website in an upcoming release.
30+
31+
## Can I use Ubuntu 24.04 LTS as the base OS? {#can-i-use-ubuntu-2404-lts-as-the-base-os}
32+
33+
Not yet. Support for Ubuntu 24.04 LTS is planned in future releases (targeting late 2025).
34+
35+
## Why can't I see the desktop even with Ubuntu 25.04 desktop version installed? {#why-cant-i-see-the-desktop-even-with-ubuntu-2504-desktop-version-installed}
36+
37+
Some versions of Ubuntu may default to text mode (multi-user target) after installation. You can check the current mode:
38+
39+
```bash
40+
sudo systemctl get-default
41+
```
42+
43+
If it returns `multi-user.target`, you can switch to graphical mode:
44+
45+
```bash
46+
sudo systemctl set-default graphical.target
47+
sudo reboot
48+
```
49+
50+
## Can I update the kernel version or other drivers of Ubuntu to get the latest fixes?
51+
52+
During the evaluation phase, we **do not recommend updating the kernel or system packages** to ensure consistency with the validated environment.
53+
Any updates may affect stability or introduce compatibility issues with pre-installed components.
54+
55+
## Why do I need to run `native_bkc_setup.sh` before using the `vllm/platform` Docker image?
56+
57+
To ensure consistent kernel and firmware behavior, `native_bkc_setup.sh` is required to unify Linux kernel version and install B60 GuC/HuC firmware directly on the host system before using the container image.
58+
59+
---
60+
61+
# Hardware & Firmware
62+
63+
## No re-sizable BAR configuration in my BIOS. What can I do to enable B60 with a larger BAR2 size?
64+
65+
Please contact your AIB (Add-In-Board) vendor to request the latest IFWI (firmware image) with max re-sizable BAR pre-configured.
66+
This setup has been validated on Gunnir and Maxsun B60 cards.
67+
68+
## Maxsun 2x GPU Card Not Detected Behind PCIe Switch
69+
70+
Many PCIe switch firmware versions do not support PCIe bifurcation, which prevents detection of dual-GPU cards like Maxsun 2x.
71+
72+
Solution: A firmware update for the PCIe switch is required.
73+
The Broadcom PEX 89104 has been validated. Please contact your PCIe switch vendor for support or an updated firmware.
74+
75+
---
76+
77+
# Benchmarking
78+
79+
## Why do I see unusually high Device-to-Device bandwidth in `ze_peak` benchmark?
80+
81+
Please export the following environment variable before running ze_peak.
82+
83+
```bash
84+
export NEOReadDebugKeys=1
85+
export RenderCompressedBuffersEnabled=0
86+
```
87+
88+
## How can I verify if the benchmark data from `platform_basic_evaluation.sh` is valid?
89+
90+
Sample benchmark results are available in:
91+
92+
```
93+
/opt/intel/multi-arc/results
94+
```
95+
96+
These data points are collected from internal evaluations using an Intel® Xeon® W5-2545X system with dual B60 GPUs.
97+
> **Disclaimer**: This reference is provided for informational purposes only and should not be interpreted as official performance indicators or guarantees. Actual results may vary depending on hardware configuration, software stack, and usage scenarios.
98+
99+
---
100+
101+
# Tools
102+
103+
## Why can't I see `xpu-smi` in the `vllm` Docker image?
104+
105+
Due to release process limitations, `xpu-smi` is currently not included in the official `vllm` Docker image.
106+
We plan to add it in the next release. In the meantime, you may install it manually using:
107+
108+
[xpu-smi 1.3.1 on GitHub](https://github.com/intel/xpumanager/releases/download/V1.3.1/xpumanager_1.3.1_20250724.061629.60921e5e_u24.04_amd64.deb)
109+
110+
## Why can't I see GPU utilization with `xpu-smi`?
111+
112+
GPU utilization metrics are not yet fully supported by `xpu-smi` in the current release.
113+
This functionality is scheduled to be added in next release.

vllm/KNOWN_ISSUES.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
2+
# 01. System Hang During Ubuntu 25.04 Installation with B60 Card Plugged In
3+
The issue is caused by an outdated GPU GuC firmware bundled in the official Ubuntu 25.04 Desktop ISO image.
4+
5+
Workaround: Remove the B60 card before starting the Ubuntu installation, and plug it back in once the installation is complete.
6+
We are also working with the Ubuntu team to address this issue upstream.
7+
8+
# 02. Limited 33 GB/s Bi-Directional P2P Bandwidth with 1x GPU Card
9+
When using a single GPU card over a x16 PCIe connection without a PCIe switch, the observed bi-directional P2P bandwidth is limited to 33 GB/s.
10+
11+
Workaround: Change the PCIe slot configuration in BIOS from Auto/x16 to x8/x8.
12+
With this change, over 40 GB/s bi-directional P2P bandwidth can be achieved.
13+
Root cause analysis is still in progress.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#!/bin/bash
2+
# disable-snap-apparmor-logs.sh
3+
# Quiet AppArmor DENIED messages from snapd
4+
5+
set -e
6+
7+
CONFIG="/etc/apparmor/parser.conf"
8+
9+
echo "[1/2] Updating AppArmor config to disable audit logs..."
10+
if ! grep -q "^no-audit" "$CONFIG"; then
11+
echo "no-audit" | sudo tee -a "$CONFIG"
12+
fi
13+
14+
echo "[2/2] Restarting AppArmor..."
15+
sudo systemctl restart apparmor
16+
17+
echo "✅ AppArmor snap-confine DENIED logs have been silenced."
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
# disable-auto-upgrade.sh
3+
# Permanently disable automatic updates on Ubuntu
4+
5+
set -e
6+
7+
echo "[1/4] Disable unattended-upgrades service..."
8+
sudo systemctl stop unattended-upgrades.service || true
9+
sudo systemctl disable unattended-upgrades.service || true
10+
11+
echo "[2/4] Disable apt-daily timers..."
12+
sudo systemctl stop apt-daily.timer apt-daily-upgrade.timer || true
13+
sudo systemctl disable apt-daily.timer apt-daily-upgrade.timer || true
14+
15+
echo "[3/4] Update APT config to disable periodic upgrades..."
16+
CONFIG_FILE="/etc/apt/apt.conf.d/20auto-upgrades"
17+
if [ -f "$CONFIG_FILE" ]; then
18+
sudo sed -i 's/^\(APT::Periodic::Update-Package-Lists\).*/\1 "0";/' "$CONFIG_FILE"
19+
sudo sed -i 's/^\(APT::Periodic::Unattended-Upgrade\).*/\1 "0";/' "$CONFIG_FILE"
20+
else
21+
echo 'APT::Periodic::Update-Package-Lists "0";' | sudo tee "$CONFIG_FILE"
22+
echo 'APT::Periodic::Unattended-Upgrade "0";' | sudo tee -a "$CONFIG_FILE"
23+
fi
24+
25+
echo "[4/4] Disable Snap auto-refresh..."
26+
sudo systemctl stop snapd.snap-repair.timer || true
27+
sudo systemctl disable snapd.snap-repair.timer || true
28+
29+
echo "✅ Automatic updates have been disabled permanently."
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
#!/bin/bash
2+
set -euo pipefail
3+
4+
is_docker() {
5+
grep -qaE 'docker|kubepods|containerd' /proc/1/cgroup && return 0
6+
[[ "$(hostname)" =~ ^[0-9a-f]{12}$ ]] && return 0
7+
return 1
8+
}
9+
10+
# Check for root privileges
11+
if [[ "$EUID" -ne 0 ]]; then
12+
echo "[ERROR] This script must be run as root."
13+
exit 1
14+
fi
15+
16+
if is_docker; then
17+
echo "[ERROR] Please run this script under native environment, not in docker"
18+
exit 1
19+
fi
20+
21+
# Prepare output directory
22+
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
23+
OUTDIR="sysinfo_$TIMESTAMP"
24+
mkdir -p "$OUTDIR"
25+
26+
echo "[INFO] Collecting system information into $OUTDIR..."
27+
28+
# 1. CPU governor
29+
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > "$OUTDIR/scaling_governor.txt" 2>/dev/null || echo "Not available" > "$OUTDIR/scaling_governor.txt"
30+
31+
# 2. CPU architecture
32+
lscpu > "$OUTDIR/lscpu.txt"
33+
34+
# 3. PCI topology
35+
lspci -tv > "$OUTDIR/lspci_tree.txt"
36+
lspci -vvv > "$OUTDIR/lspci_verbose.txt"
37+
38+
# 4. Kernel messages
39+
dmesg > "$OUTDIR/dmesg.txt"
40+
41+
# 5. DRI tree
42+
tree /sys/kernel/debug/dri/ > "$OUTDIR/dri_tree.txt" 2>/dev/null || echo "Not available" > "$OUTDIR/dri_tree.txt"
43+
44+
# 6. Memory usage
45+
free -h > "$OUTDIR/memory.txt"
46+
47+
# 7. Hardware info
48+
dmidecode > "$OUTDIR/dmidecode.txt"
49+
50+
# 8. libze info
51+
dpkg -l | grep libze > "$OUTDIR/libze_version.txt"
52+
53+
# Create tar archive first
54+
TAR_FILE="sysinfo_$TIMESTAMP.tar"
55+
XZ_FILE="$TAR_FILE.xz"
56+
57+
echo "[INFO] Creating archive $TAR_FILE..."
58+
tar -cf "$TAR_FILE" "$OUTDIR"
59+
60+
echo "[INFO] Compressing with xz -9..."
61+
xz -9 "$TAR_FILE"
62+
63+
echo "[INFO] Done. Output file: $XZ_FILE"
64+
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
#!/bin/bash
2+
3+
# Output header
4+
echo "Category,Version"
5+
6+
# 1. Ubuntu version
7+
UBUNTU_VERSION=$(grep '^VERSION=' /etc/os-release | cut -d '"' -f 2)
8+
echo "Ubuntu,$UBUNTU_VERSION"
9+
10+
# 2. Linux kernel version
11+
KERNEL_VERSION=$(uname -r)
12+
echo "Linux Kernel,$KERNEL_VERSION"
13+
14+
# 3. Intel GPU firmware versions from dmesg
15+
16+
# Extract GuC firmware version
17+
guc_ver=$(dmesg | grep -i 'Using GuC firmware' | head -n1 | grep -oP 'version \K[\d\.]+')
18+
if [[ -n "$guc_ver" ]]; then
19+
echo "GPU Firmware (guc),$guc_ver"
20+
else
21+
echo "GPU Firmware (guc),Not Found"
22+
fi
23+
24+
# Extract HuC firmware version
25+
huc_ver=$(dmesg | grep -i 'Using HuC firmware' | head -n1 | grep -oP 'version \K[\d\.]+')
26+
if [[ -n "$huc_ver" ]]; then
27+
echo "GPU Firmware (huc),$huc_ver"
28+
fi
29+
30+
# Extract DMC firmware version
31+
dmc_ver=$(dmesg | grep -i 'Finished loading DMC firmware' | head -n1 | grep -oP '\(v\K[\d\.]+')
32+
if [[ -n "$dmc_ver" ]]; then
33+
echo "GPU Firmware (dmc),$dmc_ver"
34+
else
35+
echo "GPU Firmware (dmc),Not Found"
36+
fi
37+
38+
# 4. OneAPI version (offline installed)
39+
ONEAPI_LOG=$(ls /opt/intel/oneapi/logs/installer.install.intel.oneapi.lin.basekit.product,v=* 2>/dev/null | head -n1)
40+
if [[ -n "$ONEAPI_LOG" ]]; then
41+
oneapi_ver=$(basename "$ONEAPI_LOG" | sed -n 's/.*basekit\.product,v=\(.*\)\..*/\1/p')
42+
echo "oneapi,oneapi-base-toolkit=$oneapi_ver"
43+
else
44+
echo "oneapi,oneapi-base-toolkit=Not Installed"
45+
fi
46+
47+
# 5. Parse passed-in package files
48+
for file in "$@"; do
49+
[[ ! -f "$file" ]] && continue
50+
51+
category=$(basename "$file" .txt)
52+
first=1
53+
54+
while IFS= read -r pkg; do
55+
[[ -z "$pkg" || "$pkg" =~ ^# ]] && continue
56+
57+
version=$(dpkg-query -W -f='${Version}\n' "$pkg" 2>/dev/null)
58+
version_output="$pkg=${version:-Not Installed}"
59+
60+
if [[ $first -eq 1 ]]; then
61+
echo "$category,$version_output"
62+
first=0
63+
else
64+
echo ",$version_output"
65+
fi
66+
done < "$file"
67+
done
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#!/bin/bash
2+
set -e
3+
4+
# Help message
5+
usage() {
6+
echo "Usage: $0 [-n image_name:tag]"
7+
echo "Default image name: ubuntu:25.04-custom"
8+
exit 1
9+
}
10+
11+
# Default image name
12+
IMAGE_NAME="ubuntu:25.04-custom"
13+
14+
# Parse options
15+
while getopts ":n:h" opt; do
16+
case ${opt} in
17+
n )
18+
IMAGE_NAME=$OPTARG
19+
;;
20+
h )
21+
usage
22+
;;
23+
\? )
24+
echo "Invalid option: -$OPTARG" >&2
25+
usage
26+
;;
27+
esac
28+
done
29+
30+
TAR_NAME="ubuntu-2504-rootfs.tar.gz"
31+
32+
echo "[+] Image name: $IMAGE_NAME"
33+
echo "[+] Creating root filesystem archive..."
34+
35+
sudo tar --numeric-owner -czpf "$TAR_NAME" \
36+
--exclude=/proc \
37+
--exclude=/sys \
38+
--exclude=/dev \
39+
--exclude=/tmp/* \
40+
--exclude=/run/* \
41+
--exclude=/mnt \
42+
--exclude=/media \
43+
--exclude=/lost+found \
44+
--exclude=/var/tmp/* \
45+
--exclude=/home \
46+
--exclude=/root \
47+
--exclude=/etc/ssh \
48+
--exclude=/etc/hostname \
49+
--exclude=/etc/hosts \
50+
/
51+
52+
echo "[+] Archive created: $TAR_NAME"
53+
54+
echo "[+] Importing into Docker as image: $IMAGE_NAME"
55+
cat "$TAR_NAME" | docker import - "$IMAGE_NAME"
56+
57+
echo "[✔] Done!"
58+
echo "You can run the image using:"
59+
echo " docker run -it $IMAGE_NAME bash"

0 commit comments

Comments
 (0)