From 0eaa66f69df57a0cedd76a4fb7e69d86c0cfbcc1 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Sun, 22 Feb 2026 12:39:17 -0500 Subject: [PATCH 1/3] Revise vLLM performance comparison and test setup Updated performance metrics and test environment details for vLLM and llama-cpp. --- .../08_ml_ai_hpc/08_LLM inference/03_vLLM.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md index 7ca9138f88..5de16d3ebb 100644 --- a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md +++ b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md @@ -4,19 +4,25 @@ [`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs. ## Why `vLLM`? -We tested `vLLM` and `llama-cpp` (the inference framework behind `ollama`) on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-7B-Instruct` with `512` input and `256` output tokens. +We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` with `256` input and `256` output tokens. -|Inference Server|Peak Throughput|Median Latency(ms)|Recommendation +|Inference Server|Total Throughput (tok/s)|Median Latency (ms)|Recommendation |-----|-----|-----|-----| -|`vLLM`|~4689.6|~48.0|Best for Batch/Research| -|`llama-cpp`|~115.0|~280.0|Best for Single User| +|`vLLM`|~2273|~890|Best for Batch/Research| +|`llama-cpp`|~1312|~1440 |Best for Single User| ### Test Environment GPU: NVIDIA L40S -`vLLM`: 0.13.0 +Model: `Qwen2.5-0.5B-Instruct` -`Ollama` (llama-cpp backend): 0.14.2 +Precision: FP16 + +Workload: 256 input / 256 output tokens + +Concurrency: 8 + +Max requests: 64 ## vLLM Installation Instructions Create a `vLLM` directory in your /scratch directory, then install the vLLM image: From 9611817564fea0ca1b465ad8c77d4051ea84723c Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Mon, 23 Feb 2026 12:42:35 -0500 Subject: [PATCH 2/3] Update vLLM documentation with additional model details Expanded performance comparison of vLLM and llama-cpp for different models and workloads. --- docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md index 5de16d3ebb..c682ed016c 100644 --- a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md +++ b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md @@ -4,18 +4,18 @@ [`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs. ## Why `vLLM`? -We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` with `256` input and `256` output tokens. +We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` and `Qwen2.5-7B-Instruct` with `256` input and `256` output tokens. In terms of practical usage, vLLM is better suited for high-concurrency and batch workloads, while llama-cpp is more appropriate for lightweight or single-user scenarios. -|Inference Server|Total Throughput (tok/s)|Median Latency (ms)|Recommendation +|Model|Inference Server|Generated Throughput (tok/s)|Median Latency (ms)| |-----|-----|-----|-----| -|`vLLM`|~2273|~890|Best for Batch/Research| -|`llama-cpp`|~1312|~1440 |Best for Single User| +|0.5B|`vLLM`|~2273|~890| +|0.5B|`llama-cpp`|~1312|~1440 | +|7B|`vLLM`|~354|~5780 | +|7B|`llama-cpp`|~277|~7200 | ### Test Environment GPU: NVIDIA L40S -Model: `Qwen2.5-0.5B-Instruct` - Precision: FP16 Workload: 256 input / 256 output tokens From a30266f7ec32b21e0b2cd5757ff43ef358497705 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Mon, 23 Feb 2026 13:08:51 -0500 Subject: [PATCH 3/3] Enhance vLLM section with performance details Expanded the comparison between vLLM and llama-cpp, highlighting performance advantages and integration with the Hugging Face/Torch ecosystem. --- docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md index c682ed016c..0af63cdcf1 100644 --- a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md +++ b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md @@ -4,7 +4,7 @@ [`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs. ## Why `vLLM`? -We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` and `Qwen2.5-7B-Instruct` with `256` input and `256` output tokens. In terms of practical usage, vLLM is better suited for high-concurrency and batch workloads, while llama-cpp is more appropriate for lightweight or single-user scenarios. +We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` and `Qwen2.5-7B-Instruct` with `256` input and `256` output tokens. Based on our results, vLLM retains a performance advantage over llama-cpp across both model sizes, although the relative gap decreases as the model size increases. In addition, vLLM integrates more naturally with the Hugging Face/Torch ecosystem in HPC environment, whereas llama-cpp requires more manual setup and compatibility adjustment. |Model|Inference Server|Generated Throughput (tok/s)|Median Latency (ms)| |-----|-----|-----|-----|