From 0eaa66f69df57a0cedd76a4fb7e69d86c0cfbcc1 Mon Sep 17 00:00:00 2001
From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com>
Date: Sun, 22 Feb 2026 12:39:17 -0500
Subject: [PATCH 1/3] Revise vLLM performance comparison and test setup

Updated performance metrics and test environment details for vLLM and llama-cpp.
---
 .../08_ml_ai_hpc/08_LLM inference/03_vLLM.md   | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md
index 7ca9138f88..5de16d3ebb 100644
--- a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md	
+++ b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md	
@@ -4,19 +4,25 @@
 [`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs.
 
 ## Why `vLLM`?
-We tested `vLLM` and `llama-cpp` (the inference framework behind `ollama`) on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-7B-Instruct` with `512` input and `256` output tokens.
+We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` with `256` input and `256` output tokens.
 
-|Inference Server|Peak Throughput|Median Latency(ms)|Recommendation
+|Inference Server|Total Throughput (tok/s)|Median Latency (ms)|Recommendation
 |-----|-----|-----|-----|
-|`vLLM`|~4689.6|~48.0|Best for Batch/Research|
-|`llama-cpp`|~115.0|~280.0|Best for Single User|
+|`vLLM`|~2273|~890|Best for Batch/Research|
+|`llama-cpp`|~1312|~1440 |Best for Single User|
 
 ### Test Environment
 GPU: NVIDIA L40S
 
-`vLLM`: 0.13.0
+Model: `Qwen2.5-0.5B-Instruct`
 
-`Ollama` (llama-cpp backend): 0.14.2
+Precision: FP16
+
+Workload: 256 input / 256 output tokens  
+
+Concurrency: 8
+
+Max requests: 64
 
 ## vLLM Installation Instructions
 Create a `vLLM` directory in your /scratch directory, then install the vLLM image:

From 9611817564fea0ca1b465ad8c77d4051ea84723c Mon Sep 17 00:00:00 2001
From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com>
Date: Mon, 23 Feb 2026 12:42:35 -0500
Subject: [PATCH 2/3] Update vLLM documentation with additional model details

Expanded performance comparison of vLLM and llama-cpp for different models and workloads.
---
 docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md
index 5de16d3ebb..c682ed016c 100644
--- a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md	
+++ b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md	
@@ -4,18 +4,18 @@
 [`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs.
 
 ## Why `vLLM`?
-We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` with `256` input and `256` output tokens.
+We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` and `Qwen2.5-7B-Instruct` with `256` input and `256` output tokens. In terms of practical usage, vLLM is better suited for high-concurrency and batch workloads, while llama-cpp is more appropriate for lightweight or single-user scenarios.
 
-|Inference Server|Total Throughput (tok/s)|Median Latency (ms)|Recommendation
+|Model|Inference Server|Generated Throughput (tok/s)|Median Latency (ms)|
 |-----|-----|-----|-----|
-|`vLLM`|~2273|~890|Best for Batch/Research|
-|`llama-cpp`|~1312|~1440 |Best for Single User|
+|0.5B|`vLLM`|~2273|~890|
+|0.5B|`llama-cpp`|~1312|~1440 |
+|7B|`vLLM`|~354|~5780 |
+|7B|`llama-cpp`|~277|~7200 |
 
 ### Test Environment
 GPU: NVIDIA L40S
 
-Model: `Qwen2.5-0.5B-Instruct`
-
 Precision: FP16
 
 Workload: 256 input / 256 output tokens  

From a30266f7ec32b21e0b2cd5757ff43ef358497705 Mon Sep 17 00:00:00 2001
From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com>
Date: Mon, 23 Feb 2026 13:08:51 -0500
Subject: [PATCH 3/3] Enhance vLLM section with performance details

Expanded the comparison between vLLM and llama-cpp, highlighting performance advantages and integration with the Hugging Face/Torch ecosystem.
---
 docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md
index c682ed016c..0af63cdcf1 100644
--- a/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md	
+++ b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md	
@@ -4,7 +4,7 @@
 [`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs.
 
 ## Why `vLLM`?
-We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` and `Qwen2.5-7B-Instruct` with `256` input and `256` output tokens. In terms of practical usage, vLLM is better suited for high-concurrency and batch workloads, while llama-cpp is more appropriate for lightweight or single-user scenarios.
+We tested `vLLM` and `llama-cpp` on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-0.5B-Instruct` and `Qwen2.5-7B-Instruct` with `256` input and `256` output tokens. Based on our results, vLLM retains a performance advantage over llama-cpp across both model sizes, although the relative gap decreases as the model size increases. In addition, vLLM integrates more naturally with the Hugging Face/Torch ecosystem in HPC environment, whereas llama-cpp requires more manual setup and compatibility adjustment.
 
 |Model|Inference Server|Generated Throughput (tok/s)|Median Latency (ms)|
 |-----|-----|-----|-----|