NYU-RTS · s-sajid-ali · Feb 20, 2026 · Feb 19, 2026 · Feb 19, 2026 · Feb 20, 2026
diff --git a/...LLM Inference/01_llm_inferenceoverview.md → ...LLM inference/01_llm_inferenceoverview.md b/...LLM Inference/01_llm_inferenceoverview.md → ...LLM inference/01_llm_inferenceoverview.md
diff --git a/...l_ai_hpc/LLM Inference/02_run_hf_model.md → ...i_hpc/08_LLM inference/02_run_hf_model.md b/...l_ai_hpc/LLM Inference/02_run_hf_model.md → ...i_hpc/08_LLM inference/02_run_hf_model.md
@@ -1,4 +1,4 @@
-# Run a Hugging Face model
+# Basic LLM Inference with Hugging Face transformers
 
 Here we provide an example of how one can run a Hugging Face Large-language model (LLM) on the NYU Torch cluster
 

diff --git a/...hpc/08_ml_ai_hpc/LLM Inference/03_vLLM.md → .../08_ml_ai_hpc/08_LLM inference/03_vLLM.md b/...hpc/08_ml_ai_hpc/LLM Inference/03_vLLM.md → .../08_ml_ai_hpc/08_LLM inference/03_vLLM.md
@@ -1,42 +1,50 @@
-# vLLM - A Command Line LLM Tool
+# High-performance LLM inference with `vLLM`
+
 ## What is vLLM?
-[vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving.
-
-## Why vLLM?
-We tested vLLM and llama-cpp on Torch, and found vLLM performs better on Torch:
-Model: Qwen2.5-7B-Instruct
-Prompt Tokens:512
-Output Tokens: 256
-|Backend|Peak Throughput|Median Latency(ms)|Recommendation
+[`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs.
+
+## Why `vLLM`?
+We tested `vLLM` and `llama-cpp` (the inference framework behind `ollama`) on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-7B-Instruct` with `512` input and `256` output tokens.
+
+|Inference Server|Peak Throughput|Median Latency(ms)|Recommendation
 |-----|-----|-----|-----|
-|vLLM|~4689.6|~48.0|Best for Batch/Research|
-|llama-cpp|~115.0|~280.0|Best for Single User|
+|`vLLM`|~4689.6|~48.0|Best for Batch/Research|
+|`llama-cpp`|~115.0|~280.0|Best for Single User|
+
+### Test Environment
+GPU: NVIDIA L40S
+
+`vLLM`: 0.13.0
+
+`Ollama` (llama-cpp backend): 0.14.2
 
 ## vLLM Installation Instructions
-Create a vLLM directory in your /scratch directory, then install the vLLM image:
+Create a `vLLM` directory in your /scratch directory, then install the vLLM image:
 ```
 apptainer pull docker://vllm/vllm-openai:latest
 ```
-### Use High-Performance SCRATCH Storage
-LLMs require very fast storage. On Torch, the SCRATCH filesystem is an all-flash system designed for AI workloads, providing excellent performance.To avoid exceeding your $HOME quota (50GB) and inode limits (30,000 files), you should redirect vLLM's cache and Hugging Face's model downloads to your scratch space:
-```
+### Avoid filling up your `$HOME` directory
+To avoid exceeding your `$HOME` quota (50GB) and inode limits (30,000 files), you should redirect `vLLM`'s cache and Hugging Face's model downloads to your scratch space:
+```sh
 export HF_HOME=/scratch/$USER/hf_cache
 export VLLM_CACHE_ROOT=/scratch/$USER/vllm_cache
 ```
-You should run this to configure vLLM to always use your SCRATCH storage for consistent use:
-```
+You should run this to configure `vLLM` to always use your `$SCRATCH` storage for consistent use:
+```sh
 echo "export HF_HOME=/scratch/\$USER/hf_cache" >> ~/.bashrc
 echo "export VLLM_CACHE_ROOT=/scratch/\$USER/vllm_cache" >> ~/.bashrc
 ```
 
-Note: Files on $SCRATCH are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in $HOME!
+:::note
+Files on `$SCRATCH` are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in `$HOME`!
+:::
 
 ## Run vLLM
 ### Online Serving (OpenAI-Compatible API)
-vLLM implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. 
+`vLLM` implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. 
 **In Terminal 1:**
 Start  vLLM server (In this example we use Qwen model):
-```
+```sh
 apptainer exec --nv vllm-openai_latest.sif vllm serve "Qwen/Qwen2.5-0.5B-Instruct"
 ```
 When you see:
@@ -46,7 +54,7 @@ Application startup complete.
 Open another terminal and log in to the same computing node as in terminal 1.
 
 **In Terminal 2**
-```
+```sh
 curl http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
@@ -58,33 +66,32 @@ curl http://localhost:8000/v1/chat/completions \
 ```
 
 ### Offline Inference
-If you need to process a large dataset at once without setting up a server, you can use vLLM's LLM class.
-For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.
-```
+If you need to process a large dataset at once without setting up a server, you can use `vLLM`'s LLM class.
+For example, the following code downloads the `facebook/opt-125m` model from HuggingFace and runs it in `vLLM` using the default configuration.
+```python
 from vllm import LLM
 
 # Initialize the vLLM engine.
 llm = LLM(model="facebook/opt-125m")
 ```
 After initializing the LLM instance, use the available APIs to perform model inference.
 
-### SGLang: A Simple Option for Offline Batch Inference (Supplement Material)
+### SGLang: A Simple Option for Offline Batch Inference
 For cases where users only want to run batch inference and do not need an HTTP endpoint, SGLang provides a much simpler offline engine API compared to running a full vLLM server. It is particularly suitable for dataset processing, evaluation pipelines, and one-off large-scale inference jobs.
-For more details and examples, see the official SGLang offline engine documentation:
-https://docs.sglang.io/basic_usage/offline_engine_api.html
+For more details and examples, see the official SGLang offline engine documentation here: https://docs.sglang.io/basic_usage/offline_engine_api.html
 
 
-## vLLM CLI
-The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
-```
+## `vLLM` CLI
+The `vllm` command-line tool is used to run and manage `vLLM` models. You can start by viewing the help message with:
+```sh
 vllm --help
 ```
 Serve - Starts the vLLM OpenAI Compatible API server.
-```
+```sh
 vllm serve meta-llama/Llama-2-7b-hf
 ```
 Chat - Generate chat completions via the running API server.
-```
+```sh
 # Directly connect to localhost API without arguments
 vllm chat
 
@@ -95,7 +102,7 @@ vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
 vllm chat --quick "hi"
 ```
 Complete - Generate text completions based on the given prompt via the running API server.
-```
+```sh
 # Directly connect to localhost API without arguments
 vllm complete