diff --git a/docs/hpc/08_ml_ai_hpc/LLM Inference/01_llm_inferenceoverview.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/01_llm_inferenceoverview.md similarity index 100% rename from docs/hpc/08_ml_ai_hpc/LLM Inference/01_llm_inferenceoverview.md rename to docs/hpc/08_ml_ai_hpc/08_LLM inference/01_llm_inferenceoverview.md diff --git a/docs/hpc/08_ml_ai_hpc/LLM Inference/02_run_hf_model.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/02_run_hf_model.md similarity index 99% rename from docs/hpc/08_ml_ai_hpc/LLM Inference/02_run_hf_model.md rename to docs/hpc/08_ml_ai_hpc/08_LLM inference/02_run_hf_model.md index 5db01dc19d..01cc653b0b 100644 --- a/docs/hpc/08_ml_ai_hpc/LLM Inference/02_run_hf_model.md +++ b/docs/hpc/08_ml_ai_hpc/08_LLM inference/02_run_hf_model.md @@ -1,4 +1,4 @@ -# Run a Hugging Face model +# Basic LLM Inference with Hugging Face transformers Here we provide an example of how one can run a Hugging Face Large-language model (LLM) on the NYU Torch cluster diff --git a/docs/hpc/08_ml_ai_hpc/LLM Inference/03_vLLM.md b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md similarity index 53% rename from docs/hpc/08_ml_ai_hpc/LLM Inference/03_vLLM.md rename to docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md index cdc2434da8..7ca9138f88 100644 --- a/docs/hpc/08_ml_ai_hpc/LLM Inference/03_vLLM.md +++ b/docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md @@ -1,42 +1,50 @@ -# vLLM - A Command Line LLM Tool +# High-performance LLM inference with `vLLM` + ## What is vLLM? -[vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving. - -## Why vLLM? -We tested vLLM and llama-cpp on Torch, and found vLLM performs better on Torch: -Model: Qwen2.5-7B-Instruct -Prompt Tokens:512 -Output Tokens: 256 -|Backend|Peak Throughput|Median Latency(ms)|Recommendation +[`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs. + +## Why `vLLM`? +We tested `vLLM` and `llama-cpp` (the inference framework behind `ollama`) on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-7B-Instruct` with `512` input and `256` output tokens. + +|Inference Server|Peak Throughput|Median Latency(ms)|Recommendation |-----|-----|-----|-----| -|vLLM|~4689.6|~48.0|Best for Batch/Research| -|llama-cpp|~115.0|~280.0|Best for Single User| +|`vLLM`|~4689.6|~48.0|Best for Batch/Research| +|`llama-cpp`|~115.0|~280.0|Best for Single User| + +### Test Environment +GPU: NVIDIA L40S + +`vLLM`: 0.13.0 + +`Ollama` (llama-cpp backend): 0.14.2 ## vLLM Installation Instructions -Create a vLLM directory in your /scratch directory, then install the vLLM image: +Create a `vLLM` directory in your /scratch directory, then install the vLLM image: ``` apptainer pull docker://vllm/vllm-openai:latest ``` -### Use High-Performance SCRATCH Storage -LLMs require very fast storage. On Torch, the SCRATCH filesystem is an all-flash system designed for AI workloads, providing excellent performance.To avoid exceeding your $HOME quota (50GB) and inode limits (30,000 files), you should redirect vLLM's cache and Hugging Face's model downloads to your scratch space: -``` +### Avoid filling up your `$HOME` directory +To avoid exceeding your `$HOME` quota (50GB) and inode limits (30,000 files), you should redirect `vLLM`'s cache and Hugging Face's model downloads to your scratch space: +```sh export HF_HOME=/scratch/$USER/hf_cache export VLLM_CACHE_ROOT=/scratch/$USER/vllm_cache ``` -You should run this to configure vLLM to always use your SCRATCH storage for consistent use: -``` +You should run this to configure `vLLM` to always use your `$SCRATCH` storage for consistent use: +```sh echo "export HF_HOME=/scratch/\$USER/hf_cache" >> ~/.bashrc echo "export VLLM_CACHE_ROOT=/scratch/\$USER/vllm_cache" >> ~/.bashrc ``` -Note: Files on $SCRATCH are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in $HOME! +:::note +Files on `$SCRATCH` are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in `$HOME`! +::: ## Run vLLM ### Online Serving (OpenAI-Compatible API) -vLLM implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. +`vLLM` implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. **In Terminal 1:** Start vLLM server (In this example we use Qwen model): -``` +```sh apptainer exec --nv vllm-openai_latest.sif vllm serve "Qwen/Qwen2.5-0.5B-Instruct" ``` When you see: @@ -46,7 +54,7 @@ Application startup complete. Open another terminal and log in to the same computing node as in terminal 1. **In Terminal 2** -``` +```sh curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -58,9 +66,9 @@ curl http://localhost:8000/v1/chat/completions \ ``` ### Offline Inference -If you need to process a large dataset at once without setting up a server, you can use vLLM's LLM class. -For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration. -``` +If you need to process a large dataset at once without setting up a server, you can use `vLLM`'s LLM class. +For example, the following code downloads the `facebook/opt-125m` model from HuggingFace and runs it in `vLLM` using the default configuration. +```python from vllm import LLM # Initialize the vLLM engine. @@ -68,23 +76,22 @@ llm = LLM(model="facebook/opt-125m") ``` After initializing the LLM instance, use the available APIs to perform model inference. -### SGLang: A Simple Option for Offline Batch Inference (Supplement Material) +### SGLang: A Simple Option for Offline Batch Inference For cases where users only want to run batch inference and do not need an HTTP endpoint, SGLang provides a much simpler offline engine API compared to running a full vLLM server. It is particularly suitable for dataset processing, evaluation pipelines, and one-off large-scale inference jobs. -For more details and examples, see the official SGLang offline engine documentation: -https://docs.sglang.io/basic_usage/offline_engine_api.html +For more details and examples, see the official SGLang offline engine documentation here: https://docs.sglang.io/basic_usage/offline_engine_api.html -## vLLM CLI -The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: -``` +## `vLLM` CLI +The `vllm` command-line tool is used to run and manage `vLLM` models. You can start by viewing the help message with: +```sh vllm --help ``` Serve - Starts the vLLM OpenAI Compatible API server. -``` +```sh vllm serve meta-llama/Llama-2-7b-hf ``` Chat - Generate chat completions via the running API server. -``` +```sh # Directly connect to localhost API without arguments vllm chat @@ -95,7 +102,7 @@ vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1 vllm chat --quick "hi" ``` Complete - Generate text completions based on the given prompt via the running API server. -``` +```sh # Directly connect to localhost API without arguments vllm complete