Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Run a Hugging Face model
# Basic LLM Inference with Hugging Face transformers

Here we provide an example of how one can run a Hugging Face Large-language model (LLM) on the NYU Torch cluster

Expand Down
Original file line number Diff line number Diff line change
@@ -1,42 +1,50 @@
# vLLM - A Command Line LLM Tool
# High-performance LLM inference with `vLLM`

## What is vLLM?
[vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving.

## Why vLLM?
We tested vLLM and llama-cpp on Torch, and found vLLM performs better on Torch:
Model: Qwen2.5-7B-Instruct
Prompt Tokens:512
Output Tokens: 256
|Backend|Peak Throughput|Median Latency(ms)|Recommendation
[`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs.

## Why `vLLM`?
We tested `vLLM` and `llama-cpp` (the inference framework behind `ollama`) on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-7B-Instruct` with `512` input and `256` output tokens.

|Inference Server|Peak Throughput|Median Latency(ms)|Recommendation
|-----|-----|-----|-----|
|vLLM|~4689.6|~48.0|Best for Batch/Research|
|llama-cpp|~115.0|~280.0|Best for Single User|
|`vLLM`|~4689.6|~48.0|Best for Batch/Research|
|`llama-cpp`|~115.0|~280.0|Best for Single User|

### Test Environment
GPU: NVIDIA L40S

`vLLM`: 0.13.0

`Ollama` (llama-cpp backend): 0.14.2

## vLLM Installation Instructions
Create a vLLM directory in your /scratch directory, then install the vLLM image:
Create a `vLLM` directory in your /scratch directory, then install the vLLM image:
```
apptainer pull docker://vllm/vllm-openai:latest
```
### Use High-Performance SCRATCH Storage
LLMs require very fast storage. On Torch, the SCRATCH filesystem is an all-flash system designed for AI workloads, providing excellent performance.To avoid exceeding your $HOME quota (50GB) and inode limits (30,000 files), you should redirect vLLM's cache and Hugging Face's model downloads to your scratch space:
```
### Avoid filling up your `$HOME` directory
To avoid exceeding your `$HOME` quota (50GB) and inode limits (30,000 files), you should redirect `vLLM`'s cache and Hugging Face's model downloads to your scratch space:
```sh
export HF_HOME=/scratch/$USER/hf_cache
export VLLM_CACHE_ROOT=/scratch/$USER/vllm_cache
```
You should run this to configure vLLM to always use your SCRATCH storage for consistent use:
```
You should run this to configure `vLLM` to always use your `$SCRATCH` storage for consistent use:
```sh
echo "export HF_HOME=/scratch/\$USER/hf_cache" >> ~/.bashrc
echo "export VLLM_CACHE_ROOT=/scratch/\$USER/vllm_cache" >> ~/.bashrc
```

Note: Files on $SCRATCH are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in $HOME!
:::note
Files on `$SCRATCH` are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in `$HOME`!
:::

## Run vLLM
### Online Serving (OpenAI-Compatible API)
vLLM implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments.
`vLLM` implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments.
**In Terminal 1:**
Start vLLM server (In this example we use Qwen model):
```
```sh
apptainer exec --nv vllm-openai_latest.sif vllm serve "Qwen/Qwen2.5-0.5B-Instruct"
```
When you see:
Expand All @@ -46,7 +54,7 @@ Application startup complete.
Open another terminal and log in to the same computing node as in terminal 1.

**In Terminal 2**
```
```sh
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
Expand All @@ -58,33 +66,32 @@ curl http://localhost:8000/v1/chat/completions \
```

### Offline Inference
If you need to process a large dataset at once without setting up a server, you can use vLLM's LLM class.
For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.
```
If you need to process a large dataset at once without setting up a server, you can use `vLLM`'s LLM class.
For example, the following code downloads the `facebook/opt-125m` model from HuggingFace and runs it in `vLLM` using the default configuration.
```python
from vllm import LLM

# Initialize the vLLM engine.
llm = LLM(model="facebook/opt-125m")
```
After initializing the LLM instance, use the available APIs to perform model inference.

### SGLang: A Simple Option for Offline Batch Inference (Supplement Material)
### SGLang: A Simple Option for Offline Batch Inference
For cases where users only want to run batch inference and do not need an HTTP endpoint, SGLang provides a much simpler offline engine API compared to running a full vLLM server. It is particularly suitable for dataset processing, evaluation pipelines, and one-off large-scale inference jobs.
For more details and examples, see the official SGLang offline engine documentation:
https://docs.sglang.io/basic_usage/offline_engine_api.html
For more details and examples, see the official SGLang offline engine documentation here: https://docs.sglang.io/basic_usage/offline_engine_api.html


## vLLM CLI
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
```
## `vLLM` CLI
The `vllm` command-line tool is used to run and manage `vLLM` models. You can start by viewing the help message with:
```sh
vllm --help
```
Serve - Starts the vLLM OpenAI Compatible API server.
```
```sh
vllm serve meta-llama/Llama-2-7b-hf
```
Chat - Generate chat completions via the running API server.
```
```sh
# Directly connect to localhost API without arguments
vllm chat

Expand All @@ -95,7 +102,7 @@ vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
vllm chat --quick "hi"
```
Complete - Generate text completions based on the given prompt via the running API server.
```
```sh
# Directly connect to localhost API without arguments
vllm complete

Expand Down
Loading