NVIDIA LLM NIM - vLLM Benchmark

This repository, based on vLLM Benchmark, provides NVIDA LLM NIM users an easy way to run performance testing against NIMs. This will do everything vLLM Benchmark does. It NIM support has been bolted on. This repo is not affiliated with NVIDIA.

Features

Added support for NIVDIA LLM NIMs
Benchmark NVIDIA LLM NIMs with different concurrency levels
Automatically hone in on maximum users per NIM deployment at the GPU level
Measure key performance metrics:
- Requests per second
- Latency
- Tokens per second
- Time to first token
Easy to run with customizable parameters
Generates JSON output for further analysis or visualization

Requirements

Python 3.7+
openai Python package
numpy Python package

Installation

Clone this repository:

git clone (https://github.com/staggeredsix/vllm-benchmark_NIM.git
cd vllm-benchmark

Install the required packages:
```
pip install openai numpy
```

Usage

sh ./nim-vllm-benchmark.py Select the test to run. The model will load, script will detect the correct model name for requests. Built in fun bug will have you select test again. Manual test will allow setting requests and concurrency. Auto Test will start with a medium load and adjust the load until returned tokens per second drops below 12.

Output

The benchmark will report the current in-flight requests, current average TPS every 10 seconds. After the benchmark ends it will report total concurrent requests and TPS.

Auto Test will hammer the model with requests and update you every 10 seconds with in-flight requests and TPS. After the Auto Test detects a drop to or below 12 TPS return rate it will end the test and report maximum concurrent requests. Rerun a few times for confirmation.

A800 LLama 3 8B Instruct was around 313 concurrent requests. Running 5 times with no specific cool down for GPU temps.

The benchmark results are saved in JSON format, containing detailed metrics for each run, including:

Total requests and successful requests
Requests per second
Total output tokens
Latency (average, p50, p95, p99)
Tokens per second (average, p50, p95, p99)
Time to first token (average, p50, p95, p99)

Results

Results are dumped into the same directory as the nim-vllm-benchmark.py.

Contributing

Contributions to improve the benchmarking scripts or add new features are welcome! Please feel free to submit pull requests or open issues for any bugs or feature requests.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

MY MODIFICATIONS

This repo is for me to work on personal projects without breaking the original while allowing anyone to benefit from the work. Any modifications I create are freely available. I won't push any changes back to the original work as I don't trust myself to not break things.

PENDING ENHANCEMENTS

Naming the run for logging, moving logging to results directory. Overall bug squashing. Benchmark chart creation.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
results		results
LICENSE		LICENSE
README.md		README.md
nim_list.txt		nim_list.txt
nim_vllm_benchmarks.py		nim_vllm_benchmarks.py
run_benchmarks.py		run_benchmarks.py
vllm_benchmark.py		vllm_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA LLM NIM - vLLM Benchmark

Features

Requirements

Installation

Usage

Output

Results

Contributing

License

MY MODIFICATIONS

PENDING ENHANCEMENTS

About

Uh oh!

Releases

Packages

Languages

License

staggeredsix/vllm-benchmark_NIM

Folders and files

Latest commit

History

Repository files navigation

NVIDIA LLM NIM - vLLM Benchmark

Features

Requirements

Installation

Usage

Output

Results

Contributing

License

MY MODIFICATIONS

PENDING ENHANCEMENTS

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages