Skip to content

Commit 06e1efb

Browse files
Create README.md
1 parent e76238f commit 06e1efb

File tree

1 file changed

+39
-0
lines changed

1 file changed

+39
-0
lines changed

benchmarks/README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Running benchmark
2+
3+
This benchmarking tool runs multi-process, throughput-oriented benchmark of Ampere optimized llama.cpp using arbitrary model(s) provided by the user.
4+
The benchmarking script spawns multiple parallel streams of token generation using llama.cpp and provides user with aggregate metrics of both prompt eval and token generation stages.
5+
Underneath, the _batched-bench_ script from upstream llama.cpp project is being used in an unaltered form.
6+
The script orchestrates the benchmark inside Docker container from the outside environment, **therefore this script should not be run inside Docker container.**
7+
8+
## Setup
9+
Few dependencies need to be installed first. On Debian-based systems you can use the setup script.
10+
```bash
11+
sudo bash setup_deb.sh
12+
```
13+
14+
## Downloading models
15+
Any GGUF model is expected to work, if you experience troubles running your network of choice please raise an [issue](https://github.com/AmpereComputingAI/llama.cpp/issues/new/choose).
16+
Benchmarking script expects models to be placed under _**llama.cpp/benchmarks/models**_ dir.
17+
```bash
18+
mkdir -p models
19+
huggingface-cli download QuantFactory/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct.Q8_0.gguf --local-dir models --local-dir-use-symlinks False
20+
```
21+
22+
## Benchmark
23+
Provide run.py Python script with following arguments:
24+
- -m, filename(s) of model(s) that should be available under _**llama.cpp/benchmarks/models**_ directory, multiple models can be provided
25+
- -t, threadpool(s) per single process, e.g., if there are 20 threads available on the system, if -t 10 is provided, 2 parallel processes will be spawned, each using 10 threads;
26+
multiple threadpools can be provided and they will be treated as separate cases to benchmark
27+
- -b, batch size(s) to benchmark, meaning separate token generation streams handled as a single batch; multiple batch sizes can be provided and they will be treated as separate cases to benchmark
28+
- -p, prompt size(s) to benchmark, size of an input prompt; multiple prompt sizes can be provided and they will be treated as separate cases to benchmark
29+
- -r, thread-range, e.g., on an 80-thread system, it should be input as 0-79, unless user wants to use just a subset of available threads, say 16-63 (48 threads indexed 16<>63)
30+
```bash
31+
python3 run.py -m Meta-Llama-3-8B-Instruct.Q8_0.gguf -t 10 16 32 40 64 80 -b 1 2 4 8 16 32 64 -p 512 -r 0-79
32+
```
33+
34+
## Quick run on 80t OCI A1 system
35+
```bash
36+
bash setup_deb.sh # works on Debian-based systems
37+
bash download_models.sh # uncomment preferred models in the file, by default llama3 q8_0 will be downloaded
38+
bash run.sh # modify to adjust number of threads available and other parameters
39+
```

0 commit comments

Comments
 (0)