|
| 1 | +# Running benchmark |
| 2 | + |
| 3 | +This benchmarking tool runs multi-process, throughput-oriented benchmark of Ampere optimized llama.cpp using arbitrary model(s) provided by the user. |
| 4 | +The benchmarking script spawns multiple parallel streams of token generation using llama.cpp and provides user with aggregate metrics of both prompt eval and token generation stages. |
| 5 | +Underneath, the _batched-bench_ script from upstream llama.cpp project is being used in an unaltered form. |
| 6 | +The script orchestrates the benchmark inside Docker container from the outside environment, **therefore this script should not be run inside Docker container.** |
| 7 | + |
| 8 | +## Setup |
| 9 | +Few dependencies need to be installed first. On Debian-based systems you can use the setup script. |
| 10 | +```bash |
| 11 | +sudo bash setup_deb.sh |
| 12 | +``` |
| 13 | + |
| 14 | +## Downloading models |
| 15 | +Any GGUF model is expected to work, if you experience troubles running your network of choice please raise an [issue](https://github.com/AmpereComputingAI/llama.cpp/issues/new/choose). |
| 16 | +Benchmarking script expects models to be placed under _**llama.cpp/benchmarks/models**_ dir. |
| 17 | +```bash |
| 18 | +mkdir -p models |
| 19 | +huggingface-cli download QuantFactory/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct.Q8_0.gguf --local-dir models --local-dir-use-symlinks False |
| 20 | +``` |
| 21 | + |
| 22 | +## Benchmark |
| 23 | +Provide run.py Python script with following arguments: |
| 24 | +- -m, filename(s) of model(s) that should be available under _**llama.cpp/benchmarks/models**_ directory, multiple models can be provided |
| 25 | +- -t, threadpool(s) per single process, e.g., if there are 20 threads available on the system, if -t 10 is provided, 2 parallel processes will be spawned, each using 10 threads; |
| 26 | + multiple threadpools can be provided and they will be treated as separate cases to benchmark |
| 27 | +- -b, batch size(s) to benchmark, meaning separate token generation streams handled as a single batch; multiple batch sizes can be provided and they will be treated as separate cases to benchmark |
| 28 | +- -p, prompt size(s) to benchmark, size of an input prompt; multiple prompt sizes can be provided and they will be treated as separate cases to benchmark |
| 29 | +- -r, thread-range, e.g., on an 80-thread system, it should be input as 0-79, unless user wants to use just a subset of available threads, say 16-63 (48 threads indexed 16<>63) |
| 30 | +```bash |
| 31 | +python3 run.py -m Meta-Llama-3-8B-Instruct.Q8_0.gguf -t 10 16 32 40 64 80 -b 1 2 4 8 16 32 64 -p 512 -r 0-79 |
| 32 | +``` |
| 33 | + |
| 34 | +## Quick run on 80t OCI A1 system |
| 35 | +```bash |
| 36 | +bash setup_deb.sh # works on Debian-based systems |
| 37 | +bash download_models.sh # uncomment preferred models in the file, by default llama3 q8_0 will be downloaded |
| 38 | +bash run.sh # modify to adjust number of threads available and other parameters |
| 39 | +``` |
0 commit comments