From 4f67e6555ee8fcb48e1cccdc057e7c8290c9537c Mon Sep 17 00:00:00 2001 From: Jimin Park Date: Mon, 8 Dec 2025 04:12:05 +0000 Subject: [PATCH 1/4] add readme file --- vllm/benchmarks/README.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 vllm/benchmarks/README.md diff --git a/vllm/benchmarks/README.md b/vllm/benchmarks/README.md new file mode 100644 index 000000000000..828f2ef39438 --- /dev/null +++ b/vllm/benchmarks/README.md @@ -0,0 +1 @@ +readme template From 90861f1ac7a9cc4503411b0367a3ac6a9d926c9a Mon Sep 17 00:00:00 2001 From: Jimin Park Date: Mon, 8 Dec 2025 04:16:01 +0000 Subject: [PATCH 2/4] modify readme --- vllm/benchmarks/README.md | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/vllm/benchmarks/README.md b/vllm/benchmarks/README.md index 828f2ef39438..5f5237718d37 100644 --- a/vllm/benchmarks/README.md +++ b/vllm/benchmarks/README.md @@ -1 +1,23 @@ -readme template +We support high-throughput benchmark. + +[Benchmark multiprocessing] +Background: TBD + +Implementation: TBD + +[Benchmark trimming] +Background: +To measure the high-throughput of the benchmark, the beginning (where decoding batches gradually increase as they are filled) and the end (where decoding batches gradually decrease as they are completed) must be excluded from the overall benchmark duration. + +Implementation: +Trimming is implemented based on the response metadata for each request collected from the benchmark. The response metadata includes the request transmission time, the first token generation delay, and the inter-token generation delay. Based on this, the generation time of all tokens is reversed, and token information within the user-specified time interval is collected based on that time. + + + + + +Test result: +The benchmark execution script set the warmup-time to 150 seconds and the cooldown-time to 120 seconds. The trimmed experimental results are listed at the bottom of the benchmark output, allowing us to confirm the following metrics: 1) the benchmark execution time after trimming, 2) the number of tokens generated within the defined time interval, 3) the output token throughput, and 4) the Inter-Token Latency (ITL). +Furthermore, the graph below, which visually compares the difference resulting from the trimming, shows the specific time interval from which token information was aggregated. Additionally, since the width of the bars in the graph represents 1 second, the bar height can be interpreted as tokens per second, which directly represents the output token throughput. + + From 995b8aba6950044ab7b33ffe0956c2fdbf237e4f Mon Sep 17 00:00:00 2001 From: Jimin Park Date: Mon, 8 Dec 2025 04:23:00 +0000 Subject: [PATCH 3/4] fill --- vllm/benchmarks/README.md | 109 ++++++++++++++++++++++++++++++++++---- 1 file changed, 98 insertions(+), 11 deletions(-) diff --git a/vllm/benchmarks/README.md b/vllm/benchmarks/README.md index 5f5237718d37..f4eb9e4075e3 100644 --- a/vllm/benchmarks/README.md +++ b/vllm/benchmarks/README.md @@ -1,23 +1,110 @@ -We support high-throughput benchmark. +# Moreh-vllm Benchmark -[Benchmark multiprocessing] -Background: TBD +In this directory, We introduce our works and the test result to support high-throughput benchmark. -Implementation: TBD +## Benchmark multiprocessing +- Background: TBD -[Benchmark trimming] -Background: -To measure the high-throughput of the benchmark, the beginning (where decoding batches gradually increase as they are filled) and the end (where decoding batches gradually decrease as they are completed) must be excluded from the overall benchmark duration. - -Implementation: -Trimming is implemented based on the response metadata for each request collected from the benchmark. The response metadata includes the request transmission time, the first token generation delay, and the inter-token generation delay. Based on this, the generation time of all tokens is reversed, and token information within the user-specified time interval is collected based on that time. +- Implementation: TBD +## Benchmark trimming +- Background: +To measure the high-throughput of the benchmark, the beginning (where decoding batches gradually increase as they are filled) and the end (where decoding batches gradually decrease as they are completed) must be excluded from the overall benchmark duration. +- Implementation: +Trimming is implemented based on the response metadata for each request collected from the benchmark. The response metadata includes the request transmission time, the first token generation delay, and the inter-token generation delay. Based on this, the generation time of all tokens is reversed, and token information within the user-specified time interval is collected based on that time. -Test result: +## Test result The benchmark execution script set the warmup-time to 150 seconds and the cooldown-time to 120 seconds. The trimmed experimental results are listed at the bottom of the benchmark output, allowing us to confirm the following metrics: 1) the benchmark execution time after trimming, 2) the number of tokens generated within the defined time interval, 3) the output token throughput, and 4) the Inter-Token Latency (ITL). Furthermore, the graph below, which visually compares the difference resulting from the trimming, shows the specific time interval from which token information was aggregated. Additionally, since the width of the bars in the graph represents 1 second, the bar height can be interpreted as tokens per second, which directly represents the output token throughput. +### Test env +- Heimdall (heimdall & gateway: mi300-6) +- 1P1D (prefill: mi300-4, decode: mi300-5) + - External LB enabled (8 AsyncLLM) + - 8 Proxy servers: mi300-5 (1:1 matched to each decode ranks) + +### Test script +``` +vllm bench serve \ + --backend vllm \ + --model "deepseek-ai/DeepSeek-R1" \ + --metric-percentiles "10,25,50,75,90" \ + --percentile-metrics "itl,tps,ttft,e2el" \ + --host "mif-istio.cluster.svc.cluster.local" \ + --port 80 \ + --num-prompts 10800 \ + --max-concurrency 3600 \ + --ignore-eos \ + --ready-check-timeout-sec 0 \ + --dataset-name sharegpt \ + --dataset-path /app/dataset/ShareGPT_V3_unfiltered_cleaned_split.json\ + --sharegpt-input-len 1000 \ + --sharegpt-output-len 1000 \ + --warmup-time 150 \ # 150s + --cooldown-time 120 \ # 120s + --request-rate 50 +``` + +### Test Result +``` +============ Serving Benchmark Result ============ +Successful requests: 10800 +Maximum request concurrency: 3600 +Request rate configured (RPS): 50.00 +Benchmark duration (s): 591.09 +Total input tokens: 10800000 +Total generated tokens: 10800000 +Request throughput (req/s): 18.27 +Output token throughput (tok/s): 18271.28 +Peak output token throughput (tok/s): 23163.00 +Peak concurrent requests: 3864.00 +Total Token throughput (tok/s): 36542.55 +---------------Time to First Token---------------- +Mean TTFT (ms): 5575.28 +Median TTFT (ms): 5753.45 +P10 TTFT (ms): 1436.68 +P25 TTFT (ms): 2461.82 +P50 TTFT (ms): 5753.45 +P75 TTFT (ms): 6857.50 +P90 TTFT (ms): 10445.36 +---------------Inter-token Latency---------------- +Mean ITL (ms): 167.37 +Median ITL (ms): 168.92 +P10 ITL (ms): 128.88 +P25 ITL (ms): 158.62 +P50 ITL (ms): 168.92 +P75 ITL (ms): 177.89 +P90 ITL (ms): 196.06 +----------------End-to-end Latency---------------- +Mean E2EL (ms): 172774.31 +Median E2EL (ms): 175687.41 +P10 E2EL (ms): 156867.27 +P25 E2EL (ms): 167408.38 +P50 E2EL (ms): 175687.41 +P75 E2EL (ms): 180192.98 +P90 E2EL (ms): 181481.53 +================================================== +tip: install termplotlib and gnuplot to plot the metrics +Serving Benchmark Result after warmup before cooldown +Warm-up Time: 150.0 +Cool-down Time: 120.0 +Total counted tokens at filtering: 6437105 +Benchmark duration (s): 319.96 +Total generated tokens: 6437105 +Output token throughput (tok/s): 20118.76 +---------------Inter-token Latency---------------- +Mean ITL (ms): 174.31 +Median ITL (ms): 171.26 +P10 ITL (ms): 154.19 +P25 ITL (ms): 164.95 +P50 ITL (ms): 171.26 +P75 ITL (ms): 182.13 +P90 ITL (ms): 200.83 +================================================== +``` +### Graph +TBD From 1c487c84133ab1b4968167a74457e1898a9eab77 Mon Sep 17 00:00:00 2001 From: Jimin Park Date: Mon, 8 Dec 2025 14:05:59 +0900 Subject: [PATCH 4/4] Update README.md Signed-off-by: Jimin Park --- vllm/benchmarks/README.md | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/vllm/benchmarks/README.md b/vllm/benchmarks/README.md index f4eb9e4075e3..a3a337d42efe 100644 --- a/vllm/benchmarks/README.md +++ b/vllm/benchmarks/README.md @@ -2,13 +2,15 @@ In this directory, We introduce our works and the test result to support high-throughput benchmark. -## Benchmark multiprocessing -- Background: TBD +## Multiprocessing benchmark +- Background: +The benchmark throughput measurements provided by vllm by default operate on a single process and single thread. The problem with single-threaded operation is that, at high throughputs, the thread's processing becomes a bottleneck. Specifically, the results from vllm are not measured in a timely manner, leading to incorrect ITL calculations. This issue ultimately arises due to context switching, even when using asyncio. -- Implementation: TBD +- Implementation: +If `--max-connections-per-worker [K]` option is given, M (N / K) processes will be created based on `--num_prompts [N]`. These processes will distribute the total requests as evenly as possible and establish M:1 communication with vllm (or proxy server). Each process is responsible for sending and receiving the requests assigned to it and processing the metrics. After all processes are completed, the final benchmark result is calculated by collecting the metric results of each process. -## Benchmark trimming +## Result Trimming - Background: To measure the high-throughput of the benchmark, the beginning (where decoding batches gradually increase as they are filled) and the end (where decoding batches gradually decrease as they are completed) must be excluded from the overall benchmark duration. @@ -16,18 +18,18 @@ To measure the high-throughput of the benchmark, the beginning (where decoding b Trimming is implemented based on the response metadata for each request collected from the benchmark. The response metadata includes the request transmission time, the first token generation delay, and the inter-token generation delay. Based on this, the generation time of all tokens is reversed, and token information within the user-specified time interval is collected based on that time. -## Test result +# Test result The benchmark execution script set the warmup-time to 150 seconds and the cooldown-time to 120 seconds. The trimmed experimental results are listed at the bottom of the benchmark output, allowing us to confirm the following metrics: 1) the benchmark execution time after trimming, 2) the number of tokens generated within the defined time interval, 3) the output token throughput, and 4) the Inter-Token Latency (ITL). Furthermore, the graph below, which visually compares the difference resulting from the trimming, shows the specific time interval from which token information was aggregated. Additionally, since the width of the bars in the graph represents 1 second, the bar height can be interpreted as tokens per second, which directly represents the output token throughput. -### Test env -- Heimdall (heimdall & gateway: mi300-6) -- 1P1D (prefill: mi300-4, decode: mi300-5) - - External LB enabled (8 AsyncLLM) - - 8 Proxy servers: mi300-5 (1:1 matched to each decode ranks) +## Test env +- Heimdall (heimdall & gateway) +- PD disaggregation (1P1D) +- External LB enabled on Decode server (8 AsyncLLM processes) -### Test script +## Test script ``` +# Warmup time is set to 150.0s, and Cooldown time is set to 120.0s vllm bench serve \ --backend vllm \ --model "deepseek-ai/DeepSeek-R1" \ @@ -37,20 +39,22 @@ vllm bench serve \ --port 80 \ --num-prompts 10800 \ --max-concurrency 3600 \ + --request-rate 50 \ --ignore-eos \ --ready-check-timeout-sec 0 \ + --max-connections-per-worker 432 \ --dataset-name sharegpt \ --dataset-path /app/dataset/ShareGPT_V3_unfiltered_cleaned_split.json\ --sharegpt-input-len 1000 \ --sharegpt-output-len 1000 \ - --warmup-time 150 \ # 150s - --cooldown-time 120 \ # 120s - --request-rate 50 + --warmup-time 150.0 \ + --cooldown-time 120.0 ``` -### Test Result +## Result ``` ============ Serving Benchmark Result ============ +Number of worker processes: 25 Successful requests: 10800 Maximum request concurrency: 3600 Request rate configured (RPS): 50.00