-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Description
Hello,
I haven't noticed any speed improvements when testing the Qwen3-Coder-30B-A3B-DFlash on the Nvidia DGX Spark and Blackwell 6000 Pro.
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
--tp-size 1 \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.75 \
--trust-remote-code
with dflash:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 10
Successful requests: 100
Benchmark duration (s): 88.94
Total input tokens: 6145
Total input text tokens: 6145
Total generated tokens: 6108
Total generated tokens (retokenized): 6106
Request throughput (req/s): 1.12
Input token throughput (tok/s): 69.09
Output token throughput (tok/s): 68.67
Peak output token throughput (tok/s): 101.00
Peak concurrent requests: 14
Total token throughput (tok/s): 137.76
Concurrency: 9.72
Accept length: 2.20
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 8646.72
Median E2E Latency (ms): 6690.29
P90 E2E Latency (ms): 17074.77
P99 E2E Latency (ms): 25213.58
---------------Time to First Token----------------
Mean TTFT (ms): 461.49
Median TTFT (ms): 386.96
P99 TTFT (ms): 1246.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 138.02
Median TPOT (ms): 142.49
P99 TPOT (ms): 261.11
---------------Inter-Token Latency----------------
Mean ITL (ms): 136.24
Median ITL (ms): 103.31
P95 ITL (ms): 326.95
P99 ITL (ms): 466.41
Max ITL (ms): 810.98
==================================================
without:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 10
Successful requests: 100
Benchmark duration (s): 87.88
Total input tokens: 6145
Total input text tokens: 6145
Total generated tokens: 6108
Total generated tokens (retokenized): 6107
Request throughput (req/s): 1.14
Input token throughput (tok/s): 69.93
Output token throughput (tok/s): 69.51
Peak output token throughput (tok/s): 90.00
Peak concurrent requests: 13
Total token throughput (tok/s): 139.43
Concurrency: 9.78
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 8590.60
Median E2E Latency (ms): 9589.81
P90 E2E Latency (ms): 14944.44
P99 E2E Latency (ms): 17193.38
---------------Time to First Token----------------
Mean TTFT (ms): 387.11
Median TTFT (ms): 379.88
P99 TTFT (ms): 786.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 135.14
Median TPOT (ms): 135.77
P99 TPOT (ms): 161.21
---------------Inter-Token Latency----------------
Mean ITL (ms): 136.54
Median ITL (ms): 122.70
P95 ITL (ms): 246.41
P99 ITL (ms): 322.74
Max ITL (ms): 514.76
==================================================
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels