Skip to content

Performance degradation - Qwen3-Coder-30B-A3B-DFlash #30

@shahizat

Description

@shahizat

Hello,

I haven't noticed any speed improvements when testing the Qwen3-Coder-30B-A3B-DFlash on the Nvidia DGX Spark and Blackwell 6000 Pro.

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
    --tp-size 1 \
    --dtype bfloat16 \
    --attention-backend flashinfer \
    --mem-fraction-static 0.75 \
    --trust-remote-code

with dflash:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 10        
Successful requests:                     100       
Benchmark duration (s):                  88.94     
Total input tokens:                      6145      
Total input text tokens:                 6145      
Total generated tokens:                  6108      
Total generated tokens (retokenized):    6106      
Request throughput (req/s):              1.12      
Input token throughput (tok/s):          69.09     
Output token throughput (tok/s):         68.67     
Peak output token throughput (tok/s):    101.00    
Peak concurrent requests:                14        
Total token throughput (tok/s):          137.76    
Concurrency:                             9.72      
Accept length:                           2.20      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8646.72   
Median E2E Latency (ms):                 6690.29   
P90 E2E Latency (ms):                    17074.77  
P99 E2E Latency (ms):                    25213.58  
---------------Time to First Token----------------
Mean TTFT (ms):                          461.49    
Median TTFT (ms):                        386.96    
P99 TTFT (ms):                           1246.22   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          138.02    
Median TPOT (ms):                        142.49    
P99 TPOT (ms):                           261.11    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           136.24    
Median ITL (ms):                         103.31    
P95 ITL (ms):                            326.95    
P99 ITL (ms):                            466.41    
Max ITL (ms):                            810.98    
==================================================

without:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 10        
Successful requests:                     100       
Benchmark duration (s):                  87.88     
Total input tokens:                      6145      
Total input text tokens:                 6145      
Total generated tokens:                  6108      
Total generated tokens (retokenized):    6107      
Request throughput (req/s):              1.14      
Input token throughput (tok/s):          69.93     
Output token throughput (tok/s):         69.51     
Peak output token throughput (tok/s):    90.00     
Peak concurrent requests:                13        
Total token throughput (tok/s):          139.43    
Concurrency:                             9.78      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8590.60   
Median E2E Latency (ms):                 9589.81   
P90 E2E Latency (ms):                    14944.44  
P99 E2E Latency (ms):                    17193.38  
---------------Time to First Token----------------
Mean TTFT (ms):                          387.11    
Median TTFT (ms):                        379.88    
P99 TTFT (ms):                           786.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          135.14    
Median TPOT (ms):                        135.77    
P99 TPOT (ms):                           161.21    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           136.54    
Median ITL (ms):                         122.70    
P95 ITL (ms):                            246.41    
P99 ITL (ms):                            322.74    
Max ITL (ms):                            514.76    
==================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions