Performance degradation - Qwen3-Coder-30B-A3B-DFlash

Hello,

I haven't noticed any speed improvements when testing the Qwen3-Coder-30B-A3B-DFlash on the Nvidia DGX Spark and Blackwell 6000 Pro.

```
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
    --tp-size 1 \
    --dtype bfloat16 \
    --attention-backend flashinfer \
    --mem-fraction-static 0.75 \
    --trust-remote-code
```

with dflash:

```
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 10        
Successful requests:                     100       
Benchmark duration (s):                  88.94     
Total input tokens:                      6145      
Total input text tokens:                 6145      
Total generated tokens:                  6108      
Total generated tokens (retokenized):    6106      
Request throughput (req/s):              1.12      
Input token throughput (tok/s):          69.09     
Output token throughput (tok/s):         68.67     
Peak output token throughput (tok/s):    101.00    
Peak concurrent requests:                14        
Total token throughput (tok/s):          137.76    
Concurrency:                             9.72      
Accept length:                           2.20      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8646.72   
Median E2E Latency (ms):                 6690.29   
P90 E2E Latency (ms):                    17074.77  
P99 E2E Latency (ms):                    25213.58  
---------------Time to First Token----------------
Mean TTFT (ms):                          461.49    
Median TTFT (ms):                        386.96    
P99 TTFT (ms):                           1246.22   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          138.02    
Median TPOT (ms):                        142.49    
P99 TPOT (ms):                           261.11    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           136.24    
Median ITL (ms):                         103.31    
P95 ITL (ms):                            326.95    
P99 ITL (ms):                            466.41    
Max ITL (ms):                            810.98    
==================================================
```

without:

```
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 10        
Successful requests:                     100       
Benchmark duration (s):                  87.88     
Total input tokens:                      6145      
Total input text tokens:                 6145      
Total generated tokens:                  6108      
Total generated tokens (retokenized):    6107      
Request throughput (req/s):              1.14      
Input token throughput (tok/s):          69.93     
Output token throughput (tok/s):         69.51     
Peak output token throughput (tok/s):    90.00     
Peak concurrent requests:                13        
Total token throughput (tok/s):          139.43    
Concurrency:                             9.78      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8590.60   
Median E2E Latency (ms):                 9589.81   
P90 E2E Latency (ms):                    14944.44  
P99 E2E Latency (ms):                    17193.38  
---------------Time to First Token----------------
Mean TTFT (ms):                          387.11    
Median TTFT (ms):                        379.88    
P99 TTFT (ms):                           786.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          135.14    
Median TPOT (ms):                        135.77    
P99 TPOT (ms):                           161.21    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           136.54    
Median ITL (ms):                         122.70    
P95 ITL (ms):                            246.41    
P99 ITL (ms):                            322.74    
Max ITL (ms):                            514.76    
==================================================

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation - Qwen3-Coder-30B-A3B-DFlash #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance degradation - Qwen3-Coder-30B-A3B-DFlash #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions