We’re attempting to reproduce the simulation results and observed that when comparing against vLLM 0.9.1 benchmarks, the P50 latency differs by 700%. May I ask if vLLM v1 is supported by Vidur? If not, which framework and version does Vidur use to reproduce the published results?
Specifically, when running the example command in the README (shown below), which LLM engine should we use to validate the simulation output? Is Vidur’s simulation based on vLLM or Sarathi-Serve?
When using vLLM 0.9.1, the mooncake_conversation_trace.csv trace fails because the total token length exceeds the max_model_len = 8192 limit for Meta-Llama-3-8B. Even after scaling down the token length and rerunning, the simulated latency still does not match vLLM’s measurements. Which framework does Vidur currently support, and what trace/configuration settings would you recommend for reproducing the results accurately?

python -m vidur.main \
--time_limit 10800 \
--replica_config_model_name meta-llama/Meta-Llama-3-8B \
--replica_config_device h100 \
--replica_config_network_device h100_dgx \
--cluster_config_num_replicas 8 \
--replica_config_tensor_parallel_size 1 \
--replica_config_num_pipeline_stages 1 \
--request_generator_config_type synthetic \
--synthetic_request_generator_config_num_requests 128 \
--length_generator_config_type trace \
--trace_request_length_generator_config_trace_file ./data/processed_traces/mooncake_conversation_trace.csv \
--interval_generator_config_type poisson \
--poisson_request_interval_generator_config_qps 8.0 \
--global_scheduler_config_type round_robin \
--replica_scheduler_config_type vllm_v1 \
--vllm_v1_scheduler_config_chunk_size 512 \
--vllm_v1_scheduler_config_batch_size_cap 512 \
--cache_config_enable_prefix_caching
We’re attempting to reproduce the simulation results and observed that when comparing against vLLM 0.9.1 benchmarks, the P50 latency differs by 700%. May I ask if vLLM v1 is supported by Vidur? If not, which framework and version does Vidur use to reproduce the published results?
Specifically, when running the example command in the README (shown below), which LLM engine should we use to validate the simulation output? Is Vidur’s simulation based on vLLM or Sarathi-Serve?
When using vLLM 0.9.1, the
mooncake_conversation_trace.csvtrace fails because the total token length exceeds themax_model_len = 8192limit for Meta-Llama-3-8B. Even after scaling down the token length and rerunning, the simulated latency still does not match vLLM’s measurements. Which framework does Vidur currently support, and what trace/configuration settings would you recommend for reproducing the results accurately?