MI300X - Not seeing any difference in Round Robin vs. PrefixHash-Aware Load Balancing #507

virobles · 2025-04-22T18:52:41Z

Just ran a few benchmarking tests similar to the repo example on an MI300X using Round Robin, LeastLoad, and PrefixCache strategies for the amd/Llama-3.1-8B-Instruct-FP8-KV model with:

Traffic request rate: 200.0
Maximum request concurrency: 800

Traffic request rate: 200.0
Maximum request concurrency: 3200

Traffic request rate: 800.0
Maximum request concurrency: 8000

Traffic request rate: 1200.0
Maximum request concurrency: 32000

I think I might be doing the Round Robin incorrectly but I saw only a few milliseconds of different between all 3 strategies for TTFT and TPOT.

#llama-3.1-8b-instruct-fp8-mi300x.yaml
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.1-8b-instruct-fp8-mi300x
spec:
  features: [TextGeneration]
  url: pvc://model-storage/amd/Llama-3.1-8B-Instruct-FP8-KV
  engine: VLLM
  image: docker.io/rocm/vllm-dev:nightly
  args:
  - --enable-prefix-caching
  - --max-model-len=16384
  - --max-num-batched-token=16384
  - --gpu-memory-utilization=0.95
  - --disable-log-requests
  - --kv-cache-dtype=fp8
  env:
    HIP_FORCE_DEV_KERNARG: "1"
    NCCL_MIN_NCHANNELS: "112"
    TORCH_BLAS_PREFER_HIPBLASLT: "1"
    VLLM_USE_TRITON_FLASH_ATTN: "0"
  resourceProfile: amd-gpu-mi300x:1
  minReplicas: 8
  maxReplicas: 8

#job-benchmark.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark-serving
spec:
  template:
    spec:
      containers:
        - name: benchmark-serving
          image: substratusai/benchmark_serving:latest
          args:
            - --base-url=http://10.152.183.91
            - --dataset-name=sharegpt
            - --dataset-path=/app/sharegpt_16_messages_or_more.json
            - --model=llama-3.1-8b-instruct-fp8-mi300x
            - --seed=12345
            - --tokenizer=amd/Llama-3.1-8B-Instruct-FP8-KV
            - --request-rate=1200
            - --max-concurrency=32000
            - --num-prompts=32000
            - --max-conversations=800
      restartPolicy: Never

#job-benchmark.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark-serving-kubeai
spec:
  template:
    spec:
      containers:
        - name: benchmark-serving
          image: substratusai/benchmark_serving:latest
          args:
            - --base-url=http://kubeai/openai
            - --dataset-name=sharegpt
            - --dataset-path=/app/sharegpt_16_messages_or_more.json
            - --model=llama-3.1-8b-instruct-fp8-mi300x
            - --seed=12345
            - --tokenizer=amd/Llama-3.1-8B-Instruct-FP8-KV
            - --request-rate=1200
            - --max-concurrency=32000
            - --num-prompts=32000
            - --max-conversations=800
      restartPolicy: Never

#k8s-native-service.yaml
# for running vllm without KubeAI proxy/LB
apiVersion: v1
kind: Service
metadata:
  name: vllm-direct
  labels:
    app: vllm-direct
spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 8000
  type: ClusterIP

samos123 · 2025-04-23T19:43:12Z

Did you patch the model to enable PrefixAware load balancing? One thing that helped us a lot was to enable Grafana and Prometheus metrics so we can see prefix cache hit rates and GPU KV cache utilization.

In general the benefit of of PrefixHash aware load balancing is more apparent when you have bigger models and are under medium to high load.

This was a better more realistic simulation of multi-turn chat conversations: https://github.com/substratusai/kubeai/blob/main/benchmarks/multi-turn-chat-go/runs/llama-3.1-8x-l4/run.ipynb

Images used for that benchmark: https://github.com/substratusai/kubeai/tree/main/benchmarks/multi-turn-chat-go

You need to make sure your model spec has this set to enable prefixhash aware load balancing:

        "loadBalancing": {
            "strategy": "PrefixHash",
            "prefixHash": {
                "meanLoadFactor": 125,
                "prefixCharLength": 100,
                "replication": 256,
            },
        },

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MI300X - Not seeing any difference in Round Robin vs. PrefixHash-Aware Load Balancing #507

MI300X - Not seeing any difference in Round Robin vs. PrefixHash-Aware Load Balancing #507

virobles commented Apr 22, 2025

samos123 commented Apr 23, 2025

Uh oh!

MI300X - Not seeing any difference in Round Robin vs. PrefixHash-Aware Load Balancing #507

MI300X - Not seeing any difference in Round Robin vs. PrefixHash-Aware Load Balancing #507

Comments

virobles commented Apr 22, 2025

samos123 commented Apr 23, 2025

Uh oh!