Skip to content

MI300X - Not seeing any difference in Round Robin vs. PrefixHash-Aware Load Balancing #507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
virobles opened this issue Apr 22, 2025 · 1 comment

Comments

@virobles
Copy link

Just ran a few benchmarking tests similar to the repo example on an MI300X using Round Robin, LeastLoad, and PrefixCache strategies for the amd/Llama-3.1-8B-Instruct-FP8-KV model with:

Traffic request rate: 200.0
Maximum request concurrency: 800

Traffic request rate: 200.0
Maximum request concurrency: 3200

Traffic request rate: 800.0
Maximum request concurrency: 8000

Traffic request rate: 1200.0
Maximum request concurrency: 32000

I think I might be doing the Round Robin incorrectly but I saw only a few milliseconds of different between all 3 strategies for TTFT and TPOT.

#llama-3.1-8b-instruct-fp8-mi300x.yaml
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.1-8b-instruct-fp8-mi300x
spec:
  features: [TextGeneration]
  url: pvc://model-storage/amd/Llama-3.1-8B-Instruct-FP8-KV
  engine: VLLM
  image: docker.io/rocm/vllm-dev:nightly
  args:
  - --enable-prefix-caching
  - --max-model-len=16384
  - --max-num-batched-token=16384
  - --gpu-memory-utilization=0.95
  - --disable-log-requests
  - --kv-cache-dtype=fp8
  env:
    HIP_FORCE_DEV_KERNARG: "1"
    NCCL_MIN_NCHANNELS: "112"
    TORCH_BLAS_PREFER_HIPBLASLT: "1"
    VLLM_USE_TRITON_FLASH_ATTN: "0"
  resourceProfile: amd-gpu-mi300x:1
  minReplicas: 8
  maxReplicas: 8
#job-benchmark.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark-serving
spec:
  template:
    spec:
      containers:
        - name: benchmark-serving
          image: substratusai/benchmark_serving:latest
          args:
            - --base-url=http://10.152.183.91
            - --dataset-name=sharegpt
            - --dataset-path=/app/sharegpt_16_messages_or_more.json
            - --model=llama-3.1-8b-instruct-fp8-mi300x
            - --seed=12345
            - --tokenizer=amd/Llama-3.1-8B-Instruct-FP8-KV
            - --request-rate=1200
            - --max-concurrency=32000
            - --num-prompts=32000
            - --max-conversations=800
      restartPolicy: Never
#job-benchmark.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark-serving-kubeai
spec:
  template:
    spec:
      containers:
        - name: benchmark-serving
          image: substratusai/benchmark_serving:latest
          args:
            - --base-url=http://kubeai/openai
            - --dataset-name=sharegpt
            - --dataset-path=/app/sharegpt_16_messages_or_more.json
            - --model=llama-3.1-8b-instruct-fp8-mi300x
            - --seed=12345
            - --tokenizer=amd/Llama-3.1-8B-Instruct-FP8-KV
            - --request-rate=1200
            - --max-concurrency=32000
            - --num-prompts=32000
            - --max-conversations=800
      restartPolicy: Never
#k8s-native-service.yaml
# for running vllm without KubeAI proxy/LB
apiVersion: v1
kind: Service
metadata:
  name: vllm-direct
  labels:
    app: vllm-direct
spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 8000
  type: ClusterIP
@samos123
Copy link
Contributor

Did you patch the model to enable PrefixAware load balancing? One thing that helped us a lot was to enable Grafana and Prometheus metrics so we can see prefix cache hit rates and GPU KV cache utilization.

In general the benefit of of PrefixHash aware load balancing is more apparent when you have bigger models and are under medium to high load.

This was a better more realistic simulation of multi-turn chat conversations: https://github.com/substratusai/kubeai/blob/main/benchmarks/multi-turn-chat-go/runs/llama-3.1-8x-l4/run.ipynb

Images used for that benchmark: https://github.com/substratusai/kubeai/tree/main/benchmarks/multi-turn-chat-go

You need to make sure your model spec has this set to enable prefixhash aware load balancing:

        "loadBalancing": {
            "strategy": "PrefixHash",
            "prefixHash": {
                "meanLoadFactor": 125,
                "prefixCharLength": 100,
                "replication": 256,
            },
        },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants