You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Did you patch the model to enable PrefixAware load balancing? One thing that helped us a lot was to enable Grafana and Prometheus metrics so we can see prefix cache hit rates and GPU KV cache utilization.
In general the benefit of of PrefixHash aware load balancing is more apparent when you have bigger models and are under medium to high load.
Just ran a few benchmarking tests similar to the repo example on an MI300X using Round Robin, LeastLoad, and PrefixCache strategies for the amd/Llama-3.1-8B-Instruct-FP8-KV model with:
Traffic request rate: 200.0
Maximum request concurrency: 800
Traffic request rate: 200.0
Maximum request concurrency: 3200
Traffic request rate: 800.0
Maximum request concurrency: 8000
Traffic request rate: 1200.0
Maximum request concurrency: 32000
I think I might be doing the Round Robin incorrectly but I saw only a few milliseconds of different between all 3 strategies for TTFT and TPOT.
The text was updated successfully, but these errors were encountered: