From 83a70724d20531af7fd0dae6ab4006dab9f30de8 Mon Sep 17 00:00:00 2001 From: Anna Tchernych Date: Fri, 29 Aug 2025 17:09:34 -0700 Subject: [PATCH 1/7] Initial Draft Signed-off-by: Anna Tchernych --- README.md | 4 +- .../NNNN-epp-integration.md | 232 ++++++++++++++++++ 2 files changed, 234 insertions(+), 2 deletions(-) create mode 100644 inference-gw/NNNN-epp-integration/NNNN-epp-integration.md diff --git a/README.md b/README.md index ff3e42fb..bb258e7d 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ explanation and details. 2. Identify a `code-owner` or `maintainer` of the DEP repository to shepard the process. -3. Create a draft PR and iterate with co-authors, Sponser +3. Create a draft PR and iterate with co-authors, Sponsor -4. When ready for review, mark as ready and work with Sponser to set a +4. When ready for review, mark as ready and work with Sponsor to set a review date. diff --git a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md new file mode 100644 index 00000000..b7d31fd0 --- /dev/null +++ b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md @@ -0,0 +1,232 @@ +# Dynamo Integration with the Gateway API Inference Extension's End Point Picker + +**Status**: Draft + +**Authors**: [Anna Tchernych](https://github.com/atchernych) + +**Category**: Architecture + +**Replaces**: [Link of previous proposal if applicable] + +**Replaced By**: [Link of previous proposal if applicable] + +**Sponsor**: [Name of code owner or maintainer to shepard process] + +**Required Reviewers**: [Names of technical leads that are required for acceptance] + +**Review Date**: [Date for review] + +**Pull Request**: [Link to Pull Request of the Proposal itself] + +**Implementation PR / Tracking Issue**: [Link to Pull Request](https://github.com/ai-dynamo/dynamo/pull/2786) + +# Summary + +This proposal outlines the integration of Dynamo components with the [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/), in particular the [EndPointPicker](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/) + +# Motivation + + +The prior [version of Dynamo Inference Gateway Integration](https://github.com/ai-dynamo/enhancements/blob/bis/inference-gw/0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md) leaves room for 2 enhancements to how Dynamo integrates with the EPP. +First, the EPP sends and HTTP request to the Dynamo FrontEnd to obtain the target worker instance id and the tokens. Even though the FrontEnd is deployed as a sidecar, this approach introduces additional latency. Given how well optimized the Dynamo system is, this latency would offset the performance gains provided by the highly efficient Dynamo Router. +Second, even though the Dynamo Routing call is implemented as a plugin in accordance with the plugin interface EPP provides, Dynamo cannot support other EPP-provided routing mechanisms such as Routing Filters. EPP offers more flexibility to the end user on how to route and provides a nice declarative configuration yaml file. + +## Goals + +* Expose Dynamo Router as a a Library for Go through c-bindings or a Binary Library crate. + +* Provide support for EPP standard Routing filters, Pickers and Scorers so that an end user can mix an match Dynamo Routing approach with the EPP filters, pickers and scorers through a single yaml - based config file. + +* Modularize Dynamo to enable Inference Gateway API usage with workers, without relying on the Dynamo Router. + +### Non Goals + +* Change existing Dynamo worker interfaces significantly +* Change existing Dynamo worker interfaces significantly + +## Requirements + +List out any additional requirements in numbered subheadings. + +--- + +### REQ-1: Router as a Library +The router **SHOULD** incur minimal latency overhead when used as a library. + +Use all-caps, bolded terms like **MUST** and **SHOULD** when describing each requirement. See [RFC-2119](https://datatracker.ietf.org/doc/html/rfc2119) for additional information. + +--- + +### REQ-2: Support for Standard EPP Filters +The user **SHOULD** be able to use standard EPP filter- and scorer-enabled routing **alongside** the Dynamo Router. + +--- + +### REQ-3: Inference Gateway API Without Dynamo Router +The user **SHOULD** be able to use Dynamo workers directly, without requiring the Dynamo Router. + + +# Proposal + +Below proposes the C-Bindings (FFI Layer) as a solution to the latency problem. +Massaging of Dynamo metrics is proposed as a solution to enabling EPP standard filters. + +# Implementation Details + +## Library Implementation + +Two approaches are possible: + +### 1. Rust Binary Crate +Package the Dynamo Router as a Rust binary crate. +- Invoke it from Go using `exec.Command`. +- Communicate over **stdin/stdout/IPC**. + +This approach is simpler but introduces higher latency since every call crosses a **process boundary**. + +--- + +### 2. C-Bindings (FFI Layer) +Add calls to existing C-bindings and invoke them from Go (`cgo → extern "C" Rust`). + +- Expose a C-compatible FFI layer using `#[no_mangle] extern "C" fn`. +- Build the crate into a `.so` / `.a` / `.dll` and call it from Go via `cgo`. +- See [Draft PR #2786](https://github.com/ai-dynamo/dynamo/pull/2786) for a reference implementation. + +This approach has **higher maintenance overhead** but offers **lower latency**: +- Avoids process boundary overhead (no syscalls, no kernel, no sockets). +- Zero/low copy: pass `*uint32 + length` for tokens, get back a pointer, and free with a matching `free`—cheap and predictable. +- Shared runtime: initialize the Rust tokenizer/router runtime once (bindings already use `OnceCell`) and reuse it across calls. + + +## Enabling EPP filters + +EPP filters rely on Prometheus metrics. We would have to make them available in the *InferencePool* CR. +We would have to rename the metrics of interest to EPP. For example, dynamo exposes the `num_requests_waiting` metric for the number of requests that have been routed to a specific worker but are waiting in that worker's internal queue to be processed. But EPP filters expect the `vllm:num_requests_waiting` and `nv_trt_llm_request_metrics{request_type=waiting}`. For KV cache utilization EPP expects `vllm:gpu_cache_usage_perc` and `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`. +It is also notable that the tokenization can be performed in a plugin to the Inference Gateway API BBR (Body Based Routing). At the time of the proposal tokenization is coupled with the routing and tokens are returned to the Gateway along with the worker instance id. + + +## Exposing Dynamo workers without the FrontEnd. + +Feasibility needs to be evaluated. + +## Deferred to Implementation + +**\[Optional \- if not applicable omit\]** + +List out items that are under discussion but that will be resolved only during implementation / code review. + + +**Release Target**: Date + +**Effort Estimate**: \ + +**Work Item(s):** \ + +**Supported API / Behavior:** + +* \ + +**Not Supported:** + +* \ + +# Related Proposals + + +* [Biswa's Proposal](https://github.com/ai-dynamo/enhancements/blob/bis/inference-gw/0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md) + + + +# Alternate Solutions + +**\[Required, if not applicable write N/A\]** + +N/A + +## Alt \<\1\> \ + +**Pros:** + +TODO + +**Cons:** + +TODO + +**Reason Rejected:** + +TODO + + +# Background + +## 1. Main Routing Approaches: Queue Depth & KV-Cache Utilization +Both **Dynamo** and **EPP** can route requests using **queue depth** and **KV-cache pressure**. + +- **Token awareness** + - EPP’s default approach is token-aware only *by approximation* because it relies on the **non-tokenized text** in the prompt to keep things generic. + - Dynamo, by contrast, uses a **token-aware KV algorithm**. + - The Dynamo Router runs the model’s tokenizer on the prompt, ensuring: + - Consistent tokenization with the serving runtime + - Fewer mismatches between routing and inference behavior + +--- + +## 2. EPP Pipeline Shape: *Filter → Score → Pick* +EPP’s architecture stages routing decisions explicitly: + +- **Filters** enforce hard constraints and shrink the candidate set before ranking. + *Example: drop pods with queue depth above a threshold, or require LoRA affinity.* + +- **Scorers** express preferences across the remaining pods. + *Example: prefer lower queue depth or lower KV usage.* + +- **Pickers** select the final endpoint. + *Example: max-score, round-robin, prefix-aware, etc.* + +--- + +### Standard Filters Available in EPP +EPP provides a variety of filters that can be composed via YAML configuration: + +- **DecisionTreeFilter** — try one path; if it fails, fall back to another. +- **LeastKVCacheFilter** — keep only pods in the lowest KV-usage bucket. +- **LeastQueueFilter** — keep only pods in the lowest queue-depth bucket. +- **LoRAAffinityFilter** — prefer/require pods with the target LoRA already loaded. +- **LowQueueFilter** — enforce a hard ceiling on queue depth (drop overloaded pods). + +--- + +### Declarative Configuration +EPP policies are defined in **YAML**. +For example, a `DecisionTreeFilter` can encode logic like: + +> *“If LoRA is hot, continue; else, fall back to a low-queue path.”* + +This makes policies flexible, modular, and easy to express declaratively. + + + + +## References + +- [Deep Dive into Inference Extensions](https://kgateway.dev/blog/deep-dive-inference-extensions/) — good introduction +- [Smarter AI with Kubernetes Gateway API](https://kgateway.dev/blog/smarter-ai-reference-kubernetes-gateway-api/) — good overview read +- [Gateway API Guides](https://gateway-api.sigs.k8s.io/guides/) — official guides +- [Inference Extension Overview](https://gateway-api-inference-extension.sigs.k8s.io/gieps/overview) — official docs +- [kGateway Documentation](https://kgateway.dev/docs/) — official docs +- [kGateway API (gateway_extensions_types.go)](https://github.com/kgateway-dev/kgateway/blob/36969220f2cf95b262b881e52f68ae882671825d/api/v1alpha1/gateway_extensions_types.go#L19) — source code reference +- [Inference Extension Plugins (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) — plugin implementations +- [EPP Protocol Proposal (Dynamo fork)](https://github.com/atchernych/gateway-api-inference-extension-dynamo/tree/main/docs/proposals/004-endpoint-picker-protocol) — proposal document +- [Gateway API Inference Extension (upstream source)](https://github.com/kubernetes-sigs/gateway-api-inference-extension) — main repository + + +## Terminology & Definitions + +| Term | Definition | +| :---- | :---- | +| **BBR — Body-Based Routing** | A [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/gieps/overview) plugin that extracts fields (e.g., `model`) from the request body and inserts them into headers for routing decisions. | +| **EPP — Endpoint Picker Plugin** | A [plugin mechanism](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) used by the Inference Gateway API to score and select the appropriate backend endpoint (e.g., `QueueScorer`, `MaxScorePicker`, `LoRAAffinityScorer`). | + + From feaca8c9e802584d566480473a9b13d2f9f03352 Mon Sep 17 00:00:00 2001 From: Anna Tchernych Date: Fri, 29 Aug 2025 18:33:55 -0700 Subject: [PATCH 2/7] Add reasoning for the Filters Signed-off-by: Anna Tchernych --- .../NNNN-epp-integration.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md index b7d31fd0..e6ab2beb 100644 --- a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md +++ b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md @@ -28,9 +28,35 @@ This proposal outlines the integration of Dynamo components with the [Gateway AP The prior [version of Dynamo Inference Gateway Integration](https://github.com/ai-dynamo/enhancements/blob/bis/inference-gw/0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md) leaves room for 2 enhancements to how Dynamo integrates with the EPP. + +## HTTP call overhead + First, the EPP sends and HTTP request to the Dynamo FrontEnd to obtain the target worker instance id and the tokens. Even though the FrontEnd is deployed as a sidecar, this approach introduces additional latency. Given how well optimized the Dynamo system is, this latency would offset the performance gains provided by the highly efficient Dynamo Router. + +## EPP offers richer routing control. + Second, even though the Dynamo Routing call is implemented as a plugin in accordance with the plugin interface EPP provides, Dynamo cannot support other EPP-provided routing mechanisms such as Routing Filters. EPP offers more flexibility to the end user on how to route and provides a nice declarative configuration yaml file. +For example, in Dynamo router there is a small chance that not the optimal worker will be picked: + We can have A perfect KV match on an overloaded worker but the request on this worker may be slower than a near-miss on an idle one. + +### Worker A (Overloaded + Perfect Cache) +- **overlap** = 100% → `prefill_blocks = 0` +- **active_blocks** = 1000 (very busy) +- **Cost** = `1.0 × 0 + 1000 = 1000` + +### Worker B (Idle + Near Miss) +- **overlap** = 80% → `prefill_blocks = 20% of request` +- **active_blocks** = 10 (idle) +- **Cost** = `1.0 × (0.2 × request_size) + 10` + +If the request is small enough, Worker A could still win despite being overloaded. +The current model only uses active_blocks but ignores the queue depth (num_requests_waiting) + + +This situation can be mitigated by setting temperature but not eliminated. +We can add the Queue aware penalty to our router. Alternatively, we can use an EPP filter **LowQueueFilter**. It enforces a hard ceiling on queue depth and will drop overloaded pods from the router consideration. + ## Goals * Expose Dynamo Router as a a Library for Go through c-bindings or a Binary Library crate. From d0b1ffd90abbef9f242ed81506ac13b72ef607f5 Mon Sep 17 00:00:00 2001 From: Anna Tchernych Date: Wed, 3 Sep 2025 16:27:39 -0700 Subject: [PATCH 3/7] Add the section about worker-gateway integration Signed-off-by: Anna Tchernych --- .../NNNN-epp-integration.md | 67 ++++++++++++++++++- 1 file changed, 65 insertions(+), 2 deletions(-) diff --git a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md index e6ab2beb..268f0f58 100644 --- a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md +++ b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md @@ -113,6 +113,9 @@ This approach is simpler but introduces higher latency since every call crosses --- ### 2. C-Bindings (FFI Layer) + +Client Request → Gateway API → Endpoint Picker → C Bindings → KV Router → Best Worker + Add calls to existing C-bindings and invoke them from Go (`cgo → extern "C" Rust`). - Expose a C-compatible FFI layer using `#[no_mangle] extern "C" fn`. @@ -131,10 +134,70 @@ EPP filters rely on Prometheus metrics. We would have to make them available in We would have to rename the metrics of interest to EPP. For example, dynamo exposes the `num_requests_waiting` metric for the number of requests that have been routed to a specific worker but are waiting in that worker's internal queue to be processed. But EPP filters expect the `vllm:num_requests_waiting` and `nv_trt_llm_request_metrics{request_type=waiting}`. For KV cache utilization EPP expects `vllm:gpu_cache_usage_perc` and `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`. It is also notable that the tokenization can be performed in a plugin to the Inference Gateway API BBR (Body Based Routing). At the time of the proposal tokenization is coupled with the routing and tokens are returned to the Gateway along with the worker instance id. +EPP expects the following metrics: + +LeastKVCacheFilter and LowQueueFilter (kvCacheUsagePercentageMetric): +For vLLM: vllm:gpu_cache_usage_perc +For TRTLLM: nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction} +SGLang: sglang:token_usage + +LeastQueueFilter (totalQueuedRequestsMetric): +For vLLM: vllm:num_requests_waiting +For TRTLLM: nv_trt_llm_request_metrics{request_type=waiting} +SGLang: sglang:num_queue_reqs + + + +LoRA: (Dynamo does not yet support) +--loraInfoMetric="vllm:lora_requests_info" + + + +The names can be configurable via env vars: +TOTAL_QUEUED_REQUESTS_METRIC="your_queue_metric_name" +KV_CACHE_USAGE_PERCENTAGE_METRIC="your_kv_cache_metric_name" +LORA_INFO_METRIC="your_lora_metric_name" + + + +## Exposing Dynamo Workers Without the FrontEnd + +**Flow:** +`Client Request → Gateway API → Endpoint Picker → Best Worker` + +This option relies on the routing choices provided by the **Standard Endpoint Picker**. + +--- + +### Components Needed + +1. **Keep Dynamo Runtime** + - **etcd** + - **NATS** + +2. **Create an HTTP service in front of each worker** + - Purpose: translate incoming requests into the worker’s **Dynamo endpoint call** and stream the response back. + - The service should also subscribe to the same events via NATS. + + **Implementation options:** + - **SGLang:** extend + `components/backends/sglang/src/dynamo/sglang/main.py` + - **vLLM / TRT-LLM:** create a new service + `lib/llm/src/http/gateway_sidecar.rs` + or a lightweight Python equivalent similar to `main.py` + + **The HTTP service must expose standardized endpoints:** + - `/ready` + - `/health` + - `/metrics` + + > Ensure `/health`, `/ready`, and `/metrics` return expected schemas. + > Expose worker metrics such as **queue depth** and **resource usage**. -## Exposing Dynamo workers without the FrontEnd. +3. **Deployment** + - Deploy each worker with a **sidecar HTTP service**. + - Configure the **InferencePool** to select these services. -Feasibility needs to be evaluated. ## Deferred to Implementation From fc81d62ffc7e77faa01d1d12e220197441fc4565 Mon Sep 17 00:00:00 2001 From: Anna Tchernych Date: Thu, 4 Sep 2025 14:00:36 -0700 Subject: [PATCH 4/7] details on EPP archi Signed-off-by: Anna Tchernych --- .../NNNN-epp-integration.md | 24 +++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md index 268f0f58..92319593 100644 --- a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md +++ b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md @@ -33,9 +33,9 @@ The prior [version of Dynamo Inference Gateway Integration](https://github.com/a First, the EPP sends and HTTP request to the Dynamo FrontEnd to obtain the target worker instance id and the tokens. Even though the FrontEnd is deployed as a sidecar, this approach introduces additional latency. Given how well optimized the Dynamo system is, this latency would offset the performance gains provided by the highly efficient Dynamo Router. -## EPP offers richer routing control. +## EPP offers richer routing control -Second, even though the Dynamo Routing call is implemented as a plugin in accordance with the plugin interface EPP provides, Dynamo cannot support other EPP-provided routing mechanisms such as Routing Filters. EPP offers more flexibility to the end user on how to route and provides a nice declarative configuration yaml file. +Second, EPP-provides additional routing mechanisms such as Routing Filters which Dynamo does not offer. EPP also provides a nice declarative configuration yaml file on how to route. For example, in Dynamo router there is a small chance that not the optimal worker will be picked: We can have A perfect KV match on an overloaded worker but the request on this worker may be slower than a near-miss on an idle one. @@ -130,6 +130,26 @@ This approach has **higher maintenance overhead** but offers **lower latency**: ## Enabling EPP filters +EPP filtering approach has a different shape: filter -> score -> pick +EPP’s architecture explicitly stages decisions: +Filters enforce hard constraints and shrink the candidate set before ranking (i.e. drop pods with queue above a threshold, require LoRA affinity). +Scorers express preferences across the remaining pods (e.g., prefer lower queue, lower KV usage). +Pickers choose the final endpoint (max-score, round-robin, prefix-aware, etc.). + +EPP offers standard filters one can pick and choose though yaml config : +DecisionTreeFilter — try one path, if failed go for a fallback. +LeastKVCacheFilter — keep pods in the lowest KV-usage bucket. +LeastQueueFilter — keep pods in the lowest queue-depth bucket. +LoraAffinityFilter — prefer/require pods with the target LoRA already loaded. +LowQueueFilter — hard ceiling on queue depth (drop overloaded pods). +Declarative config: EPP policies live in YAML. For example, a DecisionTreeFilter can encode: “If LoRA is hot, continue; else, fall back to a low-queue path.” This is nice. + +Our approach (Dynamo) +We have EPP delegate the final routing decision to the Dynamo Frontend. EPP framework allows for custom plugins and this is the approach I have taken. The Dynamo Routing call is implemented as a scorer. + +Current gap: we do not support LoRA-affinity routing in our FE path (so we can’t enforce “LoRA must be hot” as a hard gate). Everything else above remains compatible. + + EPP filters rely on Prometheus metrics. We would have to make them available in the *InferencePool* CR. We would have to rename the metrics of interest to EPP. For example, dynamo exposes the `num_requests_waiting` metric for the number of requests that have been routed to a specific worker but are waiting in that worker's internal queue to be processed. But EPP filters expect the `vllm:num_requests_waiting` and `nv_trt_llm_request_metrics{request_type=waiting}`. For KV cache utilization EPP expects `vllm:gpu_cache_usage_perc` and `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`. It is also notable that the tokenization can be performed in a plugin to the Inference Gateway API BBR (Body Based Routing). At the time of the proposal tokenization is coupled with the routing and tokens are returned to the Gateway along with the worker instance id. From 577728d1cb68fca9d5ba4dba7f1e40756b7bf7b3 Mon Sep 17 00:00:00 2001 From: Anna Tchernych Date: Thu, 4 Sep 2025 14:12:37 -0700 Subject: [PATCH 5/7] cleanup Signed-off-by: Anna Tchernych --- .../NNNN-epp-integration.md | 54 ++++++++++--------- 1 file changed, 30 insertions(+), 24 deletions(-) diff --git a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md index 92319593..d5f87092 100644 --- a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md +++ b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md @@ -35,7 +35,7 @@ First, the EPP sends and HTTP request to the Dynamo FrontEnd to obtain the targe ## EPP offers richer routing control -Second, EPP-provides additional routing mechanisms such as Routing Filters which Dynamo does not offer. EPP also provides a nice declarative configuration yaml file on how to route. +Second, EPP-provides additional routing mechanisms such as Routing Filters which Dynamo does not offer. For example, in Dynamo router there is a small chance that not the optimal worker will be picked: We can have A perfect KV match on an overloaded worker but the request on this worker may be slower than a near-miss on an idle one. @@ -61,7 +61,7 @@ We can add the Queue aware penalty to our router. Alternatively, we can use an E * Expose Dynamo Router as a a Library for Go through c-bindings or a Binary Library crate. -* Provide support for EPP standard Routing filters, Pickers and Scorers so that an end user can mix an match Dynamo Routing approach with the EPP filters, pickers and scorers through a single yaml - based config file. +* Provide support for EPP standard Routing filters, Pickers and Scorers so that an end user can mix and match the Dynamo Routing approach with the EPP filters, pickers and scorers. We want to support the yaml - based config file. * Modularize Dynamo to enable Inference Gateway API usage with workers, without relying on the Dynamo Router. @@ -136,13 +136,16 @@ Filters enforce hard constraints and shrink the candidate set before ranking (i. Scorers express preferences across the remaining pods (e.g., prefer lower queue, lower KV usage). Pickers choose the final endpoint (max-score, round-robin, prefix-aware, etc.). -EPP offers standard filters one can pick and choose though yaml config : -DecisionTreeFilter — try one path, if failed go for a fallback. -LeastKVCacheFilter — keep pods in the lowest KV-usage bucket. -LeastQueueFilter — keep pods in the lowest queue-depth bucket. -LoraAffinityFilter — prefer/require pods with the target LoRA already loaded. -LowQueueFilter — hard ceiling on queue depth (drop overloaded pods). -Declarative config: EPP policies live in YAML. For example, a DecisionTreeFilter can encode: “If LoRA is hot, continue; else, fall back to a low-queue path.” This is nice. +EPP offers standard filters one can pick and choose through YAML config: + +- **DecisionTreeFilter** — try one path, if failed go for a fallback. +- **LeastKVCacheFilter** — keep pods in the lowest KV-usage bucket. +- **LeastQueueFilter** — keep pods in the lowest queue-depth bucket. +- **LoraAffinityFilter** — prefer/require pods with the target LoRA already loaded. +- **LowQueueFilter** — hard ceiling on queue depth (drop overloaded pods). + + +EPP enables a declarative config: EPP policies live in YAML. For example, a DecisionTreeFilter can encode: “If LoRA is hot, continue; else, fall back to a low-queue path.” This is nice. Our approach (Dynamo) We have EPP delegate the final routing decision to the Dynamo Frontend. EPP framework allows for custom plugins and this is the approach I have taken. The Dynamo Routing call is implemented as a scorer. @@ -150,21 +153,22 @@ We have EPP delegate the final routing decision to the Dynamo Frontend. EPP fram Current gap: we do not support LoRA-affinity routing in our FE path (so we can’t enforce “LoRA must be hot” as a hard gate). Everything else above remains compatible. -EPP filters rely on Prometheus metrics. We would have to make them available in the *InferencePool* CR. -We would have to rename the metrics of interest to EPP. For example, dynamo exposes the `num_requests_waiting` metric for the number of requests that have been routed to a specific worker but are waiting in that worker's internal queue to be processed. But EPP filters expect the `vllm:num_requests_waiting` and `nv_trt_llm_request_metrics{request_type=waiting}`. For KV cache utilization EPP expects `vllm:gpu_cache_usage_perc` and `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`. +EPP filters rely on Prometheus metrics. We would have to make them available in the *InferencePool* CR on the workers. +We would have to rename the metrics of interest to EPP or expose them through environment variables. For example, dynamo exposes the `num_requests_waiting` metric for the number of requests that have been routed to a specific worker but are waiting in that worker's internal queue to be processed. But EPP filters expect the `vllm:num_requests_waiting` and `nv_trt_llm_request_metrics{request_type=waiting}`. For KV cache utilization EPP expects `vllm:gpu_cache_usage_perc` and `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`. + It is also notable that the tokenization can be performed in a plugin to the Inference Gateway API BBR (Body Based Routing). At the time of the proposal tokenization is coupled with the routing and tokens are returned to the Gateway along with the worker instance id. EPP expects the following metrics: LeastKVCacheFilter and LowQueueFilter (kvCacheUsagePercentageMetric): -For vLLM: vllm:gpu_cache_usage_perc -For TRTLLM: nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction} -SGLang: sglang:token_usage +**For vLLM: vllm:gpu_cache_usage_perc** +**For TRTLLM: nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}** +**SGLang: sglang:token_usage** LeastQueueFilter (totalQueuedRequestsMetric): -For vLLM: vllm:num_requests_waiting -For TRTLLM: nv_trt_llm_request_metrics{request_type=waiting} -SGLang: sglang:num_queue_reqs +**For vLLM: vllm:num_requests_waiting** +**For TRTLLM: nv_trt_llm_request_metrics{request_type=waiting}** +**SGLang: sglang:num_queue_reqs** @@ -172,27 +176,29 @@ LoRA: (Dynamo does not yet support) --loraInfoMetric="vllm:lora_requests_info" - The names can be configurable via env vars: -TOTAL_QUEUED_REQUESTS_METRIC="your_queue_metric_name" -KV_CACHE_USAGE_PERCENTAGE_METRIC="your_kv_cache_metric_name" -LORA_INFO_METRIC="your_lora_metric_name" +**TOTAL_QUEUED_REQUESTS_METRIC="your_queue_metric_name"** +**KV_CACHE_USAGE_PERCENTAGE_METRIC="your_kv_cache_metric_name"** +**LORA_INFO_METRIC="your_lora_metric_name"** ## Exposing Dynamo Workers Without the FrontEnd -**Flow:** +**Current Flow:** +`Client Request → Gateway API → Endpoint Picker → Dynamo Router -> Best Worker` + +**New Flow:** `Client Request → Gateway API → Endpoint Picker → Best Worker` -This option relies on the routing choices provided by the **Standard Endpoint Picker**. +This option relies on the routing options provided by the **Standard Endpoint Picker**. --- ### Components Needed 1. **Keep Dynamo Runtime** - - **etcd** + - **ETCD** - **NATS** 2. **Create an HTTP service in front of each worker** From 9720c5860fc2c3aa6d9ae9af28835ad9d5566b6d Mon Sep 17 00:00:00 2001 From: Anna Tchernych Date: Thu, 11 Sep 2025 10:13:10 -0700 Subject: [PATCH 6/7] details on static lib Signed-off-by: Anna Tchernych --- inference-gw/NNNN-epp-integration/NNNN-epp-integration.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md index d5f87092..dfb8f3b4 100644 --- a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md +++ b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md @@ -121,6 +121,12 @@ Add calls to existing C-bindings and invoke them from Go (`cgo → extern "C" Ru - Expose a C-compatible FFI layer using `#[no_mangle] extern "C" fn`. - Build the crate into a `.so` / `.a` / `.dll` and call it from Go via `cgo`. - See [Draft PR #2786](https://github.com/ai-dynamo/dynamo/pull/2786) for a reference implementation. +- The EPP go code will instantiate the Dynamo Router with the namespace (i.e. vllm-agg) and component (i.e. backend). The router will read the model card by searching in etcd for the matching entry and read the kv cache block size. During the call `callDynamoRouter` the router will return the best worker id in the standard manner. +- The Dynamo Plugin will expose the standard router configuration values through env vars + - has_overlap_score_weight: bool, + - overlap_score_weight: f64, + - as_router_temperature: bool, + - router_temperature: f64, This approach has **higher maintenance overhead** but offers **lower latency**: - Avoids process boundary overhead (no syscalls, no kernel, no sockets). From 3ee7ce0928a730ddc5a51a9737c2cd19ec1674cf Mon Sep 17 00:00:00 2001 From: Anna Tchernych Date: Tue, 23 Sep 2025 18:10:10 -0700 Subject: [PATCH 7/7] diagram Signed-off-by: Anna Tchernych --- .../NNNN-epp-integration.md | 116 ++++++++++-------- 1 file changed, 66 insertions(+), 50 deletions(-) diff --git a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md index dfb8f3b4..605c7cb0 100644 --- a/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md +++ b/inference-gw/NNNN-epp-integration/NNNN-epp-integration.md @@ -1,14 +1,14 @@ -# Dynamo Integration with the Gateway API Inference Extension's End Point Picker +# Dynamo Integration with the Gateway API Inference Extension's End Point Picker -**Status**: Draft +**Status**: Draft **Authors**: [Anna Tchernych](https://github.com/atchernych) **Category**: Architecture -**Replaces**: [Link of previous proposal if applicable] +**Replaces**: [Link of previous proposal if applicable] -**Replaced By**: [Link of previous proposal if applicable] +**Replaced By**: [Link of previous proposal if applicable] **Sponsor**: [Name of code owner or maintainer to shepard process] @@ -22,11 +22,10 @@ # Summary -This proposal outlines the integration of Dynamo components with the [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/), in particular the [EndPointPicker](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/) +This proposal outlines the integration of Dynamo components with the [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/), in particular the [EndPointPicker](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/) # Motivation - The prior [version of Dynamo Inference Gateway Integration](https://github.com/ai-dynamo/enhancements/blob/bis/inference-gw/0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md) leaves room for 2 enhancements to how Dynamo integrates with the EPP. ## HTTP call overhead @@ -35,17 +34,19 @@ First, the EPP sends and HTTP request to the Dynamo FrontEnd to obtain the targe ## EPP offers richer routing control -Second, EPP-provides additional routing mechanisms such as Routing Filters which Dynamo does not offer. +Second, EPP-provides additional routing mechanisms such as Routing Filters which Dynamo does not offer. For example, in Dynamo router there is a small chance that not the optimal worker will be picked: We can have A perfect KV match on an overloaded worker but the request on this worker may be slower than a near-miss on an idle one. ### Worker A (Overloaded + Perfect Cache) + - **overlap** = 100% → `prefill_blocks = 0` - **active_blocks** = 1000 (very busy) - **Cost** = `1.0 × 0 + 1000 = 1000` ### Worker B (Idle + Near Miss) + - **overlap** = 80% → `prefill_blocks = 20% of request` - **active_blocks** = 10 (idle) - **Cost** = `1.0 × (0.2 × request_size) + 10` @@ -53,22 +54,20 @@ For example, in Dynamo router there is a small chance that not the optimal worke If the request is small enough, Worker A could still win despite being overloaded. The current model only uses active_blocks but ignores the queue depth (num_requests_waiting) +This situation can be mitigated by setting temperature but not eliminated. +We can add the Queue aware penalty to our router. Alternatively, we can use an EPP filter **LowQueueFilter**. It enforces a hard ceiling on queue depth and will drop overloaded pods from the router consideration. -This situation can be mitigated by setting temperature but not eliminated. -We can add the Queue aware penalty to our router. Alternatively, we can use an EPP filter **LowQueueFilter**. It enforces a hard ceiling on queue depth and will drop overloaded pods from the router consideration. +## Goals -## Goals +- Expose Dynamo Router as a a Library for Go through c-bindings or a Binary Library crate. -* Expose Dynamo Router as a a Library for Go through c-bindings or a Binary Library crate. +- Provide support for EPP standard Routing filters, Pickers and Scorers so that an end user can mix and match the Dynamo Routing approach with the EPP filters, pickers and scorers. We want to support the yaml - based config file. -* Provide support for EPP standard Routing filters, Pickers and Scorers so that an end user can mix and match the Dynamo Routing approach with the EPP filters, pickers and scorers. We want to support the yaml - based config file. - -* Modularize Dynamo to enable Inference Gateway API usage with workers, without relying on the Dynamo Router. +- Modularize Dynamo to enable Inference Gateway API usage with workers, without relying on the Dynamo Router. ### Non Goals -* Change existing Dynamo worker interfaces significantly -* Change existing Dynamo worker interfaces significantly +- Change existing Dynamo worker interfaces significantly ## Requirements @@ -77,6 +76,7 @@ List out any additional requirements in numbered subheadings. --- ### REQ-1: Router as a Library + The router **SHOULD** incur minimal latency overhead when used as a library. Use all-caps, bolded terms like **MUST** and **SHOULD** when describing each requirement. See [RFC-2119](https://datatracker.ietf.org/doc/html/rfc2119) for additional information. @@ -84,17 +84,18 @@ Use all-caps, bolded terms like **MUST** and **SHOULD** when describing each req --- ### REQ-2: Support for Standard EPP Filters + The user **SHOULD** be able to use standard EPP filter- and scorer-enabled routing **alongside** the Dynamo Router. --- ### REQ-3: Inference Gateway API Without Dynamo Router -The user **SHOULD** be able to use Dynamo workers directly, without requiring the Dynamo Router. +The user **SHOULD** be able to use Dynamo workers directly, without requiring the Dynamo Router. # Proposal -Below proposes the C-Bindings (FFI Layer) as a solution to the latency problem. +Below proposes the C-Bindings (FFI Layer) as a solution to the latency problem. Massaging of Dynamo metrics is proposed as a solution to enabling EPP standard filters. # Implementation Details @@ -104,7 +105,9 @@ Massaging of Dynamo metrics is proposed as a solution to enabling EPP standard f Two approaches are possible: ### 1. Rust Binary Crate + Package the Dynamo Router as a Rust binary crate. + - Invoke it from Go using `exec.Command`. - Communicate over **stdin/stdout/IPC**. @@ -121,19 +124,19 @@ Add calls to existing C-bindings and invoke them from Go (`cgo → extern "C" Ru - Expose a C-compatible FFI layer using `#[no_mangle] extern "C" fn`. - Build the crate into a `.so` / `.a` / `.dll` and call it from Go via `cgo`. - See [Draft PR #2786](https://github.com/ai-dynamo/dynamo/pull/2786) for a reference implementation. -- The EPP go code will instantiate the Dynamo Router with the namespace (i.e. vllm-agg) and component (i.e. backend). The router will read the model card by searching in etcd for the matching entry and read the kv cache block size. During the call `callDynamoRouter` the router will return the best worker id in the standard manner. +- The EPP go code will instantiate the Dynamo Router with the namespace (i.e. vllm-agg) and component (i.e. backend). The router will read the model card by searching in etcd for the matching entry and read the kv cache block size. During the call `callDynamoRouter` the router will return the best worker id in the standard manner. - The Dynamo Plugin will expose the standard router configuration values through env vars - - has_overlap_score_weight: bool, - - overlap_score_weight: f64, - - as_router_temperature: bool, - - router_temperature: f64, + - has_overlap_score_weight: bool, + - overlap_score_weight: f64, + - as_router_temperature: bool, + - router_temperature: f64, This approach has **higher maintenance overhead** but offers **lower latency**: + - Avoids process boundary overhead (no syscalls, no kernel, no sockets). - Zero/low copy: pass `*uint32 + length` for tokens, get back a pointer, and free with a matching `free`—cheap and predictable. - Shared runtime: initialize the Rust tokenizer/router runtime once (bindings already use `OnceCell`) and reuse it across calls. - ## Enabling EPP filters EPP filtering approach has a different shape: filter -> score -> pick @@ -150,19 +153,45 @@ EPP offers standard filters one can pick and choose through YAML config: - **LoraAffinityFilter** — prefer/require pods with the target LoRA already loaded. - **LowQueueFilter** — hard ceiling on queue depth (drop overloaded pods). +```mermaid +%%{init: {"theme": "default", "themeVariables": {"primaryColor": "#ffffff", "primaryTextColor": "#000000", "lineColor": "#008000"}} }%% +flowchart TB + %% Main pipeline + subgraph pipeline[" "] + direction LR + C[Filter] --> D[Score] --> E[Pick] + end + + %% Box underneath with Frontend and Worker + subgraph runtime[Runtime] + FE[Frontend] + W[Worker] + end + + %% Force Runtime to be below pipeline + pipeline --> runtime + + %% Arrow from Runtime to pipeline with metrics label + runtime -.->|metrics| pipeline + + %% Style arrows to be green and visible + linkStyle 0 stroke:#008000,stroke-width:2px + linkStyle 1 stroke:#008000,stroke-width:2px + linkStyle 2 stroke:transparent + +``` EPP enables a declarative config: EPP policies live in YAML. For example, a DecisionTreeFilter can encode: “If LoRA is hot, continue; else, fall back to a low-queue path.” This is nice. Our approach (Dynamo) -We have EPP delegate the final routing decision to the Dynamo Frontend. EPP framework allows for custom plugins and this is the approach I have taken. The Dynamo Routing call is implemented as a scorer. +We have EPP delegate the final routing decision to the Dynamo Frontend. EPP framework allows for custom plugins and this is the approach I have taken. The Dynamo Routing call is implemented as a scorer. Current gap: we do not support LoRA-affinity routing in our FE path (so we can’t enforce “LoRA must be hot” as a hard gate). Everything else above remains compatible. +EPP filters rely on Prometheus metrics. We would have to make them available in the *InferencePool* CR on the workers. +We would have to rename the metrics of interest to EPP or expose them through environment variables. For example, dynamo exposes the `num_requests_waiting` metric for the number of requests that have been routed to a specific worker but are waiting in that worker's internal queue to be processed. But EPP filters expect the `vllm:num_requests_waiting` and `nv_trt_llm_request_metrics{request_type=waiting}`. For KV cache utilization EPP expects `vllm:gpu_cache_usage_perc` and `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`. -EPP filters rely on Prometheus metrics. We would have to make them available in the *InferencePool* CR on the workers. -We would have to rename the metrics of interest to EPP or expose them through environment variables. For example, dynamo exposes the `num_requests_waiting` metric for the number of requests that have been routed to a specific worker but are waiting in that worker's internal queue to be processed. But EPP filters expect the `vllm:num_requests_waiting` and `nv_trt_llm_request_metrics{request_type=waiting}`. For KV cache utilization EPP expects `vllm:gpu_cache_usage_perc` and `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`. - -It is also notable that the tokenization can be performed in a plugin to the Inference Gateway API BBR (Body Based Routing). At the time of the proposal tokenization is coupled with the routing and tokens are returned to the Gateway along with the worker instance id. +It is also notable that the tokenization can be performed in a plugin to the Inference Gateway API BBR (Body Based Routing). At the time of the proposal tokenization is coupled with the routing and tokens are returned to the Gateway along with the worker instance id. EPP expects the following metrics: @@ -176,19 +205,14 @@ LeastQueueFilter (totalQueuedRequestsMetric): **For TRTLLM: nv_trt_llm_request_metrics{request_type=waiting}** **SGLang: sglang:num_queue_reqs** - - LoRA: (Dynamo does not yet support) --loraInfoMetric="vllm:lora_requests_info" - The names can be configurable via env vars: **TOTAL_QUEUED_REQUESTS_METRIC="your_queue_metric_name"** **KV_CACHE_USAGE_PERCENTAGE_METRIC="your_kv_cache_metric_name"** **LORA_INFO_METRIC="your_lora_metric_name"** - - ## Exposing Dynamo Workers Without the FrontEnd **Current Flow:** @@ -230,13 +254,11 @@ This option relies on the routing options provided by the **Standard Endpoint Pi - Deploy each worker with a **sidecar HTTP service**. - Configure the **InferencePool** to select these services. - ## Deferred to Implementation **\[Optional \- if not applicable omit\]** -List out items that are under discussion but that will be resolved only during implementation / code review. - +List out items that are under discussion but that will be resolved only during implementation / code review. **Release Target**: Date @@ -246,18 +268,15 @@ List out items that are under discussion but that will be resolved only during i **Supported API / Behavior:** -* \ +- \ **Not Supported:** -* \ +- \ # Related Proposals - -* [Biswa's Proposal](https://github.com/ai-dynamo/enhancements/blob/bis/inference-gw/0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md) - - +- [Biswa's Proposal](https://github.com/ai-dynamo/enhancements/blob/bis/inference-gw/0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md) # Alternate Solutions @@ -279,10 +298,10 @@ TODO TODO - # Background ## 1. Main Routing Approaches: Queue Depth & KV-Cache Utilization + Both **Dynamo** and **EPP** can route requests using **queue depth** and **KV-cache pressure**. - **Token awareness** @@ -295,6 +314,7 @@ Both **Dynamo** and **EPP** can route requests using **queue depth** and **KV-ca --- ## 2. EPP Pipeline Shape: *Filter → Score → Pick* + EPP’s architecture stages routing decisions explicitly: - **Filters** enforce hard constraints and shrink the candidate set before ranking. @@ -309,6 +329,7 @@ EPP’s architecture stages routing decisions explicitly: --- ### Standard Filters Available in EPP + EPP provides a variety of filters that can be composed via YAML configuration: - **DecisionTreeFilter** — try one path; if it fails, fall back to another. @@ -320,6 +341,7 @@ EPP provides a variety of filters that can be composed via YAML configuration: --- ### Declarative Configuration + EPP policies are defined in **YAML**. For example, a `DecisionTreeFilter` can encode logic like: @@ -327,9 +349,6 @@ For example, a `DecisionTreeFilter` can encode logic like: This makes policies flexible, modular, and easy to express declaratively. - - - ## References - [Deep Dive into Inference Extensions](https://kgateway.dev/blog/deep-dive-inference-extensions/) — good introduction @@ -342,12 +361,9 @@ This makes policies flexible, modular, and easy to express declaratively. - [EPP Protocol Proposal (Dynamo fork)](https://github.com/atchernych/gateway-api-inference-extension-dynamo/tree/main/docs/proposals/004-endpoint-picker-protocol) — proposal document - [Gateway API Inference Extension (upstream source)](https://github.com/kubernetes-sigs/gateway-api-inference-extension) — main repository - ## Terminology & Definitions | Term | Definition | | :---- | :---- | | **BBR — Body-Based Routing** | A [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/gieps/overview) plugin that extracts fields (e.g., `model`) from the request body and inserts them into headers for routing decisions. | | **EPP — Endpoint Picker Plugin** | A [plugin mechanism](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) used by the Inference Gateway API to score and select the appropriate backend endpoint (e.g., `QueueScorer`, `MaxScorePicker`, `LoRAAffinityScorer`). | - -