From cb69a5d49b2aca38bc219ee9567b20481950d0ba Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:11:59 -0700 Subject: [PATCH 01/10] new file Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant | 1 + 1 file changed, 1 insertion(+) create mode 100644 deps/router-fault-tolerant diff --git a/deps/router-fault-tolerant b/deps/router-fault-tolerant new file mode 100644 index 00000000..8b137891 --- /dev/null +++ b/deps/router-fault-tolerant @@ -0,0 +1 @@ + From 8bc913720a98353122557a64382d6c4c36f89702 Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:19:08 -0700 Subject: [PATCH 02/10] summary Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant | 190 +++++++++++++++++++++++++++++++++++++ 1 file changed, 190 insertions(+) diff --git a/deps/router-fault-tolerant b/deps/router-fault-tolerant index 8b137891..e3da4aae 100644 --- a/deps/router-fault-tolerant +++ b/deps/router-fault-tolerant @@ -1 +1,191 @@ +# Highly Available and Fault-Tolerant Router +**Status**: Draft + +**Authors**: @PeaBrane + +**Category**: Architecture + +**Replaces**: N/A + +**Replaced By**: N/A + +**Sponsor**: @nnshah1 + +**Required Reviewers**: @ryanolson + +**Review Date**: [Date for review] + +**Pull Request**: [Link to Pull Request of the Proposal itself] + +**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation] + +# Summary + +The overarching goal here is to have a Router design that allows for multiple Router instances to be deployed for fault tolerance. +That is, in case one goes down, the others will still be able to function normally. +This requires some sort of mechanism to sync the Router states periodically (either among the Router themselves or via events from the backend engines), +and also a mechanism to "warm restart" the Router such that the Router can be brought back with up-to-date states. +Finally, the Router should be decoupled from the (http) frontend, such that the two can be scaled independently. +(It is more likely that the frontend handling the pre-processing / tokenization would need to scale first before the Router does.) + +# Motivation + +**\[Required\]** + +Describe the problem that needs to be addressed with enough detail for +someone familiar with the project to understand. Generally one to two +short paragraphs. Additional details can be placed in the background +section as needed. Cover **what** the issue is and **why** it needs to +be addressed. Link to github issues if relevant. + +## Goals + +**\[Optional \- if not applicable omit\]** + +List out any additional goals in bullet points. Goals may be aspirational / difficult to measure but guide the proposal. + +* Goal + +* Goal + +* Goal + +### Non Goals + +**\[Optional \- if not applicable omit\]** + +List out any items which are out of scope / specifically not required in bullet points. Indicates the scope of the proposal and issue being resolved. + +## Requirements + +**\[Optional \- if not applicable omit\]** + +List out any additional requirements in numbered subheadings. + +**\** + +### REQ \<\#\> \ + +Describe the requirement in as much detail as necessary for others to understand it and how it applies to the DEP. Keep in mind that requirements should be measurable and will be used to determine if a DEP has been successfully implemented or not. + +Requirement names should be prefixed using a monotonically increasing number such as “REQ 1 \” followed by “REQ 2 \” and so on. Use title casing when naming requirements. Requirement names should be as descriptive as possible while remaining as terse as possible. + +Use all-caps, bolded terms like **MUST** and **SHOULD** when describing each requirement. See [RFC-2119](https://datatracker.ietf.org/doc/html/rfc2119) for additional information. + + +# Proposal + +**\[Required\]** + +Describe the high level design / proposal. Use sub sections as needed, but start with an overview and then dig into the details. Try to provide images and diagrams to facilitate understanding. + +# Implementation Details + +**\[Optional \- if not applicable omit\]** + +Add additional detailed items here including interface signatures, etc. Add anything that is relevant but seems more of a detail than central to the proposal. Use sub sections / bullet points as needed. Try to provide images and diagrams to facilitate understanding. If applicable link to PR. + +## Deferred to Implementation + +**\[Optional \- if not applicable omit\]** + +List out items that are under discussion but that will be resolved only during implementation / code review. + +# Implementation Phases + +**\[Optional \- if not applicable omit\]** + +List out phases of implementation (can be single phase). Give each phase a monotonically increasing number; example “Phase 0” followed by “Phase 1” and so on. Give phases titles if it makes sense. + +## Phase \<\#\> \ + +**Release Target**: Date + +**Effort Estimate**: \ + +**Work Item(s):** \ + +**Supported API / Behavior:** + +* \ + +**Not Supported:** + +* \ + +# Related Proposals + +**\[Optional \- if not applicable omit\]** + +* File + +* File + +* File + +* File + +* File + +# Alternate Solutions + +**\[Required, if not applicable write N/A\]** + +List out solutions that were considered but ultimately rejected. Consider free form \- but a possible format shown below. + +## Alt \<\#\> \ + +**Pros:** + +\ + +**Cons:** + +\ + +**Reason Rejected:** + +\ + +**Notes:** + +\ + +# Background + +**\[Optional \- if not applicable omit\]** + +Add additional context and references as needed to help reviewers and authors understand the context of the problem and solution being proposed. + +## References + +**\[Optional \- if not applicable omit\]** + +Add additional references as needed to help reviewers and authors understand the context of the problem and solution being proposed. + +* \ + +## Terminology & Definitions + +**\[Optional \- if not applicable omit\]** + +List out additional terms / definitions (lexicon). Try to keep definitions as concise as possible and use links to external resources when additional information would be useful to the reader. + +Keep the list of terms sorted alphabetically to ease looking up definitions by readers. + +| \ | \ | +| :---- | :---- | +| **\** | \ | + +## Acronyms & Abbreviations + +**\[Optional \- if not applicable omit\]** + +Provide a list of frequently used acronyms and abbreviations which are uncommon or unlikely to be known by the reader. Do not include acronyms or abbreviations which the reader is likely to be familiar with. + +Keep the list of acronyms and abbreviations sorted alphabetically to ease looking up definitions by readers. + +Do not include the full definition in the expanded meaning of an abbreviation or acronym. If the reader needs the definition, please include it in the [Terminology & Definitions](#terminology--definitions) section. + +**\:** \ From ac67a62b0dd6417fb4d04a43a0f9c35431a7fffb Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:20:29 -0700 Subject: [PATCH 03/10] add .md extension Signed-off-by: Yan Ru Pei --- deps/{router-fault-tolerant => router-fault-tolerant.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename deps/{router-fault-tolerant => router-fault-tolerant.md} (100%) diff --git a/deps/router-fault-tolerant b/deps/router-fault-tolerant.md similarity index 100% rename from deps/router-fault-tolerant rename to deps/router-fault-tolerant.md From abf4c2df25e8d9389da111b2fea8a565c318027b Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:33:14 -0700 Subject: [PATCH 04/10] motivation Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant.md | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/deps/router-fault-tolerant.md b/deps/router-fault-tolerant.md index e3da4aae..69ffc35b 100644 --- a/deps/router-fault-tolerant.md +++ b/deps/router-fault-tolerant.md @@ -31,13 +31,26 @@ Finally, the Router should be decoupled from the (http) frontend, such that the # Motivation -**\[Required\]** - -Describe the problem that needs to be addressed with enough detail for -someone familiar with the project to understand. Generally one to two -short paragraphs. Additional details can be placed in the background -section as needed. Cover **what** the issue is and **why** it needs to -be addressed. Link to github issues if relevant. +As context, we have iterated over two designs of the Router that worked well in their own regard. + +First, we had a near-stateless Router listening on backend engines for KV events and load metrics. This is good because: +- Multiple Routers can be launched and synced naturally +- Easier Python binding for modular components, as the Router does not hold the output SSE stream, and simply needs to return the `best_worker_id` +But not good because: +- The radix tree of the `KvIndexer` is still very stateful, with no warm restart mechanism +- Huge performance hit under highly concurrent payloads, as KV / metric events cannot respond fast enough for the Router to keep track of the updated load states. + +Now, we have a stateful Router still listening on backend engines for KV events (can opt out of via `ApproxKvIndexer`), +but maintains the active block states locally from the request-response cycle. This is good because: +- The performance is good under high concurrency, because the Router never sees a stale load metric state, as we forced sequential processing of requests locally. +- It is highly general, as the Router can now interface with any backend engine, without the need for any event communication +But not good because: +- Due to its high statefulness, multiple Routers cannot be perfectly in sync, as a Router only sees a subset of requests / responses +- The Router holds the output SSE stream, so if the Router goes down, the stream will die along with it +- Harder to have modular components to bind to Python, as we require the entirety of `KvPushRouter` to handle the request-response cycles + +In short, a stateless Router is better for fault-tolerance, but a stateful Router is better for optimality of routing decisions. +The main motivation here is to have a design that incorporates the benefits of both, and eventually achieve a net win. ## Goals From 40d1c608411ca2fd80b4a04829345bfbdc76275c Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:33:51 -0700 Subject: [PATCH 05/10] spacings Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/deps/router-fault-tolerant.md b/deps/router-fault-tolerant.md index 69ffc35b..5b41c0ce 100644 --- a/deps/router-fault-tolerant.md +++ b/deps/router-fault-tolerant.md @@ -36,6 +36,7 @@ As context, we have iterated over two designs of the Router that worked well in First, we had a near-stateless Router listening on backend engines for KV events and load metrics. This is good because: - Multiple Routers can be launched and synced naturally - Easier Python binding for modular components, as the Router does not hold the output SSE stream, and simply needs to return the `best_worker_id` + But not good because: - The radix tree of the `KvIndexer` is still very stateful, with no warm restart mechanism - Huge performance hit under highly concurrent payloads, as KV / metric events cannot respond fast enough for the Router to keep track of the updated load states. @@ -44,6 +45,7 @@ Now, we have a stateful Router still listening on backend engines for KV events but maintains the active block states locally from the request-response cycle. This is good because: - The performance is good under high concurrency, because the Router never sees a stale load metric state, as we forced sequential processing of requests locally. - It is highly general, as the Router can now interface with any backend engine, without the need for any event communication + But not good because: - Due to its high statefulness, multiple Routers cannot be perfectly in sync, as a Router only sees a subset of requests / responses - The Router holds the output SSE stream, so if the Router goes down, the stream will die along with it From 99f0578b13e2983faf712b6cf84191f10d9fa93e Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:45:14 -0700 Subject: [PATCH 06/10] goals Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant.md | 67 +++++++++-------------------------- 1 file changed, 17 insertions(+), 50 deletions(-) diff --git a/deps/router-fault-tolerant.md b/deps/router-fault-tolerant.md index 5b41c0ce..a9257b0c 100644 --- a/deps/router-fault-tolerant.md +++ b/deps/router-fault-tolerant.md @@ -12,7 +12,7 @@ **Sponsor**: @nnshah1 -**Required Reviewers**: @ryanolson +**Required Reviewers**: @ryanolson @grahamking **Review Date**: [Date for review] @@ -24,7 +24,7 @@ The overarching goal here is to have a Router design that allows for multiple Router instances to be deployed for fault tolerance. That is, in case one goes down, the others will still be able to function normally. -This requires some sort of mechanism to sync the Router states periodically (either among the Router themselves or via events from the backend engines), +This requires some sort of mechanism to sync the Router states periodically (either among the Routers themselves or via events from the backend engines), and also a mechanism to "warm restart" the Router such that the Router can be brought back with up-to-date states. Finally, the Router should be decoupled from the (http) frontend, such that the two can be scaled independently. (It is more likely that the frontend handling the pre-processing / tokenization would need to scale first before the Router does.) @@ -33,7 +33,7 @@ Finally, the Router should be decoupled from the (http) frontend, such that the As context, we have iterated over two designs of the Router that worked well in their own regard. -First, we had a near-stateless Router listening on backend engines for KV events and load metrics. This is good because: +First, we had a **near-stateless Router** listening on backend engines for KV events and load metrics. This is good because: - Multiple Routers can be launched and synced naturally - Easier Python binding for modular components, as the Router does not hold the output SSE stream, and simply needs to return the `best_worker_id` @@ -41,7 +41,7 @@ But not good because: - The radix tree of the `KvIndexer` is still very stateful, with no warm restart mechanism - Huge performance hit under highly concurrent payloads, as KV / metric events cannot respond fast enough for the Router to keep track of the updated load states. -Now, we have a stateful Router still listening on backend engines for KV events (can opt out of via `ApproxKvIndexer`), +Now, we have a **stateful Router** still listening on backend engines for KV events (can opt out of via `ApproxKvIndexer`), but maintains the active block states locally from the request-response cycle. This is good because: - The performance is good under high concurrency, because the Router never sees a stale load metric state, as we forced sequential processing of requests locally. - It is highly general, as the Router can now interface with any backend engine, without the need for any event communication @@ -56,37 +56,19 @@ The main motivation here is to have a design that incorporates the benefits of b ## Goals -**\[Optional \- if not applicable omit\]** - -List out any additional goals in bullet points. Goals may be aspirational / difficult to measure but guide the proposal. - -* Goal - -* Goal - -* Goal +* The Router has to be performant over generic load balancers (e.g. round robin) under general settings, as it is now. +* The Router has to be a separate component that can be scaled (or not-scaled) independently from the frontend. +* Multiple Router has to be launched without losing routing optimality. +* A Router can go down without affecting the output SSE streams. +* A Router can come back up without losing its previous states or missing updates during the time it was down. ### Non Goals -**\[Optional \- if not applicable omit\]** - -List out any items which are out of scope / specifically not required in bullet points. Indicates the scope of the proposal and issue being resolved. +N/A ## Requirements -**\[Optional \- if not applicable omit\]** - -List out any additional requirements in numbered subheadings. - -**\** - -### REQ \<\#\> \ - -Describe the requirement in as much detail as necessary for others to understand it and how it applies to the DEP. Keep in mind that requirements should be measurable and will be used to determine if a DEP has been successfully implemented or not. - -Requirement names should be prefixed using a monotonically increasing number such as “REQ 1 \” followed by “REQ 2 \” and so on. Use title casing when naming requirements. Requirement names should be as descriptive as possible while remaining as terse as possible. - -Use all-caps, bolded terms like **MUST** and **SHOULD** when describing each requirement. See [RFC-2119](https://datatracker.ietf.org/doc/html/rfc2119) for additional information. +N/A # Proposal @@ -175,32 +157,17 @@ Add additional context and references as needed to help reviewers and authors un ## References -**\[Optional \- if not applicable omit\]** - -Add additional references as needed to help reviewers and authors understand the context of the problem and solution being proposed. - -* \ +* [KV Routing](https://docs.nvidia.com/dynamo/latest/architecture/kv_cache_routing.html) +* [KV Router Performance Tuning](https://docs.nvidia.com/dynamo/latest/guides/kv_router_perf_tuning.html) +* [SGL's stateful Router](https://lmsys.org/blog/2024-12-04-sglang-v0-4/) ## Terminology & Definitions -**\[Optional \- if not applicable omit\]** - -List out additional terms / definitions (lexicon). Try to keep definitions as concise as possible and use links to external resources when additional information would be useful to the reader. - -Keep the list of terms sorted alphabetically to ease looking up definitions by readers. - | \ | \ | | :---- | :---- | -| **\** | \ | +| **KvIndexer** | A data structure for maintaining a global view of prefix caches of all workers | +| **Router** | A component for routing requests to backend workers that is aware of the current loads and prefix caches of each worker | ## Acronyms & Abbreviations -**\[Optional \- if not applicable omit\]** - -Provide a list of frequently used acronyms and abbreviations which are uncommon or unlikely to be known by the reader. Do not include acronyms or abbreviations which the reader is likely to be familiar with. - -Keep the list of acronyms and abbreviations sorted alphabetically to ease looking up definitions by readers. - -Do not include the full definition in the expanded meaning of an abbreviation or acronym. If the reader needs the definition, please include it in the [Terminology & Definitions](#terminology--definitions) section. - -**\:** \ +N/A From 26564d34d68f0300ac7aeea4fec07879450312ed Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:47:43 -0700 Subject: [PATCH 07/10] N/A out the background Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/deps/router-fault-tolerant.md b/deps/router-fault-tolerant.md index a9257b0c..28a16b0f 100644 --- a/deps/router-fault-tolerant.md +++ b/deps/router-fault-tolerant.md @@ -70,7 +70,6 @@ N/A N/A - # Proposal **\[Required\]** @@ -151,9 +150,7 @@ List out solutions that were considered but ultimately rejected. Consider free f # Background -**\[Optional \- if not applicable omit\]** - -Add additional context and references as needed to help reviewers and authors understand the context of the problem and solution being proposed. +N/A ## References From db8456570d62660ba3d63f97af75d65853130017 Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:52:09 -0700 Subject: [PATCH 08/10] subheadings in motivation Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/deps/router-fault-tolerant.md b/deps/router-fault-tolerant.md index 28a16b0f..933c1283 100644 --- a/deps/router-fault-tolerant.md +++ b/deps/router-fault-tolerant.md @@ -33,6 +33,8 @@ Finally, the Router should be decoupled from the (http) frontend, such that the As context, we have iterated over two designs of the Router that worked well in their own regard. +## Initial Design + First, we had a **near-stateless Router** listening on backend engines for KV events and load metrics. This is good because: - Multiple Routers can be launched and synced naturally - Easier Python binding for modular components, as the Router does not hold the output SSE stream, and simply needs to return the `best_worker_id` @@ -41,6 +43,8 @@ But not good because: - The radix tree of the `KvIndexer` is still very stateful, with no warm restart mechanism - Huge performance hit under highly concurrent payloads, as KV / metric events cannot respond fast enough for the Router to keep track of the updated load states. +## Current Design + Now, we have a **stateful Router** still listening on backend engines for KV events (can opt out of via `ApproxKvIndexer`), but maintains the active block states locally from the request-response cycle. This is good because: - The performance is good under high concurrency, because the Router never sees a stale load metric state, as we forced sequential processing of requests locally. @@ -51,8 +55,11 @@ But not good because: - The Router holds the output SSE stream, so if the Router goes down, the stream will die along with it - Harder to have modular components to bind to Python, as we require the entirety of `KvPushRouter` to handle the request-response cycles +## Future Design + In short, a stateless Router is better for fault-tolerance, but a stateful Router is better for optimality of routing decisions. -The main motivation here is to have a design that incorporates the benefits of both, and eventually achieve a net win. +The main motivation here is to have a design that incorporates the benefits of both, and eventually achieve a net win. +More details would be provided in the following sections. ## Goals From 84b5c9d33a78408a548fbee9f1c28c06dbec92bf Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:53:04 -0700 Subject: [PATCH 09/10] merge "future design" with "goals Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/deps/router-fault-tolerant.md b/deps/router-fault-tolerant.md index 933c1283..27200829 100644 --- a/deps/router-fault-tolerant.md +++ b/deps/router-fault-tolerant.md @@ -55,14 +55,12 @@ But not good because: - The Router holds the output SSE stream, so if the Router goes down, the stream will die along with it - Harder to have modular components to bind to Python, as we require the entirety of `KvPushRouter` to handle the request-response cycles -## Future Design +## Goals In short, a stateless Router is better for fault-tolerance, but a stateful Router is better for optimality of routing decisions. The main motivation here is to have a design that incorporates the benefits of both, and eventually achieve a net win. More details would be provided in the following sections. -## Goals - * The Router has to be performant over generic load balancers (e.g. round robin) under general settings, as it is now. * The Router has to be a separate component that can be scaled (or not-scaled) independently from the frontend. * Multiple Router has to be launched without losing routing optimality. From 5ac72f4fec10fc79792fdc8a3e9e616a6e5d9a77 Mon Sep 17 00:00:00 2001 From: Yan Ru Pei Date: Fri, 1 Aug 2025 10:53:46 -0700 Subject: [PATCH 10/10] separate paragraph for itemized goals Signed-off-by: Yan Ru Pei --- deps/router-fault-tolerant.md | 1 + 1 file changed, 1 insertion(+) diff --git a/deps/router-fault-tolerant.md b/deps/router-fault-tolerant.md index 27200829..3f92c0ad 100644 --- a/deps/router-fault-tolerant.md +++ b/deps/router-fault-tolerant.md @@ -61,6 +61,7 @@ In short, a stateless Router is better for fault-tolerance, but a stateful Route The main motivation here is to have a design that incorporates the benefits of both, and eventually achieve a net win. More details would be provided in the following sections. +The overarching goals are then: * The Router has to be performant over generic load balancers (e.g. round robin) under general settings, as it is now. * The Router has to be a separate component that can be scaled (or not-scaled) independently from the frontend. * Multiple Router has to be launched without losing routing optimality.