Sglang Inference roadmap

## Roadmap & Dependencies

### Completed
- [X] ✅ Verify disagg works over EFA
- [X] ✅ 📝 Fix docs: KV cache aware router

### Active Development Paths

#### Path 1: Benchmarking → Scaling → API Development

##### Benchmarking & Performance Analysis
- [ ] 📊 Setup lightweight benchmarking pattern
 - [ ] Verify whether or not EFA connections or clustering improves throughput between prefill and decode nodes, or lack of EFA substantially or statistically significantly increases latency when communicating kvcache between instances
 - [ ] Set up benchmark description standard to pair with configs that describe a couple of dimensions of the config. Most importantly, trade off considerations between throughput and latency, and ec2 cost per throughput
- [ ] 📊 Find output tokens/sec trade-offs
- [ ] 📊 Develop better fit scaling policies
- [ ] 📊 Create automated configuration and search and tuning mechanism

##### Decision Point: EFA Bonding Requirements
- [ ] Determine if EFA bonding is needed between prefill and decode instances
 - **Option A: If yes** - Scale in monolithic placement clusters
 - Pros: Fast interconnect backbone for instances
 - Cons: Reduced reach into ec2 fleet availability, causes more OS additions to upstream router logic to support multi AZ
 - **Option B: If no** - Scale in placement clusters only as multi instance TP needs
 - Pros: Simple, more ec2 fleet availability reach, Multi AZ is super easy
 - Cons: Potentially bottlenecks kvcache transfer between decodes and prefill instances, but if it doesn't based on upstream decision point so much con

##### API Development (Post-Decision)
- [ ] 🔧 Add API: add/remove worker for disagg (SGLang upstream contribution)
- [ ] 🔧 Add worker draining to router (SGLang upstream contribution)
 - Important for scale-ins

#### Path 2: Future Enhancements - Hierarchical Caching

- [ ] 🔮 Test and add file-based hierarchical caching
- [ ] 🔮 Implement distributed file system across worker nodes
- [ ] 🔮 Consider worker routing and existing routing methodologies
- [ ] 🔮 Evaluate whether distributed file system caching is worth the complexity

### Standalone Tasks

#### Infrastructure
- [ ] 🏗️ Add torch compilation and caching at AMI build
 - Enable fast scale out for torch caching

#### Documentation
- [ ] 📝 Fix docs: GPU workers diagram in README

---

```mermaid
graph TB
 %% Completed tasks at top
 I1["✅ Verify disagg works over EFA"]
 D1["✅ Fix docs: KV cache aware router"]
 
 %% Standalone tasks
 I2["🏗️ Add torch compilation and caching at AMI build"]
 D2["📝 Fix docs: GPU workers diagram"]
 
 %% Main flow: Benchmarking path
 B1["📊 Setup lightweight benchmarking pattern"]
 B1a["📊 Verify EFA impact on kvcache throughput/latency"]
 B1b["📊 Set up benchmark description standard"]
 B2["📊 Find output tokens/sec trade-offs"]
 B3["📊 Develop better fit scaling policies"]
 B4["📊 Create automated config and tuning mechanism"]
 
 %% Decision point
 DEC{"🔍 Decision: EFA bonding needed between prefill and decode instances?"}
 
 %% Decision branches
 OPT1["✅ Yes: Monolithic placement + Fast interconnect - Reduced EC2 availability - Complex multi-AZ"]
 OPT2["❌ No: Placement for TP only + Simple, better availability + Easy multi-AZ - Potential kvcache bottleneck"]
 
 %% API Development
 A1["🔧 Add API: add/remove worker for disagg"]
 A2["🔧 Add worker draining to router"]
 
 %% Future enhancements
 F1["🔮 Test file-based hierarchical caching"]
 F2["🔮 Implement distributed file system"]
 F3["🔮 Consider worker routing methods"]
 F4["🔮 Evaluate complexity vs benefits"]
 
 %% Dependencies
 I1 --> B1
 B1 --> B1a
 B1 --> B1b
 B1 --> B2
 B1a --> DEC
 B1b --> B2
 B2 --> B3
 B2 --> B4
 B3 --> B4
 
 DEC --> OPT1
 DEC --> OPT2
 OPT1 --> A1
 OPT2 --> A1
 
 A1 --> A2
 B3 --> F3
 A1 --> F3
 A2 --> F3
 
 F1 --> F2
 F2 --> F3
 F3 --> F4
 
 %% Styling
 classDef completed fill:#4ade80,stroke:#22c55e,stroke-width:3px,color:#000
 classDef docs fill:#94a3b8,stroke:#64748b,stroke-width:2px,color:#000
 classDef api fill:#a78bfa,stroke:#8b5cf6,stroke-width:2px,color:#000
 classDef infra fill:#60a5fa,stroke:#3b82f6,stroke-width:2px,color:#000
 classDef bench fill:#fbbf24,stroke:#f59e0b,stroke-width:2px,color:#000
 classDef future fill:#fb923c,stroke:#f97316,stroke-width:2px,color:#000
 classDef decision fill:#ef4444,stroke:#dc2626,stroke-width:3px,color:#fff
 classDef option fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#000
 
 class I1,D1 completed
 class D2 docs
 class A1,A2 api
 class I2 infra
 class B1,B1a,B1b,B2,B3,B4 bench
 class F1,F2,F3,F4 future
 class DEC decision
 class OPT1,OPT2 option
```

## Legend
- ✅ **Green**: Completed tasks
- 📝 **Gray**: Documentation updates
- 🔧 **Purple**: API development
- 🏗️ **Blue**: Deployment & Infrastructure
- 📊 **Yellow**: Benchmarking & Scaling
- 🔮 **Orange**: Future Enhancements
- **Arrows**: Dependencies (must complete source before target)

## Key Insights

### Critical Decision Point
The EFA bonding analysis will determine the entire scaling architecture approach. This decision impacts:
- API implementation complexity
- Multi-AZ support feasibility
- EC2 fleet availability reach
- Overall system complexity

### Parallel Work Streams
1. **Main Path**: Benchmarking → Decision Point → API Development
2. **Future Path**: Hierarchical Caching Investigation (can start independently)
3. **Independent**: Torch compilation and documentation updates

### Current Bottlenecks
- **Benchmarking completion** blocks the EFA decision point
- **EFA decision** blocks the API development approach
- **Scaling policies** inform both API design and future caching strategies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sglang Inference roadmap #44

Roadmap & Dependencies

Completed

Active Development Paths

Path 1: Benchmarking → Scaling → API Development

Benchmarking & Performance Analysis

Decision Point: EFA Bonding Requirements

API Development (Post-Decision)

Path 2: Future Enhancements - Hierarchical Caching

Standalone Tasks

Infrastructure

Documentation

Legend

Key Insights

Critical Decision Point

Parallel Work Streams

Current Bottlenecks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sglang Inference roadmap #44

Description

Roadmap & Dependencies

Completed

Active Development Paths

Path 1: Benchmarking → Scaling → API Development

Benchmarking & Performance Analysis

Decision Point: EFA Bonding Requirements

API Development (Post-Decision)

Path 2: Future Enhancements - Hierarchical Caching

Standalone Tasks

Infrastructure

Documentation

Legend

Key Insights

Critical Decision Point

Parallel Work Streams

Current Bottlenecks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions