Skip to content

Sglang Inference roadmap #44

@jonathancaevans

Description

@jonathancaevans

Roadmap & Dependencies

Completed

  • ✅ Verify disagg works over EFA
  • ✅ 📝 Fix docs: KV cache aware router

Active Development Paths

Path 1: Benchmarking → Scaling → API Development

Benchmarking & Performance Analysis
  • 📊 Setup lightweight benchmarking pattern
    • Verify whether or not EFA connections or clustering improves throughput between prefill and decode nodes, or lack of EFA substantially or statistically significantly increases latency when communicating kvcache between instances
    • Set up benchmark description standard to pair with configs that describe a couple of dimensions of the config. Most importantly, trade off considerations between throughput and latency, and ec2 cost per throughput
  • 📊 Find output tokens/sec trade-offs
  • 📊 Develop better fit scaling policies
  • 📊 Create automated configuration and search and tuning mechanism
Decision Point: EFA Bonding Requirements
  • Determine if EFA bonding is needed between prefill and decode instances
    • Option A: If yes - Scale in monolithic placement clusters
      • Pros: Fast interconnect backbone for instances
      • Cons: Reduced reach into ec2 fleet availability, causes more OS additions to upstream router logic to support multi AZ
    • Option B: If no - Scale in placement clusters only as multi instance TP needs
      • Pros: Simple, more ec2 fleet availability reach, Multi AZ is super easy
      • Cons: Potentially bottlenecks kvcache transfer between decodes and prefill instances, but if it doesn't based on upstream decision point so much con
API Development (Post-Decision)
  • 🔧 Add API: add/remove worker for disagg (SGLang upstream contribution)
  • 🔧 Add worker draining to router (SGLang upstream contribution)
    • Important for scale-ins

Path 2: Future Enhancements - Hierarchical Caching

  • 🔮 Test and add file-based hierarchical caching
  • 🔮 Implement distributed file system across worker nodes
  • 🔮 Consider worker routing and existing routing methodologies
  • 🔮 Evaluate whether distributed file system caching is worth the complexity

Standalone Tasks

Infrastructure

  • 🏗️ Add torch compilation and caching at AMI build
    • Enable fast scale out for torch caching

Documentation

  • 📝 Fix docs: GPU workers diagram in README

graph TB
    %% Completed tasks at top
    I1["✅ Verify disagg works over EFA"]
    D1["✅ Fix docs: KV cache aware router"]
    
    %% Standalone tasks
    I2["🏗️ Add torch compilation<br/>and caching at AMI build"]
    D2["📝 Fix docs: GPU workers diagram"]
    
    %% Main flow: Benchmarking path
    B1["📊 Setup lightweight<br/>benchmarking pattern"]
    B1a["📊 Verify EFA impact on<br/>kvcache throughput/latency"]
    B1b["📊 Set up benchmark<br/>description standard"]
    B2["📊 Find output tokens/sec<br/>trade-offs"]
    B3["📊 Develop better fit<br/>scaling policies"]
    B4["📊 Create automated config<br/>and tuning mechanism"]
    
    %% Decision point
    DEC{"🔍 Decision: EFA bonding<br/>needed between prefill<br/>and decode instances?"}
    
    %% Decision branches
    OPT1["✅ Yes: Monolithic placement<br/>+ Fast interconnect<br/>- Reduced EC2 availability<br/>- Complex multi-AZ"]
    OPT2["❌ No: Placement for TP only<br/>+ Simple, better availability<br/>+ Easy multi-AZ<br/>- Potential kvcache bottleneck"]
    
    %% API Development
    A1["🔧 Add API: add/remove<br/>worker for disagg"]
    A2["🔧 Add worker draining<br/>to router"]
    
    %% Future enhancements
    F1["🔮 Test file-based<br/>hierarchical caching"]
    F2["🔮 Implement distributed<br/>file system"]
    F3["🔮 Consider worker<br/>routing methods"]
    F4["🔮 Evaluate complexity<br/>vs benefits"]
    
    %% Dependencies
    I1 --> B1
    B1 --> B1a
    B1 --> B1b
    B1 --> B2
    B1a --> DEC
    B1b --> B2
    B2 --> B3
    B2 --> B4
    B3 --> B4
    
    DEC --> OPT1
    DEC --> OPT2
    OPT1 --> A1
    OPT2 --> A1
    
    A1 --> A2
    B3 --> F3
    A1 --> F3
    A2 --> F3
    
    F1 --> F2
    F2 --> F3
    F3 --> F4
    
    %% Styling
    classDef completed fill:#4ade80,stroke:#22c55e,stroke-width:3px,color:#000
    classDef docs fill:#94a3b8,stroke:#64748b,stroke-width:2px,color:#000
    classDef api fill:#a78bfa,stroke:#8b5cf6,stroke-width:2px,color:#000
    classDef infra fill:#60a5fa,stroke:#3b82f6,stroke-width:2px,color:#000
    classDef bench fill:#fbbf24,stroke:#f59e0b,stroke-width:2px,color:#000
    classDef future fill:#fb923c,stroke:#f97316,stroke-width:2px,color:#000
    classDef decision fill:#ef4444,stroke:#dc2626,stroke-width:3px,color:#fff
    classDef option fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#000
    
    class I1,D1 completed
    class D2 docs
    class A1,A2 api
    class I2 infra
    class B1,B1a,B1b,B2,B3,B4 bench
    class F1,F2,F3,F4 future
    class DEC decision
    class OPT1,OPT2 option
Loading

Legend

  • Green: Completed tasks
  • 📝 Gray: Documentation updates
  • 🔧 Purple: API development
  • 🏗️ Blue: Deployment & Infrastructure
  • 📊 Yellow: Benchmarking & Scaling
  • 🔮 Orange: Future Enhancements
  • Arrows: Dependencies (must complete source before target)

Key Insights

Critical Decision Point

The EFA bonding analysis will determine the entire scaling architecture approach. This decision impacts:

  • API implementation complexity
  • Multi-AZ support feasibility
  • EC2 fleet availability reach
  • Overall system complexity

Parallel Work Streams

  1. Main Path: Benchmarking → Decision Point → API Development
  2. Future Path: Hierarchical Caching Investigation (can start independently)
  3. Independent: Torch compilation and documentation updates

Current Bottlenecks

  • Benchmarking completion blocks the EFA decision point
  • EFA decision blocks the API development approach
  • Scaling policies inform both API design and future caching strategies

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions