Skip to content

Conversation

@Bihan
Copy link
Collaborator

@Bihan Bihan commented Oct 15, 2025

Intro
We want to make it possible to create a gateway which extends the gateway functionality with additional features (all sgl-router features such as cache aware routing, etc) while keeping all the standard gateway features (such as authentication, rate limits).

For the user, using such gateway should be very simple, e.g. setting router to sglang in gateway configurations. Eg:

type: gateway
name: sglang-gateway

backend: aws
region: eu-west-1

domain: example.com
router: sglang

The rest for the user should look the same - the same service endpoint, authentication and rate limits working, etc.

While this first experimental version should only bring minimum features - allow to route replicas traffic through the router (dstack’s gateway/ngnix -> sglang-router -> replica workers), in the future this may be extended with router-specific scaling metrics, such as ttft, e2e, Prefill-Decode Disaggregation, etc).

As the first experimental version, the most critical is to come up with the minimum changes that are tested thoroughly that would allow embedding the router: sglang without breaking any existing functionality.

Note:

  1. In this version installation of pip & sglang-router is done in gateway machine, irrespective of whether router:sglang is in gateway config or not. To make it conditional in future, it should be implemented across backends that support gateway.

  2. Modified upstream block of src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2 to respect router: sglang in gateway config.

upstream {{ domain }}.upstream {
    {% if router == "sglang" %}
    server 127.0.0.1:3000;  # SGLang router on the gateway
    {% else %}
    {% for replica in replicas %}
    server unix:{{ replica.socket }};  # replica {{ replica.id }}
    {% endfor %}
    {% endif %}
}
  1. Created new nginx conf: src/dstack/_internal/proxy/gateway/resources/nginx/sglang_workers.jinja2

This nginx conf forwards HTTP to Unix socket. dstack workers listen on Unix sockets, while the sglang-router speaks HTTP, so this bridge lets the router reach each worker via local TCP ports.

# Worker 1
upstream sglang_worker_1_upstream {
    server unix:/tmp/tmpazynu7m5/replica.sock;
}

server {
    listen 127.0.0.1:10001;
    access_log off; # disable access logs for this internal endpoint

    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location / {
        proxy_pass http://sglang_worker_1_upstream;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Connection "";
        proxy_set_header Upgrade $http_upgrade;
    }
}

# Worker 2
upstream sglang_worker_2_upstream {
    server unix:/tmp/tmpazynu7m6/replica.sock;
}

server {
    listen 127.0.0.1:10002;
    access_log off; # disable access logs for this internal endpoint

    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location / {
        proxy_pass http://sglang_worker_2_upstream;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Connection "";
        proxy_set_header Upgrade $http_upgrade;
    }
}

Copy link
Contributor

@peterschmidt85 peterschmidt85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add an example of how to test the new router
  2. Please ensure auto-scaling works (incl. downscaling to 0), and also that dstack uses routers' API to add/remove workers without restarting the gateway
  3. And only after that, refactor the code to move the sgl-router implementation to a separate sg-lang-related subclass - to ensure the normal gateway code doesn't have any sgl-router specific code - similar to how each backend encapsulates its own logic
  4. Ensure tests are working

@Bihan
Copy link
Collaborator Author

Bihan commented Oct 16, 2025

@peterschmidt85

  1. Add an example of how to test the new router

Step 1
Replace return value as shown in below example in method get_dstack_gateway_wheel (exact path see here) .

Eg:

def get_dstack_gateway_wheel(build: str) -> str:
    channel = "release" if settings.DSTACK_RELEASE else "stgn"
    base_url = f"https://dstack-gateway-downloads.s3.amazonaws.com/{channel}"
    if build == "latest":
        r = requests.get(f"{base_url}/latest-version", timeout=5)
        r.raise_for_status()
        build = r.text.strip()
        logger.debug("Found the latest gateway build: %s", build)
    # return f"{base_url}/dstack_gateway-{build}-py3-none-any.whl"
    return "https://bihan-test-bucket.s3.eu-west-1.amazonaws.com/dstack_gateway-0.0.0-py3-none-any.whl"

Step 2

Apply below gateway config.

type: gateway
name: sglang-gateway

backend: aws
region: eu-west-1

domain: example.com
router: sglang

Step 3
Update DNS

Step 4

Apply below service config

type: service
name: sglang-service


image: lmsysorg/sglang:latest
env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct

commands:
  - python -m sglang.launch_server --model-path $MODEL_ID --host 0.0.0.0 --port 8000 --enable-metrics

port: 8000
model: meta-llama/Llama-3.2-3B-Instruct

resources:
  gpu: 24GB

replicas: 2

Step 5
After you see /health endpoint returning 200 as show in below logs, your service is ready for query.

Logs:

[2025-10-16 07:01:38] INFO:     Application startup complete.
[2025-10-16 07:01:38] INFO:     Uvicorn running on https://sglang-service.bihan-gateway.dstack.ai (Press CTRL+C to quit)
[2025-10-16 07:01:39] INFO:     127.0.0.1:3580 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-16 07:01:39] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:02:07] INFO:     127.0.0.1:3906 - "GET /health HTTP/1.1" 503 Service Unavailable
[2025-10-16 07:02:46] INFO:     127.0.0.1:3592 - "POST /generate HTTP/1.1" 200 OK
[2025-10-16 07:02:46] The server is fired up and ready to roll!
[2025-10-16 07:03:07] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:03:08] INFO:     127.0.0.1:3516 - "GET /health HTTP/1.1" 200 OK
[2025-10-16 07:03:08] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:03:09] INFO:     127.0.0.1:3790 - "GET /health HTTP/1.1" 200 OK

Step 6
You can then either use dstack-frontend http://localhost:3000/projects/main/models/sglang-service
Or

You you can query from terminal

curl https://sglang-service.example.com/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer <token>" \
  --data '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Deep Learning?"
      }
    ]
  }'

Note: You can check sglang-router logs: cat ~/dstack/router_logs/sgl-router.

Also, maybe in the future we can show sglang-router's log instead of replica's log in dstack CLI

Eg:

sglang-service provisioning completed (running)
Service is published at:
  https://sglang-service.bihan-gateway.dstack.ai
Model meta-llama/Llama-3.2-3B-Instruct is published at:
  https://gateway.bihan-gateway.dstack.ai


2025-10-16 06:59:05  INFO sglang_router_rs::core::worker_manager: src/core/worker_manager.rs:1077: Waiting for 2 workers to become healthy. Unhealthy: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
...
...
2025-10-16 07:03:08  INFO sglang_router_rs::core::worker_manager: src/core/worker_manager.rs:1111: All 2 workers are healthy: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
...
...
2025-10-16 07:03:08  INFO sglang_router_rs::server: src/server.rs:1066: Router ready | workers: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
2025-10-16 07:03:08  INFO sglang_router_rs::server: src/server.rs:1094: Starting server on 0.0.0.0:3000 

@Bihan
Copy link
Collaborator Author

Bihan commented Oct 20, 2025

Will send a new PR

@Bihan Bihan closed this Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants