From 19fd13a82c76ba1d1215739d5a14b33cc3702749 Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Mon, 22 Sep 2025 09:06:56 +0800
Subject: [PATCH 1/9] Fix healthcheck curl missing & Implement testing profile

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 .github/copilot-instructions.md  | 88 ++++++++++++++++++++++++++++++++
 Dockerfile.extproc               | 10 +++-
 config/config.testing.yaml       | 84 ++++++++++++++++++++++++++++++
 docker-compose.yml               | 19 +++++++
 scripts/entrypoint.sh            | 12 +++++
 tools/mock-vllm/Dockerfile       | 16 ++++++
 tools/mock-vllm/README.md        |  9 ++++
 tools/mock-vllm/app.py           | 45 ++++++++++++++++
 tools/mock-vllm/requirements.txt |  3 ++
 9 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 .github/copilot-instructions.md
 create mode 100644 config/config.testing.yaml
 create mode 100644 scripts/entrypoint.sh
 create mode 100644 tools/mock-vllm/Dockerfile
 create mode 100644 tools/mock-vllm/README.md
 create mode 100644 tools/mock-vllm/app.py
 create mode 100644 tools/mock-vllm/requirements.txt

diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
new file mode 100644
index 00000000..931ecbde
--- /dev/null
+++ b/.github/copilot-instructions.md
@@ -0,0 +1,88 @@
+# Copilot Agent Instructions — vLLM Semantic Router
+
+Purpose: help AI coding agents work effectively in this repo by knowing the architecture, conventions, and non-obvious workflows.
+
+## Big picture
+
+- This is a Mixture-of-Models router for LLM requests with: Envoy External Processing (gRPC) for request routing, classification (intent/PII/security), semantic similarity caching, and tool auto-selection.
+- Primary implementation is Go with a Rust ML binding (HuggingFace Candle) via CGO for embeddings/similarity. A small HTTP Classification API is exposed alongside the gRPC extproc server.
+
+## Core components (key files)
+
+- Entry point: `src/semantic-router/cmd/main.go` (starts gRPC extproc, Classification API, and Prometheus metrics)
+- Envoy ExtProc server: `src/semantic-router/pkg/extproc/` (stream handlers, routing logic, request/response transforms)
+- Configuration: `config/config.yaml` (routing categories, model_config, reasoning families, semantic cache backend, vLLM endpoints, tools DB, classifiers)
+- Classification API: `src/semantic-router/pkg/api/server.go` (e.g., POST `/api/v1/classify/intent|pii|security|batch`)
+- Config loader/utilities: `src/semantic-router/pkg/config/` (hot-reload support, endpoint selection, policy helpers)
+- Cache backends: `src/semantic-router/pkg/cache/` (in-memory or Milvus; compile-time tag `milvus`)
+- Tools database: `src/semantic-router/pkg/tools/` (semantic tool selection)
+- Candle Rust binding (CGO): `candle-binding/` (builds native lib used for similarity)
+- Tests: Go unit/integration under `src/semantic-router/pkg/**`, e2e in `e2e-tests/`, research/bench suite in `bench/`
+
+## How things talk to each other
+
+1. Client → Envoy → gRPC ExtProc (`extproc.Server`) → Router selects model/tools/reasoning and edits OpenAI-compatible request → forwards to chosen vLLM endpoint.
+2. Router uses Candle embeddings for similarity cache and tool selection.
+3. Classification uses either legacy ModernBERT models or auto-discovered LoRA unified classifiers (services initialize a global ClassificationService).
+4. Config changes are hot-reloaded (fsnotify) without restarting the gRPC server.
+
+## Build / run workflows (non-obvious bits)
+
+- Makefile orchestrates sub-makefiles under `tools/make/`
+  - Build router (also builds Rust lib): `make build-router`
+  - Run router with config: `CONFIG_FILE=config/config.yaml make run-router`
+  - Run Envoy (installs func-e if missing): `make run-envoy`
+  - Download local models from HF Hub: `make download-models` (uses `hf download` CLI)
+- Dynamic library path on macOS: prefer `DYLD_LIBRARY_PATH` to point to `candle-binding/target/release`; Linux uses `LD_LIBRARY_PATH`. The Makefile sets `LD_LIBRARY_PATH`—on macOS set `DYLD_LIBRARY_PATH` in zsh if needed.
+- Ports: gRPC extproc `:50051` (flag `-port`), Classification API `:8080` (`-api-port`), Prometheus `:9190` (`-metrics-port`).
+- Docker: `docker-compose.yml` spins up router + Envoy (+ optional testing profile).
+
+Example (zsh):
+
+```sh
+# Build native lib + router
+make build-router
+
+# If macOS, ensure Candle dylib is discoverable for CGO
+export DYLD_LIBRARY_PATH="$PWD/candle-binding/target/release:$DYLD_LIBRARY_PATH"
+
+# Run router with the default config and metrics
+CONFIG_FILE=config/config.yaml make run-router
+
+# Run Envoy (separate terminal)
+make run-envoy
+```
+
+## Configuration patterns (edit `config/config.yaml`)
+
+- `categories[]` with per-category `model_scores` and reasoning flags drive model selection; `default_model` is the fallback.
+- `model_config` + `reasoning_families` normalize “reasoning mode” syntax across model families (e.g., deepseek, qwen3, gpt-oss). Use `GetModelReasoningFamily()` helpers, don’t hardcode.
+- `semantic_cache`: `backend_type: memory|milvus`, `similarity_threshold`, `ttl_seconds`. For Milvus, run `make start-milvus` and test with `-tags=milvus`.
+- `tools`: enable semantic tool selection via `tools_db_path` (JSON), `top_k`, and threshold (defaults to BERT threshold if unset).
+- `classifier`: paths to ModernBERT/LoRA models and mapping jsons; batch endpoint requires unified classifier to be available.
+- `vllm_endpoints[]`: list models per endpoint; selection respects per-model `preferred_endpoints` and weights.
+
+## Testing
+
+- Go vet and tidy: `make vet` and `make check-go-mod-tidy`
+- Unit tests (Go): `make test-semantic-router` (set `SKIP_MILVUS_TESTS=false` to include Milvus) or `go test -v ./...` under `src/semantic-router`
+- Milvus-specific: `make test-milvus-cache` or `make test-semantic-router-milvus` (uses `-tags=milvus`)
+- E2E Python tests: see `e2e-tests/README.md` (requires router+envoy running)
+- Quick cURL demos: `make test-auto-prompt-reasoning`, `test-pii`, `test-tools` (hits Envoy at `http://localhost:8801/v1/chat/completions` with `model: "auto"`)
+
+## Conventions & tips for contributors (agents)
+
+- Use config accessors from `pkg/config` (e.g., endpoint selection, PII policies). Avoid duplicating selection logic.
+- Prefer `services.*ClassificationService` APIs for classification; a global service may be set by auto-discovery.
+- Respect streaming in ExtProc handlers and record metrics via `pkg/metrics`.
+- Keep hot-reload safe: re-create `OpenAIRouter` on config changes using `Server.watchConfigAndReload` pattern.
+- When adding cache/tool logic, use existing interfaces: `cache.CacheBackend`, `tools.ToolsDatabase`.
+
+References
+
+- Router main: `src/semantic-router/cmd/main.go`
+- ExtProc: `src/semantic-router/pkg/extproc/`
+- Config: `config/config.yaml`, helpers in `src/semantic-router/pkg/config/`
+- Candle binding: `candle-binding/`
+- Bench: `bench/` (CLI and plots)
+- Docs site: `website/` (Docusaurus)
diff --git a/Dockerfile.extproc b/Dockerfile.extproc
index 1ba8b45e..5925d00c 100644
--- a/Dockerfile.extproc
+++ b/Dockerfile.extproc
@@ -54,5 +54,13 @@ COPY config/config.yaml /app/config/
 ENV LD_LIBRARY_PATH=/app/lib
 
 EXPOSE 50051
+# Install curl for healthchecks and basic diagnostics
+RUN dnf -y update && \
+    dnf -y install curl && \
+    dnf clean all
 
-CMD ["/app/extproc-server", "--config", "/app/config/config.yaml"]
+# Copy entrypoint to allow switching config via env var CONFIG_FILE
+COPY scripts/entrypoint.sh /app/entrypoint.sh
+RUN chmod +x /app/entrypoint.sh
+
+ENTRYPOINT ["/app/entrypoint.sh"]
diff --git a/config/config.testing.yaml b/config/config.testing.yaml
new file mode 100644
index 00000000..0b84e0ff
--- /dev/null
+++ b/config/config.testing.yaml
@@ -0,0 +1,84 @@
+bert_model:
+  model_id: sentence-transformers/all-MiniLM-L12-v2
+  threshold: 0.6
+  use_cpu: true
+
+semantic_cache:
+  enabled: true
+  backend_type: "memory"
+  similarity_threshold: 0.8
+  max_entries: 1000
+  ttl_seconds: 3600
+  eviction_policy: "fifo"
+
+tools:
+  enabled: true
+  top_k: 3
+  similarity_threshold: 0.2
+  tools_db_path: "config/tools_db.json"
+  fallback_to_empty: true
+
+prompt_guard:
+  enabled: true
+  use_modernbert: true
+  model_id: "models/jailbreak_classifier_modernbert-base_model"
+  threshold: 0.7
+  use_cpu: true
+  jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json"
+
+vllm_endpoints:
+  - name: "mock"
+    address: "mock-vllm"
+    port: 8000
+    models:
+      - "openai/gpt-oss-20b"
+    weight: 1
+    health_check_path: "/health"
+
+model_config:
+  "openai/gpt-oss-20b":
+    reasoning_family: "gpt-oss"
+    preferred_endpoints: ["mock"]
+    pii_policy:
+      allow_by_default: true
+
+categories:
+  - name: other
+    model_scores:
+      - model: openai/gpt-oss-20b
+        score: 0.7
+        use_reasoning: false
+
+default_model: openai/gpt-oss-20b
+
+reasoning_families:
+  deepseek:
+    type: "chat_template_kwargs"
+    parameter: "thinking"
+
+  qwen3:
+    type: "chat_template_kwargs"
+    parameter: "enable_thinking"
+
+  gpt-oss:
+    type: "reasoning_effort"
+    parameter: "reasoning_effort"
+  gpt:
+    type: "reasoning_effort"
+    parameter: "reasoning_effort"
+
+default_reasoning_effort: high
+
+api:
+  batch_classification:
+    max_batch_size: 100
+    concurrency_threshold: 5
+    max_concurrency: 8
+    metrics:
+      enabled: true
+      detailed_goroutine_tracking: true
+      high_resolution_timing: false
+      sample_rate: 1.0
+      duration_buckets:
+        [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
+      size_buckets: [1, 2, 5, 10, 20, 50, 100, 200]
diff --git a/docker-compose.yml b/docker-compose.yml
index 09f7b9ad..afc7e7e1 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -13,6 +13,7 @@ services:
       - ./models:/app/models:ro
     environment:
       - LD_LIBRARY_PATH=/app/lib
+      - CONFIG_FILE=${CONFIG_FILE:-/app/config/config.yaml}
     networks:
       - semantic-network
     healthcheck:
@@ -44,6 +45,24 @@ services:
       retries: 5
       start_period: 10s
 
+  # Mock vLLM service for testing profile
+  mock-vllm:
+    build:
+      context: ./tools/mock-vllm
+      dockerfile: Dockerfile
+    container_name: mock-vllm
+    profiles: ["testing"]
+    ports:
+      - "8000:8000"
+    networks:
+      - semantic-network
+    healthcheck:
+      test: ["CMD", "curl", "-fsS", "http://localhost:8000/health"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+      start_period: 5s
+
 networks:
   semantic-network:
     driver: bridge
diff --git a/scripts/entrypoint.sh b/scripts/entrypoint.sh
new file mode 100644
index 00000000..c0b4093a
--- /dev/null
+++ b/scripts/entrypoint.sh
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+CONFIG_FILE_PATH=${CONFIG_FILE:-/app/config/config.yaml}
+
+if [[ ! -f "$CONFIG_FILE_PATH" ]]; then
+  echo "[entrypoint] Config file not found at $CONFIG_FILE_PATH" >&2
+  exit 1
+fi
+
+echo "[entrypoint] Starting semantic-router with config: $CONFIG_FILE_PATH"
+exec /app/extproc-server --config "$CONFIG_FILE_PATH"
diff --git a/tools/mock-vllm/Dockerfile b/tools/mock-vllm/Dockerfile
new file mode 100644
index 00000000..a3287059
--- /dev/null
+++ b/tools/mock-vllm/Dockerfile
@@ -0,0 +1,16 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+
+COPY requirements.txt /app/requirements.txt
+RUN pip install --no-cache-dir -r /app/requirements.txt
+
+COPY app.py /app/app.py
+
+EXPOSE 8000
+
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
diff --git a/tools/mock-vllm/README.md b/tools/mock-vllm/README.md
new file mode 100644
index 00000000..1ac7a9b8
--- /dev/null
+++ b/tools/mock-vllm/README.md
@@ -0,0 +1,9 @@
+# Mock vLLM (OpenAI-compatible) service
+
+A tiny FastAPI server that emulates minimal endpoints used by the router:
+
+- GET /health
+- GET /v1/models
+- POST /v1/chat/completions
+
+Intended for local testing with Docker Compose profile `testing`.
diff --git a/tools/mock-vllm/app.py b/tools/mock-vllm/app.py
new file mode 100644
index 00000000..c991c76f
--- /dev/null
+++ b/tools/mock-vllm/app.py
@@ -0,0 +1,45 @@
+from fastapi import FastAPI
+from pydantic import BaseModel
+from typing import List, Optional
+
+app = FastAPI()
+
+
+class ChatMessage(BaseModel):
+    role: str
+    content: str
+
+
+class ChatRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    temperature: Optional[float] = 0.2
+
+
+@app.get("/health")
+async def health():
+    return {"status": "ok"}
+
+
+@app.get("/v1/models")
+async def models():
+    return {"data": [{"id": "openai/gpt-oss-20b", "object": "model"}]}
+
+
+@app.post("/v1/chat/completions")
+async def chat_completions(req: ChatRequest):
+    # Very simple echo-like behavior
+    last_user = next((m.content for m in reversed(req.messages) if m.role == "user"), "")
+    content = f"[mock-{req.model}] You said: {last_user}"
+    return {
+        "id": "cmpl-mock-123",
+        "object": "chat.completion",
+        "model": req.model,
+        "choices": [
+            {
+                "index": 0,
+                "message": {"role": "assistant", "content": content},
+                "finish_reason": "stop",
+            }
+        ],
+    }
diff --git a/tools/mock-vllm/requirements.txt b/tools/mock-vllm/requirements.txt
new file mode 100644
index 00000000..3971515d
--- /dev/null
+++ b/tools/mock-vllm/requirements.txt
@@ -0,0 +1,3 @@
+fastapi==0.115.0
+uvicorn==0.30.6
+pydantic==2.9.2

From 61fb53a3a499229e86cc508ae8f16117ebf02bdd Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Mon, 22 Sep 2025 10:47:38 +0800
Subject: [PATCH 2/9] fix pre-commit error

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 .github/copilot-instructions.md | 88 ---------------------------------
 tools/mock-vllm/app.py          |  7 ++-
 2 files changed, 5 insertions(+), 90 deletions(-)
 delete mode 100644 .github/copilot-instructions.md

diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
deleted file mode 100644
index 931ecbde..00000000
--- a/.github/copilot-instructions.md
+++ /dev/null
@@ -1,88 +0,0 @@
-# Copilot Agent Instructions — vLLM Semantic Router
-
-Purpose: help AI coding agents work effectively in this repo by knowing the architecture, conventions, and non-obvious workflows.
-
-## Big picture
-
-- This is a Mixture-of-Models router for LLM requests with: Envoy External Processing (gRPC) for request routing, classification (intent/PII/security), semantic similarity caching, and tool auto-selection.
-- Primary implementation is Go with a Rust ML binding (HuggingFace Candle) via CGO for embeddings/similarity. A small HTTP Classification API is exposed alongside the gRPC extproc server.
-
-## Core components (key files)
-
-- Entry point: `src/semantic-router/cmd/main.go` (starts gRPC extproc, Classification API, and Prometheus metrics)
-- Envoy ExtProc server: `src/semantic-router/pkg/extproc/` (stream handlers, routing logic, request/response transforms)
-- Configuration: `config/config.yaml` (routing categories, model_config, reasoning families, semantic cache backend, vLLM endpoints, tools DB, classifiers)
-- Classification API: `src/semantic-router/pkg/api/server.go` (e.g., POST `/api/v1/classify/intent|pii|security|batch`)
-- Config loader/utilities: `src/semantic-router/pkg/config/` (hot-reload support, endpoint selection, policy helpers)
-- Cache backends: `src/semantic-router/pkg/cache/` (in-memory or Milvus; compile-time tag `milvus`)
-- Tools database: `src/semantic-router/pkg/tools/` (semantic tool selection)
-- Candle Rust binding (CGO): `candle-binding/` (builds native lib used for similarity)
-- Tests: Go unit/integration under `src/semantic-router/pkg/**`, e2e in `e2e-tests/`, research/bench suite in `bench/`
-
-## How things talk to each other
-
-1. Client → Envoy → gRPC ExtProc (`extproc.Server`) → Router selects model/tools/reasoning and edits OpenAI-compatible request → forwards to chosen vLLM endpoint.
-2. Router uses Candle embeddings for similarity cache and tool selection.
-3. Classification uses either legacy ModernBERT models or auto-discovered LoRA unified classifiers (services initialize a global ClassificationService).
-4. Config changes are hot-reloaded (fsnotify) without restarting the gRPC server.
-
-## Build / run workflows (non-obvious bits)
-
-- Makefile orchestrates sub-makefiles under `tools/make/`
-  - Build router (also builds Rust lib): `make build-router`
-  - Run router with config: `CONFIG_FILE=config/config.yaml make run-router`
-  - Run Envoy (installs func-e if missing): `make run-envoy`
-  - Download local models from HF Hub: `make download-models` (uses `hf download` CLI)
-- Dynamic library path on macOS: prefer `DYLD_LIBRARY_PATH` to point to `candle-binding/target/release`; Linux uses `LD_LIBRARY_PATH`. The Makefile sets `LD_LIBRARY_PATH`—on macOS set `DYLD_LIBRARY_PATH` in zsh if needed.
-- Ports: gRPC extproc `:50051` (flag `-port`), Classification API `:8080` (`-api-port`), Prometheus `:9190` (`-metrics-port`).
-- Docker: `docker-compose.yml` spins up router + Envoy (+ optional testing profile).
-
-Example (zsh):
-
-```sh
-# Build native lib + router
-make build-router
-
-# If macOS, ensure Candle dylib is discoverable for CGO
-export DYLD_LIBRARY_PATH="$PWD/candle-binding/target/release:$DYLD_LIBRARY_PATH"
-
-# Run router with the default config and metrics
-CONFIG_FILE=config/config.yaml make run-router
-
-# Run Envoy (separate terminal)
-make run-envoy
-```
-
-## Configuration patterns (edit `config/config.yaml`)
-
-- `categories[]` with per-category `model_scores` and reasoning flags drive model selection; `default_model` is the fallback.
-- `model_config` + `reasoning_families` normalize “reasoning mode” syntax across model families (e.g., deepseek, qwen3, gpt-oss). Use `GetModelReasoningFamily()` helpers, don’t hardcode.
-- `semantic_cache`: `backend_type: memory|milvus`, `similarity_threshold`, `ttl_seconds`. For Milvus, run `make start-milvus` and test with `-tags=milvus`.
-- `tools`: enable semantic tool selection via `tools_db_path` (JSON), `top_k`, and threshold (defaults to BERT threshold if unset).
-- `classifier`: paths to ModernBERT/LoRA models and mapping jsons; batch endpoint requires unified classifier to be available.
-- `vllm_endpoints[]`: list models per endpoint; selection respects per-model `preferred_endpoints` and weights.
-
-## Testing
-
-- Go vet and tidy: `make vet` and `make check-go-mod-tidy`
-- Unit tests (Go): `make test-semantic-router` (set `SKIP_MILVUS_TESTS=false` to include Milvus) or `go test -v ./...` under `src/semantic-router`
-- Milvus-specific: `make test-milvus-cache` or `make test-semantic-router-milvus` (uses `-tags=milvus`)
-- E2E Python tests: see `e2e-tests/README.md` (requires router+envoy running)
-- Quick cURL demos: `make test-auto-prompt-reasoning`, `test-pii`, `test-tools` (hits Envoy at `http://localhost:8801/v1/chat/completions` with `model: "auto"`)
-
-## Conventions & tips for contributors (agents)
-
-- Use config accessors from `pkg/config` (e.g., endpoint selection, PII policies). Avoid duplicating selection logic.
-- Prefer `services.*ClassificationService` APIs for classification; a global service may be set by auto-discovery.
-- Respect streaming in ExtProc handlers and record metrics via `pkg/metrics`.
-- Keep hot-reload safe: re-create `OpenAIRouter` on config changes using `Server.watchConfigAndReload` pattern.
-- When adding cache/tool logic, use existing interfaces: `cache.CacheBackend`, `tools.ToolsDatabase`.
-
-References
-
-- Router main: `src/semantic-router/cmd/main.go`
-- ExtProc: `src/semantic-router/pkg/extproc/`
-- Config: `config/config.yaml`, helpers in `src/semantic-router/pkg/config/`
-- Candle binding: `candle-binding/`
-- Bench: `bench/` (CLI and plots)
-- Docs site: `website/` (Docusaurus)
diff --git a/tools/mock-vllm/app.py b/tools/mock-vllm/app.py
index c991c76f..c806f961 100644
--- a/tools/mock-vllm/app.py
+++ b/tools/mock-vllm/app.py
@@ -1,6 +1,7 @@
+from typing import List, Optional
+
 from fastapi import FastAPI
 from pydantic import BaseModel
-from typing import List, Optional
 
 app = FastAPI()
 
@@ -29,7 +30,9 @@ async def models():
 @app.post("/v1/chat/completions")
 async def chat_completions(req: ChatRequest):
     # Very simple echo-like behavior
-    last_user = next((m.content for m in reversed(req.messages) if m.role == "user"), "")
+    last_user = next(
+        (m.content for m in reversed(req.messages) if m.role == "user"), ""
+    )
     content = f"[mock-{req.model}] You said: {last_user}"
     return {
         "id": "cmpl-mock-123",

From 46867844fb45a6e9081b8103cbec63bdb1dde5ed Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Mon, 22 Sep 2025 11:11:59 +0800
Subject: [PATCH 3/9] Added usage fields and metadata to chat_completions

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 tools/mock-vllm/app.py | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/tools/mock-vllm/app.py b/tools/mock-vllm/app.py
index c806f961..e4d02d15 100644
--- a/tools/mock-vllm/app.py
+++ b/tools/mock-vllm/app.py
@@ -1,3 +1,5 @@
+import math
+import time
 from typing import List, Optional
 
 from fastapi import FastAPI
@@ -34,15 +36,46 @@ async def chat_completions(req: ChatRequest):
         (m.content for m in reversed(req.messages) if m.role == "user"), ""
     )
     content = f"[mock-{req.model}] You said: {last_user}"
+
+    # Rough token estimation: ~1 token per 4 characters (ceil)
+    def estimate_tokens(text: str) -> int:
+        if not text:
+            return 0
+        return max(1, math.ceil(len(text) / 4))
+
+    prompt_text = "\n".join(
+        m.content for m in req.messages if isinstance(m.content, str)
+    )
+    prompt_tokens = estimate_tokens(prompt_text)
+    completion_tokens = estimate_tokens(content)
+    total_tokens = prompt_tokens + completion_tokens
+
+    created_ts = int(time.time())
+
+    usage = {
+        "prompt_tokens": prompt_tokens,
+        "completion_tokens": completion_tokens,
+        "total_tokens": total_tokens,
+        # Optional details fields some clients read when using caching/reasoning
+        "prompt_tokens_details": {"cached_tokens": 0},
+        "completion_tokens_details": {"reasoning_tokens": 0},
+    }
+
     return {
         "id": "cmpl-mock-123",
         "object": "chat.completion",
+        "created": created_ts,
         "model": req.model,
+        "system_fingerprint": "mock-vllm",
         "choices": [
             {
                 "index": 0,
                 "message": {"role": "assistant", "content": content},
                 "finish_reason": "stop",
+                "logprobs": None,
             }
         ],
+        "usage": usage,
+        # Some SDKs look for token_usage; keep it as an alias for convenience.
+        "token_usage": usage,
     }

From f8a1703ec764a4a7007b39f8afb33dc59aa280fc Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Mon, 22 Sep 2025 15:21:10 +0800
Subject: [PATCH 4/9] remove curl install & add mirrors for CN users

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 Dockerfile.extproc         | 13 +++++++++----
 docker-compose.yml         |  5 +++++
 tools/mock-vllm/Dockerfile |  4 ++++
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/Dockerfile.extproc b/Dockerfile.extproc
index 5925d00c..89e66ada 100644
--- a/Dockerfile.extproc
+++ b/Dockerfile.extproc
@@ -24,11 +24,20 @@ FROM golang:1.24 as go-builder
 
 WORKDIR /app
 
+# Use China-friendly Go module mirrors to avoid proxy.golang.org timeouts
+ENV GOPROXY=https://goproxy.cn,direct
+# Prefer a reachable checksum database in CN (or set to 'off' if still blocked)
+ENV GOSUMDB=sum.golang.google.cn
+
 # Copy Go module files first for better layer caching
 RUN mkdir -p src/semantic-router
 COPY src/semantic-router/go.mod src/semantic-router/go.sum src/semantic-router/
 COPY candle-binding/go.mod candle-binding/semantic-router.go candle-binding/
 
+# Pre-download Go modules to leverage Docker layer caching and fail fast if mirrors are unreachable
+RUN cd src/semantic-router && go mod download && \
+    cd /app/candle-binding && go mod download
+
 # Copy semantic-router source code
 COPY src/semantic-router/ src/semantic-router/
 
@@ -54,10 +63,6 @@ COPY config/config.yaml /app/config/
 ENV LD_LIBRARY_PATH=/app/lib
 
 EXPOSE 50051
-# Install curl for healthchecks and basic diagnostics
-RUN dnf -y update && \
-    dnf -y install curl && \
-    dnf clean all
 
 # Copy entrypoint to allow switching config via env var CONFIG_FILE
 COPY scripts/entrypoint.sh /app/entrypoint.sh
diff --git a/docker-compose.yml b/docker-compose.yml
index afc7e7e1..7f38cab4 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -11,9 +11,14 @@ services:
     volumes:
       - ./config:/app/config:ro
       - ./models:/app/models:ro
+      # - ~/.cache/huggingface:/root/.cache/huggingface  # uncomment to persist Hugging Face cache on host (CN users)
     environment:
       - LD_LIBRARY_PATH=/app/lib
       - CONFIG_FILE=${CONFIG_FILE:-/app/config/config.yaml}
+      # The following environment variables help CN mainland users download Hugging Face models via mirrors
+      # - HF_HUB_ENABLE_HF_TRANSFER=1             # uncomment to enable fast transfer for HF downloads (CN users)
+      # - HF_ENDPOINT=https://hf-mirror.com       # uncomment to use HF mirror endpoint in China
+      # - HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface  # uncomment to set HF cache directory (works with volume above)
     networks:
       - semantic-network
     healthcheck:
diff --git a/tools/mock-vllm/Dockerfile b/tools/mock-vllm/Dockerfile
index a3287059..3a7e812c 100644
--- a/tools/mock-vllm/Dockerfile
+++ b/tools/mock-vllm/Dockerfile
@@ -6,6 +6,10 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     curl \
     && rm -rf /var/lib/apt/lists/*
 
+# Uncomment to Configure pip to use a China mirror for faster installs
+# RUN python -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple && \
+#     python -m pip config set global.trusted-host pypi.tuna.tsinghua.edu.cn
+
 COPY requirements.txt /app/requirements.txt
 RUN pip install --no-cache-dir -r /app/requirements.txt
 

From 8ba24aee8ec88d2a178fef220ec6a58adadbbe36 Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Mon, 22 Sep 2025 16:04:29 +0800
Subject: [PATCH 5/9] Update docker quick start doc & comment config for CN
 user

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 Dockerfile.extproc                            |   4 +-
 .../docs/getting-started/docker-quickstart.md | 140 ++++++++++++------
 2 files changed, 99 insertions(+), 45 deletions(-)

diff --git a/Dockerfile.extproc b/Dockerfile.extproc
index 89e66ada..652e6f89 100644
--- a/Dockerfile.extproc
+++ b/Dockerfile.extproc
@@ -25,9 +25,9 @@ FROM golang:1.24 as go-builder
 WORKDIR /app
 
 # Use China-friendly Go module mirrors to avoid proxy.golang.org timeouts
-ENV GOPROXY=https://goproxy.cn,direct
+# ENV GOPROXY=https://goproxy.cn,direct
 # Prefer a reachable checksum database in CN (or set to 'off' if still blocked)
-ENV GOSUMDB=sum.golang.google.cn
+# ENV GOSUMDB=sum.golang.google.cn
 
 # Copy Go module files first for better layer caching
 RUN mkdir -p src/semantic-router
diff --git a/website/docs/getting-started/docker-quickstart.md b/website/docs/getting-started/docker-quickstart.md
index e06bed44..7eae6e59 100644
--- a/website/docs/getting-started/docker-quickstart.md
+++ b/website/docs/getting-started/docker-quickstart.md
@@ -6,40 +6,40 @@ Run Semantic Router + Envoy locally using Docker Compose v2.
 
 - Docker Engine and Docker Compose v2 (use the `docker compose` command, not the legacy `docker-compose`)
 
-   ```bash
-   # Verify
-   docker compose version
-   ```
+  ```bash
+  # Verify
+  docker compose version
+  ```
 
-   Install Docker Compose v2 for Ubuntu(if missing), see more in [Docker Compose Plugin Installation](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
+  Install Docker Compose v2 for Ubuntu(if missing), see more in [Docker Compose Plugin Installation](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
 
-   ```bash
-   # Remove legacy v1 if present (optional)
-   sudo apt-get remove -y docker-compose || true
+  ```bash
+  # Remove legacy v1 if present (optional)
+  sudo apt-get remove -y docker-compose || true
 
-   sudo apt-get update
-   sudo apt-get install -y ca-certificates curl gnupg
-   sudo install -m 0755 -d /etc/apt/keyrings
-   curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --yes --dearmor -o /etc/apt/keyrings/docker.gpg
-   echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
-   sudo apt-get update
-   sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+  sudo apt-get update
+  sudo apt-get install -y ca-certificates curl gnupg
+  sudo install -m 0755 -d /etc/apt/keyrings
+  curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --yes --dearmor -o /etc/apt/keyrings/docker.gpg
+  echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
+  sudo apt-get update
+  sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
 
-   docker compose version
-   ```
+  docker compose version
+  ```
 
 - Ensure ports 8801, 50051, 19000 are free
 
 ## Install and Run with Docker Compose v2
 
-1) Clone the repo and move into it (from your workspace root):
+1. Clone the repo and move into it (from your workspace root):
 
 ```bash
 git clone https://github.com/vllm-project/semantic-router.git
 cd semantic-router
 ```
 
-2) Download required models (classification models):
+2. Download required models (classification models):
 
 ```bash
 make download-models
@@ -53,7 +53,7 @@ This downloads the classification models used by the router:
 
 Note: The BERT similarity model defaults to a remote Hugging Face model. See Troubleshooting for offline/local usage.
 
-3) Start the services with Docker Compose v2:
+3. Start the services with Docker Compose v2:
 
 ```bash
 # Start core services (semantic-router + envoy)
@@ -62,11 +62,12 @@ docker compose up --build
 # Or run in background (recommended)
 docker compose up --build -d
 
-# With testing profile (includes mock vLLM)
-docker compose --profile testing up --build
+# With testing profile (includes mock vLLM). Use testing config to point router at the mock endpoint:
+# (CONFIG_FILE is read by the router entrypoint; the file is mounted from ./config)
+CONFIG_FILE=/app/config/config.testing.yaml docker compose --profile testing up --build
 ```
 
-4) Verify
+4. Verify
 
 - Semantic Router (gRPC): localhost:50051
 - Envoy Proxy: http://localhost:8801
@@ -90,7 +91,7 @@ docker compose down
 
 ## Troubleshooting
 
-### 1) Router exits immediately with a Hugging Face DNS/download error
+** 1. Router exits immediately with a Hugging Face DNS/download error **
 
 Symptoms (from `docker compose logs -f semantic-router`):
 
@@ -103,32 +104,85 @@ Why: `bert_model.model_id` in `config/config.yaml` points to a remote model (`se
 Fix options:
 
 - Allow network access in the container (online):
+
   - Ensure your host can resolve DNS, or add DNS servers to the `semantic-router` service in `docker-compose.yml`:
 
-      ```yaml
-      services:
-         semantic-router:
-            # ...
-            dns:
-               - 1.1.1.1
-               - 8.8.8.8
-      ```
-      
+    ```yaml
+    services:
+      semantic-router:
+        # ...
+        dns:
+          - 1.1.1.1
+          - 8.8.8.8
+    ```
+
   - If behind a proxy, set `http_proxy/https_proxy/no_proxy` env vars for the service.
 
 - Use a local copy of the model (offline):
-   1. Download `sentence-transformers/all-MiniLM-L12-v2` to `./models/sentence-transformers/all-MiniLM-L12-v2/` on the host.
-   2. Update `config/config.yaml` to use the local path (mounted into the container at `/app/models`):
 
-       ```yaml
-       bert_model:
-          model_id: "models/sentence-transformers/all-MiniLM-L12-v2"
-          threshold: 0.6
-          use_cpu: true
-       ```
+  1. Download `sentence-transformers/all-MiniLM-L12-v2` to `./models/sentence-transformers/all-MiniLM-L12-v2/` on the host.
+  2. Update `config/config.yaml` to use the local path (mounted into the container at `/app/models`):
+
+      ```yaml
+      bert_model:
+        model_id: "models/sentence-transformers/all-MiniLM-L12-v2"
+        threshold: 0.6
+        use_cpu: true
+      ```
+
+  3. Recreate services: `docker compose up -d --build`
+
+Extra tip: If you use the testing profile, also pass the testing config so the router targets the mock service:
+
+```bash
+CONFIG_FILE=/app/config/config.testing.yaml docker compose --profile testing up --build
+```
+
+** 2. Envoy/Router up but requests fail **
+
+- Ensure `mock-vllm` is healthy (testing profile only):
+  - `docker compose ps` should show mock-vllm healthy; logs show 200 on `/health`.
+- Verify the router config in use:
+  - Router logs print `Starting vLLM Semantic Router ExtProc with config: ...`. If it shows `/app/config/config.yaml` while testing, you forgot `CONFIG_FILE`.
+- Basic smoke test via Envoy (OpenAI-compatible):
+  - Send a POST to `http://localhost:8801/v1/chat/completions` with `{"model":"auto", "messages":[{"role":"user","content":"hi"}]}` and check that the mock responds with `[mock-openai/gpt-oss-20b]` content when testing profile is active.
+
+** 3. DNS problems inside containers **
 
-   3. Recreate services: `docker compose up -d --build`
+If DNS is flaky in your Docker environment, add DNS servers to the `semantic-router` service in `docker-compose.yml`:
 
-### 2) Port already in use
+```yaml
+services:
+  semantic-router:
+    # ...
+    dns:
+      - 1.1.1.1
+      - 8.8.8.8
+```
+
+For corporate proxies, set `http_proxy`, `https_proxy`, and `no_proxy` in the service `environment`.
 
 Make sure 8801, 50051, 19000 are not bound by other processes. Adjust ports in `docker-compose.yml` if needed.
+
+** 4. China Mainland tips (mirrors and offline caches) **
+
+If you're in CN mainland and network access to Go/Hugging Face/PyPI is slow or blocked:
+
+- Hugging Face models (router downloads BERT embeddings on first run):
+
+  - Prefer using a local copy mounted via `./models` and point `bert_model.model_id` to `models/...`.
+  - Or mount your HF cache into the container and set cache env var (uncomment in `docker-compose.yml`):
+    - Volume: `~/.cache/huggingface:/root/.cache/huggingface`
+    - Env: `HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface`
+  - Optional mirrors:
+    - `HF_ENDPOINT=https://hf-mirror.com`
+    - `HF_HUB_ENABLE_HF_TRANSFER=1`
+
+- Go modules (used during image build):
+
+  - Already set in Dockerfile to `GOPROXY=https://goproxy.cn,direct` and `GOSUMDB=sum.golang.google.cn` for reliability.
+
+- PyPI (for mock-vllm image):
+  - You can configure pip to use a mirror (commented example in `tools/mock-vllm/Dockerfile`):
+    - `python -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple`
+    - `python -m pip config set global.trusted-host pypi.tuna.tsinghua.edu.cn`

From ca525415d6e7b40d865ff6d5f97009fdd0105555 Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Mon, 22 Sep 2025 16:26:09 +0800
Subject: [PATCH 6/9] clean docker-compose.yml

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 docker-compose.yml                                | 5 -----
 website/docs/getting-started/docker-quickstart.md | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/docker-compose.yml b/docker-compose.yml
index 7f38cab4..afc7e7e1 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -11,14 +11,9 @@ services:
     volumes:
       - ./config:/app/config:ro
       - ./models:/app/models:ro
-      # - ~/.cache/huggingface:/root/.cache/huggingface  # uncomment to persist Hugging Face cache on host (CN users)
     environment:
       - LD_LIBRARY_PATH=/app/lib
       - CONFIG_FILE=${CONFIG_FILE:-/app/config/config.yaml}
-      # The following environment variables help CN mainland users download Hugging Face models via mirrors
-      # - HF_HUB_ENABLE_HF_TRANSFER=1             # uncomment to enable fast transfer for HF downloads (CN users)
-      # - HF_ENDPOINT=https://hf-mirror.com       # uncomment to use HF mirror endpoint in China
-      # - HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface  # uncomment to set HF cache directory (works with volume above)
     networks:
       - semantic-network
     healthcheck:
diff --git a/website/docs/getting-started/docker-quickstart.md b/website/docs/getting-started/docker-quickstart.md
index 7eae6e59..e2c5c771 100644
--- a/website/docs/getting-started/docker-quickstart.md
+++ b/website/docs/getting-started/docker-quickstart.md
@@ -171,7 +171,7 @@ If you're in CN mainland and network access to Go/Hugging Face/PyPI is slow or b
 - Hugging Face models (router downloads BERT embeddings on first run):
 
   - Prefer using a local copy mounted via `./models` and point `bert_model.model_id` to `models/...`.
-  - Or mount your HF cache into the container and set cache env var (uncomment in `docker-compose.yml`):
+  - Or mount your HF cache into the container and set cache env var (in `docker-compose.yml`):
     - Volume: `~/.cache/huggingface:/root/.cache/huggingface`
     - Env: `HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface`
   - Optional mirrors:
@@ -180,7 +180,7 @@ If you're in CN mainland and network access to Go/Hugging Face/PyPI is slow or b
 
 - Go modules (used during image build):
 
-  - Already set in Dockerfile to `GOPROXY=https://goproxy.cn,direct` and `GOSUMDB=sum.golang.google.cn` for reliability.
+  - Set in `Dockerfile`: `GOPROXY=https://goproxy.cn,direct` and `GOSUMDB=sum.golang.google.cn`.
 
 - PyPI (for mock-vllm image):
   - You can configure pip to use a mirror (commented example in `tools/mock-vllm/Dockerfile`):

From 825157302cb102fcd3aa4bbe6f7e6c145545490d Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Mon, 22 Sep 2025 21:22:18 +0800
Subject: [PATCH 7/9] modify docker-quickstart

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 .../docs/getting-started/docker-quickstart.md | 31 +++++++------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/website/docs/getting-started/docker-quickstart.md b/website/docs/getting-started/docker-quickstart.md
index e2c5c771..0742b589 100644
--- a/website/docs/getting-started/docker-quickstart.md
+++ b/website/docs/getting-started/docker-quickstart.md
@@ -4,26 +4,19 @@ Run Semantic Router + Envoy locally using Docker Compose v2.
 
 ## Prerequisites
 
-- Docker Engine and Docker Compose v2 (use the `docker compose` command, not the legacy `docker-compose`)
+- Docker Engine, see more in [Docker Engine Installation](https://docs.docker.com/engine/install/) 
+- Docker Compose v2 (use the `docker compose` command, not the legacy `docker-compose`)
 
   ```bash
   # Verify
   docker compose version
   ```
 
-  Install Docker Compose v2 for Ubuntu(if missing), see more in [Docker Compose Plugin Installation](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
+  Docker Compose Installation for Ubuntu(if missing), see more in [Docker Compose Plugin Installation](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
 
   ```bash
-  # Remove legacy v1 if present (optional)
-  sudo apt-get remove -y docker-compose || true
-
-  sudo apt-get update
-  sudo apt-get install -y ca-certificates curl gnupg
-  sudo install -m 0755 -d /etc/apt/keyrings
-  curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --yes --dearmor -o /etc/apt/keyrings/docker.gpg
-  echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
   sudo apt-get update
-  sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+  sudo apt-get install -y docker-compose-plugin
 
   docker compose version
   ```
@@ -32,14 +25,14 @@ Run Semantic Router + Envoy locally using Docker Compose v2.
 
 ## Install and Run with Docker Compose v2
 
-1. Clone the repo and move into it (from your workspace root):
+**1. Clone the repo and move into it (from your workspace root)**
 
 ```bash
 git clone https://github.com/vllm-project/semantic-router.git
 cd semantic-router
 ```
 
-2. Download required models (classification models):
+**2. Download required models (classification models)**
 
 ```bash
 make download-models
@@ -53,7 +46,7 @@ This downloads the classification models used by the router:
 
 Note: The BERT similarity model defaults to a remote Hugging Face model. See Troubleshooting for offline/local usage.
 
-3. Start the services with Docker Compose v2:
+**3. Start the services with Docker Compose v2**
 
 ```bash
 # Start core services (semantic-router + envoy)
@@ -67,7 +60,7 @@ docker compose up --build -d
 CONFIG_FILE=/app/config/config.testing.yaml docker compose --profile testing up --build
 ```
 
-4. Verify
+**4. Verify**
 
 - Semantic Router (gRPC): localhost:50051
 - Envoy Proxy: http://localhost:8801
@@ -91,7 +84,7 @@ docker compose down
 
 ## Troubleshooting
 
-** 1. Router exits immediately with a Hugging Face DNS/download error **
+**1. Router exits immediately with a Hugging Face DNS/download error**
 
 Symptoms (from `docker compose logs -f semantic-router`):
 
@@ -138,7 +131,7 @@ Extra tip: If you use the testing profile, also pass the testing config so the r
 CONFIG_FILE=/app/config/config.testing.yaml docker compose --profile testing up --build
 ```
 
-** 2. Envoy/Router up but requests fail **
+**2. Envoy/Router up but requests fail**
 
 - Ensure `mock-vllm` is healthy (testing profile only):
   - `docker compose ps` should show mock-vllm healthy; logs show 200 on `/health`.
@@ -147,7 +140,7 @@ CONFIG_FILE=/app/config/config.testing.yaml docker compose --profile testing up
 - Basic smoke test via Envoy (OpenAI-compatible):
   - Send a POST to `http://localhost:8801/v1/chat/completions` with `{"model":"auto", "messages":[{"role":"user","content":"hi"}]}` and check that the mock responds with `[mock-openai/gpt-oss-20b]` content when testing profile is active.
 
-** 3. DNS problems inside containers **
+**3. DNS problems inside containers**
 
 If DNS is flaky in your Docker environment, add DNS servers to the `semantic-router` service in `docker-compose.yml`:
 
@@ -164,7 +157,7 @@ For corporate proxies, set `http_proxy`, `https_proxy`, and `no_proxy` in the se
 
 Make sure 8801, 50051, 19000 are not bound by other processes. Adjust ports in `docker-compose.yml` if needed.
 
-** 4. China Mainland tips (mirrors and offline caches) **
+**4. China Mainland tips (mirrors and offline caches)**
 
 If you're in CN mainland and network access to Go/Hugging Face/PyPI is slow or blocked:
 

From 6b34904c9c6e941cce10c7bcebf01311a846e807 Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Mon, 22 Sep 2025 22:16:32 +0800
Subject: [PATCH 8/9] installation for more distribution

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 website/docs/getting-started/docker-quickstart.md | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/website/docs/getting-started/docker-quickstart.md b/website/docs/getting-started/docker-quickstart.md
index 0742b589..5ddf5144 100644
--- a/website/docs/getting-started/docker-quickstart.md
+++ b/website/docs/getting-started/docker-quickstart.md
@@ -7,17 +7,18 @@ Run Semantic Router + Envoy locally using Docker Compose v2.
 - Docker Engine, see more in [Docker Engine Installation](https://docs.docker.com/engine/install/) 
 - Docker Compose v2 (use the `docker compose` command, not the legacy `docker-compose`)
 
-  ```bash
-  # Verify
-  docker compose version
-  ```
-
-  Docker Compose Installation for Ubuntu(if missing), see more in [Docker Compose Plugin Installation](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
+  Docker Compose Plugin Installation(if missing), see more in [Docker Compose Plugin Installation](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
 
   ```bash
+  # For Ubuntu and Debian, run:
   sudo apt-get update
   sudo apt-get install -y docker-compose-plugin
 
+  # For RPM-based distributions, run:
+  sudo yum update
+  sudo yum install docker-compose-plugin
+
+  # Verify
   docker compose version
   ```
 

From e50e441be64b64fb9cdd8902178e06b1c9f0f188 Mon Sep 17 00:00:00 2001
From: JaredforReal <w13431838023@gmail.com>
Date: Tue, 23 Sep 2025 12:15:09 +0800
Subject: [PATCH 9/9] get rid of optimization for CN network

Signed-off-by: JaredforReal <w13431838023@gmail.com>
---
 Dockerfile.extproc                            |  9 --------
 tools/mock-vllm/Dockerfile                    | 10 +++-----
 .../docs/getting-started/docker-quickstart.md | 23 -------------------
 3 files changed, 3 insertions(+), 39 deletions(-)

diff --git a/Dockerfile.extproc b/Dockerfile.extproc
index 652e6f89..72ead6e4 100644
--- a/Dockerfile.extproc
+++ b/Dockerfile.extproc
@@ -24,20 +24,11 @@ FROM golang:1.24 as go-builder
 
 WORKDIR /app
 
-# Use China-friendly Go module mirrors to avoid proxy.golang.org timeouts
-# ENV GOPROXY=https://goproxy.cn,direct
-# Prefer a reachable checksum database in CN (or set to 'off' if still blocked)
-# ENV GOSUMDB=sum.golang.google.cn
-
 # Copy Go module files first for better layer caching
 RUN mkdir -p src/semantic-router
 COPY src/semantic-router/go.mod src/semantic-router/go.sum src/semantic-router/
 COPY candle-binding/go.mod candle-binding/semantic-router.go candle-binding/
 
-# Pre-download Go modules to leverage Docker layer caching and fail fast if mirrors are unreachable
-RUN cd src/semantic-router && go mod download && \
-    cd /app/candle-binding && go mod download
-
 # Copy semantic-router source code
 COPY src/semantic-router/ src/semantic-router/
 
diff --git a/tools/mock-vllm/Dockerfile b/tools/mock-vllm/Dockerfile
index 3a7e812c..ea955b2b 100644
--- a/tools/mock-vllm/Dockerfile
+++ b/tools/mock-vllm/Dockerfile
@@ -6,14 +6,10 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     curl \
     && rm -rf /var/lib/apt/lists/*
 
-# Uncomment to Configure pip to use a China mirror for faster installs
-# RUN python -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple && \
-#     python -m pip config set global.trusted-host pypi.tuna.tsinghua.edu.cn
+COPY requirements.txt 
+RUN pip install --no-cache-dir -r requirements.txt
 
-COPY requirements.txt /app/requirements.txt
-RUN pip install --no-cache-dir -r /app/requirements.txt
-
-COPY app.py /app/app.py
+COPY app.py
 
 EXPOSE 8000
 
diff --git a/website/docs/getting-started/docker-quickstart.md b/website/docs/getting-started/docker-quickstart.md
index 5ddf5144..6a517ff2 100644
--- a/website/docs/getting-started/docker-quickstart.md
+++ b/website/docs/getting-started/docker-quickstart.md
@@ -157,26 +157,3 @@ services:
 For corporate proxies, set `http_proxy`, `https_proxy`, and `no_proxy` in the service `environment`.
 
 Make sure 8801, 50051, 19000 are not bound by other processes. Adjust ports in `docker-compose.yml` if needed.
-
-**4. China Mainland tips (mirrors and offline caches)**
-
-If you're in CN mainland and network access to Go/Hugging Face/PyPI is slow or blocked:
-
-- Hugging Face models (router downloads BERT embeddings on first run):
-
-  - Prefer using a local copy mounted via `./models` and point `bert_model.model_id` to `models/...`.
-  - Or mount your HF cache into the container and set cache env var (in `docker-compose.yml`):
-    - Volume: `~/.cache/huggingface:/root/.cache/huggingface`
-    - Env: `HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface`
-  - Optional mirrors:
-    - `HF_ENDPOINT=https://hf-mirror.com`
-    - `HF_HUB_ENABLE_HF_TRANSFER=1`
-
-- Go modules (used during image build):
-
-  - Set in `Dockerfile`: `GOPROXY=https://goproxy.cn,direct` and `GOSUMDB=sum.golang.google.cn`.
-
-- PyPI (for mock-vllm image):
-  - You can configure pip to use a mirror (commented example in `tools/mock-vllm/Dockerfile`):
-    - `python -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple`
-    - `python -m pip config set global.trusted-host pypi.tuna.tsinghua.edu.cn`