Skip to content

Commit 0c9f039

Browse files
[TPU Offload] Separate offload manager and cpu-cache backend, and code structure refactor (#1122)
Signed-off-by: Juncheng Gu <jcgu@google.com>
1 parent f6a6720 commit 0c9f039

27 files changed

+2414
-3124
lines changed

examples/gke/benchmarks/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ kubectl apply -f deploy-baseline.yaml
3333

3434
### Option B: vLLM with TPU Host Offload
3535

36-
This deployment configures vLLM to use a `TPUConnector` for KV cache offload to the host CPU memory. This is specified by the `--kv-transfer-config` argument.
36+
This deployment configures vLLM to use a `TPUOffloadConnector` for KV cache offload to the host CPU memory. This is specified by the `--kv-transfer-config` argument.
3737

3838
```bash
3939
kubectl apply -f deploy-cpu-offload.yaml

examples/gke/benchmarks/deploy-cpu-offload.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ spec:
2121
imagePullPolicy: Always
2222
command: ["/bin/sh", "-c"]
2323
args:
24-
- "vllm serve meta-llama/Llama-3.3-70B-Instruct --kv-transfer-config '{\"kv_connector\":\"TPUConnector\",\"kv_role\":\"kv_both\",\"kv_connector_module_path\":\"tpu_inference.distributed.tpu_connector_local\"}' --port 8000 --max_num_batched_tokens 2048 --enable-chunked-prefill --tensor-parallel-size 8 --seed 42 --enable_prefix_caching --gpu-memory-utilization 0.9"
24+
- "vllm serve meta-llama/Llama-3.3-70B-Instruct --kv-transfer-config '{\"kv_connector\":\"TPUOffloadConnector\",\"kv_role\":\"kv_both\",\"kv_connector_module_path\":\"tpu_inference.distributed.offload.tpu_offload_connector\"}' --port 8000 --max_num_batched_tokens 2048 --enable-chunked-prefill --tensor-parallel-size 8 --seed 42 --enable_prefix_caching --gpu-memory-utilization 0.9"
2525
env:
2626
- name: HUGGING_FACE_HUB_TOKEN
2727
valueFrom:

examples/gke/pod_tpu_commons_cpu_offload.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ spec:
1818
- --tensor_parallel_size=8
1919
- --max_model_len=1024
2020
- --kv-transfer-config
21-
- '{"kv_connector":"TPUConnector","kv_connector_module_path":"tpu_inference.distributed.tpu_connector_local","kv_role":"kv_both"}'
21+
- '{"kv_connector":"TPUOffloadConnector","kv_connector_module_path":"tpu_inference.distributed.offload.tpu_offload_connector","kv_role":"kv_both"}'
2222
env:
2323
- name: HUGGING_FACE_HUB_TOKEN
2424
valueFrom:

examples/gke/pod_tpu_commons_cpu_offload_verification.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@ apiVersion: v1
22
kind: Pod
33
metadata:
44
name: tpu-job-offline-inference
5-
# This pod verifies the correctness of the TPUConnector implementation.
5+
# This pod verifies the correctness of the TPUOffloadConnector implementation.
66
# It runs a script that internally performs two text generations:
77
# 1. A baseline run with a standard vLLM engine.
8-
# 2. A test run with the TPUConnector enabled.
8+
# 2. A test run with the TPUOffloadConnector enabled.
99
# The pod succeeds only if the outputs from both runs are identical,
1010
# ensuring that the connector does not alter the model's output.
1111
spec:
@@ -25,7 +25,7 @@ spec:
2525
- --max_model_len=1024
2626
- --seed=42
2727
- --kv-transfer-config
28-
- '{"kv_connector":"TPUConnector","kv_connector_module_path":"tpu_inference.distributed.tpu_connector_local","kv_role":"kv_both"}'
28+
- '{"kv_connector":"TPUOffloadConnector","kv_connector_module_path":"tpu_inference.distributed.offload.tpu_offload_connector","kv_role":"kv_both"}'
2929
env:
3030
- name: HUGGING_FACE_HUB_TOKEN
3131
valueFrom:

examples/gke/pod_tpu_host_offload_unit_tests.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ apiVersion: v1
22
kind: Pod
33
metadata:
44
name: tpu-job-host-offload-unit-tests
5-
# This pod runs the distributed unit tests for the TPUConnector
5+
# This pod runs the distributed unit tests for the TPUOffloadConnector
66
# and other related functionalities. It executes all tests found in the
77
# tests/distributed/ directory using pytest.
88
spec:
@@ -17,12 +17,12 @@ spec:
1717
command:
1818
- /bin/bash
1919
- -c
20-
- "pytest -sv tests/distributed/host_offloading_precompile_test.py"
21-
# - "pytest -sv tests/distributed/cpu_offloading_worker_test.py"
22-
# - "pytest -sv tests/distributed/cpu_offloading_cache_util_test.py"
23-
# - "pytest -sv tests/distributed/host_offloading_accuracy_test.py"
24-
# - "pytest -sv tests/distributed/local_cpu_backend_test.py"
25-
# - "pytest -sv tests/distributed/host_offloading_precompile_test.py"
20+
- "pytest -sv tests/distributed/offload/tpu_offload_cpu_backend_test.py"
21+
- "pytest -sv tests/distributed/offload/tpu_offload_connector_worker_test.py"
22+
- "pytest -sv tests/distributed/offload/tpu_offload_connector_scheduler_test.py"
23+
- "pytest -sv tests/distributed/offload/tpu_offload_utils_test.py"
24+
- "pytest -sv tests/distributed/offload/tpu_offload_manager_test.py"
25+
- "pytest -sv tests/distributed/offload/tpu_offload_accuracy_test.py"
2626
env:
2727
- name: HUGGING_FACE_HUB_TOKEN
2828
valueFrom:

examples/offline_inference_kv_cache_verification.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# SPDX-License-Identifier: Apache-2.0
22
"""
3-
This script performs an automated correctness verification for the TPUConnector.
3+
This script performs an automated correctness verification for the TPUOffloadConnector.
44
55
The verification works by performing a two-stage experiment for multiple prompts:
66
1. Baseline Run: For each prompt, it first runs a text generation using a
77
standard vLLM engine configuration without any KV cache connector. The
88
output from this run is considered the "source of truth".
99
1010
2. Test Run: It then runs the exact same text generation, but this time
11-
with the TPUConnector enabled via the `--kv-transfer-config` argument.
11+
with the TPUOffloadConnector enabled via the `--kv-transfer-config` argument.
1212
It runs the generation twice to verify prefix caching.
1313
1414
3. Comparison: The script compares the output from each test run against the
@@ -131,7 +131,7 @@ def main(args: dict):
131131
time.sleep(10)
132132

133133
# 2. Run the test with the local tpu kv connector enabled
134-
print("\n--- Running Test (with TPUConnector) ---")
134+
print("\n--- Running Test (with TPUOffloadConnector) ---")
135135
# With the connector, we run generation twice to test the prefix cache
136136
test_llm, test_params = setup_llm(args)
137137
test_outputs = run_invocations(test_llm,

tests/distributed/cpu_offloading_cache_util_test.py

Lines changed: 0 additions & 129 deletions
This file was deleted.

0 commit comments

Comments
 (0)