Fix numerical issue on hybrid kv cache allocation #1139

Chenyaaang · 2025-11-20T20:51:15Z

Description

Fix numerical issue on hybrid kv cache allocation. When we enable hybrid kv cache, at each kv cache allocation round, the block_id is different between each kv cache group, which means different layers are writing to different block_ids, so we need to create individual attention metadata for each layer, instead of using the same attention metadata for every layer.

Tests

unit tests in tpu_worker, tpu_runner passed
The results w/ vs w/o hybrid kv cache are the same when I run offline_inference.py with Gemma model. python examples/offline_inference.py --model google/gemma-3-27b-it --tensor-parallel-size 8
CI: https://buildkite.com/tpu-commons/tpu-inference-ci/builds/5787 all tasks are green except for lora, which I believe is an upstream change, not related to my pr.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-11-20T20:51:33Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

py4

These PR doesn't have any tests. Please add the following tests:

e2e Correctness test: output with and without hybrid allocation is the same
e2e performance test: performance with hybrid allocator is higher than without hybrid allocator
unit tests for the changed python files and the runner. We need to keep coverage above 70% and we need our PRs to come with enough tests

tpu_inference/runner/tpu_runner.py

py4

Does this also work for JAX path? if no, can we also make JAX path work?

tpu_inference/runner/tpu_runner.py

Chenyaaang · 2025-11-21T00:06:45Z

Does this also work for JAX path? if no, can we also make JAX path work?

It should be backend agnostic, but to enable in Jax, we need to modify the individual jax model. Previously all jax models don't need hybrid kv cache, so it's not enabled. The numerical issue is also reported using vLLM model instead of flax nnx.

github-actions · 2025-11-21T00:09:31Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: Chenyaaang <chenyangli@google.com>

kyuyeunk · 2025-11-21T00:51:52Z

with this PR, Ion gpt-oss, 've verified that numeric issue has been solved & also a performance issue that stemmed from numeric issues has been resolved.

Signed-off-by: Chenyaaang <chenyangli@google.com>

Chenyaaang requested review from hfan, kyuyeunk, mrjunwan-lang, py4, sixiang-google, vanbasten23, vipannalla and wenxindongwork as code owners November 20, 2025 20:51

py4 reviewed Nov 20, 2025

View reviewed changes

tpu_inference/runner/tpu_runner.py Show resolved Hide resolved

tpu_inference/runner/tpu_runner.py Show resolved Hide resolved

Chenyaaang removed request for hfan, mrjunwan-lang, sixiang-google, vanbasten23, vipannalla and wenxindongwork November 20, 2025 22:22

py4 reviewed Nov 20, 2025

View reviewed changes

tpu_inference/runner/tpu_runner.py Show resolved Hide resolved

tpu_inference/runner/tpu_runner.py Show resolved Hide resolved

Chenyaaang closed this Nov 21, 2025

Chenyaaang reopened this Nov 21, 2025

fix hybrid kv cache

a1d07b7

Signed-off-by: Chenyaaang <chenyangli@google.com>

Chenyaaang force-pushed the test-hybrid-kv branch from 8f5b161 to a1d07b7 Compare November 21, 2025 00:41

add comments

616016c

Signed-off-by: Chenyaaang <chenyangli@google.com>

Chenyaaang added 2 commits November 21, 2025 01:55

fix dp recompilation

73d0633

Signed-off-by: Chenyaaang <chenyangli@google.com>

fix dp unit test

c7dfd6a

Signed-off-by: Chenyaaang <chenyangli@google.com>

richardsliu approved these changes Nov 22, 2025

View reviewed changes

Chenyaaang merged commit cfc7610 into main Nov 22, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix numerical issue on hybrid kv cache allocation #1139

Fix numerical issue on hybrid kv cache allocation #1139

Uh oh!

Chenyaaang commented Nov 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

py4 left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

py4 left a comment

Uh oh!

Uh oh!

Uh oh!

Chenyaaang commented Nov 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

kyuyeunk commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix numerical issue on hybrid kv cache allocation #1139

Fix numerical issue on hybrid kv cache allocation #1139

Uh oh!

Conversation

Chenyaaang commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

github-actions bot commented Nov 20, 2025

Description

Tests

Checklist

Uh oh!

py4 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

py4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Chenyaaang commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 21, 2025

Description

Tests

Checklist

Uh oh!

kyuyeunk commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Chenyaaang commented Nov 20, 2025 •

edited

Loading

py4 left a comment •

edited

Loading

Chenyaaang commented Nov 21, 2025 •

edited

Loading