[DP] Functional DP for GPT-OSS #1137

wenxindongwork · 2025-11-20T05:29:57Z

Description

Add functional DP support for GPT-OSS Torchax backend.

Verified baseline throughput unchanged (5037.82) , DP=2 throughput is 1.54x (7781.92).

Validated numerical correctness with offline_inference.py

Full details: https://paste.googleplex.com/5240826907197440

Tests

https://buildkite.com/tpu-commons/tpu-inference-ci/builds/5712
https://buildkite.com/tpu-commons/tpu-inference-ci/builds/5749

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-11-20T05:30:12Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

kyuyeunk

Please add torchax dp related unit tests.

kyuyeunk · 2025-11-20T06:41:27Z

tpu_inference/layers/vllm/fused_moe.py

+                                 total_repeat_length=m // mesh.shape["data"])
+            return (gmm_result_local + rhs_bis).astype(gmm_result_local.dtype)
+
+        gmm_result = shard_map(


personally, I prefer not using shard_map if possible as it is really prone to numeric error when not properly used. When using check_rep=False, unlike other ops, there isn't any safety feature that guarantees that all the numerics of a tensor adheres to a proper SPMD / sharding annotation.

I prefer using it only when it's really necessary (like using kernel).

Please modify this code not use shard map and you can refer to this PR where I replaced existing using of shard_map to a regular jax function: #590

For more context, when using shard map with check_rep=False, when using something like 'out_spec=P(None)', it only annotates the tensor as having that sharding but shard map does not introduce any collective to ensure it.

Meaning, it is possible that the output tenor numeric is not replicated along multiple devices and all devices have different numeric because shard_map does not provide any guarantees - which makes it really painful to debug when there's a numeric issue.

I tried to not use a shard_map but did not figure out a way that doesn't involve a for loop.

The complexity here is that jnp.repeat(bias, group_size, 0) expects bias and group_size to share the same size on dimension 0, but group_size[0]= DP*num_experts whereas bias[0] = num_experts.

if that's the case, can you do something like this?

# convert (experts, model_dim) to (experts * dp_size, model_dim) bias = jnp.repeat(bias, dp_size, 0) # (optional. may or may not needed) match bias's sharding with group_size's sharding bias = jax.lax.with_sharding_constraint(bias, P("data", "model")) # Now the bias.shape[0] and group_size.shape[0] matches rhs_bias = jnp.repeat(bias, group_size, 0)

Note that if complier optimization works correctly, the first 2 code (jnp.repeat & sharding constraint) will be a no-op. because the data is already present in each dp rank & we are just telling it to treat them differently starting from now.

Thanks Kyuyeun for this suggestio, For correctness, I have to use jnp.tile instead of jnp.repeat. however, I am noticing performance drop (7575.50 vs 7781.92) if I do this instead of shard_map. Maybe due to jnp.tile not being a no-op?

rhs_bias = jnp.tile(rhs_bias, (mesh.shape["data"], 1)) # adding the sharding constraint does not make a difference rhs_bias = jnp.repeat(rhs_bias, group_sizes, 0, total_repeat_length=m) gmm_result = (gmm_result + rhs_bias).astype(gmm_result.dtype)

Hmm in-theory it should be no-op.

because bias is already replicated along TPUs in dp axis, and combining tile/repeat with sharding constraint just tells TPU to treat them like a separate non-replicated tensor.

I'll do some test locally and get back to you asap.

tpu_inference/layers/vllm/quantization/common.py

kyuyeunk

Please add torchax dp related unit tests.

Also, please address this comment.

tpu_inference/layers/vllm/quantization/common.py

kyuyeunk · 2025-11-20T18:57:01Z

tpu_inference/layers/vllm/fused_moe.py

+                                 total_repeat_length=m // mesh.shape["data"])
+            return (gmm_result_local + rhs_bis).astype(gmm_result_local.dtype)
+
+        gmm_result = shard_map(


if that's the case, can you do something like this?

# convert (experts, model_dim) to (experts * dp_size, model_dim) bias = jnp.repeat(bias, dp_size, 0) # (optional. may or may not needed) match bias's sharding with group_size's sharding bias = jax.lax.with_sharding_constraint(bias, P("data", "model")) # Now the bias.shape[0] and group_size.shape[0] matches rhs_bias = jnp.repeat(bias, group_size, 0)

wenxindongwork · 2025-11-20T23:14:56Z

added e2e model parallelism test for Llama3.1 1b for torchax.

wenxindongwork added 5 commits November 20, 2025 02:14

squash

3fe16df

wip

3eada3d

only submit model dp

8970b80

wip

270e511

wip

a9d5154

wenxindongwork requested review from hfan, kyuyeunk, vanbasten23 and vipannalla as code owners November 20, 2025 05:29

formatting

b10487a

wenxindongwork force-pushed the torch-dp-pr branch from 1efb3dc to b10487a Compare November 20, 2025 05:32

wenxindongwork requested a review from yaochengji November 20, 2025 06:06

kyuyeunk reviewed Nov 20, 2025

View reviewed changes

wenxindongwork added 2 commits November 20, 2025 23:13

wip

a670581

formatting

9087e9c

[DP] Functional DP for GPT-OSS #1137

Are you sure you want to change the base?

[DP] Functional DP for GPT-OSS #1137

Conversation

wenxindongwork commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

github-actions bot commented Nov 20, 2025

Description

Tests

Checklist

Uh oh!

kyuyeunk left a comment

Choose a reason for hiding this comment

Uh oh!

kyuyeunk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

kyuyeunk Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenxindongwork Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

kyuyeunk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

kyuyeunk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

wenxindongwork Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyuyeunk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kyuyeunk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kyuyeunk Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

wenxindongwork commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenxindongwork commented Nov 20, 2025 •

edited

Loading

kyuyeunk Nov 20, 2025 •

edited

Loading

wenxindongwork Nov 20, 2025 •

edited

Loading