NVFP4 cast/transpose without TMA by matthiasdiener · Pull Request #472 · ROCm/TransformerEngine

matthiasdiener · 2026-03-04T16:17:16Z

Description

Fixes https://github.com/ROCm/frameworks-internal/issues/15731

TODO:

Implement other cases, not just fwd 1D
tests for other cases (2D, SR)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Resolve wheels and examples

…ener/fp4-cast-transpose

This reverts commit 5c747bd.

matthiasdiener · 2026-03-17T22:22:45Z

The code contains number of ifdefs for just substitution of cuda_fp4.h __nv_fp4_e2m1, etc, with HIP counterparts. I suggest to use custom hipification map (build_tools/hipify/custom_map.json) and remove ifdefs from code. It can also be used for #include <cudaTypedefs.h>

Thank you for the suggestion. I changed to using the hipify map in 55a8c84

wangye805

So currently we don't have any walkaround for the stochastic rounding path?

build_tools/hipify/custom_map.json

tests/cpp/operator/test_cast_nvfp4_transpose.cu

tests/cpp/test_common.cu

tests/cpp/test_common.h

wangye805 · 2026-03-18T04:57:59Z

transformer_engine/common/cast/dispatch/quantize.cuh

+#ifdef __HIP_PLATFORM_AMD__
+        // If amax was not explicitly set, fall back to the scale field which
+        // holds the same value when set via set_scale().
+        NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,


Is it a bug fix for upstream? If not, why do we need this specific treatment for global amax?

Yes, I believe this is a bug in upstream.

Maybe put comment then.

I see. Thanks!

Also, check if upstream already had an fix. If not, I think it's okay to drop the rocm specific guard. What do you think @ipanfilo?

I don't think this is fixed in upstream yet. I added a comment in a607feb

Thanks. I would like to understand more about this fix. Probably get a B200 to test NV upstream behavior is hard. Which cpp gtest failed due to this bug? According to NV upstream design, should the output_amax be set with the correct value? If so, we should fix the bug in the place where this setting was missed

The tests in test_cast_nvfp4_transpose.cu fail - I suspect this isn't tested upstream for the fallback path (which is the only path we currently have implemented). Looking at the upstream optimized kernel (in quantize_transpose_nvfp4.cuh), amax_rowwise_ptr is explicitly allowed to be null; the kernel falls back to 1.0f in that case. amax.dptr is never allocated for NVFP4 tensors in the upstream test Tensor class, and the optimized path never needs it. I changed the fix to match this null-handling behavior in the fallback kernel in 82af544. I believe this is still incorrect in upstream's main branch too.

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

tests/cpp/operator/test_cast_nvfp4_transpose.cu

matthiasdiener · 2026-03-18T21:36:46Z

So currently we don't have any walkaround for the stochastic rounding path?

I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it.

tests/cpp/test_common.h

ipanfilo · 2026-03-18T23:22:22Z

transformer_engine/common/cast/dispatch/quantize.cuh

+#ifdef __HIP_PLATFORM_AMD__
+        // If amax was not explicitly set, fall back to the scale field which
+        // holds the same value when set via set_scale().
+        NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,


Maybe put comment then.

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

ipanfilo · 2026-03-18T23:31:51Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

 }

+#ifdef __HIP_PLATFORM_AMD__
+__device__ __forceinline__ fp4x4_storage_t cvt_fp32_to_fp4_4x_with_stochastic_rounding(


fp4x4_storage_t is already correctly redefined for HIP and CUDA so no need ifdef here. Or if, you want to keep original declaration unchanged, you can use 'using __nv_fp4x4_e2m1= __hip_fp4x4_storage_t' on AMD.

I'm not sure this can be simplified further.
Both sides need different return types and fp4x4_storage_t can only be used on AMD.
The map file has "__nv_fp4x2_e2m1" : "__hip_fp4x2_e2m1", , so using __nv_fp4x4_e2m1 = __hip_fp4x4_storage_t would become using __hip_fp4x4_e2m1 = __hip_fp4x4_storage_t after hipification, which is a redefinition of the existing struct in amd_hip_fp4.h.

fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.
Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?

fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.

I did this simplification in fc5af65.

Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?

Thanks, I was able to find another way to do the bit fiddling in this function and the SR function, that does not need __hip_fp4x4_storage_t in 94a4e5e.

ipanfilo · 2026-03-18T23:35:26Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

        "FP4 cvt.rs PTX instructions are architecture-specific. "
        "Try recompiling with sm_XXXa instead of sm_XXX.");
+#else
+#ifdef __gfx950__


It may make sense to have analogue of ARCH_HAS_STOCHASTIC_ROUNDING define if such guarding is used in multiple places - we'll later add more platforms with FP4 support.

Added in a607feb

ipanfilo · 2026-03-18T23:57:52Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu


 // for 2D block scaling, we need to reduce amax in warp
+#ifdef __HIP_PLATFORM_AMD__
+static __device__ constexpr uint64_t WARP_REDUCE_AMAX_GROUP_MASKS[8] = {


I think with 32 threads per wavefront actively used the high half of mask should be 0

Isn't kThreadsPerWarp=32 just a logical grouping value here, not the hardware wavefront width?

wangye805 · 2026-03-19T04:26:31Z

So currently we don't have any walkaround for the stochastic rounding path?

I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it.

Great. Thanks

…pose

ipanfilo · 2026-03-20T04:45:07Z

tests/cpp/test_common.cu

-    size_t scale_dim_X = DIVUP_TO_MULTIPLE(DIVUP(last_dim, 16lu), scale_tensor_alignment_X_rowwise);
+#ifdef __HIP_PLATFORM_AMD__
+    // NVFP4 requires [128,4] padding on AMD regardless of MXFP8 alignment constants
+    constexpr size_t nvfp4_align_Y = 128;


Use nvfp4 constants from test_common.h

Thanks, done in 5a5803c.

ipanfilo · 2026-03-20T05:40:53Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

 }

+#ifdef __HIP_PLATFORM_AMD__
+__device__ __forceinline__ fp4x4_storage_t cvt_fp32_to_fp4_4x_with_stochastic_rounding(


fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.
Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?

wangye805 · 2026-03-20T05:46:18Z

transformer_engine/common/cast/dispatch/quantize.cuh

+#ifdef __HIP_PLATFORM_AMD__
+        // If amax was not explicitly set, fall back to the scale field which
+        // holds the same value when set via set_scale().
+        NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,


Thanks. I would like to understand more about this fix. Probably get a B200 to test NV upstream behavior is hard. Which cpp gtest failed due to this bug? According to NV upstream design, should the output_amax be set with the correct value? If so, we should fix the bug in the place where this setting was missed

…pose

wangye805 and others added 14 commits February 2, 2026 14:16

[ROCm] resolve the conflicts in common dir

b8a4024

[ROCm] resolve the conflicts on jax side

0519b4b

[ROCm] resolve the conflicts on pytorch side

8f4b04d

[ROCm] resolve the conflicts in setup

e60ff21

[ROCm] resolve the cpp gtest

8bbb162

[ROCm] resolve pytorch and jax tests

f573b40

Resolve wheels and examples

pytest, example, wheels conflict resolution

eaaae94

jax and pytorch bugfix

8f94cf6

copyrights and fp8_autocast->autocast fix

bac7993

Enable test_distributed_dense.py

8ae38e8

address IFU comments

05a977a

_FormatHelperFP8 and missing file add

0385852

add use_async_d2h_group_size as a test parameter

46d382d

enable FP4 tests

15416f1

matthiasdiener self-assigned this Mar 4, 2026

matthiasdiener and others added 12 commits March 4, 2026 16:13

rough initial version

bac5096

initial working version

da24223

Addressing comments and small fixes

c03b7bb

various cleanups

c453dba

manually update runner labels

4a843ba

Comment cleanup

316dffb

Merge remote-tracking branch 'origin/IFU-dev-20251114-v2.10' into mdi…

8a47bc5

…ener/fp4-cast-transpose

only enable on gfx950

5c747bd

Update jax gemm.py

db56b8f

Merge remote-tracking branch 'origin/IFU-dev-20251114-v2.10' into mdi…

b318bda

…ener/fp4-cast-transpose

Revert "only enable on gfx950"

62eea94

This reverts commit 5c747bd.

reenable in NVTEDType

6d459ec

matthiasdiener changed the base branch from IFU-dev-20251114-v2.10 to dev March 6, 2026 19:17

matthiasdiener changed the base branch from dev to IFU-dev-20251114-v2.10 March 6, 2026 19:20

Fix dev merge conflicts

6eb2707

matthiasdiener added 2 commits March 17, 2026 16:17

adjust error message slightly

6cd6038

simplify via hipify map

55a8c84

matthiasdiener force-pushed the mdiener/fp4-cast-transpose branch from 472372b to 55a8c84 Compare March 17, 2026 22:12

adjust more error messages

10d88bf

wangye805 requested changes Mar 18, 2026

View reviewed changes

matthiasdiener added 2 commits March 18, 2026 12:56

change disabling of header includes

b4caf6f

address review comments

511db61

matthiasdiener requested a review from ipanfilo March 18, 2026 18:28

matthiasdiener added 2 commits March 18, 2026 15:54

implement SR

36cf73a

simplify slightly

a85f68f

ipanfilo reviewed Mar 18, 2026

View reviewed changes

matthiasdiener added 2 commits March 19, 2026 11:50

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

f4f5ec9

…pose

address review comments

a607feb

matthiasdiener force-pushed the mdiener/fp4-cast-transpose branch from bb0712d to a607feb Compare March 19, 2026 19:26

matthiasdiener requested review from ipanfilo and wangye805 March 19, 2026 19:27

bugfix arch SR support

ca2e444

ipanfilo reviewed Mar 20, 2026

View reviewed changes

wangye805 requested changes Mar 20, 2026

View reviewed changes

matthiasdiener added 4 commits March 20, 2026 13:09

use scale constants

5a5803c

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

d36ccbd

…pose

simplify to use __hip_fp4x4_storage_t directly

fc5af65

simplify storage for bit fiddling

94a4e5e

matthiasdiener requested a review from ipanfilo March 20, 2026 18:37

allow null amax in fallback kernel

82af544

matthiasdiener requested a review from wangye805 March 20, 2026 19:23

minor cleanup

56fefaf

Conversation

matthiasdiener commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

matthiasdiener commented Mar 17, 2026

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matthiasdiener commented Mar 18, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangye805 commented Mar 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

matthiasdiener commented Mar 4, 2026 •

edited

Loading

matthiasdiener Mar 19, 2026 •

edited

Loading

matthiasdiener Mar 20, 2026 •

edited

Loading