Skip to content

NVFP4 cast/transpose without TMA#472

Open
matthiasdiener wants to merge 55 commits intodevfrom
mdiener/fp4-cast-transpose
Open

NVFP4 cast/transpose without TMA#472
matthiasdiener wants to merge 55 commits intodevfrom
mdiener/fp4-cast-transpose

Conversation

@matthiasdiener
Copy link
Contributor

@matthiasdiener matthiasdiener commented Mar 4, 2026

Description

Fixes https://github.com/ROCm/frameworks-internal/issues/15731

TODO:

  • Implement other cases, not just fwd 1D
  • tests for other cases (2D, SR)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this Mar 4, 2026
@matthiasdiener matthiasdiener changed the base branch from IFU-dev-20251114-v2.10 to dev March 6, 2026 19:17
@matthiasdiener matthiasdiener changed the base branch from dev to IFU-dev-20251114-v2.10 March 6, 2026 19:20
@matthiasdiener matthiasdiener force-pushed the mdiener/fp4-cast-transpose branch from 472372b to 55a8c84 Compare March 17, 2026 22:12
@matthiasdiener
Copy link
Contributor Author

The code contains number of ifdefs for just substitution of cuda_fp4.h __nv_fp4_e2m1, etc, with HIP counterparts. I suggest to use custom hipification map (build_tools/hipify/custom_map.json) and remove ifdefs from code. It can also be used for #include <cudaTypedefs.h>

Thank you for the suggestion. I changed to using the hipify map in 55a8c84

Copy link
Collaborator

@wangye805 wangye805 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently we don't have any walkaround for the stochastic rounding path?

#ifdef __HIP_PLATFORM_AMD__
// If amax was not explicitly set, fall back to the scale field which
// holds the same value when set via set_scale().
NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bug fix for upstream? If not, why do we need this specific treatment for global amax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this is a bug in upstream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put comment then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

Also, check if upstream already had an fix. If not, I think it's okay to drop the rocm specific guard. What do you think @ipanfilo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is fixed in upstream yet. I added a comment in a607feb

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I would like to understand more about this fix. Probably get a B200 to test NV upstream behavior is hard. Which cpp gtest failed due to this bug? According to NV upstream design, should the output_amax be set with the correct value? If so, we should fix the bug in the place where this setting was missed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in test_cast_nvfp4_transpose.cu fail - I suspect this isn't tested upstream for the fallback path (which is the only path we currently have implemented). Looking at the upstream optimized kernel (in quantize_transpose_nvfp4.cuh), amax_rowwise_ptr is explicitly allowed to be null; the kernel falls back to 1.0f in that case. amax.dptr is never allocated for NVFP4 tensors in the upstream test Tensor class, and the optimized path never needs it. I changed the fix to match this null-handling behavior in the fallback kernel in 82af544. I believe this is still incorrect in upstream's main branch too.

@matthiasdiener matthiasdiener requested a review from ipanfilo March 18, 2026 18:28
@matthiasdiener
Copy link
Contributor Author

So currently we don't have any walkaround for the stochastic rounding path?

I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it.

#ifdef __HIP_PLATFORM_AMD__
// If amax was not explicitly set, fall back to the scale field which
// holds the same value when set via set_scale().
NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put comment then.

}

#ifdef __HIP_PLATFORM_AMD__
__device__ __forceinline__ fp4x4_storage_t cvt_fp32_to_fp4_4x_with_stochastic_rounding(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp4x4_storage_t is already correctly redefined for HIP and CUDA so no need ifdef here. Or if, you want to keep original declaration unchanged, you can use 'using __nv_fp4x4_e2m1= __hip_fp4x4_storage_t' on AMD.

Copy link
Contributor Author

@matthiasdiener matthiasdiener Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this can be simplified further.
Both sides need different return types and fp4x4_storage_t can only be used on AMD.
The map file has "__nv_fp4x2_e2m1" : "__hip_fp4x2_e2m1", , so using __nv_fp4x4_e2m1 = __hip_fp4x4_storage_t would become using __hip_fp4x4_e2m1 = __hip_fp4x4_storage_t after hipification, which is a redefinition of the existing struct in amd_hip_fp4.h.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.
Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?

Copy link
Contributor Author

@matthiasdiener matthiasdiener Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.

I did this simplification in fc5af65.

Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?

Thanks, I was able to find another way to do the bit fiddling in this function and the SR function, that does not need __hip_fp4x4_storage_t in 94a4e5e.

"FP4 cvt.rs PTX instructions are architecture-specific. "
"Try recompiling with sm_XXXa instead of sm_XXX.");
#else
#ifdef __gfx950__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to have analogue of ARCH_HAS_STOCHASTIC_ROUNDING define if such guarding is used in multiple places - we'll later add more platforms with FP4 support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a607feb


// for 2D block scaling, we need to reduce amax in warp
#ifdef __HIP_PLATFORM_AMD__
static __device__ constexpr uint64_t WARP_REDUCE_AMAX_GROUP_MASKS[8] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with 32 threads per wavefront actively used the high half of mask should be 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't kThreadsPerWarp=32 just a logical grouping value here, not the hardware wavefront width?

@wangye805
Copy link
Collaborator

So currently we don't have any walkaround for the stochastic rounding path?

I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it.

Great. Thanks

@matthiasdiener matthiasdiener force-pushed the mdiener/fp4-cast-transpose branch from bb0712d to a607feb Compare March 19, 2026 19:26
size_t scale_dim_X = DIVUP_TO_MULTIPLE(DIVUP(last_dim, 16lu), scale_tensor_alignment_X_rowwise);
#ifdef __HIP_PLATFORM_AMD__
// NVFP4 requires [128,4] padding on AMD regardless of MXFP8 alignment constants
constexpr size_t nvfp4_align_Y = 128;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use nvfp4 constants from test_common.h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in 5a5803c.

}

#ifdef __HIP_PLATFORM_AMD__
__device__ __forceinline__ fp4x4_storage_t cvt_fp32_to_fp4_4x_with_stochastic_rounding(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.
Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?

#ifdef __HIP_PLATFORM_AMD__
// If amax was not explicitly set, fall back to the scale field which
// holds the same value when set via set_scale().
NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I would like to understand more about this fix. Probably get a B200 to test NV upstream behavior is hard. Which cpp gtest failed due to this bug? According to NV upstream design, should the output_amax be set with the correct value? If so, we should fix the bug in the place where this setting was missed

@matthiasdiener matthiasdiener requested a review from ipanfilo March 20, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants