Conversation
Resolve wheels and examples
472372b to
55a8c84
Compare
Thank you for the suggestion. I changed to using the hipify map in 55a8c84 |
wangye805
left a comment
There was a problem hiding this comment.
So currently we don't have any walkaround for the stochastic rounding path?
| #ifdef __HIP_PLATFORM_AMD__ | ||
| // If amax was not explicitly set, fall back to the scale field which | ||
| // holds the same value when set via set_scale(). | ||
| NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr, |
There was a problem hiding this comment.
Is it a bug fix for upstream? If not, why do we need this specific treatment for global amax?
There was a problem hiding this comment.
Yes, I believe this is a bug in upstream.
There was a problem hiding this comment.
I see. Thanks!
Also, check if upstream already had an fix. If not, I think it's okay to drop the rocm specific guard. What do you think @ipanfilo?
There was a problem hiding this comment.
I don't think this is fixed in upstream yet. I added a comment in a607feb
There was a problem hiding this comment.
Thanks. I would like to understand more about this fix. Probably get a B200 to test NV upstream behavior is hard. Which cpp gtest failed due to this bug? According to NV upstream design, should the output_amax be set with the correct value? If so, we should fix the bug in the place where this setting was missed
There was a problem hiding this comment.
The tests in test_cast_nvfp4_transpose.cu fail - I suspect this isn't tested upstream for the fallback path (which is the only path we currently have implemented). Looking at the upstream optimized kernel (in quantize_transpose_nvfp4.cuh), amax_rowwise_ptr is explicitly allowed to be null; the kernel falls back to 1.0f in that case. amax.dptr is never allocated for NVFP4 tensors in the upstream test Tensor class, and the optimized path never needs it. I changed the fix to match this null-handling behavior in the fallback kernel in 82af544. I believe this is still incorrect in upstream's main branch too.
transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu
Outdated
Show resolved
Hide resolved
I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it. |
| #ifdef __HIP_PLATFORM_AMD__ | ||
| // If amax was not explicitly set, fall back to the scale field which | ||
| // holds the same value when set via set_scale(). | ||
| NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr, |
transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu
Show resolved
Hide resolved
| } | ||
|
|
||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| __device__ __forceinline__ fp4x4_storage_t cvt_fp32_to_fp4_4x_with_stochastic_rounding( |
There was a problem hiding this comment.
fp4x4_storage_t is already correctly redefined for HIP and CUDA so no need ifdef here. Or if, you want to keep original declaration unchanged, you can use 'using __nv_fp4x4_e2m1= __hip_fp4x4_storage_t' on AMD.
There was a problem hiding this comment.
I'm not sure this can be simplified further.
Both sides need different return types and fp4x4_storage_t can only be used on AMD.
The map file has "__nv_fp4x2_e2m1" : "__hip_fp4x2_e2m1", , so using __nv_fp4x4_e2m1 = __hip_fp4x4_storage_t would become using __hip_fp4x4_e2m1 = __hip_fp4x4_storage_t after hipification, which is a redefinition of the existing struct in amd_hip_fp4.h.
There was a problem hiding this comment.
fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.
Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?
There was a problem hiding this comment.
fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.
I did this simplification in fc5af65.
Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?
Thanks, I was able to find another way to do the bit fiddling in this function and the SR function, that does not need __hip_fp4x4_storage_t in 94a4e5e.
| "FP4 cvt.rs PTX instructions are architecture-specific. " | ||
| "Try recompiling with sm_XXXa instead of sm_XXX."); | ||
| #else | ||
| #ifdef __gfx950__ |
There was a problem hiding this comment.
It may make sense to have analogue of ARCH_HAS_STOCHASTIC_ROUNDING define if such guarding is used in multiple places - we'll later add more platforms with FP4 support.
|
|
||
| // for 2D block scaling, we need to reduce amax in warp | ||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| static __device__ constexpr uint64_t WARP_REDUCE_AMAX_GROUP_MASKS[8] = { |
There was a problem hiding this comment.
I think with 32 threads per wavefront actively used the high half of mask should be 0
There was a problem hiding this comment.
Isn't kThreadsPerWarp=32 just a logical grouping value here, not the hardware wavefront width?
Great. Thanks |
bb0712d to
a607feb
Compare
tests/cpp/test_common.cu
Outdated
| size_t scale_dim_X = DIVUP_TO_MULTIPLE(DIVUP(last_dim, 16lu), scale_tensor_alignment_X_rowwise); | ||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| // NVFP4 requires [128,4] padding on AMD regardless of MXFP8 alignment constants | ||
| constexpr size_t nvfp4_align_Y = 128; |
There was a problem hiding this comment.
Use nvfp4 constants from test_common.h
| } | ||
|
|
||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| __device__ __forceinline__ fp4x4_storage_t cvt_fp32_to_fp4_4x_with_stochastic_rounding( |
There was a problem hiding this comment.
fp4x4_storage_t is declared not to have ifdef later in code but use fp4x4storage_t everywhere. If you use ifdef here, you don't need fp4x4storage_t but can use __hip_fp4x4_storage_t directly.
Also, why do you need to use __hip_fp4x4_storage_t, not __hip_fp4x4_e2m1 here which would let just using hipification for resulting type?
| #ifdef __HIP_PLATFORM_AMD__ | ||
| // If amax was not explicitly set, fall back to the scale field which | ||
| // holds the same value when set via set_scale(). | ||
| NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr, |
There was a problem hiding this comment.
Thanks. I would like to understand more about this fix. Probably get a B200 to test NV upstream behavior is hard. Which cpp gtest failed due to this bug? According to NV upstream design, should the output_amax be set with the correct value? If so, we should fix the bug in the place where this setting was missed
Description
Fixes https://github.com/ROCm/frameworks-internal/issues/15731
TODO:
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: