Skip to content

Conversation

IMbackK
Copy link
Collaborator

@IMbackK IMbackK commented Aug 13, 2025

Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2025
@IMbackK
Copy link
Collaborator Author

IMbackK commented Aug 13, 2025

looks like this dosent compile against older rocm, dont commit.

@IMbackK IMbackK marked this pull request as draft August 13, 2025 09:03
@IMbackK IMbackK marked this pull request as ready for review August 14, 2025 09:48
@IMbackK
Copy link
Collaborator Author

IMbackK commented Aug 14, 2025

@JohannesGaessler im not a huge fan of the changes in 97392a5 but i see no better option, the other options we have are:

  1. restrict the supported rocm versions to 6.3+
    1. this is just one version behind latest
    2. we loose debian stable. probably others, like that
  2. keep using hipbfloat16.h for now including the bfloat162 hack

@IMbackK
Copy link
Collaborator Author

IMbackK commented Aug 14, 2025

to everyone: if you do merge this, please do not squash 97392a5 into 3f662cd, as i intend to revert 97392a5 once we are in the position to drop support for rocm <6.3

Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews
@JohannesGaessler
Copy link
Collaborator

I agree that having some extra function for type conversions is very annoying. Unfortunately with CUDA this is already necessary on master due to FP16 <-> BF16 conversions being ambiguous. I pushed a version that consolidates the other code that needs to handle these cases so that, going forward, we need to maintain only a single version. I also simplified the code a bit: I changed the name to ggml_cuda_cast to make the lines using it a bit shorter and I swapped the order of template arguments in order to make it possible to specify only the destination type with the source type being inferred from the argument.

One more question: is there a reason why you declared the function as __host__ __device__? I don't think it's ever used in host code. (There is I think technically also a difference for how code is being compiled due to the use of inline instead of __forceinline__ but the compiler really should be inlining these functions.)

@IMbackK
Copy link
Collaborator Author

IMbackK commented Aug 14, 2025

I had the template parameter order that way around because this is what all other functions in ggml do, but this way is fine with me.

The rest of the changes are fine with me.

While nothing currently uses this on the host side, i see no reason to restrict it to device code, indeed if its just __device__ some one might look at this and determine that its only necessary to use the casting function on the device side, which is not true.

the inline is obviously just there because its header implemented, i have no objections to makeing it forceinline, but if the compiler is not inlineing these functions automatically i dont know what to tell you.

@JohannesGaessler
Copy link
Collaborator

I think __forceinline__ is incompatible with __host__, that's why I was asking. Using it would be more consistent with the rest of the code but I don't think it's going to matter much.

@IMbackK
Copy link
Collaborator Author

IMbackK commented Aug 14, 2025

__forceinline__ is fine to apply to host functions in hip, not sure about cuda

@IMbackK IMbackK merged commit 5ba36f6 into ggml-org:master Aug 14, 2025
46 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants