Skip to content

Conversation

@taimur-10x
Copy link
Collaborator

@taimur-10x taimur-10x commented Nov 12, 2025

This PR extends the existing RISC-V Vector (RVV) floating-point support introduced introduced in (PR# 15075), adding new kernels.

Summary

  • Adds a BF16 RVV Flag to ggml-cpu/CMakeLists.txt to enable the zvfbfwma extension
  • Adds 6 new kernels for floating-point operations.

Newly Added Kernels

  • ggml_vec_dot_bf16
  • ggml_vec_mad_f16
  • ggml_vec_scale_f16
  • ggml_vec_dot_f16_unroll
  • ggml_cpu_bf16_to_fp32
  • ggml_cpu_fp16_to_fp32

Testing

Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.

For RISE

Additional Notes

The testing and benchmarking files will be shared in a subsequent PR:

  • test-float-fns: Functional Testing of Floating-Point Kernels
  • test-float-perf : Performance Benchmarking of Floating-Point Kernels

Benchmarks

Benchmark results on BananaPI-BPI F3 (VLEN=256)

Cache Hot

Kernel LMUL Unroll Input Row Size M-Ops/s (Scalar) M-Ops/s (Vector) Speedup
ggml_vec_mad_f16 4 2 1024 1478.62 5461.33 ~3.70x
ggml_vec_scale_f16 4 2 1024 854.98 3370.70 ~3.94x
ggml_vec_dot_f16_unroll 2 2 1024 702.21 8943.23 ~12.73
ggml_vec_silu_f32 2 1 1024 667.47 7509.33 ~11.25x
ggml_cpu_fp16_to_fp32 2 2 1024 346.18 1446.33 ~4.18x

We do not have the hardware to benchmark bf16 kernels:

  • ggml_vec_dot_bf16
  • ggml_cpu_bf16_to_fp32

Kernel Benchmarking

vec_dot_f16

Kernel LMUL Unrolling Clobbering Cache Cold (ns) Cache Hot (ns)
vec_dot_f16_scalar(autovec)       3687 2625
vec_dot_f16 1 1 No 1708 375
vec_dot_f16 1 2 No 1791 333
vec_dot_f16 1 2 Yes 1833 333
vec_dot_f16 1 4 No 2208 333
vec_dot_f16 1 4 Yes 1999 375
vec_dot_f16 1 8 No 3708 375
vec_dot_f16 1 8 Yes 1791 375
vec_dot_f16 2 1 No 1791 375
vec_dot_f16 2 2 No 2375 291
vec_dot_f16 2 2 Yes 1917 291
vec_dot_f16 2 4 No 3666 333
vec_dot_f16 2 4 Yes 1791 333
vec_dot_f16 4 1 No 3458 291
vec_dot_f16 4 2 No 3124 291
vec_dot_f16 4 2 Yes 3333 333

vec_mad_f16

Kernel LMUL Unrolling Clobbering Cache Cold (ns) Cache Hot (ns)
vec_mad_f16_scalar - - - 2583 1416
vec_mad_f16 1 1 No 2000 541
vec_mad_f16 1 2 No 2124 541
vec_mad_f16 1 2 Yes 1958 541
vec_mad_f16 1 4 No 1917 541
vec_mad_f16 1 4 Yes 2208 541
vec_mad_f16 1 8 No 2083 583
vec_mad_f16 1 8 Yes 3167 500
vec_mad_f16 2 1 No 1916 458
vec_mad_f16 2 2 No 2000 458
vec_mad_f16 2 2 Yes 2249 416
vec_mad_f16 2 4 No 1833 458
vec_mad_f16 2 4 Yes 3125 458
vec_mad_f16 2 8 No 1874 500
vec_mad_f16 2 8 Yes 2375 458
vec_mad_f16 4 1 No 3292 375
vec_mad_f16 4 2 No 3917 375
vec_mad_f16 4 2 Yes 2250 375
vec_mad_f16 4 4 No 2375 375
vec_mad_f16 4 4 Yes 2417 375
vec_mad_f16 8 1 No 2258 375
vec_mad_f16 8 2 No 2458 375
vec_mad_f16 8 2 Yes 2667 375

vec_scale_f16

Kernel LMUL Unrolling Clobbering Cache Cold (ns) Cache Hot (ns)
vec_scale_f16_scalar (autovec) - - - 1766 416
vec_scale_f16 1 1 No 1750 416
vec_scale_f16 1 2 No 1833 416
vec_scale_f16 1 2 Yes 1750 416
vec_scale_f16 1 4 No 1666 416
vec_scale_f16 1 4 Yes 1833 416
vec_scale_f16 1 8 No 1958 416
vec_scale_f16 1 8 Yes 1375 416
vec_scale_f16 2 1 No 1584 333
vec_scale_f16 2 2 No 1708 334
vec_scale_f16 2 2 Yes 1666 333
vec_scale_f16 2 4 No 1708 375
vec_scale_f16 2 4 Yes 1750 375
vec_scale_f16 2 8 No 1708 375
vec_scale_f16 2 8 Yes 1916 375
vec_scale_f16 4 1 No 1708 291
vec_scale_f16 4 2 No 1625 291
vec_scale_f16 4 2 Yes 1500 291
vec_scale_f16 4 4 No 1708 291
vec_scale_f16 4 4 Yes 1666 291
vec_scale_f16 4 8 No 1666 291
vec_scale_f16 4 8 Yes 1542 291
vec_scale_f16 8 1 No 1542 291
vec_scale_f16 8 2 No 1666 291
vec_scale_f16 8 2 Yes 1666 291
vec_scale_f16 8 4 No 1625 291
vec_scale_f16 8 4 Yes 1541 291

cpu_f16_to_fp32

Kernel LMUL Unrolling Clobbering Cache Cold (ns) Cache Hot (ns)
cpu_f16_to_f32_scalar (auto_vec)       2499 708
cpu_f16_to_f32 1 1 No 2500 541
cpu_f16_to_f32 1 2 No 2166 541
cpu_f16_to_f32 1 2 Yes 2250 541
cpu_f16_to_f32 1 4 No 2334 583
cpu_f16_to_f32 1 4 Yes 2500 541
cpu_f16_to_f32 1 8 No 2416 583
cpu_f16_to_f32 1 8 Yes 2583 541
cpu_f16_to_f32 2 1 No 2500 416
cpu_f16_to_f32 2 2 No 2125 416
cpu_f16_to_f32 2 2 Yes 2042 458
cpu_f16_to_f32 2 4 No 2250 458
cpu_f16_to_f32 2 4 Yes 2333 458
cpu_f16_to_f32 4 1 No 2416 458
cpu_f16_to_f32 4 2 No 2000 458
cpu_f16_to_f32 4 2 Yes 2334 458
cpu_f16_to_f32 4 4 No 2500 458
cpu_f16_to_f32 4 4 Yes 2625 458

vec_dot_f16_unroll

Kernel LMUL Unrolling Clobbering Cache Cold (ns) Cache Hot (ns)
vec_dot_f16_unroll_scalar (autovec) - - - 5083 3333
vec_dot_f16_unroll 1 1 No 2833 625
vec_dot_f16_unroll 1 2 No 2958 500
vec_dot_f16_unroll 1 2 Yes 4124 500
vec_dot_f16_unroll 1 4 No 3833 541
vec_dot_f16_unroll 1 4 Yes 4375 500
vec_dot_f16_unroll 2 1 No 2750 500
vec_dot_f16_unroll 2 2 No 3875 458
vec_dot_f16_unroll 2 2 Yes 4208 458
vec_dot_f16_unroll 4 1 No 3583 500
vec_dot_f16_unroll 4 2 No 5666 1416
vec_dot_f16_unroll 4 2 Yes 6332 1416

vec_silu_f32

Kernel LMUL Unrolling Clobbering Cache Cold (ns) Cache Hot (ns)
vec_silu_f32_scalar (autovectorized)       55959 54459
vec_silu_f32 1 - - 8625 7666
vec_silu_f32 2 - - 7041 6083
vec_silu_f32 4 - - 8125 6875

@luhenry
Copy link

luhenry commented Nov 12, 2025

Would you still have the numbers laying around for the using m8 for the various kernels? IIRC the choice of lmul was based on the best performance for cold data caches, right?

@taimur-10x
Copy link
Collaborator Author

Would you still have the numbers laying around for the using m8 for the various kernels? IIRC the choice of lmul was based on the best performance for cold data caches, right?

Yeah, we switched from a higher LMUL in favor of a lower one to cater better for the cache cold case.

We'll share the numbers for the kernels (with the LMUL and unrolling permutations) for both cache hot and cold.

@luhenry
Copy link

luhenry commented Nov 13, 2025

Would you still have the numbers laying around for the using m8 for the various kernels? IIRC the choice of lmul was based on the best performance for cold data caches, right?

Yeah, we switched from a higher LMUL in favor of a lower one to cater better for the cache cold case.

We'll share the numbers for the kernels (with the LMUL and unrolling permutations) for both cache hot and cold.

As discussed on the call, the way we should chose is:

  1. prioritize the hot cache numbers, lower is better
  2. then, look at cold cache numbers

We should also pick what's currently the better number on BananaPi, and not optimize for a ideal hardware (out-of-order, more vector ports, etc.). For better or for worst, BananaPi is what is currently commercially available, so it's the fairer target to have for RISE (no preferential treatment of some microarchitectures). Once there is a better hardware broadly commercially available, then we will want to do another pass of optimizations.

@taimur-10x
Copy link
Collaborator Author

taimur-10x commented Nov 13, 2025

We should also pick what's currently the better number on BananaPi, and not optimize for a ideal hardware (out-of-order, more vector ports, etc.). For better or for worst, BananaPi is what is currently commercially available, so it's the fairer target to have for RISE (no preferential treatment of some microarchitectures). Once there is a better hardware broadly commercially available, then we will want to do another pass of optimizations.

Sure, makes sense. We'll make the changes.

@david-baker-808
Copy link
Collaborator

david-baker-808 commented Nov 13, 2025 via email

@taimur-10x taimur-10x force-pushed the 10x/rvv-floating-kernels branch from e646298 to ffbab18 Compare November 14, 2025 13:05
@taimur-10x
Copy link
Collaborator Author

We've made the changes as discussed yesterday. Good to go from our end.

@taimur-10x taimur-10x requested a review from luhenry November 17, 2025 08:52
@luhenry
Copy link

luhenry commented Nov 17, 2025

@taimur-10x can you please rebase before merging, or should it be squashed. Thank you!

@luhenry
Copy link

luhenry commented Nov 17, 2025

@taimur-10x I asked for a rebase, not merging master. A git pull --rebase upstream master or git rebase origin/master equivalent. That way you can remove the Merge master commits.

@taimur-10x taimur-10x force-pushed the 10x/rvv-floating-kernels branch from e3be2ff to da60598 Compare November 17, 2025 10:09
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
@taimur-10x taimur-10x force-pushed the 10x/rvv-floating-kernels branch from be4fa97 to acf3e4f Compare November 17, 2025 10:18
@taimur-10x
Copy link
Collaborator Author

@ludovic, rebased and squashed where required.

Should I merge this in master in this fork, and then open up a PR from there to upstream? Or should I close this PR, and open up a new one for upstream from this branch?

@luhenry luhenry changed the base branch from master to riscv November 17, 2025 10:39
@luhenry luhenry merged commit 1bda8fb into riscv Nov 17, 2025
59 of 71 checks passed
@luhenry
Copy link

luhenry commented Nov 17, 2025

@taimur-10x I've created a riscv branch from master, please submit a MR to upstream from that branch.

@taimur-10x
Copy link
Collaborator Author

Opened here: ggml-org#17318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants