[RISC-V] Extend support for RVV floating-point kernels #1

taimur-10x · 2025-11-12T08:11:47Z

This PR extends the existing RISC-V Vector (RVV) floating-point support introduced introduced in (PR# 15075), adding new kernels.

Summary

Adds a BF16 RVV Flag to ggml-cpu/CMakeLists.txt to enable the zvfbfwma extension
Adds 6 new kernels for floating-point operations.

Newly Added Kernels

ggml_vec_dot_bf16
ggml_vec_mad_f16
ggml_vec_scale_f16
ggml_vec_dot_f16_unroll
ggml_cpu_bf16_to_fp32
ggml_cpu_fp16_to_fp32

Testing

Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.

For RISE

Additional Notes

The testing and benchmarking files will be shared in a subsequent PR:

test-float-fns: Functional Testing of Floating-Point Kernels
test-float-perf : Performance Benchmarking of Floating-Point Kernels

Benchmarks

Benchmark results on BananaPI-BPI F3 (VLEN=256)

Cache Hot

Kernel	LMUL	Unroll	Input Row Size	M-Ops/s (Scalar)	M-Ops/s (Vector)	Speedup
`ggml_vec_mad_f16`	4	2	1024	1478.62	5461.33	~3.70x
`ggml_vec_scale_f16`	4	2	1024	854.98	3370.70	~3.94x
`ggml_vec_dot_f16_unroll`	2	2	1024	702.21	8943.23	~12.73
`ggml_vec_silu_f32`	2	1	1024	667.47	7509.33	~11.25x
`ggml_cpu_fp16_to_fp32`	2	2	1024	346.18	1446.33	~4.18x

We do not have the hardware to benchmark bf16 kernels:

ggml_vec_dot_bf16
ggml_cpu_bf16_to_fp32

Kernel Benchmarking

vec_dot_f16

Kernel	LMUL	Unrolling	Clobbering	Cache Cold (ns)	Cache Hot (ns)
vec_dot_f16_scalar(autovec)				3687	2625
vec_dot_f16	1	1	No	1708	375
vec_dot_f16	1	2	No	1791	333
vec_dot_f16	1	2	Yes	1833	333
vec_dot_f16	1	4	No	2208	333
vec_dot_f16	1	4	Yes	1999	375
vec_dot_f16	1	8	No	3708	375
vec_dot_f16	1	8	Yes	1791	375
vec_dot_f16	2	1	No	1791	375
vec_dot_f16	2	2	No	2375	291
`vec_dot_f16`	`2`	`2`	`Yes`	`1917`	`291`
vec_dot_f16	2	4	No	3666	333
vec_dot_f16	2	4	Yes	1791	333
vec_dot_f16	4	1	No	3458	291
vec_dot_f16	4	2	No	3124	291
vec_dot_f16	4	2	Yes	3333	333

vec_mad_f16

Kernel	LMUL	Unrolling	Clobbering	Cache Cold (ns)	Cache Hot (ns)
vec_mad_f16_scalar	-	-	-	2583	1416
vec_mad_f16	1	1	No	2000	541
vec_mad_f16	1	2	No	2124	541
vec_mad_f16	1	2	Yes	1958	541
vec_mad_f16	1	4	No	1917	541
vec_mad_f16	1	4	Yes	2208	541
vec_mad_f16	1	8	No	2083	583
vec_mad_f16	1	8	Yes	3167	500
vec_mad_f16	2	1	No	1916	458
vec_mad_f16	2	2	No	2000	458
vec_mad_f16	2	2	Yes	2249	416
vec_mad_f16	2	4	No	1833	458
vec_mad_f16	2	4	Yes	3125	458
vec_mad_f16	2	8	No	1874	500
vec_mad_f16	2	8	Yes	2375	458
vec_mad_f16	4	1	No	3292	375
vec_mad_f16	4	2	No	3917	375
`vec_mad_f16`	`4`	`2`	`Yes`	`2250`	`375`
vec_mad_f16	4	4	No	2375	375
vec_mad_f16	4	4	Yes	2417	375
vec_mad_f16	8	1	No	2258	375
vec_mad_f16	8	2	No	2458	375
vec_mad_f16	8	2	Yes	2667	375

vec_scale_f16

Kernel	LMUL	Unrolling	Clobbering	Cache Cold (ns)	Cache Hot (ns)
vec_scale_f16_scalar (autovec)	-	-	-	1766	416
vec_scale_f16	1	1	No	1750	416
vec_scale_f16	1	2	No	1833	416
vec_scale_f16	1	2	Yes	1750	416
vec_scale_f16	1	4	No	1666	416
vec_scale_f16	1	4	Yes	1833	416
vec_scale_f16	1	8	No	1958	416
vec_scale_f16	1	8	Yes	1375	416
vec_scale_f16	2	1	No	1584	333
vec_scale_f16	2	2	No	1708	334
vec_scale_f16	2	2	Yes	1666	333
vec_scale_f16	2	4	No	1708	375
vec_scale_f16	2	4	Yes	1750	375
vec_scale_f16	2	8	No	1708	375
vec_scale_f16	2	8	Yes	1916	375
vec_scale_f16	4	1	No	1708	291
vec_scale_f16	4	2	No	1625	291
`vec_scale_f16`	`4`	`2`	`Yes`	`1500`	`291`
vec_scale_f16	4	4	No	1708	291
vec_scale_f16	4	4	Yes	1666	291
vec_scale_f16	4	8	No	1666	291
vec_scale_f16	4	8	Yes	1542	291
vec_scale_f16	8	1	No	1542	291
vec_scale_f16	8	2	No	1666	291
vec_scale_f16	8	2	Yes	1666	291
vec_scale_f16	8	4	No	1625	291
vec_scale_f16	8	4	Yes	1541	291

cpu_f16_to_fp32

Kernel	LMUL	Unrolling	Clobbering	Cache Cold (ns)	Cache Hot (ns)
cpu_f16_to_f32_scalar (auto_vec)				2499	708
cpu_f16_to_f32	1	1	No	2500	541
cpu_f16_to_f32	1	2	No	2166	541
cpu_f16_to_f32	1	2	Yes	2250	541
cpu_f16_to_f32	1	4	No	2334	583
cpu_f16_to_f32	1	4	Yes	2500	541
cpu_f16_to_f32	1	8	No	2416	583
cpu_f16_to_f32	1	8	Yes	2583	541
cpu_f16_to_f32	2	1	No	2500	416
`cpu_f16_to_f32`	`2`	`2`	`No`	`2125`	`416`
cpu_f16_to_f32	2	2	Yes	2042	458
cpu_f16_to_f32	2	4	No	2250	458
cpu_f16_to_f32	2	4	Yes	2333	458
cpu_f16_to_f32	4	1	No	2416	458
cpu_f16_to_f32	4	2	No	2000	458
cpu_f16_to_f32	4	2	Yes	2334	458
cpu_f16_to_f32	4	4	No	2500	458
cpu_f16_to_f32	4	4	Yes	2625	458

vec_dot_f16_unroll

Kernel	LMUL	Unrolling	Clobbering	Cache Cold (ns)	Cache Hot (ns)
vec_dot_f16_unroll_scalar (autovec)	-	-	-	5083	3333
vec_dot_f16_unroll	1	1	No	2833	625
vec_dot_f16_unroll	1	2	No	2958	500
vec_dot_f16_unroll	1	2	Yes	4124	500
vec_dot_f16_unroll	1	4	No	3833	541
vec_dot_f16_unroll	1	4	Yes	4375	500
vec_dot_f16_unroll	2	1	No	2750	500
`vec_dot_f16_unroll`	`2`	`2`	`No`	`3875`	`458`
vec_dot_f16_unroll	2	2	Yes	4208	458
vec_dot_f16_unroll	4	1	No	3583	500
vec_dot_f16_unroll	4	2	No	5666	1416
vec_dot_f16_unroll	4	2	Yes	6332	1416

vec_silu_f32

Kernel	LMUL	Unrolling	Clobbering	Cache Cold (ns)	Cache Hot (ns)
vec_silu_f32_scalar (autovectorized)				55959	54459
vec_silu_f32	1	-	-	8625	7666
`vec_silu_f32`	`2`	`-`	`-`	`7041`	`6083`
vec_silu_f32	4	-	-	8125	6875

ggml/src/ggml-cpu/vec.h

luhenry · 2025-11-12T10:29:40Z

Would you still have the numbers laying around for the using m8 for the various kernels? IIRC the choice of lmul was based on the best performance for cold data caches, right?

taimur-10x · 2025-11-12T10:47:03Z

Would you still have the numbers laying around for the using m8 for the various kernels? IIRC the choice of lmul was based on the best performance for cold data caches, right?

Yeah, we switched from a higher LMUL in favor of a lower one to cater better for the cache cold case.

We'll share the numbers for the kernels (with the LMUL and unrolling permutations) for both cache hot and cold.

luhenry · 2025-11-13T13:25:04Z

Would you still have the numbers laying around for the using m8 for the various kernels? IIRC the choice of lmul was based on the best performance for cold data caches, right?

Yeah, we switched from a higher LMUL in favor of a lower one to cater better for the cache cold case.

We'll share the numbers for the kernels (with the LMUL and unrolling permutations) for both cache hot and cold.

As discussed on the call, the way we should chose is:

prioritize the hot cache numbers, lower is better
then, look at cold cache numbers

We should also pick what's currently the better number on BananaPi, and not optimize for a ideal hardware (out-of-order, more vector ports, etc.). For better or for worst, BananaPi is what is currently commercially available, so it's the fairer target to have for RISE (no preferential treatment of some microarchitectures). Once there is a better hardware broadly commercially available, then we will want to do another pass of optimizations.

taimur-10x · 2025-11-13T13:48:15Z

We should also pick what's currently the better number on BananaPi, and not optimize for a ideal hardware (out-of-order, more vector ports, etc.). For better or for worst, BananaPi is what is currently commercially available, so it's the fairer target to have for RISE (no preferential treatment of some microarchitectures). Once there is a better hardware broadly commercially available, then we will want to do another pass of optimizations.

Sure, makes sense. We'll make the changes.

david-baker-808 · 2025-11-13T14:37:40Z

I see these are still doing accumulations to fp32 using __riscv_vfwmaccbf16_vv_f32m4 but that is what is called for in the function prototype ggml_vec_dot_bf16(int n, float * GGML_RESTRICT s, … so no complaints here. Actually no complaints with any part of the pull request. Full speed ahead! Thanks!! Dave From: Ludovic Henry ***@***.***> Sent: Thursday, November 13, 2025 7:05 AM To: riseproject-dev/llama.cpp ***@***.***> Cc: David Baker ***@***.***>; Review requested ***@***.***> Subject: Re: [riseproject-dev/llama.cpp] [RISC-V] Extend support for RVV floating-point kernels (PR #1) @luhenry<https://github.com/luhenry> requested your review on: #1<#1> [RISC-V] Extend support for RVV floating-point kernels. — Reply to this email directly, view it on GitHub<#1 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BHGNESYEV3N6ELSQWGSZRX334R6XVAVCNFSM6AAAAACL3KOR26VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRQHEYTOMBTGUZDGOI>. You are receiving this because your review was requested.Message ID: ***@***.******@***.***>>

taimur-10x · 2025-11-14T13:23:36Z

We've made the changes as discussed yesterday. Good to go from our end.

luhenry · 2025-11-17T09:15:20Z

@taimur-10x can you please rebase before merging, or should it be squashed. Thank you!

luhenry · 2025-11-17T09:49:42Z

@taimur-10x I asked for a rebase, not merging master. A git pull --rebase upstream master or git rebase origin/master equivalent. That way you can remove the Merge master commits.

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

taimur-10x · 2025-11-17T10:31:20Z

@ludovic, rebased and squashed where required.

Should I merge this in master in this fork, and then open up a PR from there to upstream? Or should I close this PR, and open up a new one for upstream from this branch?

luhenry · 2025-11-17T10:40:36Z

@taimur-10x I've created a riscv branch from master, please submit a MR to upstream from that branch.

taimur-10x · 2025-11-17T11:12:13Z

Opened here: ggml-org#17318

github-actions bot added the ggml label Nov 12, 2025

taimur-10x assigned taimur-10x and rehan-10xengineer Nov 12, 2025

taimur-10x requested review from luhenry and saad-b-nasir November 12, 2025 08:58

luhenry reviewed Nov 12, 2025

View reviewed changes

ggml/src/ggml-cpu/vec.h Outdated Show resolved Hide resolved

luhenry requested a review from david-baker-808 November 13, 2025 13:04

taimur-10x force-pushed the 10x/rvv-floating-kernels branch from e646298 to ffbab18 Compare November 14, 2025 13:05

taimur-10x requested a review from luhenry November 17, 2025 08:52

luhenry approved these changes Nov 17, 2025

View reviewed changes

taimur-10x added 2 commits November 17, 2025 14:53

cmake: add BF16 RVV flag for ggml-cpu

7936f1b

ggml-cpu: add floating-point conversion kernels

360e65b

taimur-10x force-pushed the 10x/rvv-floating-kernels branch from e3be2ff to da60598 Compare November 17, 2025 10:09

ggml: add floating-point kernels

acf3e4f

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

taimur-10x force-pushed the 10x/rvv-floating-kernels branch from be4fa97 to acf3e4f Compare November 17, 2025 10:18

luhenry changed the base branch from master to riscv November 17, 2025 10:39

luhenry merged commit 1bda8fb into riscv Nov 17, 2025
59 of 71 checks passed

[RISC-V] Extend support for RVV floating-point kernels #1

[RISC-V] Extend support for RVV floating-point kernels #1

Uh oh!

Conversation

taimur-10x commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Newly Added Kernels

Testing

For RISE

Additional Notes

Benchmarks

Cache Hot

Kernel Benchmarking

vec_dot_f16

vec_mad_f16

vec_scale_f16

cpu_f16_to_fp32

vec_dot_f16_unroll

vec_silu_f32

Uh oh!

Uh oh!

luhenry commented Nov 12, 2025

Uh oh!

taimur-10x commented Nov 12, 2025

Uh oh!

luhenry commented Nov 13, 2025

Uh oh!

taimur-10x commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-baker-808 commented Nov 13, 2025 via email

Uh oh!

taimur-10x commented Nov 14, 2025

Uh oh!

luhenry commented Nov 17, 2025

Uh oh!

luhenry commented Nov 17, 2025

Uh oh!

taimur-10x commented Nov 17, 2025

Uh oh!

Uh oh!

luhenry commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taimur-10x commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

taimur-10x commented Nov 12, 2025 •

edited

Loading

taimur-10x commented Nov 13, 2025 •

edited

Loading

luhenry commented Nov 17, 2025 •

edited

Loading