Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275

Srihari-mcw · 2025-08-12T19:05:03Z

The PR contains block interleaving approach for Q6_K quantization for x64/x86 SIMD Architecture
Initial gains were observed with prompt processing with the above changes compared to the tested Q6_K model
The GEMM function was implemented for AVX512/AVX2 and GEMV functions are implemented for the AVX2 architecture
repack_q6_K_to_q6_K_8_bl function rearranges the weight in Q6_K format to Q6_Kx8 format(block_q6_Kx8)

Block Interleaving Formats

Block_Q6_Kx8 :

Used to contain data of 8 Q6_K blocks in interleaved fashion
uint8 scales[128] - Scales from source Q6_K blocks are taken. Every 16 byte here is packed such that it contains scales for corresponding sub blocks from Q6_K structure - There are 16 sub blocks in original Q6_K structure
The d values from source Q6_K blocks are stored together in an array
Quant values (hbits and lbits) from the source Q6_K blocks are sequentially extracted and interleaved into groups of eight bytes

Performance numbers with llama2 7B model quantized to Q6_K is attached here

GCC Linux :

Q6_K Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	40.22 ± 0.04		79c116 - Base Commit
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	45.51 ± 0.07	13.15%	3b3d551 - AVX2 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	59.81 ± 0.11	48.71%	3b3d551 - AVX512 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.55 ± 0.00		79c116 - Base Commit
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.29 ± 0.00	-2.46%	3b3d551 - AVX2 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.29 ± 0.00	-2.46%	3b3d551 - AVX512 Version

GCC Version = 12.3

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

Note :

The scalar code implementation currently has an accuracy mismatch and will be fixed in the upcoming days.

Srihari-mcw · 2025-08-14T02:54:12Z

The perplexity results with llama2 7B are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
llama 7B Q6_K	5.8164 +/- 0.03250	79c116 - Base Commit
llama 7B Q6_K	5.8163 +/- 0.03250	3b3d551 - Updated Commit

jukofyork · 2025-08-14T11:17:40Z

AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1

Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference?

Manogna-Sree and others added 17 commits August 11, 2025 00:08

Initial interleaving support for Q6_K Block Interleaving

58d53aa

Fix for inaccuracy of GEMM Q6K

a6ac753

Initial implementation of GEMM Q6_K for edge handling case

f19838b

Avx512 implementation of GEMM Q6K

8938aaa

Avx512 implementation of GEMM Q6K for edge handling case

3afe0b8

GEMV scalar implementation

33f640d

GEMM scalar implementation

a75e02e

Initial cleanup of GEMM

161d75c

Further cleanup of GEMM

1f78449

Cleanup commit for AVX2 GEMM bigger loop

9b755b2

Cleanup of smaller loop of AVX2'

e3baf1c

Further cleanup

d2c9ea3

Add further fixes and updates to scalar code

d080604

Cleanup GEMV Code

e7f3018

Fix issues with scalar version

ff335ef

Rename variables to maintain convention in other functions

ecf010c

Remove print

3b3d551

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 12, 2025

Manogna-Sree added 2 commits August 14, 2025 02:48

Fix for inaccuracies in the scalar version

6daa661

Fix CI/CD issues

6e54182

Manogna-Sree and others added 2 commits August 14, 2025 05:02

Remove empty line

0afe12c

Remove trailing whitespaces

50d33b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275

Srihari-mcw commented Aug 12, 2025 •

edited

Loading

Uh oh!

Srihari-mcw commented Aug 14, 2025

Uh oh!

jukofyork commented Aug 14, 2025

Uh oh!

Uh oh!

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275

Are you sure you want to change the base?

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275

Conversation

Srihari-mcw commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Srihari-mcw commented Aug 14, 2025

Uh oh!

jukofyork commented Aug 14, 2025

Uh oh!

Uh oh!

Srihari-mcw commented Aug 12, 2025 •

edited

Loading