Skip to content

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

Srihari-mcw
Copy link
Collaborator

@Srihari-mcw Srihari-mcw commented Aug 12, 2025

  • The PR contains block interleaving approach for Q6_K quantization for x64/x86 SIMD Architecture
  • Initial gains were observed with prompt processing with the above changes compared to the tested Q6_K model
  • The GEMM function was implemented for AVX512/AVX2 and GEMV functions are implemented for the AVX2 architecture
  • repack_q6_K_to_q6_K_8_bl function rearranges the weight in Q6_K format to Q6_Kx8 format(block_q6_Kx8)

Block Interleaving Formats

Block_Q6_Kx8 :

  • Used to contain data of 8 Q6_K blocks in interleaved fashion
  • uint8 scales[128] - Scales from source Q6_K blocks are taken. Every 16 byte here is packed such that it contains scales for corresponding sub blocks from Q6_K structure - There are 16 sub blocks in original Q6_K structure
  • The d values from source Q6_K blocks are stored together in an array
  • Quant values (hbits and lbits) from the source Q6_K blocks are sequentially extracted and interleaved into groups of eight bytes

Performance numbers with llama2 7B model quantized to Q6_K is attached here

GCC Linux :

Q6_K Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 40.22 ± 0.04 79c116 - Base Commit
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 45.51 ± 0.07 13.15% 3b3d551 - AVX2 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 pp 512 59.81 ± 0.11 48.71% 3b3d551 - AVX512 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.55 ± 0.00 79c116 - Base Commit
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.29 ± 0.00 -2.46% 3b3d551 - AVX2 Version
llama 7B Q6_K 5.15 GiB 6.74 B CPU 6 tg 128 10.29 ± 0.00 -2.46% 3b3d551 - AVX512 Version

GCC Version = 12.3

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Note :

The scalar code implementation currently has an accuracy mismatch and will be fixed in the upcoming days.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 12, 2025
@Srihari-mcw
Copy link
Collaborator Author

The perplexity results with llama2 7B are tabulated as follows :

model perplexity (Final estimate PPL) Commit id
llama 7B Q6_K 5.8164 +/- 0.03250 79c116 - Base Commit
llama 7B Q6_K 5.8163 +/- 0.03250 3b3d551 - Updated Commit

@jukofyork
Copy link
Collaborator

AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1

Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants