Skip to content

Franzabner/mixed-quant-epi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YOSO-YAi

Per-Layer Quantization Evaluated by Energy Per Intelligence

Two Configs, Identical Perplexity, Different Joule Costs on ARM

Francisco Abner Electrical Engineer, CEO & Founder — YOSO-YAi LLC New Albany, Ohio

Paper Status License Framework Tooling Instrument


Abstract — Standard GGUF quantization applies a single quantization level uniformly across all layers. Mixed quantization assigns different bit widths to different layers — higher precision where it matters, lower precision where it doesn't. Two mixed-quantization configurations can produce identical perplexity scores yet consume different amounts of energy on ARM hardware. This paper uses Energy Per Intelligence (EPI) to evaluate per-layer quantization strategies, measuring real power consumption on a Raspberry Pi 5 cluster with the epi-meter board. We map the Pareto frontier of accuracy vs. joules/token, identify configurations that are energy-optimal but invisible to perplexity-only evaluation, and demonstrate that the cheapest token in joules is not always the one produced by the smallest model.


Table of Contents

  1. Introduction
  2. Background
  3. Research Questions
  4. The Core Insight
  5. Experimental Design
  6. Quantization Matrix
  7. Methodology
  8. Results
  9. Pareto Analysis
  10. Discussion
  11. Comparison to Prior Work
  12. Reproducibility
  13. Future Work
  14. Citation
  15. References
  16. License

1. Introduction

Quantization is the most common model compression technique. Reduce the precision of weight tensors — from FP16 to INT8, INT4, or lower — and the model gets smaller, loads faster, and (in theory) runs cheaper. The standard approach applies one quantization level uniformly: Q4_K_M means every layer is quantized to roughly 4-bit precision.

But not every layer is equally important. Attention layers in early transformer blocks shape token representations that propagate through the entire network. Feed-forward layers in late blocks refine the output but have less downstream impact. A uniform quantization level treats all layers as equal. They are not.

Mixed quantization assigns different bit widths to different layers. llama.cpp supports this via per-layer quantization maps in GGUF. The DGX Spark can produce arbitrary mixed-quant configurations. The question is: which configurations are optimal?

The standard answer uses perplexity — lower perplexity means less accuracy loss. But two configurations with identical perplexity can have different energy costs on ARM hardware. One may use more memory bandwidth (loading larger weight tensors), take longer per token, or trigger different cache behavior. Perplexity cannot see this. EPI can.


2. Background

Uniform Quantization

GGUF Type Bits/Weight Relative Size Standard Use
Q2_K ~2.6 0.33x Aggressive compression
Q3_K_S ~3.4 0.42x Small, lower quality
Q4_K_M ~4.8 0.55x Default — best size/quality tradeoff
Q5_K_M ~5.7 0.65x Higher quality, larger
Q6_K ~6.6 0.75x Near-FP16 quality
Q8_0 ~8.5 1.0x Minimal compression

Mixed Quantization

Instead of one type for all layers, assign types per layer or layer group:

{
  "layers_0_7":   "Q6_K",     // Early layers: high precision (critical)
  "layers_8_23":  "Q4_K_M",   // Middle layers: standard precision
  "layers_24_31": "Q3_K_S"    // Late layers: aggressive compression
}

Two such maps can produce the same perplexity but different:

  • Model file sizes (different total bits)
  • Memory bandwidth (loading different-sized tensors)
  • Inference latency (different compute per layer)
  • Energy consumption (different joules per token)

What's Missing

Existing mixed-quant evaluations use perplexity as the sole quality metric. None measures the actual energy cost on production hardware. Nobody maps the Pareto frontier of accuracy vs. joules/token for mixed-quant configurations.


3. Research Questions

# Question
RQ1 Do mixed-quant configurations with identical perplexity produce different EPI on ARM hardware?
RQ2 What does the Pareto frontier of accuracy vs. J/token look like for per-layer quantization?
RQ3 Which layers are most sensitive to quantization depth in terms of energy impact?
RQ4 Is the EPI-optimal mixed-quant configuration smaller or larger than uniform Q4_K_M?
RQ5 Does the energy benefit of mixed quantization come from reduced memory bandwidth, reduced compute, or both?

4. The Core Insight

  Config A                    Config B
  ─────────                   ─────────
  Early:  Q6_K               Early:  Q4_K_M
  Middle: Q4_K_M             Middle: Q5_K_M
  Late:   Q3_K_S             Late:   Q4_K_M

  Perplexity: 7.82           Perplexity: 7.82      ← IDENTICAL
  File size:  4.1 GB         File size:  4.5 GB
  J/Token:    ???             J/Token:    ???       ← DIFFERENT?
  Accuracy:   ???             Accuracy:   ???
  EPI:        ???             EPI:        ???       ← WHICH IS BETTER?

Perplexity says these are equivalent. An electrician with a meter on the circuit might disagree. This paper finds out.


5. Experimental Design

Strategy

  1. Baseline: Uniform quantization levels (Q2_K through Q8_0)
  2. Mixed configs: Systematic per-layer-group quantization maps
  3. Measure all on the same hardware with the same instrument
  4. Map the Pareto frontier — which configs are optimal?

Hardware

Component Specification
Surgery + quantization DGX Spark (GB10, 128GB)
Deployment target Pi 5 cluster (4x 16GB, distributed-llama)
Measurement epi-meter board (4x ATM90E26, CT clamps)
Quantization tool llama.cpp (GGUF with per-layer quant maps)

Target Models

Model Parameters Layers Architecture
Llama-3.1-8B 8B 32 Dense
Qwen3-30B-A3B 30B (3B active) MoE

Dense model (Llama) provides clean signal without MoE routing noise. MoE model (Qwen) tests whether mixed-quant interacts with expert gating.


6. Quantization Matrix

Layer Groups

For a 32-layer model:

Group Layers Role
Early 0–7 Token embedding refinement, representation shaping
Middle 8–23 Core reasoning and transformation
Late 24–31 Output refinement and prediction

Uniform Baselines (6 configs)

Config ID All Layers Expected Size
uniform_q2k Q2_K Smallest
uniform_q3ks Q3_K_S
uniform_q4km Q4_K_M Standard
uniform_q5km Q5_K_M
uniform_q6k Q6_K
uniform_q8 Q8_0 Largest

Mixed Configurations (18 configs)

Systematic exploration of precision allocation:

Config ID Early Middle Late Strategy
mix_high_early Q6_K Q4_K_M Q3_K_S Protect early, compress late
mix_high_late Q3_K_S Q4_K_M Q6_K Protect late, compress early
mix_high_middle Q3_K_S Q6_K Q3_K_S Protect middle only
mix_gradient_down Q6_K Q5_K_M Q4_K_M Descending precision
mix_gradient_up Q4_K_M Q5_K_M Q6_K Ascending precision
mix_extreme_early Q8_0 Q4_K_M Q2_K Max early, min late
mix_extreme_late Q2_K Q4_K_M Q8_0 Min early, max late
mix_bookend_high Q6_K Q3_K_S Q6_K High ends, compressed middle
mix_bookend_low Q3_K_S Q6_K Q3_K_S Low ends, high middle
mix_q5_q4 Q5_K_M Q4_K_M Q4_K_M Slightly higher early
mix_q4_q5 Q4_K_M Q4_K_M Q5_K_M Slightly higher late
mix_q6_q4_q4 Q6_K Q4_K_M Q4_K_M High early only
mix_q4_q4_q6 Q4_K_M Q4_K_M Q6_K High late only
mix_q4_q6_q4 Q4_K_M Q6_K Q4_K_M High middle only
mix_q5_q3_q5 Q5_K_M Q3_K_S Q5_K_M Squeeze middle
mix_q3_q5_q3 Q3_K_S Q5_K_M Q3_K_S Squeeze ends
mix_q6_q3_q4 Q6_K Q3_K_S Q4_K_M Protect early, squeeze middle
mix_q4_q3_q6 Q4_K_M Q3_K_S Q6_K Protect late, squeeze middle

Total: 6 uniform + 18 mixed = 24 configurations per model.


7. Methodology

Per-Configuration Pipeline

1. Generate per-layer quant map (JSON)
2. Quantize model to GGUF using llama.cpp with custom quant map
3. Record file size
4. Deploy GGUF to Pi cluster (rsync)
5. Wait 60s for thermal stabilization
6. Run benchmark suite (MMLU 5-shot, ARC-C 25-shot, HellaSwag 10-shot)
7. Capture epi-meter power trace
8. Calculate EPI (epi-bench)
9. Log results with full metadata
10. Repeat 3x, report median

Per-Layer Quant Map Format

{
  "config_id": "mix_high_early",
  "model": "llama-3.1-8b",
  "layer_groups": [
    {"layers": [0, 1, 2, 3, 4, 5, 6, 7], "quant": "Q6_K"},
    {"layers": [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], "quant": "Q4_K_M"},
    {"layers": [24, 25, 26, 27, 28, 29, 30, 31], "quant": "Q3_K_S"}
  ]
}

Measurements

Metric Source Description
EPI epi-bench J/T ÷ A — the primary metric
J/Token epi-meter + epi-bench Total energy ÷ total tokens
Accuracy (composite) Benchmark suite (MMLU + ARC-C + HellaSwag) / 3
Perplexity Evaluation script Standard perplexity on held-out text
File size ls -la GGUF file size in bytes
Tokens/second Benchmark runner Inference throughput
Avg watts epi-meter Average cluster power draw

8. Results

Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026.

Llama-3.1-8B Results

Config Early Middle Late Size (GB) Perplexity J/Token Accuracy EPI
uniform_q2k Q2_K Q2_K Q2_K
uniform_q3ks Q3_K_S Q3_K_S Q3_K_S
uniform_q4km Q4_K_M Q4_K_M Q4_K_M
uniform_q5km Q5_K_M Q5_K_M Q5_K_M
uniform_q6k Q6_K Q6_K Q6_K
uniform_q8 Q8_0 Q8_0 Q8_0
mix_high_early Q6_K Q4_K_M Q3_K_S
mix_high_late Q3_K_S Q4_K_M Q6_K
mix_gradient_down Q6_K Q5_K_M Q4_K_M
mix_extreme_early Q8_0 Q4_K_M Q2_K
... ... ... ...

Perplexity-Matched Pairs

Configurations with near-identical perplexity but potentially different EPI.

Config A Config B Perplexity A Perplexity B EPI A EPI B Delta

9. Pareto Analysis

Pending measurement data.

Planned Pareto Plot

  J/Token
    │
    │  ×uniform_q8          (high accuracy, high energy)
    │
    │      ×uniform_q6k
    │
    │         ×mix_gradient_down
    │           ×mix_high_early    ← Pareto-optimal?
    │              ×uniform_q4km
    │
    │                 ×mix_bookend_high
    │                    ×uniform_q3ks
    │
    │                          ×uniform_q2k  (low accuracy, low energy)
    │
    └────────────────────────────────────────── Accuracy
                                           1.0

The Pareto frontier connects configurations where no other config is both more accurate AND cheaper in energy. Configurations below the frontier are dominated — another config exists that is either more accurate at the same energy, or cheaper at the same accuracy.

Key question: Are any mixed-quant configs on the Pareto frontier that no uniform config reaches?


10. Discussion

Pending measurement data.

Expected topics:

  • Why identical perplexity ≠ identical energy: Memory bandwidth on ARM Cortex-A76, cache line utilization for different tensor sizes, compute pipeline stalls
  • The electrician's framing: Quantization is impedance matching. Each layer is a circuit stage. Mixed quantization is adjusting the impedance per stage for minimum total power loss.
  • Practical guidance: Which mixed-quant strategy should a Pi cluster operator choose? Table of recommendations by use case (latency-sensitive, energy-sensitive, quality-sensitive)
  • Tool integration: How to use epi-bench to evaluate your own mixed-quant configs

11. Comparison to Prior Work

Work Evaluates Metric Hardware Measures Energy?
llama.cpp importance matrix Per-layer quantization Perplexity CPU/GPU No
GPTQ per-layer Per-layer bit allocation Perplexity, benchmarks GPU No
AWQ group quantization Group-level quantization Perplexity GPU No
This paper Per-layer-group quant on ARM EPI, J/token, accuracy Pi 5 cluster Yes (epi-meter)

12. Reproducibility

Component Repository
EPI Framework energy-per-intelligence
Measurement Board epi-meter
Calculation Tooling epi-bench
Raw Data data/ in this repo
Quant Configs data/quant-configs/ — all 24 per-layer quant maps as JSON
Surgery Code code/surgery/
Analysis Code code/analysis/
Visualization code/visualization/

Community replication: use epi-bench with your own hardware. Submit results to data/community/ via PR.


13. Future Work

Direction Description
Per-layer (not per-group) Individual layer quantization — 32 independent choices instead of 3 groups
Automated search Gradient-free optimization of per-layer quant map to minimize EPI
Combined with pruning Expert pruning + mixed quantization interaction effects on EPI
Dynamic quantization Load different precision per inference phase (prefill vs. decode)
Cross-model Same mixed-quant strategy across Llama, Mistral, Qwen — does it transfer?

14. Citation

@article{abner2026mixedquantepi,
  title   = {Per-Layer Quantization Evaluated by Energy Per Intelligence},
  author  = {Abner, Francisco},
  year    = {2026},
  url     = {https://github.com/Franzabner/mixed-quant-epi},
  note    = {YOSO-YAi LLC. Data collection in progress.}
}

15. References

# Reference
[1] Abner, F. "Energy Per Intelligence." YOSO-YAi LLC, 2026. GitHub
[2] Frantar et al. "GPTQ: Accurate Post-Training Quantization for GPT" (2023)
[3] Lin et al. "AWQ: Activation-aware Weight Quantization" (2024)
[4] llama.cpp quantization documentation and importance matrix
[5] Dettmers et al. "The case for 4-bit precision" (2023)

16. License

Content License
Paper (README, figures) CC BY 4.0
Code MIT
Data CC BY 4.0

Two configs. Identical perplexity. Different joule costs.

Perplexity can't see it. The meter can.

YOSO-YAi

Francisco Abner — Electrical Engineer, CEO & Founder, YOSO-YAi LLC

About

Per-Layer Quantization Evaluated by Energy Per Intelligence. Two configs, identical perplexity, different joule costs on ARM. Pareto frontier of accuracy vs energy for 24 quant configurations.

Topics

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE-CODE

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages