Abstract — Standard GGUF quantization applies a single quantization level uniformly across all layers. Mixed quantization assigns different bit widths to different layers — higher precision where it matters, lower precision where it doesn't. Two mixed-quantization configurations can produce identical perplexity scores yet consume different amounts of energy on ARM hardware. This paper uses Energy Per Intelligence (EPI) to evaluate per-layer quantization strategies, measuring real power consumption on a Raspberry Pi 5 cluster with the epi-meter board. We map the Pareto frontier of accuracy vs. joules/token, identify configurations that are energy-optimal but invisible to perplexity-only evaluation, and demonstrate that the cheapest token in joules is not always the one produced by the smallest model.
- Introduction
- Background
- Research Questions
- The Core Insight
- Experimental Design
- Quantization Matrix
- Methodology
- Results
- Pareto Analysis
- Discussion
- Comparison to Prior Work
- Reproducibility
- Future Work
- Citation
- References
- License
Quantization is the most common model compression technique. Reduce the precision of weight tensors — from FP16 to INT8, INT4, or lower — and the model gets smaller, loads faster, and (in theory) runs cheaper. The standard approach applies one quantization level uniformly: Q4_K_M means every layer is quantized to roughly 4-bit precision.
But not every layer is equally important. Attention layers in early transformer blocks shape token representations that propagate through the entire network. Feed-forward layers in late blocks refine the output but have less downstream impact. A uniform quantization level treats all layers as equal. They are not.
Mixed quantization assigns different bit widths to different layers. llama.cpp supports this via per-layer quantization maps in GGUF. The DGX Spark can produce arbitrary mixed-quant configurations. The question is: which configurations are optimal?
The standard answer uses perplexity — lower perplexity means less accuracy loss. But two configurations with identical perplexity can have different energy costs on ARM hardware. One may use more memory bandwidth (loading larger weight tensors), take longer per token, or trigger different cache behavior. Perplexity cannot see this. EPI can.
| GGUF Type | Bits/Weight | Relative Size | Standard Use |
|---|---|---|---|
| Q2_K | ~2.6 | 0.33x | Aggressive compression |
| Q3_K_S | ~3.4 | 0.42x | Small, lower quality |
| Q4_K_M | ~4.8 | 0.55x | Default — best size/quality tradeoff |
| Q5_K_M | ~5.7 | 0.65x | Higher quality, larger |
| Q6_K | ~6.6 | 0.75x | Near-FP16 quality |
| Q8_0 | ~8.5 | 1.0x | Minimal compression |
Instead of one type for all layers, assign types per layer or layer group:
{
"layers_0_7": "Q6_K", // Early layers: high precision (critical)
"layers_8_23": "Q4_K_M", // Middle layers: standard precision
"layers_24_31": "Q3_K_S" // Late layers: aggressive compression
}Two such maps can produce the same perplexity but different:
- Model file sizes (different total bits)
- Memory bandwidth (loading different-sized tensors)
- Inference latency (different compute per layer)
- Energy consumption (different joules per token)
Existing mixed-quant evaluations use perplexity as the sole quality metric. None measures the actual energy cost on production hardware. Nobody maps the Pareto frontier of accuracy vs. joules/token for mixed-quant configurations.
| # | Question |
|---|---|
| RQ1 | Do mixed-quant configurations with identical perplexity produce different EPI on ARM hardware? |
| RQ2 | What does the Pareto frontier of accuracy vs. J/token look like for per-layer quantization? |
| RQ3 | Which layers are most sensitive to quantization depth in terms of energy impact? |
| RQ4 | Is the EPI-optimal mixed-quant configuration smaller or larger than uniform Q4_K_M? |
| RQ5 | Does the energy benefit of mixed quantization come from reduced memory bandwidth, reduced compute, or both? |
Config A Config B
───────── ─────────
Early: Q6_K Early: Q4_K_M
Middle: Q4_K_M Middle: Q5_K_M
Late: Q3_K_S Late: Q4_K_M
Perplexity: 7.82 Perplexity: 7.82 ← IDENTICAL
File size: 4.1 GB File size: 4.5 GB
J/Token: ??? J/Token: ??? ← DIFFERENT?
Accuracy: ??? Accuracy: ???
EPI: ??? EPI: ??? ← WHICH IS BETTER?
Perplexity says these are equivalent. An electrician with a meter on the circuit might disagree. This paper finds out.
- Baseline: Uniform quantization levels (Q2_K through Q8_0)
- Mixed configs: Systematic per-layer-group quantization maps
- Measure all on the same hardware with the same instrument
- Map the Pareto frontier — which configs are optimal?
| Component | Specification |
|---|---|
| Surgery + quantization | DGX Spark (GB10, 128GB) |
| Deployment target | Pi 5 cluster (4x 16GB, distributed-llama) |
| Measurement | epi-meter board (4x ATM90E26, CT clamps) |
| Quantization tool | llama.cpp (GGUF with per-layer quant maps) |
| Model | Parameters | Layers | Architecture |
|---|---|---|---|
| Llama-3.1-8B | 8B | 32 | Dense |
| Qwen3-30B-A3B | 30B (3B active) | — | MoE |
Dense model (Llama) provides clean signal without MoE routing noise. MoE model (Qwen) tests whether mixed-quant interacts with expert gating.
For a 32-layer model:
| Group | Layers | Role |
|---|---|---|
| Early | 0–7 | Token embedding refinement, representation shaping |
| Middle | 8–23 | Core reasoning and transformation |
| Late | 24–31 | Output refinement and prediction |
| Config ID | All Layers | Expected Size |
|---|---|---|
uniform_q2k |
Q2_K | Smallest |
uniform_q3ks |
Q3_K_S | — |
uniform_q4km |
Q4_K_M | Standard |
uniform_q5km |
Q5_K_M | — |
uniform_q6k |
Q6_K | — |
uniform_q8 |
Q8_0 | Largest |
Systematic exploration of precision allocation:
| Config ID | Early | Middle | Late | Strategy |
|---|---|---|---|---|
mix_high_early |
Q6_K | Q4_K_M | Q3_K_S | Protect early, compress late |
mix_high_late |
Q3_K_S | Q4_K_M | Q6_K | Protect late, compress early |
mix_high_middle |
Q3_K_S | Q6_K | Q3_K_S | Protect middle only |
mix_gradient_down |
Q6_K | Q5_K_M | Q4_K_M | Descending precision |
mix_gradient_up |
Q4_K_M | Q5_K_M | Q6_K | Ascending precision |
mix_extreme_early |
Q8_0 | Q4_K_M | Q2_K | Max early, min late |
mix_extreme_late |
Q2_K | Q4_K_M | Q8_0 | Min early, max late |
mix_bookend_high |
Q6_K | Q3_K_S | Q6_K | High ends, compressed middle |
mix_bookend_low |
Q3_K_S | Q6_K | Q3_K_S | Low ends, high middle |
mix_q5_q4 |
Q5_K_M | Q4_K_M | Q4_K_M | Slightly higher early |
mix_q4_q5 |
Q4_K_M | Q4_K_M | Q5_K_M | Slightly higher late |
mix_q6_q4_q4 |
Q6_K | Q4_K_M | Q4_K_M | High early only |
mix_q4_q4_q6 |
Q4_K_M | Q4_K_M | Q6_K | High late only |
mix_q4_q6_q4 |
Q4_K_M | Q6_K | Q4_K_M | High middle only |
mix_q5_q3_q5 |
Q5_K_M | Q3_K_S | Q5_K_M | Squeeze middle |
mix_q3_q5_q3 |
Q3_K_S | Q5_K_M | Q3_K_S | Squeeze ends |
mix_q6_q3_q4 |
Q6_K | Q3_K_S | Q4_K_M | Protect early, squeeze middle |
mix_q4_q3_q6 |
Q4_K_M | Q3_K_S | Q6_K | Protect late, squeeze middle |
Total: 6 uniform + 18 mixed = 24 configurations per model.
1. Generate per-layer quant map (JSON)
2. Quantize model to GGUF using llama.cpp with custom quant map
3. Record file size
4. Deploy GGUF to Pi cluster (rsync)
5. Wait 60s for thermal stabilization
6. Run benchmark suite (MMLU 5-shot, ARC-C 25-shot, HellaSwag 10-shot)
7. Capture epi-meter power trace
8. Calculate EPI (epi-bench)
9. Log results with full metadata
10. Repeat 3x, report median
{
"config_id": "mix_high_early",
"model": "llama-3.1-8b",
"layer_groups": [
{"layers": [0, 1, 2, 3, 4, 5, 6, 7], "quant": "Q6_K"},
{"layers": [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], "quant": "Q4_K_M"},
{"layers": [24, 25, 26, 27, 28, 29, 30, 31], "quant": "Q3_K_S"}
]
}| Metric | Source | Description |
|---|---|---|
| EPI | epi-bench | J/T ÷ A — the primary metric |
| J/Token | epi-meter + epi-bench | Total energy ÷ total tokens |
| Accuracy (composite) | Benchmark suite | (MMLU + ARC-C + HellaSwag) / 3 |
| Perplexity | Evaluation script | Standard perplexity on held-out text |
| File size | ls -la |
GGUF file size in bytes |
| Tokens/second | Benchmark runner | Inference throughput |
| Avg watts | epi-meter | Average cluster power draw |
Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026.
| Config | Early | Middle | Late | Size (GB) | Perplexity | J/Token | Accuracy | EPI |
|---|---|---|---|---|---|---|---|---|
uniform_q2k |
Q2_K | Q2_K | Q2_K | — | — | — | — | — |
uniform_q3ks |
Q3_K_S | Q3_K_S | Q3_K_S | — | — | — | — | — |
uniform_q4km |
Q4_K_M | Q4_K_M | Q4_K_M | — | — | — | — | — |
uniform_q5km |
Q5_K_M | Q5_K_M | Q5_K_M | — | — | — | — | — |
uniform_q6k |
Q6_K | Q6_K | Q6_K | — | — | — | — | — |
uniform_q8 |
Q8_0 | Q8_0 | Q8_0 | — | — | — | — | — |
mix_high_early |
Q6_K | Q4_K_M | Q3_K_S | — | — | — | — | — |
mix_high_late |
Q3_K_S | Q4_K_M | Q6_K | — | — | — | — | — |
mix_gradient_down |
Q6_K | Q5_K_M | Q4_K_M | — | — | — | — | — |
mix_extreme_early |
Q8_0 | Q4_K_M | Q2_K | — | — | — | — | — |
| ... | ... | ... | ... | — | — | — | — | — |
Configurations with near-identical perplexity but potentially different EPI.
| Config A | Config B | Perplexity A | Perplexity B | EPI A | EPI B | Delta |
|---|---|---|---|---|---|---|
| — | — | — | — | — | — | — |
Pending measurement data.
J/Token
│
│ ×uniform_q8 (high accuracy, high energy)
│
│ ×uniform_q6k
│
│ ×mix_gradient_down
│ ×mix_high_early ← Pareto-optimal?
│ ×uniform_q4km
│
│ ×mix_bookend_high
│ ×uniform_q3ks
│
│ ×uniform_q2k (low accuracy, low energy)
│
└────────────────────────────────────────── Accuracy
1.0
The Pareto frontier connects configurations where no other config is both more accurate AND cheaper in energy. Configurations below the frontier are dominated — another config exists that is either more accurate at the same energy, or cheaper at the same accuracy.
Key question: Are any mixed-quant configs on the Pareto frontier that no uniform config reaches?
Pending measurement data.
Expected topics:
- Why identical perplexity ≠ identical energy: Memory bandwidth on ARM Cortex-A76, cache line utilization for different tensor sizes, compute pipeline stalls
- The electrician's framing: Quantization is impedance matching. Each layer is a circuit stage. Mixed quantization is adjusting the impedance per stage for minimum total power loss.
- Practical guidance: Which mixed-quant strategy should a Pi cluster operator choose? Table of recommendations by use case (latency-sensitive, energy-sensitive, quality-sensitive)
- Tool integration: How to use epi-bench to evaluate your own mixed-quant configs
| Work | Evaluates | Metric | Hardware | Measures Energy? |
|---|---|---|---|---|
| llama.cpp importance matrix | Per-layer quantization | Perplexity | CPU/GPU | No |
| GPTQ per-layer | Per-layer bit allocation | Perplexity, benchmarks | GPU | No |
| AWQ group quantization | Group-level quantization | Perplexity | GPU | No |
| This paper | Per-layer-group quant on ARM | EPI, J/token, accuracy | Pi 5 cluster | Yes (epi-meter) |
| Component | Repository |
|---|---|
| EPI Framework | energy-per-intelligence |
| Measurement Board | epi-meter |
| Calculation Tooling | epi-bench |
| Raw Data | data/ in this repo |
| Quant Configs | data/quant-configs/ — all 24 per-layer quant maps as JSON |
| Surgery Code | code/surgery/ |
| Analysis Code | code/analysis/ |
| Visualization | code/visualization/ |
Community replication: use epi-bench with your own hardware. Submit results to data/community/ via PR.
| Direction | Description |
|---|---|
| Per-layer (not per-group) | Individual layer quantization — 32 independent choices instead of 3 groups |
| Automated search | Gradient-free optimization of per-layer quant map to minimize EPI |
| Combined with pruning | Expert pruning + mixed quantization interaction effects on EPI |
| Dynamic quantization | Load different precision per inference phase (prefill vs. decode) |
| Cross-model | Same mixed-quant strategy across Llama, Mistral, Qwen — does it transfer? |
@article{abner2026mixedquantepi,
title = {Per-Layer Quantization Evaluated by Energy Per Intelligence},
author = {Abner, Francisco},
year = {2026},
url = {https://github.com/Franzabner/mixed-quant-epi},
note = {YOSO-YAi LLC. Data collection in progress.}
}| # | Reference |
|---|---|
| [1] | Abner, F. "Energy Per Intelligence." YOSO-YAi LLC, 2026. GitHub |
| [2] | Frantar et al. "GPTQ: Accurate Post-Training Quantization for GPT" (2023) |
| [3] | Lin et al. "AWQ: Activation-aware Weight Quantization" (2024) |
| [4] | llama.cpp quantization documentation and importance matrix |
| [5] | Dettmers et al. "The case for 4-bit precision" (2023) |
| Content | License |
|---|---|
| Paper (README, figures) | CC BY 4.0 |
| Code | MIT |
| Data | CC BY 4.0 |
