Per-Layer Quantization Evaluated by Energy Per Intelligence

Two Configs, Identical Perplexity, Different Joule Costs on ARM

Francisco Abner Electrical Engineer, CEO & Founder — YOSO-YAi LLC New Albany, Ohio

Abstract — Standard GGUF quantization applies a single quantization level uniformly across all layers. Mixed quantization assigns different bit widths to different layers — higher precision where it matters, lower precision where it doesn't. Two mixed-quantization configurations can produce identical perplexity scores yet consume different amounts of energy on ARM hardware. This paper uses Energy Per Intelligence (EPI) to evaluate per-layer quantization strategies, measuring real power consumption on a Raspberry Pi 5 cluster with the epi-meter board. We map the Pareto frontier of accuracy vs. joules/token, identify configurations that are energy-optimal but invisible to perplexity-only evaluation, and demonstrate that the cheapest token in joules is not always the one produced by the smallest model.

1. Introduction

Quantization is the most common model compression technique. Reduce the precision of weight tensors — from FP16 to INT8, INT4, or lower — and the model gets smaller, loads faster, and (in theory) runs cheaper. The standard approach applies one quantization level uniformly: Q4_K_M means every layer is quantized to roughly 4-bit precision.

But not every layer is equally important. Attention layers in early transformer blocks shape token representations that propagate through the entire network. Feed-forward layers in late blocks refine the output but have less downstream impact. A uniform quantization level treats all layers as equal. They are not.

Mixed quantization assigns different bit widths to different layers. llama.cpp supports this via per-layer quantization maps in GGUF. The DGX Spark can produce arbitrary mixed-quant configurations. The question is: which configurations are optimal?

The standard answer uses perplexity — lower perplexity means less accuracy loss. But two configurations with identical perplexity can have different energy costs on ARM hardware. One may use more memory bandwidth (loading larger weight tensors), take longer per token, or trigger different cache behavior. Perplexity cannot see this. EPI can.

2. Background

Uniform Quantization

GGUF Type	Bits/Weight	Relative Size	Standard Use
Q2_K	~2.6	0.33x	Aggressive compression
Q3_K_S	~3.4	0.42x	Small, lower quality
Q4_K_M	~4.8	0.55x	Default — best size/quality tradeoff
Q5_K_M	~5.7	0.65x	Higher quality, larger
Q6_K	~6.6	0.75x	Near-FP16 quality
Q8_0	~8.5	1.0x	Minimal compression

Mixed Quantization

Instead of one type for all layers, assign types per layer or layer group:

{
  "layers_0_7":   "Q6_K",     // Early layers: high precision (critical)
  "layers_8_23":  "Q4_K_M",   // Middle layers: standard precision
  "layers_24_31": "Q3_K_S"    // Late layers: aggressive compression
}

Two such maps can produce the same perplexity but different:

Model file sizes (different total bits)
Memory bandwidth (loading different-sized tensors)
Inference latency (different compute per layer)
Energy consumption (different joules per token)

What's Missing

Existing mixed-quant evaluations use perplexity as the sole quality metric. None measures the actual energy cost on production hardware. Nobody maps the Pareto frontier of accuracy vs. joules/token for mixed-quant configurations.

3. Research Questions

#	Question
RQ1	Do mixed-quant configurations with identical perplexity produce different EPI on ARM hardware?
RQ2	What does the Pareto frontier of accuracy vs. J/token look like for per-layer quantization?
RQ3	Which layers are most sensitive to quantization depth in terms of energy impact?
RQ4	Is the EPI-optimal mixed-quant configuration smaller or larger than uniform Q4_K_M?
RQ5	Does the energy benefit of mixed quantization come from reduced memory bandwidth, reduced compute, or both?

4. The Core Insight

  Config A                    Config B
  ─────────                   ─────────
  Early:  Q6_K               Early:  Q4_K_M
  Middle: Q4_K_M             Middle: Q5_K_M
  Late:   Q3_K_S             Late:   Q4_K_M

  Perplexity: 7.82           Perplexity: 7.82      ← IDENTICAL
  File size:  4.1 GB         File size:  4.5 GB
  J/Token:    ???             J/Token:    ???       ← DIFFERENT?
  Accuracy:   ???             Accuracy:   ???
  EPI:        ???             EPI:        ???       ← WHICH IS BETTER?

Perplexity says these are equivalent. An electrician with a meter on the circuit might disagree. This paper finds out.

5. Experimental Design

Strategy

Baseline: Uniform quantization levels (Q2_K through Q8_0)
Mixed configs: Systematic per-layer-group quantization maps
Measure all on the same hardware with the same instrument
Map the Pareto frontier — which configs are optimal?

Hardware

Component	Specification
Surgery + quantization	DGX Spark (GB10, 128GB)
Deployment target	Pi 5 cluster (4x 16GB, distributed-llama)
Measurement	epi-meter board (4x ATM90E26, CT clamps)
Quantization tool	llama.cpp (GGUF with per-layer quant maps)

Target Models

Model	Parameters	Layers	Architecture
Llama-3.1-8B	8B	32	Dense
Qwen3-30B-A3B	30B (3B active)	—	MoE

Dense model (Llama) provides clean signal without MoE routing noise. MoE model (Qwen) tests whether mixed-quant interacts with expert gating.

6. Quantization Matrix

Layer Groups

For a 32-layer model:

Group	Layers	Role
Early	0–7	Token embedding refinement, representation shaping
Middle	8–23	Core reasoning and transformation
Late	24–31	Output refinement and prediction

Uniform Baselines (6 configs)

Config ID	All Layers	Expected Size
`uniform_q2k`	Q2_K	Smallest
`uniform_q3ks`	Q3_K_S	—
`uniform_q4km`	Q4_K_M	Standard
`uniform_q5km`	Q5_K_M	—
`uniform_q6k`	Q6_K	—
`uniform_q8`	Q8_0	Largest

Mixed Configurations (18 configs)

Systematic exploration of precision allocation:

Config ID	Early	Middle	Late	Strategy
`mix_high_early`	Q6_K	Q4_K_M	Q3_K_S	Protect early, compress late
`mix_high_late`	Q3_K_S	Q4_K_M	Q6_K	Protect late, compress early
`mix_high_middle`	Q3_K_S	Q6_K	Q3_K_S	Protect middle only
`mix_gradient_down`	Q6_K	Q5_K_M	Q4_K_M	Descending precision
`mix_gradient_up`	Q4_K_M	Q5_K_M	Q6_K	Ascending precision
`mix_extreme_early`	Q8_0	Q4_K_M	Q2_K	Max early, min late
`mix_extreme_late`	Q2_K	Q4_K_M	Q8_0	Min early, max late
`mix_bookend_high`	Q6_K	Q3_K_S	Q6_K	High ends, compressed middle
`mix_bookend_low`	Q3_K_S	Q6_K	Q3_K_S	Low ends, high middle
`mix_q5_q4`	Q5_K_M	Q4_K_M	Q4_K_M	Slightly higher early
`mix_q4_q5`	Q4_K_M	Q4_K_M	Q5_K_M	Slightly higher late
`mix_q6_q4_q4`	Q6_K	Q4_K_M	Q4_K_M	High early only
`mix_q4_q4_q6`	Q4_K_M	Q4_K_M	Q6_K	High late only
`mix_q4_q6_q4`	Q4_K_M	Q6_K	Q4_K_M	High middle only
`mix_q5_q3_q5`	Q5_K_M	Q3_K_S	Q5_K_M	Squeeze middle
`mix_q3_q5_q3`	Q3_K_S	Q5_K_M	Q3_K_S	Squeeze ends
`mix_q6_q3_q4`	Q6_K	Q3_K_S	Q4_K_M	Protect early, squeeze middle
`mix_q4_q3_q6`	Q4_K_M	Q3_K_S	Q6_K	Protect late, squeeze middle

Total: 6 uniform + 18 mixed = 24 configurations per model.

7. Methodology

Per-Configuration Pipeline

1. Generate per-layer quant map (JSON)
2. Quantize model to GGUF using llama.cpp with custom quant map
3. Record file size
4. Deploy GGUF to Pi cluster (rsync)
5. Wait 60s for thermal stabilization
6. Run benchmark suite (MMLU 5-shot, ARC-C 25-shot, HellaSwag 10-shot)
7. Capture epi-meter power trace
8. Calculate EPI (epi-bench)
9. Log results with full metadata
10. Repeat 3x, report median

Per-Layer Quant Map Format

{
  "config_id": "mix_high_early",
  "model": "llama-3.1-8b",
  "layer_groups": [
    {"layers": [0, 1, 2, 3, 4, 5, 6, 7], "quant": "Q6_K"},
    {"layers": [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], "quant": "Q4_K_M"},
    {"layers": [24, 25, 26, 27, 28, 29, 30, 31], "quant": "Q3_K_S"}
  ]
}

Measurements

Metric	Source	Description
EPI	epi-bench	J/T ÷ A — the primary metric
J/Token	epi-meter + epi-bench	Total energy ÷ total tokens
Accuracy (composite)	Benchmark suite	(MMLU + ARC-C + HellaSwag) / 3
Perplexity	Evaluation script	Standard perplexity on held-out text
File size	`ls -la`	GGUF file size in bytes
Tokens/second	Benchmark runner	Inference throughput
Avg watts	epi-meter	Average cluster power draw

8. Results

Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026.

Llama-3.1-8B Results

Config	Early	Middle	Late	Size (GB)	Perplexity	J/Token	Accuracy	EPI
`uniform_q2k`	Q2_K	Q2_K	Q2_K	—	—	—	—	—
`uniform_q3ks`	Q3_K_S	Q3_K_S	Q3_K_S	—	—	—	—	—
`uniform_q4km`	Q4_K_M	Q4_K_M	Q4_K_M	—	—	—	—	—
`uniform_q5km`	Q5_K_M	Q5_K_M	Q5_K_M	—	—	—	—	—
`uniform_q6k`	Q6_K	Q6_K	Q6_K	—	—	—	—	—
`uniform_q8`	Q8_0	Q8_0	Q8_0	—	—	—	—	—
`mix_high_early`	Q6_K	Q4_K_M	Q3_K_S	—	—	—	—	—
`mix_high_late`	Q3_K_S	Q4_K_M	Q6_K	—	—	—	—	—
`mix_gradient_down`	Q6_K	Q5_K_M	Q4_K_M	—	—	—	—	—
`mix_extreme_early`	Q8_0	Q4_K_M	Q2_K	—	—	—	—	—
...	...	...	...	—	—	—	—	—

Perplexity-Matched Pairs

Configurations with near-identical perplexity but potentially different EPI.

Config A	Config B	Perplexity A	Perplexity B	EPI A	EPI B	Delta
—	—	—	—	—	—	—

9. Pareto Analysis

Pending measurement data.

Planned Pareto Plot

  J/Token
    │
    │  ×uniform_q8          (high accuracy, high energy)
    │
    │      ×uniform_q6k
    │
    │         ×mix_gradient_down
    │           ×mix_high_early    ← Pareto-optimal?
    │              ×uniform_q4km
    │
    │                 ×mix_bookend_high
    │                    ×uniform_q3ks
    │
    │                          ×uniform_q2k  (low accuracy, low energy)
    │
    └────────────────────────────────────────── Accuracy
                                           1.0

The Pareto frontier connects configurations where no other config is both more accurate AND cheaper in energy. Configurations below the frontier are dominated — another config exists that is either more accurate at the same energy, or cheaper at the same accuracy.

Key question: Are any mixed-quant configs on the Pareto frontier that no uniform config reaches?

10. Discussion

Pending measurement data.

Expected topics:

Why identical perplexity ≠ identical energy: Memory bandwidth on ARM Cortex-A76, cache line utilization for different tensor sizes, compute pipeline stalls
The electrician's framing: Quantization is impedance matching. Each layer is a circuit stage. Mixed quantization is adjusting the impedance per stage for minimum total power loss.
Practical guidance: Which mixed-quant strategy should a Pi cluster operator choose? Table of recommendations by use case (latency-sensitive, energy-sensitive, quality-sensitive)
Tool integration: How to use epi-bench to evaluate your own mixed-quant configs

11. Comparison to Prior Work

Work	Evaluates	Metric	Hardware	Measures Energy?
llama.cpp importance matrix	Per-layer quantization	Perplexity	CPU/GPU	No
GPTQ per-layer	Per-layer bit allocation	Perplexity, benchmarks	GPU	No
AWQ group quantization	Group-level quantization	Perplexity	GPU	No
This paper	Per-layer-group quant on ARM	EPI, J/token, accuracy	Pi 5 cluster	Yes (epi-meter)

12. Reproducibility

Component	Repository
EPI Framework	`energy-per-intelligence`
Measurement Board	`epi-meter`
Calculation Tooling	`epi-bench`
Raw Data	`data/` in this repo
Quant Configs	`data/quant-configs/` — all 24 per-layer quant maps as JSON
Surgery Code	`code/surgery/`
Analysis Code	`code/analysis/`
Visualization	`code/visualization/`

Community replication: use epi-bench with your own hardware. Submit results to data/community/ via PR.

13. Future Work

Direction	Description
Per-layer (not per-group)	Individual layer quantization — 32 independent choices instead of 3 groups
Automated search	Gradient-free optimization of per-layer quant map to minimize EPI
Combined with pruning	Expert pruning + mixed quantization interaction effects on EPI
Dynamic quantization	Load different precision per inference phase (prefill vs. decode)
Cross-model	Same mixed-quant strategy across Llama, Mistral, Qwen — does it transfer?

14. Citation

@article{abner2026mixedquantepi,
  title   = {Per-Layer Quantization Evaluated by Energy Per Intelligence},
  author  = {Abner, Francisco},
  year    = {2026},
  url     = {https://github.com/Franzabner/mixed-quant-epi},
  note    = {YOSO-YAi LLC. Data collection in progress.}
}

15. References

#	Reference
[1]	Abner, F. "Energy Per Intelligence." YOSO-YAi LLC, 2026. GitHub
[2]	Frantar et al. "GPTQ: Accurate Post-Training Quantization for GPT" (2023)
[3]	Lin et al. "AWQ: Activation-aware Weight Quantization" (2024)
[4]	llama.cpp quantization documentation and importance matrix
[5]	Dettmers et al. "The case for 4-bit precision" (2023)

16. License

Content	License
Paper (README, figures)	CC BY 4.0
Code	MIT
Data	CC BY 4.0

Two configs. Identical perplexity. Different joule costs.

Perplexity can't see it. The meter can.

Francisco Abner — Electrical Engineer, CEO & Founder, YOSO-YAi LLC

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
figures		figures
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-CODE		LICENSE-CODE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Per-Layer Quantization Evaluated by Energy Per Intelligence

Two Configs, Identical Perplexity, Different Joule Costs on ARM

Table of Contents

1. Introduction

2. Background

Uniform Quantization

Mixed Quantization

What's Missing

3. Research Questions

4. The Core Insight

5. Experimental Design

Strategy

Hardware

Target Models

6. Quantization Matrix

Layer Groups

Uniform Baselines (6 configs)

Mixed Configurations (18 configs)

7. Methodology

Per-Configuration Pipeline

Per-Layer Quant Map Format

Measurements

8. Results

Llama-3.1-8B Results

Perplexity-Matched Pairs

9. Pareto Analysis

Planned Pareto Plot

10. Discussion

11. Comparison to Prior Work

12. Reproducibility

13. Future Work

14. Citation

15. References

16. License

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages