Abstract — The AI industry evaluates model compression by accuracy retained. The energy industry evaluates systems by efficiency delivered. This paper bridges the two with Energy Per Intelligence (EPI) — a metric that divides the energy cost of inference (joules per token) by the task accuracy of the output. We define EPI formally, instrument a Raspberry Pi 5 cluster with a custom AC-side power measurement board (the epi-meter), and establish baseline EPI measurements for unmodified open-weights models on production ARM hardware. Subsequent papers in this series apply EPI to evaluate specific model surgery techniques: expert pruning, mixed quantization, and attention head removal. All measurements, tooling, and the measurement instrument itself are open source.
- Introduction
- The Gap in Existing Research
- Defining Energy Per Intelligence
- Why Joules, Not Watts
- Measurement Infrastructure
- The epi-meter Board
- Experimental Setup
- Baseline Models
- Baseline Results
- Benchmark Selection
- EPI Across Hardware
- Reproducibility
- Limitations
- Future Work
- Citation
- Related Papers
- License
A token is a unit of electricity. Every token an AI model generates is a specific quantity of energy that flowed through a specific piece of silicon. The model's architecture decides how many joules that token costs. The quantization decides it. The number of active MoE experts decides it. Every layer, every attention head, every weight tensor is a component in a circuit, and every component draws power.
Tokenomics is energy economics. Model architecture is circuit design for intelligence. And the question every AI system should answer is:
How many joules does one unit of useful intelligence cost?
The AI industry treats tokens as abstract units of text. API companies price them like items on a shelf: dollars per million. But a token is not abstract. It is a physical event. The author of this paper started as an electrician — pulling wire on Fluor Corporation job sites, working instrumentation and controls at Tesla's Gigafactory, and wiring Meta's NightCrawler data center. He has always thought in watts. This paper applies that lens to AI models.
This document defines:
- The EPI metric that measures intelligence efficiency
- The instrumentation that captures the energy data
- The baseline measurements on production ARM hardware
- The framework for evaluating model surgery by energy outcome
Two active research communities exist. Neither connects to the other.
| Community | What They Do | What They Don't Do |
|---|---|---|
| Model Surgery — MoE-Pruner [1], NAEE [2], EEP [3], SparseGPT [4], Wanda [5] | Prune experts, remove heads, quantize weights. Measure accuracy retention (perplexity, benchmark scores). | Measure energy consumption. No joules/token. No efficiency ratio. No production hardware testing. |
| Energy Benchmarking — TokenPowerBench [6], ML.ENERGY [7], EuroMLSys 2025 [8] | Measure joules/token on stock (unmodified) models. Typically on data center GPUs (A100, H100, B200). | Perform surgery. No model modification. No efficiency ratio. No edge hardware. |
| CNN Energy Analysis — PruneEnergyAnalyzer [9] | Measure energy after pruning CNNs. Provide joules and FPS metrics. | Not LLMs. Not MoE. Not ARM. No EPI-style efficiency ratio. |
| Capability | Model Surgery Papers | Energy Benchmarks | PruneEnergyAnalyzer | This Paper |
|---|---|---|---|---|
| LLM Surgery | Yes | — | — | Yes |
| Dedicated HW Energy Measurement | — | Yes | Yes | Yes |
| Efficiency Metric (EPI) | — | — | — | Yes |
| Production ARM Hardware | — | — | — | Yes |
| Custom Measurement Instrument | — | — | — | Yes |
Nobody performs surgery on an LLM, deploys the modified model to production ARM hardware, measures the actual energy cost with dedicated instrumentation, and frames the result as an efficiency ratio of intelligence per joule. This paper occupies that gap.
EPI = J/T ÷ A
Where:
| Symbol | Name | Unit | Definition |
|---|---|---|---|
| EPI | Energy Per Intelligence | J/(token · accuracy) | The composite metric. Lower is better. |
| J/T | Joules per Token | J/token | Total energy consumed during inference ÷ number of tokens generated. |
| A | Task Accuracy | dimensionless [0, 1] | Model's score on a domain-specific benchmark, normalized to a 0–1 scale. |
E_total
EPI = ─────────────
N_tokens × A
Where:
E_total= total energy consumed during the inference run (joules), measured by the epi-meterN_tokens= total tokens generated during the inference runA= benchmark accuracy score, normalized to [0, 1]
| EPI Value | Meaning |
|---|---|
| Lower | More useful intelligence per joule. A well-designed circuit. |
| Higher | More energy wasted per unit of useful output. Unnecessary load. |
| Improving | Surgery removed load without destroying capability. |
| Degrading | Surgery damaged capability more than it saved energy. |
A model that produces garbage output at very low energy cost has excellent joules/token but is useless. EPI divides by task accuracy, penalizing low-quality output. A model that burns watts on useless computation is a circuit with unnecessary load. Model surgery removes the load.
Watts is the rate of energy consumption (joules per second). Two systems can draw the same watts, but if one takes twice as long per token, it consumed twice the joules to produce the same output.
Power (W) × Duration (s)
Joules = ─────────────────────────── = Total energy cost of the token
1
| Metric | What It Captures | What It Misses |
|---|---|---|
| Watts | Instantaneous power draw | Duration. A slow model at low watts can cost more than a fast model at high watts. |
| Tokens/second | Speed | Energy. A fast model that draws 3x the power is not efficient. |
| Joules/token | Total energy per output unit | Quality. A model that produces garbage at low energy is not useful. |
| EPI | Energy per unit of useful intelligence | Nothing relevant to this analysis. |
James Prescott Joule established the relationship between heat and mechanical work in the 1840s. The unit that bears his name is the correct unit for measuring the cost of computation: total energy expended, not instantaneous rate.
All EPI measurements are performed on the YOSO-YAi FACTORY infrastructure — a research lab purpose-built for this work.
╔═══════════════════════════════════════════════════════════════════════╗
║ EPI MEASUREMENT PIPELINE ║
╠═══════════════════════════════════════════════════════════════════════╣
║ ║
║ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ ║
║ │ DGX Spark │ │ Pi Cluster │ │ epi-meter │ ║
║ │ GB10 128GB │────▶│ 4x Pi 5 │◄────│ RP2350 + 4x IC │ ║
║ │ │ │ 16GB each │ │ CT clamps (AC) │ ║
║ │ Quantize │ │ │ │ │ ║
║ │ to GGUF │ │ distributed │ │ True RMS watts │ ║
║ │ rsync to Pi │ │ -llama │ │ per node │ ║
║ └──────────────┘ └──────┬───────┘ └────────┬─────────┘ ║
║ │ │ ║
║ Benchmark │ Power trace │ ║
║ results │ (JSON/UART) │ ║
║ ▼ ▼ ║
║ ┌─────────────────────────────────┐ ║
║ │ Orchestrator Pi │ ║
║ │ 10-inch display │ ║
║ │ │ ║
║ │ SQLite: power traces │ ║
║ │ SQLite: benchmark results │ ║
║ │ Live dashboard │ ║
║ └──────────────┬──────────────────┘ ║
║ │ ║
║ SSH pull │ ║
║ ▼ ║
║ ┌──────────────────────────┐ ║
║ │ epi-bench │ ║
║ │ (on DGX or any machine) │ ║
║ │ │ ║
║ │ EPI = J/T ÷ A │ ║
║ │ Pareto plots │ ║
║ │ Results database │ ║
║ └──────────────────────────┘ ║
║ ║
╚═══════════════════════════════════════════════════════════════════════╝
| Machine | Role | Function |
|---|---|---|
| DGX Spark (GB10, 128GB, 1 PFLOP) | Surgeon + Oracle | Model surgery, GGUF quantization, deployment to Pi cluster, results analysis. |
| Pi 5 Cluster (4x 16GB, 64GB total) | Patient | Where modified models run inference on real ARM silicon. Ground truth for EPI. |
| epi-meter Board (RP2350 + metering ICs) | Instrument | Custom PCB. 4-channel AC energy metering via CT clamps. True RMS, power factor, real watts. |
| Orchestrator Pi (10-inch display) | Mission Control | Receives epi-meter data, renders live power visualization, logs to SQLite. |
DGX quantizes model to GGUF
→ rsync shards to Pi cluster
→ Pi cluster loads into distributed-llama
→ Benchmark suite runs on Pi cluster
→ epi-meter captures real watts per node via CT clamps
→ epi-meter streams JSON/UART to Orchestrator
→ Orchestrator logs to SQLite
→ epi-bench pulls traces + benchmark results
→ epi-bench calculates EPI
→ Results logged to database
Full design files, firmware, and build guide:
Franzabner/epi-meter
The epi-meter is a custom power measurement PCB designed to instrument the Pi cluster with dedicated energy metering hardware. It is the first YOSO-YAi board design that is fully public.
The Pi 5 has no GPU telemetry (nvidia-smi equivalent does not exist). Software power estimation on ARM Linux is unreliable and does not capture full system draw. For publishable research, the measurement instrument must be independent of the system being measured.
An electrician does not trust a device's self-reported power draw. An electrician puts a meter on the circuit.
| Parameter | Specification |
|---|---|
| MCU | RP2350 (Pico 2 silicon) — same silicon as BMC in YOSO-YAi production boards |
| Energy Metering | 4x dedicated ICs (ATM90E26 / ADE7753 class) over SPI |
| Current Sensing | 4x CT clamps (SCT-013 class), one per Pi node, non-invasive |
| Voltage Sensing | One voltage divider from shared AC reference |
| Measurements | True RMS voltage + current, real power (W), power factor, reactive power (VAR) |
| Computation | Power computed in hardware by the metering IC, not the MCU |
| Sampling | Metering IC internal: 1 kHz+. RMS output at 10–50 Hz |
| Output | UART to Orchestrator Pi, 115200 baud, JSON |
| Power | USB-C from Orchestrator or separate 5V supply |
| Measurement Point | AC side, between wall outlet and each Pi node |
AC inlet measurement captures everything: PSU efficiency losses, cooling fans, the entire system draw. This is what the electricity bill reflects. This is the real cost of a token. DC measurement would require opening each device, probing internal rails, and would miss PSU overhead. AC measurement via CT clamps is non-invasive — no cutting mains, no voiding warranties.
| Component | Specification | Qty |
|---|---|---|
| Raspberry Pi 5 | 16 GB LPDDR4X, active cooling, official PSU | 4 |
| NVMe Storage | 1 TB M.2 2230 per node | 4 |
| Network | Gigabit Ethernet, PoE switch | 1 |
| Inference Engine | distributed-llama (distributed across 4 nodes) |
— |
| Power Measurement | epi-meter board with CT clamps on each node's AC inlet | 1 |
| Ambient Temperature | Logged per run via Orchestrator (target: 22 ± 2 °C) | — |
1. Power on all nodes. Wait 5 minutes for thermal stabilization.
2. Load model into distributed-llama across all 4 nodes.
3. Verify serving with health check (Serving Verifier skill).
4. Begin epi-meter recording (continuous JSON stream to Orchestrator SQLite).
5. Run benchmark suite:
a. MMLU (5-shot) — broad knowledge
b. ARC-Challenge (25-shot) — reasoning
c. HellaSwag (10-shot) — commonsense
6. Record: total tokens generated, total time, per-node power traces.
7. Stop epi-meter recording.
8. Pull power traces and benchmark results from Orchestrator.
9. Calculate EPI using epi-bench.
10. Log all results with full metadata to results database.
| Factor | Control Method |
|---|---|
| Ambient temperature | HVAC-controlled room, logged per run |
| Background processes | Minimal OS services, no GUI, no competing workloads |
| Warm-up | 5-minute thermal stabilization before each measurement |
| Repetitions | Each configuration measured 3x, median reported |
| Clock stability | Pi 5 frequency governor set to performance (fixed clock) |
Baseline EPI is measured on unmodified open-weights models — no surgery, no pruning, no custom quantization. These baselines establish the reference point against which all surgical modifications are compared.
| Model | Architecture | Parameters | Quantization | Context |
|---|---|---|---|---|
| Qwen3-30B-A3B | MoE (128 experts, 8 active) | 30B total, 3B active | Q4_K_M (GGUF) | 4096 |
| Llama-3.1-8B | Dense | 8B | Q4_K_M (GGUF) | 4096 |
| Mistral-7B-v0.3 | Dense | 7.2B | Q4_K_M (GGUF) | 4096 |
| Phi-3-mini-4k | Dense | 3.8B | Q4_K_M (GGUF) | 4096 |
| Gemma-2-9B | Dense | 9.2B | Q4_K_M (GGUF) | 4096 |
Note: Final model selection may be adjusted based on distributed-llama compatibility and Pi cluster memory constraints. Models listed represent the target baseline set.
Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026. This section will be populated with measured data once the lab is live.
Results will be published in the following structure:
data/baseline/
├── qwen3-30b-a3b_q4km/
│ ├── run_001.json # Full run metadata
│ ├── power_trace_001.csv # Per-node power samples (timestamp, node, watts)
│ ├── benchmark_001.json # MMLU, ARC, HellaSwag scores
│ └── epi_001.json # Calculated EPI + all intermediate values
├── llama-3.1-8b_q4km/
│ └── ...
├── mistral-7b-v03_q4km/
│ └── ...
├── phi-3-mini-4k_q4km/
│ └── ...
└── gemma-2-9b_q4km/
└── ...
Each epi_*.json file will contain:
The following table will be filled with measured data. Values shown as
—are pending measurement.
| Model | Quant | J/Token | MMLU | ARC-C | HSwag | Accuracy | EPI |
|---|---|---|---|---|---|---|---|
| Qwen3-30B-A3B | Q4_K_M | — | — | — | — | — | — |
| Llama-3.1-8B | Q4_K_M | — | — | — | — | — | — |
| Mistral-7B-v0.3 | Q4_K_M | — | — | — | — | — | — |
| Phi-3-mini-4k | Q4_K_M | — | — | — | — | — | — |
| Gemma-2-9B | Q4_K_M | — | — | — | — | — | — |
The intelligence denominator in EPI uses established benchmarks to enable cross-study comparison:
| Benchmark | Measures | Shot Count | Why Selected |
|---|---|---|---|
| MMLU | Broad knowledge across 57 domains | 5-shot | Industry standard. Enables comparison with existing model surgery papers. |
| ARC-Challenge | Grade-school science reasoning | 25-shot | Tests reasoning capability that model surgery may degrade. |
| HellaSwag | Commonsense natural language inference | 10-shot | Sensitive to model quality degradation from aggressive pruning. |
A = w₁ × MMLU + w₂ × ARC-C + w₃ × HellaSwag
Default weights: w₁ = w₂ = w₃ = 1/3 (equal weighting). Custom weighting supported by epi-bench.
A custom electrical engineering domain benchmark is planned as an open-source evaluation suite, enabling EPI measurement specific to the domains YOSO-YAi products serve. This benchmark will be published in a separate repository when ready.
The same model produces different EPI on different hardware. The hardware is a variable in the equation, not a constant.
┌─────────────────────────────────────────────────────────────┐
│ │
│ Same Model, Same Surgery, Different Hardware = Different EPI│
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ DGX │ │ Pi 5 │ │ Pi 4 │ │
│ │ Spark │ │ Cluster │ │ (hypo.) │ │
│ │ │ │ │ │ │ │
│ │ EPI: X │ │ EPI: Y │ │ EPI: Z │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ X ≠ Y ≠ Z — Hardware determines the energy cost │
│ │
└─────────────────────────────────────────────────────────────┘
This is why EPI must be measured on the production hardware (Pi cluster), not the training hardware (DGX Spark). The DGX Spark is the surgeon — it performs the surgery. The Pi cluster is the patient — it lives with the result.
You measure the patient's outcome, not the surgeon's electricity bill.
When reporting EPI, the hardware configuration is always specified. Cross-hardware EPI comparison is meaningful only when the measurement methodology is identical (same epi-meter firmware, same measurement point, same environmental controls).
Everything needed to reproduce these measurements is open source:
| Component | Repository | What's Included |
|---|---|---|
| Measurement Instrument | epi-meter |
KiCad schematic, PCB layout, Gerbers, BOM, RP2350 firmware, 3D enclosure, calibration guide |
| Calculation Tooling | epi-bench |
EPI calculator, power trace parser, Pareto plotter, benchmark runner, CSV format spec |
| Raw Data | data/ in this repository |
Power traces, benchmark scores, calculated EPI, full run metadata |
| Analysis Code | code/ in this repository |
Visualization scripts, statistical analysis, table generators |
We invite the community to replicate measurements on their own hardware and submit results:
- Build an epi-meter (or use any calibrated AC power meter)
- Run the standardized benchmark suite from epi-bench
- Calculate EPI using the provided tooling
- Submit a pull request to
data/community/with your results
See CONTRIBUTING.md for the submission format and quality requirements.
| Limitation | Mitigation |
|---|---|
| AC-side measurement includes PSU efficiency losses | Consistent across all runs. PSU efficiency is part of the real-world cost. |
| CT clamp accuracy (typically ±1–2%) | Calibrated against known resistive load. Error budget documented per run. |
| Pi 5 frequency scaling | Governor set to performance mode (fixed clock) for all measurements. |
| Benchmark scores are task-dependent | Multiple benchmarks with composite scoring. Custom domain benchmark planned. |
| Single lab environment | Environmental conditions logged per run. Community replication invited. |
| Small cluster (4 nodes) | Representative of target deployment hardware. Not intended to model data center scale. |
| Paper | Repository | Core Question |
|---|---|---|
| Expert Pruning × EPI | expert-pruning-epi |
Where is the efficient operating point when dropping MoE experts? |
| Mixed Quantization × EPI | mixed-quant-epi |
Two configs, identical perplexity — do they cost the same joules on ARM? |
| Head Surgery × EPI | attention-head-surgery-epi |
When you remove attention heads, does the energy actually drop — or do remaining heads compensate? |
| Fine-Tune Energy Payback | Planned | kWh cost of fine-tuning vs. downstream EPI improvement. Payback period in tokens. |
| EPI Prediction | Planned | Given proposed surgery parameters, predict EPI on ARM before deploying. |
@article{abner2026epi,
title = {Energy Per Intelligence: A Metric for Evaluating Model Surgery
From the Perspective of an Electrical Engineer},
author = {Abner, Francisco},
year = {2026},
url = {https://github.com/Franzabner/energy-per-intelligence},
note = {YOSO-YAi LLC. Data collection in progress.}
}References below include foundational work this paper builds upon and positions against.
| # | Reference | Relevance |
|---|---|---|
| [1] | MoE-Pruner: Pruning Mixture-of-Experts Large Language Models (2024) | Expert pruning methodology — measures accuracy, not energy |
| [2] | NAEE: N-gram Aware Expert Elimination (2024) | MoE expert elimination — accuracy-only evaluation |
| [3] | EEP: Expert-level Efficient Pruning for MoE (2024) | Expert pruning with efficiency claims — no joule measurement |
| [4] | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (2023) | Weight pruning — perplexity evaluation only |
| [5] | Wanda: A Simple and Effective Pruning Approach for Large Language Models (2023) | Pruning without retraining — accuracy evaluation only |
| [6] | TokenPowerBench: Benchmarking Energy Consumption of LLM Inference (2024) | Energy benchmark on stock models — no surgery |
| [7] | ML.ENERGY Leaderboard (ongoing) | GPU-focused energy tracking — no ARM, no surgery |
| [8] | EuroMLSys 2025: Energy Evaluation of LLM Serving (2025) | Data center energy — no edge hardware, no model modification |
| [9] | PruneEnergyAnalyzer: CNN Pruning Energy Analysis (2024) | Energy after pruning — CNNs only, not LLMs, not ARM |
This work is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
You are free to share and adapt this work for any purpose, provided you give appropriate credit.
All code in this repository is licensed under the MIT License.

{ "model": "qwen3-30b-a3b", "quantization": "Q4_K_M", "surgery": "none (baseline)", "hardware": "4x Pi 5 16GB (distributed-llama)", "instrument": "epi-meter v1.0", "measurement_point": "AC inlet per node", "environment": { "ambient_temp_c": null, // Logged at runtime "frequency_governor": "performance", "background_load": "minimal" }, "energy": { "total_joules": null, // Measured by epi-meter "total_tokens": null, // Counted by benchmark runner "joules_per_token": null, // E_total / N_tokens "avg_watts_cluster": null, // Average across all nodes "peak_watts_cluster": null, // Maximum instantaneous "duration_seconds": null, // Total inference time "kwh_total": null // For cost context }, "accuracy": { "mmlu_5shot": null, // 0.0 - 1.0 "arc_challenge_25shot": null, // 0.0 - 1.0 "hellaswag_10shot": null, // 0.0 - 1.0 "composite": null // Weighted average, normalized }, "epi": { "value": null, // J/T ÷ A — the final metric "joules_per_token": null, // Numerator "accuracy_composite": null // Denominator }, "run_metadata": { "run_id": null, "timestamp_utc": null, "repetition": null, // 1, 2, or 3 "epi_meter_firmware": null, "distributed_llama_version": null } }