Energy Per Intelligence

A Metric for Evaluating Model Surgery From the Perspective of an Electrical Engineer

Francisco Abner Electrical Engineer, CEO & Founder — YOSO-YAi LLC New Albany, Ohio

Abstract — The AI industry evaluates model compression by accuracy retained. The energy industry evaluates systems by efficiency delivered. This paper bridges the two with Energy Per Intelligence (EPI) — a metric that divides the energy cost of inference (joules per token) by the task accuracy of the output. We define EPI formally, instrument a Raspberry Pi 5 cluster with a custom AC-side power measurement board (the epi-meter), and establish baseline EPI measurements for unmodified open-weights models on production ARM hardware. Subsequent papers in this series apply EPI to evaluate specific model surgery techniques: expert pruning, mixed quantization, and attention head removal. All measurements, tooling, and the measurement instrument itself are open source.

1. Introduction

A token is a unit of electricity. Every token an AI model generates is a specific quantity of energy that flowed through a specific piece of silicon. The model's architecture decides how many joules that token costs. The quantization decides it. The number of active MoE experts decides it. Every layer, every attention head, every weight tensor is a component in a circuit, and every component draws power.

Tokenomics is energy economics. Model architecture is circuit design for intelligence. And the question every AI system should answer is:

How many joules does one unit of useful intelligence cost?

The AI industry treats tokens as abstract units of text. API companies price them like items on a shelf: dollars per million. But a token is not abstract. It is a physical event. The author of this paper started as an electrician — pulling wire on Fluor Corporation job sites, working instrumentation and controls at Tesla's Gigafactory, and wiring Meta's NightCrawler data center. He has always thought in watts. This paper applies that lens to AI models.

This document defines:

The EPI metric that measures intelligence efficiency
The instrumentation that captures the energy data
The baseline measurements on production ARM hardware
The framework for evaluating model surgery by energy outcome

2. The Gap in Existing Research

Two active research communities exist. Neither connects to the other.

Community	What They Do	What They Don't Do
Model Surgery — MoE-Pruner [1], NAEE [2], EEP [3], SparseGPT [4], Wanda [5]	Prune experts, remove heads, quantize weights. Measure accuracy retention (perplexity, benchmark scores).	Measure energy consumption. No joules/token. No efficiency ratio. No production hardware testing.
Energy Benchmarking — TokenPowerBench [6], ML.ENERGY [7], EuroMLSys 2025 [8]	Measure joules/token on stock (unmodified) models. Typically on data center GPUs (A100, H100, B200).	Perform surgery. No model modification. No efficiency ratio. No edge hardware.
CNN Energy Analysis — PruneEnergyAnalyzer [9]	Measure energy after pruning CNNs. Provide joules and FPS metrics.	Not LLMs. Not MoE. Not ARM. No EPI-style efficiency ratio.

The Five-Column Gap

Capability	Model Surgery Papers	Energy Benchmarks	PruneEnergyAnalyzer	This Paper
LLM Surgery	Yes	—	—	Yes
Dedicated HW Energy Measurement	—	Yes	Yes	Yes
Efficiency Metric (EPI)	—	—	—	Yes
Production ARM Hardware	—	—	—	Yes
Custom Measurement Instrument	—	—	—	Yes

Nobody performs surgery on an LLM, deploys the modified model to production ARM hardware, measures the actual energy cost with dedicated instrumentation, and frames the result as an efficiency ratio of intelligence per joule. This paper occupies that gap.

3. Defining Energy Per Intelligence

Formal Definition

EPI = J/T ÷ A

Where:

Symbol	Name	Unit	Definition
EPI	Energy Per Intelligence	J/(token · accuracy)	The composite metric. Lower is better.
J/T	Joules per Token	J/token	Total energy consumed during inference ÷ number of tokens generated.
A	Task Accuracy	dimensionless [0, 1]	Model's score on a domain-specific benchmark, normalized to a 0–1 scale.

Expanded Form

         E_total
EPI = ─────────────
       N_tokens × A

Where:

E_total = total energy consumed during the inference run (joules), measured by the epi-meter
N_tokens = total tokens generated during the inference run
A = benchmark accuracy score, normalized to [0, 1]

Interpretation

EPI Value	Meaning
Lower	More useful intelligence per joule. A well-designed circuit.
Higher	More energy wasted per unit of useful output. Unnecessary load.
Improving	Surgery removed load without destroying capability.
Degrading	Surgery damaged capability more than it saved energy.

A model that produces garbage output at very low energy cost has excellent joules/token but is useless. EPI divides by task accuracy, penalizing low-quality output. A model that burns watts on useless computation is a circuit with unnecessary load. Model surgery removes the load.

4. Why Joules, Not Watts

Watts is the rate of energy consumption (joules per second). Two systems can draw the same watts, but if one takes twice as long per token, it consumed twice the joules to produce the same output.

             Power (W) × Duration (s)
Joules   = ─────────────────────────── = Total energy cost of the token
                      1

Metric	What It Captures	What It Misses
Watts	Instantaneous power draw	Duration. A slow model at low watts can cost more than a fast model at high watts.
Tokens/second	Speed	Energy. A fast model that draws 3x the power is not efficient.
Joules/token	Total energy per output unit	Quality. A model that produces garbage at low energy is not useful.
EPI	Energy per unit of useful intelligence	Nothing relevant to this analysis.

James Prescott Joule established the relationship between heat and mechanical work in the 1840s. The unit that bears his name is the correct unit for measuring the cost of computation: total energy expended, not instantaneous rate.

5. Measurement Infrastructure

All EPI measurements are performed on the YOSO-YAi FACTORY infrastructure — a research lab purpose-built for this work.

╔═══════════════════════════════════════════════════════════════════════╗
║                    EPI MEASUREMENT PIPELINE                         ║
╠═══════════════════════════════════════════════════════════════════════╣
║                                                                     ║
║  ┌──────────────┐     ┌──────────────┐     ┌──────────────────┐    ║
║  │  DGX Spark   │     │  Pi Cluster  │     │  epi-meter       │    ║
║  │  GB10 128GB  │────▶│  4x Pi 5     │◄────│  RP2350 + 4x IC  │    ║
║  │              │     │  16GB each   │     │  CT clamps (AC)  │    ║
║  │  Quantize    │     │              │     │                  │    ║
║  │  to GGUF     │     │  distributed │     │  True RMS watts  │    ║
║  │  rsync to Pi │     │  -llama      │     │  per node        │    ║
║  └──────────────┘     └──────┬───────┘     └────────┬─────────┘    ║
║                              │                      │              ║
║                    Benchmark │          Power trace  │              ║
║                     results  │           (JSON/UART) │              ║
║                              ▼                      ▼              ║
║                     ┌─────────────────────────────────┐            ║
║                     │     Orchestrator Pi              │            ║
║                     │     10-inch display              │            ║
║                     │                                  │            ║
║                     │  SQLite: power traces            │            ║
║                     │  SQLite: benchmark results       │            ║
║                     │  Live dashboard                  │            ║
║                     └──────────────┬──────────────────┘            ║
║                                    │                               ║
║                          SSH pull  │                               ║
║                                    ▼                               ║
║                     ┌──────────────────────────┐                   ║
║                     │  epi-bench               │                   ║
║                     │  (on DGX or any machine) │                   ║
║                     │                          │                   ║
║                     │  EPI = J/T ÷ A           │                   ║
║                     │  Pareto plots             │                   ║
║                     │  Results database         │                   ║
║                     └──────────────────────────┘                   ║
║                                                                     ║
╚═══════════════════════════════════════════════════════════════════════╝

Machine	Role	Function
DGX Spark (GB10, 128GB, 1 PFLOP)	Surgeon + Oracle	Model surgery, GGUF quantization, deployment to Pi cluster, results analysis.
Pi 5 Cluster (4x 16GB, 64GB total)	Patient	Where modified models run inference on real ARM silicon. Ground truth for EPI.
epi-meter Board (RP2350 + metering ICs)	Instrument	Custom PCB. 4-channel AC energy metering via CT clamps. True RMS, power factor, real watts.
Orchestrator Pi (10-inch display)	Mission Control	Receives epi-meter data, renders live power visualization, logs to SQLite.

Data Flow

DGX quantizes model to GGUF
  → rsync shards to Pi cluster
    → Pi cluster loads into distributed-llama
      → Benchmark suite runs on Pi cluster
        → epi-meter captures real watts per node via CT clamps
          → epi-meter streams JSON/UART to Orchestrator
            → Orchestrator logs to SQLite
              → epi-bench pulls traces + benchmark results
                → epi-bench calculates EPI
                  → Results logged to database

6. The epi-meter Board

Full design files, firmware, and build guide: Franzabner/epi-meter

The epi-meter is a custom power measurement PCB designed to instrument the Pi cluster with dedicated energy metering hardware. It is the first YOSO-YAi board design that is fully public.

Why Not Software Measurement?

The Pi 5 has no GPU telemetry (nvidia-smi equivalent does not exist). Software power estimation on ARM Linux is unreliable and does not capture full system draw. For publishable research, the measurement instrument must be independent of the system being measured.

An electrician does not trust a device's self-reported power draw. An electrician puts a meter on the circuit.

Specifications

Parameter	Specification
MCU	RP2350 (Pico 2 silicon) — same silicon as BMC in YOSO-YAi production boards
Energy Metering	4x dedicated ICs (ATM90E26 / ADE7753 class) over SPI
Current Sensing	4x CT clamps (SCT-013 class), one per Pi node, non-invasive
Voltage Sensing	One voltage divider from shared AC reference
Measurements	True RMS voltage + current, real power (W), power factor, reactive power (VAR)
Computation	Power computed in hardware by the metering IC, not the MCU
Sampling	Metering IC internal: 1 kHz+. RMS output at 10–50 Hz
Output	UART to Orchestrator Pi, 115200 baud, JSON
Power	USB-C from Orchestrator or separate 5V supply
Measurement Point	AC side, between wall outlet and each Pi node

Why AC Side?

AC inlet measurement captures everything: PSU efficiency losses, cooling fans, the entire system draw. This is what the electricity bill reflects. This is the real cost of a token. DC measurement would require opening each device, probing internal rails, and would miss PSU overhead. AC measurement via CT clamps is non-invasive — no cutting mains, no voiding warranties.

7. Experimental Setup

Hardware Configuration

Component	Specification	Qty
Raspberry Pi 5	16 GB LPDDR4X, active cooling, official PSU	4
NVMe Storage	1 TB M.2 2230 per node	4
Network	Gigabit Ethernet, PoE switch	1
Inference Engine	`distributed-llama` (distributed across 4 nodes)	—
Power Measurement	epi-meter board with CT clamps on each node's AC inlet	1
Ambient Temperature	Logged per run via Orchestrator (target: 22 ± 2 °C)	—

Measurement Protocol

1. Power on all nodes. Wait 5 minutes for thermal stabilization.
2. Load model into distributed-llama across all 4 nodes.
3. Verify serving with health check (Serving Verifier skill).
4. Begin epi-meter recording (continuous JSON stream to Orchestrator SQLite).
5. Run benchmark suite:
   a. MMLU (5-shot) — broad knowledge
   b. ARC-Challenge (25-shot) — reasoning
   c. HellaSwag (10-shot) — commonsense
6. Record: total tokens generated, total time, per-node power traces.
7. Stop epi-meter recording.
8. Pull power traces and benchmark results from Orchestrator.
9. Calculate EPI using epi-bench.
10. Log all results with full metadata to results database.

Environmental Controls

Factor	Control Method
Ambient temperature	HVAC-controlled room, logged per run
Background processes	Minimal OS services, no GUI, no competing workloads
Warm-up	5-minute thermal stabilization before each measurement
Repetitions	Each configuration measured 3x, median reported
Clock stability	Pi 5 frequency governor set to `performance` (fixed clock)

8. Baseline Models

Baseline EPI is measured on unmodified open-weights models — no surgery, no pruning, no custom quantization. These baselines establish the reference point against which all surgical modifications are compared.

Model	Architecture	Parameters	Quantization	Context
Qwen3-30B-A3B	MoE (128 experts, 8 active)	30B total, 3B active	Q4_K_M (GGUF)	4096
Llama-3.1-8B	Dense	8B	Q4_K_M (GGUF)	4096
Mistral-7B-v0.3	Dense	7.2B	Q4_K_M (GGUF)	4096
Phi-3-mini-4k	Dense	3.8B	Q4_K_M (GGUF)	4096
Gemma-2-9B	Dense	9.2B	Q4_K_M (GGUF)	4096

Note: Final model selection may be adjusted based on distributed-llama compatibility and Pi cluster memory constraints. Models listed represent the target baseline set.

9. Baseline Results

Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026. This section will be populated with measured data once the lab is live.

Expected Data Format

Results will be published in the following structure:

data/baseline/
├── qwen3-30b-a3b_q4km/
│   ├── run_001.json          # Full run metadata
│   ├── power_trace_001.csv   # Per-node power samples (timestamp, node, watts)
│   ├── benchmark_001.json    # MMLU, ARC, HellaSwag scores
│   └── epi_001.json          # Calculated EPI + all intermediate values
├── llama-3.1-8b_q4km/
│   └── ...
├── mistral-7b-v03_q4km/
│   └── ...
├── phi-3-mini-4k_q4km/
│   └── ...
└── gemma-2-9b_q4km/
    └── ...

Result Schema

Each epi_*.json file will contain:

{
  "model": "qwen3-30b-a3b",
  "quantization": "Q4_K_M",
  "surgery": "none (baseline)",
  "hardware": "4x Pi 5 16GB (distributed-llama)",
  "instrument": "epi-meter v1.0",
  "measurement_point": "AC inlet per node",
  "environment": {
    "ambient_temp_c": null,       // Logged at runtime
    "frequency_governor": "performance",
    "background_load": "minimal"
  },
  "energy": {
    "total_joules": null,         // Measured by epi-meter
    "total_tokens": null,         // Counted by benchmark runner
    "joules_per_token": null,     // E_total / N_tokens
    "avg_watts_cluster": null,    // Average across all nodes
    "peak_watts_cluster": null,   // Maximum instantaneous
    "duration_seconds": null,     // Total inference time
    "kwh_total": null             // For cost context
  },
  "accuracy": {
    "mmlu_5shot": null,           // 0.0 - 1.0
    "arc_challenge_25shot": null, // 0.0 - 1.0
    "hellaswag_10shot": null,     // 0.0 - 1.0
    "composite": null             // Weighted average, normalized
  },
  "epi": {
    "value": null,                // J/T ÷ A — the final metric
    "joules_per_token": null,     // Numerator
    "accuracy_composite": null    // Denominator
  },
  "run_metadata": {
    "run_id": null,
    "timestamp_utc": null,
    "repetition": null,           // 1, 2, or 3
    "epi_meter_firmware": null,
    "distributed_llama_version": null
  }
}

Placeholder Table

The following table will be filled with measured data. Values shown as — are pending measurement.

Model	Quant	J/Token	MMLU	ARC-C	HSwag	Accuracy	EPI
Qwen3-30B-A3B	Q4_K_M	—	—	—	—	—	—
Llama-3.1-8B	Q4_K_M	—	—	—	—	—	—
Mistral-7B-v0.3	Q4_K_M	—	—	—	—	—	—
Phi-3-mini-4k	Q4_K_M	—	—	—	—	—	—
Gemma-2-9B	Q4_K_M	—	—	—	—	—	—

10. Benchmark Selection

The intelligence denominator in EPI uses established benchmarks to enable cross-study comparison:

Benchmark	Measures	Shot Count	Why Selected
MMLU	Broad knowledge across 57 domains	5-shot	Industry standard. Enables comparison with existing model surgery papers.
ARC-Challenge	Grade-school science reasoning	25-shot	Tests reasoning capability that model surgery may degrade.
HellaSwag	Commonsense natural language inference	10-shot	Sensitive to model quality degradation from aggressive pruning.

Composite Accuracy Score

A = w₁ × MMLU + w₂ × ARC-C + w₃ × HellaSwag

Default weights: w₁ = w₂ = w₃ = 1/3 (equal weighting). Custom weighting supported by epi-bench.

Future: Domain-Specific Benchmarks

A custom electrical engineering domain benchmark is planned as an open-source evaluation suite, enabling EPI measurement specific to the domains YOSO-YAi products serve. This benchmark will be published in a separate repository when ready.

11. EPI Across Hardware

The same model produces different EPI on different hardware. The hardware is a variable in the equation, not a constant.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  Same Model, Same Surgery, Different Hardware = Different EPI│
│                                                             │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐              │
│  │ DGX     │     │ Pi 5    │     │ Pi 4    │              │
│  │ Spark   │     │ Cluster │     │ (hypo.) │              │
│  │         │     │         │     │         │              │
│  │ EPI: X  │     │ EPI: Y  │     │ EPI: Z  │              │
│  └─────────┘     └─────────┘     └─────────┘              │
│                                                             │
│  X ≠ Y ≠ Z — Hardware determines the energy cost           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

This is why EPI must be measured on the production hardware (Pi cluster), not the training hardware (DGX Spark). The DGX Spark is the surgeon — it performs the surgery. The Pi cluster is the patient — it lives with the result.

You measure the patient's outcome, not the surgeon's electricity bill.

When reporting EPI, the hardware configuration is always specified. Cross-hardware EPI comparison is meaningful only when the measurement methodology is identical (same epi-meter firmware, same measurement point, same environmental controls).

12. Reproducibility

Everything needed to reproduce these measurements is open source:

Component	Repository	What's Included
Measurement Instrument	`epi-meter`	KiCad schematic, PCB layout, Gerbers, BOM, RP2350 firmware, 3D enclosure, calibration guide
Calculation Tooling	`epi-bench`	EPI calculator, power trace parser, Pareto plotter, benchmark runner, CSV format spec
Raw Data	`data/` in this repository	Power traces, benchmark scores, calculated EPI, full run metadata
Analysis Code	`code/` in this repository	Visualization scripts, statistical analysis, table generators

Community Submissions

We invite the community to replicate measurements on their own hardware and submit results:

Build an epi-meter (or use any calibrated AC power meter)
Run the standardized benchmark suite from epi-bench
Calculate EPI using the provided tooling
Submit a pull request to data/community/ with your results

See CONTRIBUTING.md for the submission format and quality requirements.

13. Limitations

Limitation	Mitigation
AC-side measurement includes PSU efficiency losses	Consistent across all runs. PSU efficiency is part of the real-world cost.
CT clamp accuracy (typically ±1–2%)	Calibrated against known resistive load. Error budget documented per run.
Pi 5 frequency scaling	Governor set to `performance` mode (fixed clock) for all measurements.
Benchmark scores are task-dependent	Multiple benchmarks with composite scoring. Custom domain benchmark planned.
Single lab environment	Environmental conditions logged per run. Community replication invited.
Small cluster (4 nodes)	Representative of target deployment hardware. Not intended to model data center scale.

14. Future Work

Paper	Repository	Core Question
Expert Pruning × EPI	`expert-pruning-epi`	Where is the efficient operating point when dropping MoE experts?
Mixed Quantization × EPI	`mixed-quant-epi`	Two configs, identical perplexity — do they cost the same joules on ARM?
Head Surgery × EPI	`attention-head-surgery-epi`	When you remove attention heads, does the energy actually drop — or do remaining heads compensate?
Fine-Tune Energy Payback	Planned	kWh cost of fine-tuning vs. downstream EPI improvement. Payback period in tokens.
EPI Prediction	Planned	Given proposed surgery parameters, predict EPI on ARM before deploying.

15. Citation

@article{abner2026epi,
  title   = {Energy Per Intelligence: A Metric for Evaluating Model Surgery
             From the Perspective of an Electrical Engineer},
  author  = {Abner, Francisco},
  year    = {2026},
  url     = {https://github.com/Franzabner/energy-per-intelligence},
  note    = {YOSO-YAi LLC. Data collection in progress.}
}

16. Related Papers

References below include foundational work this paper builds upon and positions against.

#	Reference	Relevance
[1]	MoE-Pruner: Pruning Mixture-of-Experts Large Language Models (2024)	Expert pruning methodology — measures accuracy, not energy
[2]	NAEE: N-gram Aware Expert Elimination (2024)	MoE expert elimination — accuracy-only evaluation
[3]	EEP: Expert-level Efficient Pruning for MoE (2024)	Expert pruning with efficiency claims — no joule measurement
[4]	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (2023)	Weight pruning — perplexity evaluation only
[5]	Wanda: A Simple and Effective Pruning Approach for Large Language Models (2023)	Pruning without retraining — accuracy evaluation only
[6]	TokenPowerBench: Benchmarking Energy Consumption of LLM Inference (2024)	Energy benchmark on stock models — no surgery
[7]	ML.ENERGY Leaderboard (ongoing)	GPU-focused energy tracking — no ARM, no surgery
[8]	EuroMLSys 2025: Energy Evaluation of LLM Serving (2025)	Data center energy — no edge hardware, no model modification
[9]	PruneEnergyAnalyzer: CNN Pruning Energy Analysis (2024)	Energy after pruning — CNNs only, not LLMs, not ARM

17. License

This work is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to share and adapt this work for any purpose, provided you give appropriate credit.

All code in this repository is licensed under the MIT License.

The future of labor is compute. Measured in energy per intelligence.

Francisco Abner — Electrical Engineer, CEO & Founder, YOSO-YAi LLC

New Albany, Ohio · 2026

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
figures		figures
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-CODE		LICENSE-CODE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Energy Per Intelligence

A Metric for Evaluating Model Surgery From the Perspective of an Electrical Engineer

Table of Contents

1. Introduction

2. The Gap in Existing Research

The Five-Column Gap

3. Defining Energy Per Intelligence

Formal Definition

Expanded Form

Interpretation

4. Why Joules, Not Watts

5. Measurement Infrastructure

Data Flow

6. The epi-meter Board

Why Not Software Measurement?

Specifications

Why AC Side?

7. Experimental Setup

Hardware Configuration

Measurement Protocol

Environmental Controls

8. Baseline Models

9. Baseline Results

Expected Data Format

Result Schema

Placeholder Table

10. Benchmark Selection

Composite Accuracy Score

Future: Domain-Specific Benchmarks

11. EPI Across Hardware

12. Reproducibility

Community Submissions

13. Limitations

14. Future Work

15. Citation

16. Related Papers

17. License

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages