Skip to content

Franzabner/energy-per-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YOSO-YAi

Energy Per Intelligence

A Metric for Evaluating Model Surgery From the Perspective of an Electrical Engineer

Francisco Abner Electrical Engineer, CEO & Founder — YOSO-YAi LLC New Albany, Ohio

Paper Status License epi-meter epi-bench


Abstract — The AI industry evaluates model compression by accuracy retained. The energy industry evaluates systems by efficiency delivered. This paper bridges the two with Energy Per Intelligence (EPI) — a metric that divides the energy cost of inference (joules per token) by the task accuracy of the output. We define EPI formally, instrument a Raspberry Pi 5 cluster with a custom AC-side power measurement board (the epi-meter), and establish baseline EPI measurements for unmodified open-weights models on production ARM hardware. Subsequent papers in this series apply EPI to evaluate specific model surgery techniques: expert pruning, mixed quantization, and attention head removal. All measurements, tooling, and the measurement instrument itself are open source.


Table of Contents

  1. Introduction
  2. The Gap in Existing Research
  3. Defining Energy Per Intelligence
  4. Why Joules, Not Watts
  5. Measurement Infrastructure
  6. The epi-meter Board
  7. Experimental Setup
  8. Baseline Models
  9. Baseline Results
  10. Benchmark Selection
  11. EPI Across Hardware
  12. Reproducibility
  13. Limitations
  14. Future Work
  15. Citation
  16. Related Papers
  17. License

1. Introduction

A token is a unit of electricity. Every token an AI model generates is a specific quantity of energy that flowed through a specific piece of silicon. The model's architecture decides how many joules that token costs. The quantization decides it. The number of active MoE experts decides it. Every layer, every attention head, every weight tensor is a component in a circuit, and every component draws power.

Tokenomics is energy economics. Model architecture is circuit design for intelligence. And the question every AI system should answer is:

How many joules does one unit of useful intelligence cost?

The AI industry treats tokens as abstract units of text. API companies price them like items on a shelf: dollars per million. But a token is not abstract. It is a physical event. The author of this paper started as an electrician — pulling wire on Fluor Corporation job sites, working instrumentation and controls at Tesla's Gigafactory, and wiring Meta's NightCrawler data center. He has always thought in watts. This paper applies that lens to AI models.

This document defines:

  • The EPI metric that measures intelligence efficiency
  • The instrumentation that captures the energy data
  • The baseline measurements on production ARM hardware
  • The framework for evaluating model surgery by energy outcome

2. The Gap in Existing Research

Two active research communities exist. Neither connects to the other.

Community What They Do What They Don't Do
Model Surgery — MoE-Pruner [1], NAEE [2], EEP [3], SparseGPT [4], Wanda [5] Prune experts, remove heads, quantize weights. Measure accuracy retention (perplexity, benchmark scores). Measure energy consumption. No joules/token. No efficiency ratio. No production hardware testing.
Energy Benchmarking — TokenPowerBench [6], ML.ENERGY [7], EuroMLSys 2025 [8] Measure joules/token on stock (unmodified) models. Typically on data center GPUs (A100, H100, B200). Perform surgery. No model modification. No efficiency ratio. No edge hardware.
CNN Energy Analysis — PruneEnergyAnalyzer [9] Measure energy after pruning CNNs. Provide joules and FPS metrics. Not LLMs. Not MoE. Not ARM. No EPI-style efficiency ratio.

The Five-Column Gap

Capability Model Surgery Papers Energy Benchmarks PruneEnergyAnalyzer This Paper
LLM Surgery Yes Yes
Dedicated HW Energy Measurement Yes Yes Yes
Efficiency Metric (EPI) Yes
Production ARM Hardware Yes
Custom Measurement Instrument Yes

Nobody performs surgery on an LLM, deploys the modified model to production ARM hardware, measures the actual energy cost with dedicated instrumentation, and frames the result as an efficiency ratio of intelligence per joule. This paper occupies that gap.


3. Defining Energy Per Intelligence

Formal Definition

EPI = J/T ÷ A

Where:

Symbol Name Unit Definition
EPI Energy Per Intelligence J/(token · accuracy) The composite metric. Lower is better.
J/T Joules per Token J/token Total energy consumed during inference ÷ number of tokens generated.
A Task Accuracy dimensionless [0, 1] Model's score on a domain-specific benchmark, normalized to a 0–1 scale.

Expanded Form

         E_total
EPI = ─────────────
       N_tokens × A

Where:

  • E_total = total energy consumed during the inference run (joules), measured by the epi-meter
  • N_tokens = total tokens generated during the inference run
  • A = benchmark accuracy score, normalized to [0, 1]

Interpretation

EPI Value Meaning
Lower More useful intelligence per joule. A well-designed circuit.
Higher More energy wasted per unit of useful output. Unnecessary load.
Improving Surgery removed load without destroying capability.
Degrading Surgery damaged capability more than it saved energy.

A model that produces garbage output at very low energy cost has excellent joules/token but is useless. EPI divides by task accuracy, penalizing low-quality output. A model that burns watts on useless computation is a circuit with unnecessary load. Model surgery removes the load.


4. Why Joules, Not Watts

Watts is the rate of energy consumption (joules per second). Two systems can draw the same watts, but if one takes twice as long per token, it consumed twice the joules to produce the same output.

             Power (W) × Duration (s)
Joules   = ─────────────────────────── = Total energy cost of the token
                      1
Metric What It Captures What It Misses
Watts Instantaneous power draw Duration. A slow model at low watts can cost more than a fast model at high watts.
Tokens/second Speed Energy. A fast model that draws 3x the power is not efficient.
Joules/token Total energy per output unit Quality. A model that produces garbage at low energy is not useful.
EPI Energy per unit of useful intelligence Nothing relevant to this analysis.

James Prescott Joule established the relationship between heat and mechanical work in the 1840s. The unit that bears his name is the correct unit for measuring the cost of computation: total energy expended, not instantaneous rate.


5. Measurement Infrastructure

All EPI measurements are performed on the YOSO-YAi FACTORY infrastructure — a research lab purpose-built for this work.

╔═══════════════════════════════════════════════════════════════════════╗
║                    EPI MEASUREMENT PIPELINE                         ║
╠═══════════════════════════════════════════════════════════════════════╣
║                                                                     ║
║  ┌──────────────┐     ┌──────────────┐     ┌──────────────────┐    ║
║  │  DGX Spark   │     │  Pi Cluster  │     │  epi-meter       │    ║
║  │  GB10 128GB  │────▶│  4x Pi 5     │◄────│  RP2350 + 4x IC  │    ║
║  │              │     │  16GB each   │     │  CT clamps (AC)  │    ║
║  │  Quantize    │     │              │     │                  │    ║
║  │  to GGUF     │     │  distributed │     │  True RMS watts  │    ║
║  │  rsync to Pi │     │  -llama      │     │  per node        │    ║
║  └──────────────┘     └──────┬───────┘     └────────┬─────────┘    ║
║                              │                      │              ║
║                    Benchmark │          Power trace  │              ║
║                     results  │           (JSON/UART) │              ║
║                              ▼                      ▼              ║
║                     ┌─────────────────────────────────┐            ║
║                     │     Orchestrator Pi              │            ║
║                     │     10-inch display              │            ║
║                     │                                  │            ║
║                     │  SQLite: power traces            │            ║
║                     │  SQLite: benchmark results       │            ║
║                     │  Live dashboard                  │            ║
║                     └──────────────┬──────────────────┘            ║
║                                    │                               ║
║                          SSH pull  │                               ║
║                                    ▼                               ║
║                     ┌──────────────────────────┐                   ║
║                     │  epi-bench               │                   ║
║                     │  (on DGX or any machine) │                   ║
║                     │                          │                   ║
║                     │  EPI = J/T ÷ A           │                   ║
║                     │  Pareto plots             │                   ║
║                     │  Results database         │                   ║
║                     └──────────────────────────┘                   ║
║                                                                     ║
╚═══════════════════════════════════════════════════════════════════════╝
Machine Role Function
DGX Spark (GB10, 128GB, 1 PFLOP) Surgeon + Oracle Model surgery, GGUF quantization, deployment to Pi cluster, results analysis.
Pi 5 Cluster (4x 16GB, 64GB total) Patient Where modified models run inference on real ARM silicon. Ground truth for EPI.
epi-meter Board (RP2350 + metering ICs) Instrument Custom PCB. 4-channel AC energy metering via CT clamps. True RMS, power factor, real watts.
Orchestrator Pi (10-inch display) Mission Control Receives epi-meter data, renders live power visualization, logs to SQLite.

Data Flow

DGX quantizes model to GGUF
  → rsync shards to Pi cluster
    → Pi cluster loads into distributed-llama
      → Benchmark suite runs on Pi cluster
        → epi-meter captures real watts per node via CT clamps
          → epi-meter streams JSON/UART to Orchestrator
            → Orchestrator logs to SQLite
              → epi-bench pulls traces + benchmark results
                → epi-bench calculates EPI
                  → Results logged to database

6. The epi-meter Board

Full design files, firmware, and build guide: Franzabner/epi-meter

The epi-meter is a custom power measurement PCB designed to instrument the Pi cluster with dedicated energy metering hardware. It is the first YOSO-YAi board design that is fully public.

Why Not Software Measurement?

The Pi 5 has no GPU telemetry (nvidia-smi equivalent does not exist). Software power estimation on ARM Linux is unreliable and does not capture full system draw. For publishable research, the measurement instrument must be independent of the system being measured.

An electrician does not trust a device's self-reported power draw. An electrician puts a meter on the circuit.

Specifications

Parameter Specification
MCU RP2350 (Pico 2 silicon) — same silicon as BMC in YOSO-YAi production boards
Energy Metering 4x dedicated ICs (ATM90E26 / ADE7753 class) over SPI
Current Sensing 4x CT clamps (SCT-013 class), one per Pi node, non-invasive
Voltage Sensing One voltage divider from shared AC reference
Measurements True RMS voltage + current, real power (W), power factor, reactive power (VAR)
Computation Power computed in hardware by the metering IC, not the MCU
Sampling Metering IC internal: 1 kHz+. RMS output at 10–50 Hz
Output UART to Orchestrator Pi, 115200 baud, JSON
Power USB-C from Orchestrator or separate 5V supply
Measurement Point AC side, between wall outlet and each Pi node

Why AC Side?

AC inlet measurement captures everything: PSU efficiency losses, cooling fans, the entire system draw. This is what the electricity bill reflects. This is the real cost of a token. DC measurement would require opening each device, probing internal rails, and would miss PSU overhead. AC measurement via CT clamps is non-invasive — no cutting mains, no voiding warranties.


7. Experimental Setup

Hardware Configuration

Component Specification Qty
Raspberry Pi 5 16 GB LPDDR4X, active cooling, official PSU 4
NVMe Storage 1 TB M.2 2230 per node 4
Network Gigabit Ethernet, PoE switch 1
Inference Engine distributed-llama (distributed across 4 nodes)
Power Measurement epi-meter board with CT clamps on each node's AC inlet 1
Ambient Temperature Logged per run via Orchestrator (target: 22 ± 2 °C)

Measurement Protocol

1. Power on all nodes. Wait 5 minutes for thermal stabilization.
2. Load model into distributed-llama across all 4 nodes.
3. Verify serving with health check (Serving Verifier skill).
4. Begin epi-meter recording (continuous JSON stream to Orchestrator SQLite).
5. Run benchmark suite:
   a. MMLU (5-shot) — broad knowledge
   b. ARC-Challenge (25-shot) — reasoning
   c. HellaSwag (10-shot) — commonsense
6. Record: total tokens generated, total time, per-node power traces.
7. Stop epi-meter recording.
8. Pull power traces and benchmark results from Orchestrator.
9. Calculate EPI using epi-bench.
10. Log all results with full metadata to results database.

Environmental Controls

Factor Control Method
Ambient temperature HVAC-controlled room, logged per run
Background processes Minimal OS services, no GUI, no competing workloads
Warm-up 5-minute thermal stabilization before each measurement
Repetitions Each configuration measured 3x, median reported
Clock stability Pi 5 frequency governor set to performance (fixed clock)

8. Baseline Models

Baseline EPI is measured on unmodified open-weights models — no surgery, no pruning, no custom quantization. These baselines establish the reference point against which all surgical modifications are compared.

Model Architecture Parameters Quantization Context
Qwen3-30B-A3B MoE (128 experts, 8 active) 30B total, 3B active Q4_K_M (GGUF) 4096
Llama-3.1-8B Dense 8B Q4_K_M (GGUF) 4096
Mistral-7B-v0.3 Dense 7.2B Q4_K_M (GGUF) 4096
Phi-3-mini-4k Dense 3.8B Q4_K_M (GGUF) 4096
Gemma-2-9B Dense 9.2B Q4_K_M (GGUF) 4096

Note: Final model selection may be adjusted based on distributed-llama compatibility and Pi cluster memory constraints. Models listed represent the target baseline set.


9. Baseline Results

Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026. This section will be populated with measured data once the lab is live.

Expected Data Format

Results will be published in the following structure:

data/baseline/
├── qwen3-30b-a3b_q4km/
│   ├── run_001.json          # Full run metadata
│   ├── power_trace_001.csv   # Per-node power samples (timestamp, node, watts)
│   ├── benchmark_001.json    # MMLU, ARC, HellaSwag scores
│   └── epi_001.json          # Calculated EPI + all intermediate values
├── llama-3.1-8b_q4km/
│   └── ...
├── mistral-7b-v03_q4km/
│   └── ...
├── phi-3-mini-4k_q4km/
│   └── ...
└── gemma-2-9b_q4km/
    └── ...

Result Schema

Each epi_*.json file will contain:

{
  "model": "qwen3-30b-a3b",
  "quantization": "Q4_K_M",
  "surgery": "none (baseline)",
  "hardware": "4x Pi 5 16GB (distributed-llama)",
  "instrument": "epi-meter v1.0",
  "measurement_point": "AC inlet per node",
  "environment": {
    "ambient_temp_c": null,       // Logged at runtime
    "frequency_governor": "performance",
    "background_load": "minimal"
  },
  "energy": {
    "total_joules": null,         // Measured by epi-meter
    "total_tokens": null,         // Counted by benchmark runner
    "joules_per_token": null,     // E_total / N_tokens
    "avg_watts_cluster": null,    // Average across all nodes
    "peak_watts_cluster": null,   // Maximum instantaneous
    "duration_seconds": null,     // Total inference time
    "kwh_total": null             // For cost context
  },
  "accuracy": {
    "mmlu_5shot": null,           // 0.0 - 1.0
    "arc_challenge_25shot": null, // 0.0 - 1.0
    "hellaswag_10shot": null,     // 0.0 - 1.0
    "composite": null             // Weighted average, normalized
  },
  "epi": {
    "value": null,                // J/T ÷ A — the final metric
    "joules_per_token": null,     // Numerator
    "accuracy_composite": null    // Denominator
  },
  "run_metadata": {
    "run_id": null,
    "timestamp_utc": null,
    "repetition": null,           // 1, 2, or 3
    "epi_meter_firmware": null,
    "distributed_llama_version": null
  }
}

Placeholder Table

The following table will be filled with measured data. Values shown as are pending measurement.

Model Quant J/Token MMLU ARC-C HSwag Accuracy EPI
Qwen3-30B-A3B Q4_K_M
Llama-3.1-8B Q4_K_M
Mistral-7B-v0.3 Q4_K_M
Phi-3-mini-4k Q4_K_M
Gemma-2-9B Q4_K_M

10. Benchmark Selection

The intelligence denominator in EPI uses established benchmarks to enable cross-study comparison:

Benchmark Measures Shot Count Why Selected
MMLU Broad knowledge across 57 domains 5-shot Industry standard. Enables comparison with existing model surgery papers.
ARC-Challenge Grade-school science reasoning 25-shot Tests reasoning capability that model surgery may degrade.
HellaSwag Commonsense natural language inference 10-shot Sensitive to model quality degradation from aggressive pruning.

Composite Accuracy Score

A = w₁ × MMLU + w₂ × ARC-C + w₃ × HellaSwag

Default weights: w₁ = w₂ = w₃ = 1/3 (equal weighting). Custom weighting supported by epi-bench.

Future: Domain-Specific Benchmarks

A custom electrical engineering domain benchmark is planned as an open-source evaluation suite, enabling EPI measurement specific to the domains YOSO-YAi products serve. This benchmark will be published in a separate repository when ready.


11. EPI Across Hardware

The same model produces different EPI on different hardware. The hardware is a variable in the equation, not a constant.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  Same Model, Same Surgery, Different Hardware = Different EPI│
│                                                             │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐              │
│  │ DGX     │     │ Pi 5    │     │ Pi 4    │              │
│  │ Spark   │     │ Cluster │     │ (hypo.) │              │
│  │         │     │         │     │         │              │
│  │ EPI: X  │     │ EPI: Y  │     │ EPI: Z  │              │
│  └─────────┘     └─────────┘     └─────────┘              │
│                                                             │
│  X ≠ Y ≠ Z — Hardware determines the energy cost           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

This is why EPI must be measured on the production hardware (Pi cluster), not the training hardware (DGX Spark). The DGX Spark is the surgeon — it performs the surgery. The Pi cluster is the patient — it lives with the result.

You measure the patient's outcome, not the surgeon's electricity bill.

When reporting EPI, the hardware configuration is always specified. Cross-hardware EPI comparison is meaningful only when the measurement methodology is identical (same epi-meter firmware, same measurement point, same environmental controls).


12. Reproducibility

Everything needed to reproduce these measurements is open source:

Component Repository What's Included
Measurement Instrument epi-meter KiCad schematic, PCB layout, Gerbers, BOM, RP2350 firmware, 3D enclosure, calibration guide
Calculation Tooling epi-bench EPI calculator, power trace parser, Pareto plotter, benchmark runner, CSV format spec
Raw Data data/ in this repository Power traces, benchmark scores, calculated EPI, full run metadata
Analysis Code code/ in this repository Visualization scripts, statistical analysis, table generators

Community Submissions

We invite the community to replicate measurements on their own hardware and submit results:

  1. Build an epi-meter (or use any calibrated AC power meter)
  2. Run the standardized benchmark suite from epi-bench
  3. Calculate EPI using the provided tooling
  4. Submit a pull request to data/community/ with your results

See CONTRIBUTING.md for the submission format and quality requirements.


13. Limitations

Limitation Mitigation
AC-side measurement includes PSU efficiency losses Consistent across all runs. PSU efficiency is part of the real-world cost.
CT clamp accuracy (typically ±1–2%) Calibrated against known resistive load. Error budget documented per run.
Pi 5 frequency scaling Governor set to performance mode (fixed clock) for all measurements.
Benchmark scores are task-dependent Multiple benchmarks with composite scoring. Custom domain benchmark planned.
Single lab environment Environmental conditions logged per run. Community replication invited.
Small cluster (4 nodes) Representative of target deployment hardware. Not intended to model data center scale.

14. Future Work

Paper Repository Core Question
Expert Pruning × EPI expert-pruning-epi Where is the efficient operating point when dropping MoE experts?
Mixed Quantization × EPI mixed-quant-epi Two configs, identical perplexity — do they cost the same joules on ARM?
Head Surgery × EPI attention-head-surgery-epi When you remove attention heads, does the energy actually drop — or do remaining heads compensate?
Fine-Tune Energy Payback Planned kWh cost of fine-tuning vs. downstream EPI improvement. Payback period in tokens.
EPI Prediction Planned Given proposed surgery parameters, predict EPI on ARM before deploying.

15. Citation

@article{abner2026epi,
  title   = {Energy Per Intelligence: A Metric for Evaluating Model Surgery
             From the Perspective of an Electrical Engineer},
  author  = {Abner, Francisco},
  year    = {2026},
  url     = {https://github.com/Franzabner/energy-per-intelligence},
  note    = {YOSO-YAi LLC. Data collection in progress.}
}

16. Related Papers

References below include foundational work this paper builds upon and positions against.

# Reference Relevance
[1] MoE-Pruner: Pruning Mixture-of-Experts Large Language Models (2024) Expert pruning methodology — measures accuracy, not energy
[2] NAEE: N-gram Aware Expert Elimination (2024) MoE expert elimination — accuracy-only evaluation
[3] EEP: Expert-level Efficient Pruning for MoE (2024) Expert pruning with efficiency claims — no joule measurement
[4] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (2023) Weight pruning — perplexity evaluation only
[5] Wanda: A Simple and Effective Pruning Approach for Large Language Models (2023) Pruning without retraining — accuracy evaluation only
[6] TokenPowerBench: Benchmarking Energy Consumption of LLM Inference (2024) Energy benchmark on stock models — no surgery
[7] ML.ENERGY Leaderboard (ongoing) GPU-focused energy tracking — no ARM, no surgery
[8] EuroMLSys 2025: Energy Evaluation of LLM Serving (2025) Data center energy — no edge hardware, no model modification
[9] PruneEnergyAnalyzer: CNN Pruning Energy Analysis (2024) Energy after pruning — CNNs only, not LLMs, not ARM

17. License

This work is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to share and adapt this work for any purpose, provided you give appropriate credit.

All code in this repository is licensed under the MIT License.


The future of labor is compute. Measured in energy per intelligence.

YOSO-YAi

Francisco Abner — Electrical Engineer, CEO & Founder, YOSO-YAi LLC

New Albany, Ohio · 2026

About

Energy Per Intelligence: A Metric for Evaluating Model Surgery From the Perspective of an Electrical Engineer. EPI = Joules/Token ÷ Accuracy. Measured on Pi 5 cluster with custom epi-meter board.

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE-CODE

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages