Abstract — Mixture-of-Experts (MoE) models activate only a subset of their parameters per token, making them natural candidates for expert pruning. Existing work evaluates pruning by accuracy retained — perplexity, benchmark scores. None measures the energy cost of the result on production hardware. This paper applies Energy Per Intelligence (EPI) to expert pruning: we systematically drop 1 through N experts from an MoE model, deploy each variant to a Raspberry Pi 5 cluster, measure real power consumption with the epi-meter board, and calculate EPI for every configuration. The result is an EPI curve — a plot of EPI vs. experts removed — that reveals the efficient operating point: the depth of pruning that minimizes the energy cost per unit of useful intelligence. We compare this operating point to the accuracy-only optimal identified by prior work and show that the two do not always coincide.
- Introduction
- Background
- Research Questions
- Experimental Design
- Surgery Matrix
- Target Model
- Methodology
- Expected Results
- Results
- Analysis
- Discussion
- Comparison to Prior Work
- Reproducibility
- Future Work
- Citation
- References
- License
An MoE model is a circuit with redundant paths. Each expert is a sub-network that activates for specific inputs. A model with 128 experts but only 8 active per token has 120 experts idle at any given moment — but they still occupy memory, influence routing, and shape the model's capacity distribution.
Expert pruning removes experts permanently. The pruned model is smaller, loads faster, and — the hypothesis — consumes less energy per token. But does the energy saving justify the accuracy loss? At what point does removing one more expert cost more intelligence than it saves in joules?
Existing expert pruning papers (MoE-Pruner [1], NAEE [2], EEP [3]) answer the accuracy question: how many experts can you drop before the model degrades? None of them answer the energy question: how many experts should you drop to minimize the energy cost per unit of useful intelligence?
This paper answers the energy question. We use EPI — Energy Per Intelligence — as the metric, the epi-meter as the measurement instrument, and a Raspberry Pi 5 cluster as the production hardware target. The result is an EPI curve that identifies the efficient operating point — the pruning depth that minimizes EPI.
MoE models use a gating network (router) to select a subset of experts for each token. For a model with E total experts and K active experts per token:
Token → Router → Select K of E experts → Weighted sum of expert outputs
Benefits: parameter efficiency (large total capacity, small active compute). Cost: routing overhead, memory for all experts, load imbalance.
Expert pruning removes entire expert sub-networks from the model:
| Method | Strategy | Evaluation |
|---|---|---|
| MoE-Pruner [1] | Prune by router frequency (least-used experts) | Perplexity, downstream accuracy |
| NAEE [2] | N-gram aware elimination (prune by token context) | Perplexity, benchmark scores |
| EEP [3] | Expert-level efficiency pruning | Accuracy retention per parameter removed |
Gap: All three measure accuracy. None measures energy. None uses EPI. None tests on ARM hardware.
EPI = Joules per Token / Task Accuracy
Defined in energy-per-intelligence. Lower EPI = more useful intelligence per joule. Measured with the epi-meter on a Pi 5 cluster.
| # | Question |
|---|---|
| RQ1 | How does EPI change as experts are removed from an MoE model? |
| RQ2 | Where is the EPI minimum — the efficient operating point? |
| RQ3 | Does the EPI-optimal pruning depth differ from the accuracy-optimal pruning depth? |
| RQ4 | Does the location of the EPI minimum vary across layer groups (early, middle, late)? |
| RQ5 | Does the energy saving from expert removal come from reduced computation, reduced memory bandwidth, or both? |
The central output of this paper is the EPI curve: a plot of EPI (Y-axis) vs. number of experts removed (X-axis). The shape reveals the tradeoff:
EPI
│
│ ╲
│ ╲ ← Accuracy drops faster than energy saves
│ ╲
│ ╲___
│ ╲__ ← Efficient operating point (EPI minimum)
│ ╲
│ ╲
│ ╲ ← Energy still dropping but accuracy
│ ╲ collapses — EPI rises
│ ╲
└──────────────────────── Experts Removed
0 1 2 3 4 5 6 7 8
The EPI minimum is the point where removing one more expert costs more intelligence (accuracy) than it saves in energy (joules). This is the efficient operating point — and it is invisible to accuracy-only analysis.
| Component | Specification |
|---|---|
| Surgery platform | DGX Spark (GB10, 128GB, 1 PFLOP) |
| Deployment target | Pi 5 cluster (4x 16GB, distributed-llama) |
| Measurement | epi-meter board (RP2350, 4x ATM90E26, CT clamps) |
| Orchestrator | Pi 5 with 10-inch display, SQLite logging |
| Quantization | Q4_K_M (GGUF) via llama.cpp |
The surgery matrix is executed autonomously by NemoClaw (YOSO-YAi's autonomous systems engineer on the DGX Spark) via n8n workflows:
CEO triggers via Telegram
→ n8n generates surgery matrix
→ For each configuration:
→ NemoClaw performs surgery (mergekit + safetensors)
→ Validate modified model
→ Quantize to GGUF (llama.cpp)
→ Deploy to Pi cluster (rsync)
→ Wait 60s for stabilization
→ Run benchmark suite
→ Collect epi-meter power trace
→ Calculate EPI (epi-bench)
→ Log results
→ Generate summary report
→ Send to CEO via Telegram
Each surgery run is one combination of parameters. The full matrix systematically explores expert removal depth and layer targeting.
| Parameter | Values | Count |
|---|---|---|
| Experts dropped | 1, 2, 3, 4, 5, 6, 7, 8 | 8 |
| Layer targeting | all layers, early only (0–25%), late only (75–100%) | 3 |
| Total runs | 8 × 3 | 24 |
Plus 1 baseline (zero experts removed) = 25 total configurations.
| Run | Experts Dropped | Target Layers | Surgery ID |
|---|---|---|---|
| 0 | 0 (baseline) | — | baseline |
| 1 | 1 | all | drop1_all |
| 2 | 1 | early | drop1_early |
| 3 | 1 | late | drop1_late |
| 4 | 2 | all | drop2_all |
| 5 | 2 | early | drop2_early |
| 6 | 2 | late | drop2_late |
| 7 | 3 | all | drop3_all |
| ... | ... | ... | ... |
| 22 | 8 | all | drop8_all |
| 23 | 8 | early | drop8_early |
| 24 | 8 | late | drop8_late |
| Parameter | Value |
|---|---|
| Model | Qwen3-30B-A3B |
| Architecture | MoE (Mixture of Experts) |
| Total parameters | 30B |
| Active parameters | 3B per token |
| Total experts | 128 |
| Active experts | 8 per token |
| Router | Top-K gating |
| Quantization | Q4_K_M (GGUF) |
| Inference engine | distributed-llama |
| Deployment | 4x Pi 5 16GB |
- MoE architecture with a large number of experts (128) — ample room for pruning exploration
- High expert-to-active ratio (128:8 = 16:1) — most experts are idle per token
- Fits on Pi cluster after Q4_K_M quantization — production-relevant deployment target
- Active community — results are immediately relevant to the Qwen ecosystem
For each run in the matrix:
- Load model — Open the Qwen3-30B-A3B safetensors on DGX Spark
- Identify targets — Select experts to drop based on router frequency analysis
- Perform surgery — Remove expert weight tensors using mergekit/safetensors
- Validate — Load modified model, run sanity prompts, verify coherent output
- Quantize — Convert to Q4_K_M GGUF using llama.cpp
- Deploy — rsync GGUF shards to Pi cluster
- Stabilize — Wait 60 seconds for thermal equilibrium
- Benchmark — Run MMLU (5-shot), ARC-Challenge (25-shot), HellaSwag (10-shot)
- Measure — Capture epi-meter power trace for entire benchmark duration
- Calculate — Compute EPI using epi-bench
- Log — Store all results with full metadata
Experts are ranked by router frequency — the fraction of tokens for which each expert is selected. Least-frequently-used experts are dropped first. This follows the approach of MoE-Pruner [1] to enable direct comparison.
| Benchmark | Shots | Measures |
|---|---|---|
| MMLU | 5 | Broad knowledge (57 domains) |
| ARC-Challenge | 25 | Reasoning |
| HellaSwag | 10 | Commonsense inference |
Composite accuracy: A = (MMLU + ARC-C + HellaSwag) / 3
- Instrument: epi-meter board (4x ATM90E26, CT clamps on each Pi node's AC inlet)
- Measurement point: AC side — captures total system draw including PSU losses
- Sampling: 1 Hz JSON telemetry to Orchestrator SQLite
- Repetitions: 3 per configuration, median reported
- Environmental: Fixed CPU frequency governor (
performance), ambient temperature logged
Hypotheses to be validated or refuted by measurement.
Removing experts initially improves EPI (energy drops faster than accuracy). Past a critical point, accuracy collapses while energy savings plateau — EPI rises. The curve has a minimum.
The pruning depth that retains maximum accuracy is not the same as the pruning depth that minimizes EPI. Accuracy-only analysis would recommend a shallower prune than the energy-optimal point.
Dropping experts from late layers (near the output) saves more energy per unit of accuracy lost than dropping from early layers. Late layers may have more redundant computation.
Removing the first expert saves more energy than removing the eighth. Energy savings per expert removed decrease as the remaining experts compensate.
Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026. This section will be populated with measured data.
| Experts Dropped | J/Token | MMLU | ARC-C | HSwag | Accuracy | EPI | vs Baseline |
|---|---|---|---|---|---|---|---|
| 0 (baseline) | — | — | — | — | — | — | — |
| 1 | — | — | — | — | — | — | —% |
| 2 | — | — | — | — | — | — | —% |
| 3 | — | — | — | — | — | — | —% |
| 4 | — | — | — | — | — | — | —% |
| 5 | — | — | — | — | — | — | —% |
| 6 | — | — | — | — | — | — | —% |
| 7 | — | — | — | — | — | — | —% |
| 8 | — | — | — | — | — | — | —% |
| Experts Dropped | J/Token | Accuracy | EPI | vs Baseline |
|---|---|---|---|---|
| 1 | — | — | — | —% |
| 2 | — | — | — | —% |
| ... | — | — | — | —% |
| 8 | — | — | — | —% |
| Experts Dropped | J/Token | Accuracy | EPI | vs Baseline |
|---|---|---|---|---|
| 1 | — | — | — | —% |
| 2 | — | — | — | —% |
| ... | — | — | — | —% |
| 8 | — | — | — | —% |
| Configuration | Best EPI | Experts Dropped | Layer Target | EPI Improvement |
|---|---|---|---|---|
| Overall best | — | — | — | —% |
| Best (all layers) | — | — | all | —% |
| Best (early only) | — | — | early | —% |
| Best (late only) | — | — | late | —% |
Pending measurement data.
Planned analyses:
- EPI curve shape — Is it U-shaped as hypothesized? Where is the minimum?
- Accuracy vs. EPI divergence — Plot accuracy-optimal vs. EPI-optimal. Do they differ?
- Layer group comparison — Overlay EPI curves for all/early/late layer targeting
- Power trace analysis — Do pruned models show lower per-node power, shorter inference time, or both?
- Pareto frontier — Plot accuracy vs. J/token for all 25 configurations. Which are Pareto-optimal?
- Energy decomposition — Attribute energy savings to computation vs. memory bandwidth
Pending measurement data.
Expected discussion topics:
- The electrician's perspective: What does the EPI curve look like as a circuit diagram? Each expert is a component drawing power. Removing components reduces load — but the remaining circuit must compensate.
- Practical recommendation: Which pruning depth should a practitioner choose for deployment on resource-constrained ARM hardware?
- MoE routing after pruning: Does the router adapt its distribution after expert removal? Does this affect energy?
- Generalizability: Would a different MoE model (Mixtral, DeepSeek-V3) show the same EPI curve shape?
| Paper | Metric | Hardware | Measures Energy? | Measures EPI? |
|---|---|---|---|---|
| MoE-Pruner [1] | Perplexity, benchmarks | GPU (A100/H100) | No | No |
| NAEE [2] | Perplexity, benchmarks | GPU | No | No |
| EEP [3] | Accuracy/parameter | GPU | No | No |
| This paper | EPI, J/token, accuracy | Pi 5 cluster (ARM) | Yes (epi-meter) | Yes |
The key difference: prior work asks "how much accuracy can we keep?" This paper asks "how much useful intelligence per joule can we get?"
| Component | Repository | Contents |
|---|---|---|
| EPI Framework | energy-per-intelligence |
Metric definition, measurement protocol |
| Measurement Board | epi-meter |
KiCad schematic, firmware, BOM, build guide |
| Calculation Tooling | epi-bench |
EPI calculator, Pareto plotter, power trace tools |
| Raw Data | data/ in this repo |
Power traces, benchmarks, EPI results per configuration |
| Surgery Code | code/surgery/ in this repo |
Expert identification, pruning scripts |
| Analysis Code | code/analysis/ in this repo |
EPI curve analysis, statistical comparisons |
| Visualization | code/visualization/ in this repo |
Plot generation scripts |
Challenge the results:
- Replicate the surgery on the same model using the provided scripts
- Deploy to your own hardware (Pi cluster, x86, GPU — any platform)
- Measure power with your own instrument (epi-meter or calibrated AC meter)
- Calculate EPI using epi-bench
- Submit to
data/community/via pull request - Open a Discussion with your findings
| Direction | Description |
|---|---|
| Other MoE models | Apply the same matrix to Mixtral-8x7B, DeepSeek-V3, DBRX |
| Combined surgery | Expert pruning + quantization depth interaction on EPI |
| Router retraining | Fine-tune the router after pruning — does it redistribute load and lower EPI? |
| Dynamic expert selection | Vary active expert count (K) and measure EPI — is K=6 better than K=8 on ARM? |
| EPI prediction | Train the DGX to predict EPI from surgery parameters before deploying |
@article{abner2026expertpruningepi,
title = {Dropping MoE Experts and Measuring Energy Per Intelligence:
Where Is the Efficient Operating Point?},
author = {Abner, Francisco},
year = {2026},
url = {https://github.com/Franzabner/expert-pruning-epi},
note = {YOSO-YAi LLC. Data collection in progress.}
}| # | Reference |
|---|---|
| [1] | MoE-Pruner: Pruning Mixture-of-Experts Large Language Models (2024) |
| [2] | NAEE: N-gram Aware Expert Elimination for MoE Models (2024) |
| [3] | EEP: Expert-level Efficient Pruning for Mixture-of-Experts (2024) |
| [4] | Abner, F. "Energy Per Intelligence: A Metric for Evaluating Model Surgery From the Perspective of an Electrical Engineer." YOSO-YAi LLC, 2026. GitHub |
| [5] | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (Frantar & Alistarh, 2023) |
| [6] | Qwen Technical Report (Qwen Team, 2024) |
| Content | License |
|---|---|
| Paper (README, figures) | CC BY 4.0 |
| Code (surgery, analysis, visualization) | MIT |
| Data (power traces, benchmarks, EPI results) | CC BY 4.0 |
