Dropping MoE Experts and Measuring Energy Per Intelligence

Where Is the Efficient Operating Point?

Francisco Abner Electrical Engineer, CEO & Founder — YOSO-YAi LLC New Albany, Ohio

Abstract — Mixture-of-Experts (MoE) models activate only a subset of their parameters per token, making them natural candidates for expert pruning. Existing work evaluates pruning by accuracy retained — perplexity, benchmark scores. None measures the energy cost of the result on production hardware. This paper applies Energy Per Intelligence (EPI) to expert pruning: we systematically drop 1 through N experts from an MoE model, deploy each variant to a Raspberry Pi 5 cluster, measure real power consumption with the epi-meter board, and calculate EPI for every configuration. The result is an EPI curve — a plot of EPI vs. experts removed — that reveals the efficient operating point: the depth of pruning that minimizes the energy cost per unit of useful intelligence. We compare this operating point to the accuracy-only optimal identified by prior work and show that the two do not always coincide.

1. Introduction

An MoE model is a circuit with redundant paths. Each expert is a sub-network that activates for specific inputs. A model with 128 experts but only 8 active per token has 120 experts idle at any given moment — but they still occupy memory, influence routing, and shape the model's capacity distribution.

Expert pruning removes experts permanently. The pruned model is smaller, loads faster, and — the hypothesis — consumes less energy per token. But does the energy saving justify the accuracy loss? At what point does removing one more expert cost more intelligence than it saves in joules?

Existing expert pruning papers (MoE-Pruner [1], NAEE [2], EEP [3]) answer the accuracy question: how many experts can you drop before the model degrades? None of them answer the energy question: how many experts should you drop to minimize the energy cost per unit of useful intelligence?

This paper answers the energy question. We use EPI — Energy Per Intelligence — as the metric, the epi-meter as the measurement instrument, and a Raspberry Pi 5 cluster as the production hardware target. The result is an EPI curve that identifies the efficient operating point — the pruning depth that minimizes EPI.

2. Background

Mixture-of-Experts Architecture

MoE models use a gating network (router) to select a subset of experts for each token. For a model with E total experts and K active experts per token:

Token → Router → Select K of E experts → Weighted sum of expert outputs

Benefits: parameter efficiency (large total capacity, small active compute). Cost: routing overhead, memory for all experts, load imbalance.

Expert Pruning

Expert pruning removes entire expert sub-networks from the model:

Method	Strategy	Evaluation
MoE-Pruner [1]	Prune by router frequency (least-used experts)	Perplexity, downstream accuracy
NAEE [2]	N-gram aware elimination (prune by token context)	Perplexity, benchmark scores
EEP [3]	Expert-level efficiency pruning	Accuracy retention per parameter removed

Gap: All three measure accuracy. None measures energy. None uses EPI. None tests on ARM hardware.

EPI (Energy Per Intelligence)

EPI = Joules per Token / Task Accuracy

Defined in energy-per-intelligence. Lower EPI = more useful intelligence per joule. Measured with the epi-meter on a Pi 5 cluster.

3. Research Questions

#	Question
RQ1	How does EPI change as experts are removed from an MoE model?
RQ2	Where is the EPI minimum — the efficient operating point?
RQ3	Does the EPI-optimal pruning depth differ from the accuracy-optimal pruning depth?
RQ4	Does the location of the EPI minimum vary across layer groups (early, middle, late)?
RQ5	Does the energy saving from expert removal come from reduced computation, reduced memory bandwidth, or both?

4. Experimental Design

The EPI Curve

The central output of this paper is the EPI curve: a plot of EPI (Y-axis) vs. number of experts removed (X-axis). The shape reveals the tradeoff:

 EPI
  │
  │   ╲
  │    ╲         ← Accuracy drops faster than energy saves
  │     ╲
  │      ╲___
  │          ╲__ ← Efficient operating point (EPI minimum)
  │             ╲
  │              ╲
  │               ╲  ← Energy still dropping but accuracy
  │                ╲    collapses — EPI rises
  │                 ╲
  └──────────────────────── Experts Removed
  0   1   2   3   4   5   6   7   8

The EPI minimum is the point where removing one more expert costs more intelligence (accuracy) than it saves in energy (joules). This is the efficient operating point — and it is invisible to accuracy-only analysis.

Hardware

Component	Specification
Surgery platform	DGX Spark (GB10, 128GB, 1 PFLOP)
Deployment target	Pi 5 cluster (4x 16GB, distributed-llama)
Measurement	epi-meter board (RP2350, 4x ATM90E26, CT clamps)
Orchestrator	Pi 5 with 10-inch display, SQLite logging
Quantization	Q4_K_M (GGUF) via llama.cpp

Automation

The surgery matrix is executed autonomously by NemoClaw (YOSO-YAi's autonomous systems engineer on the DGX Spark) via n8n workflows:

CEO triggers via Telegram
  → n8n generates surgery matrix
    → For each configuration:
      → NemoClaw performs surgery (mergekit + safetensors)
        → Validate modified model
          → Quantize to GGUF (llama.cpp)
            → Deploy to Pi cluster (rsync)
              → Wait 60s for stabilization
                → Run benchmark suite
                  → Collect epi-meter power trace
                    → Calculate EPI (epi-bench)
                      → Log results
    → Generate summary report
      → Send to CEO via Telegram

5. Surgery Matrix

Each surgery run is one combination of parameters. The full matrix systematically explores expert removal depth and layer targeting.

Matrix Parameters

Parameter	Values	Count
Experts dropped	1, 2, 3, 4, 5, 6, 7, 8	8
Layer targeting	all layers, early only (0–25%), late only (75–100%)	3
Total runs	8 × 3	24

Plus 1 baseline (zero experts removed) = 25 total configurations.

Run Matrix

Run	Experts Dropped	Target Layers	Surgery ID
0	0 (baseline)	—	`baseline`
1	1	all	`drop1_all`
2	1	early	`drop1_early`
3	1	late	`drop1_late`
4	2	all	`drop2_all`
5	2	early	`drop2_early`
6	2	late	`drop2_late`
7	3	all	`drop3_all`
...	...	...	...
22	8	all	`drop8_all`
23	8	early	`drop8_early`
24	8	late	`drop8_late`

6. Target Model

Primary: Qwen3-30B-A3B

Parameter	Value
Model	Qwen3-30B-A3B
Architecture	MoE (Mixture of Experts)
Total parameters	30B
Active parameters	3B per token
Total experts	128
Active experts	8 per token
Router	Top-K gating
Quantization	Q4_K_M (GGUF)
Inference engine	distributed-llama
Deployment	4x Pi 5 16GB

Why This Model

MoE architecture with a large number of experts (128) — ample room for pruning exploration
High expert-to-active ratio (128:8 = 16:1) — most experts are idle per token
Fits on Pi cluster after Q4_K_M quantization — production-relevant deployment target
Active community — results are immediately relevant to the Qwen ecosystem

7. Methodology

Surgery Procedure

For each run in the matrix:

Load model — Open the Qwen3-30B-A3B safetensors on DGX Spark
Identify targets — Select experts to drop based on router frequency analysis
Perform surgery — Remove expert weight tensors using mergekit/safetensors
Validate — Load modified model, run sanity prompts, verify coherent output
Quantize — Convert to Q4_K_M GGUF using llama.cpp
Deploy — rsync GGUF shards to Pi cluster
Stabilize — Wait 60 seconds for thermal equilibrium
Benchmark — Run MMLU (5-shot), ARC-Challenge (25-shot), HellaSwag (10-shot)
Measure — Capture epi-meter power trace for entire benchmark duration
Calculate — Compute EPI using epi-bench
Log — Store all results with full metadata

Pruning Strategy

Experts are ranked by router frequency — the fraction of tokens for which each expert is selected. Least-frequently-used experts are dropped first. This follows the approach of MoE-Pruner [1] to enable direct comparison.

Benchmarks

Benchmark	Shots	Measures
MMLU	5	Broad knowledge (57 domains)
ARC-Challenge	25	Reasoning
HellaSwag	10	Commonsense inference

Composite accuracy: A = (MMLU + ARC-C + HellaSwag) / 3

Measurement

Instrument: epi-meter board (4x ATM90E26, CT clamps on each Pi node's AC inlet)
Measurement point: AC side — captures total system draw including PSU losses
Sampling: 1 Hz JSON telemetry to Orchestrator SQLite
Repetitions: 3 per configuration, median reported
Environmental: Fixed CPU frequency governor (performance), ambient temperature logged

8. Expected Results

Hypotheses to be validated or refuted by measurement.

H1: EPI Curve Is U-Shaped

Removing experts initially improves EPI (energy drops faster than accuracy). Past a critical point, accuracy collapses while energy savings plateau — EPI rises. The curve has a minimum.

H2: Accuracy-Optimal ≠ EPI-Optimal

The pruning depth that retains maximum accuracy is not the same as the pruning depth that minimizes EPI. Accuracy-only analysis would recommend a shallower prune than the energy-optimal point.

H3: Late-Layer Pruning Is More Energy-Efficient

Dropping experts from late layers (near the output) saves more energy per unit of accuracy lost than dropping from early layers. Late layers may have more redundant computation.

H4: Energy Savings Are Sub-Linear

Removing the first expert saves more energy than removing the eighth. Energy savings per expert removed decrease as the remaining experts compensate.

9. Results

Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026. This section will be populated with measured data.

EPI Curve (All Layers)

Experts Dropped	J/Token	MMLU	ARC-C	HSwag	Accuracy	EPI	vs Baseline
0 (baseline)	—	—	—	—	—	—	—
1	—	—	—	—	—	—	—%
2	—	—	—	—	—	—	—%
3	—	—	—	—	—	—	—%
4	—	—	—	—	—	—	—%
5	—	—	—	—	—	—	—%
6	—	—	—	—	—	—	—%
7	—	—	—	—	—	—	—%
8	—	—	—	—	—	—	—%

EPI Curve (Early Layers Only)

Experts Dropped	J/Token	Accuracy	EPI	vs Baseline
1	—	—	—	—%
2	—	—	—	—%
...	—	—	—	—%
8	—	—	—	—%

EPI Curve (Late Layers Only)

Experts Dropped	J/Token	Accuracy	EPI	vs Baseline
1	—	—	—	—%
2	—	—	—	—%
...	—	—	—	—%
8	—	—	—	—%

Summary

Configuration	Best EPI	Experts Dropped	Layer Target	EPI Improvement
Overall best	—	—	—	—%
Best (all layers)	—	—	all	—%
Best (early only)	—	—	early	—%
Best (late only)	—	—	late	—%

10. Analysis

Pending measurement data.

Planned analyses:

EPI curve shape — Is it U-shaped as hypothesized? Where is the minimum?
Accuracy vs. EPI divergence — Plot accuracy-optimal vs. EPI-optimal. Do they differ?
Layer group comparison — Overlay EPI curves for all/early/late layer targeting
Power trace analysis — Do pruned models show lower per-node power, shorter inference time, or both?
Pareto frontier — Plot accuracy vs. J/token for all 25 configurations. Which are Pareto-optimal?
Energy decomposition — Attribute energy savings to computation vs. memory bandwidth

11. Discussion

Pending measurement data.

Expected discussion topics:

The electrician's perspective: What does the EPI curve look like as a circuit diagram? Each expert is a component drawing power. Removing components reduces load — but the remaining circuit must compensate.
Practical recommendation: Which pruning depth should a practitioner choose for deployment on resource-constrained ARM hardware?
MoE routing after pruning: Does the router adapt its distribution after expert removal? Does this affect energy?
Generalizability: Would a different MoE model (Mixtral, DeepSeek-V3) show the same EPI curve shape?

12. Comparison to Prior Work

Paper	Metric	Hardware	Measures Energy?	Measures EPI?
MoE-Pruner [1]	Perplexity, benchmarks	GPU (A100/H100)	No	No
NAEE [2]	Perplexity, benchmarks	GPU	No	No
EEP [3]	Accuracy/parameter	GPU	No	No
This paper	EPI, J/token, accuracy	Pi 5 cluster (ARM)	Yes (epi-meter)	Yes

The key difference: prior work asks "how much accuracy can we keep?" This paper asks "how much useful intelligence per joule can we get?"

13. Reproducibility

Component	Repository	Contents
EPI Framework	`energy-per-intelligence`	Metric definition, measurement protocol
Measurement Board	`epi-meter`	KiCad schematic, firmware, BOM, build guide
Calculation Tooling	`epi-bench`	EPI calculator, Pareto plotter, power trace tools
Raw Data	`data/` in this repo	Power traces, benchmarks, EPI results per configuration
Surgery Code	`code/surgery/` in this repo	Expert identification, pruning scripts
Analysis Code	`code/analysis/` in this repo	EPI curve analysis, statistical comparisons
Visualization	`code/visualization/` in this repo	Plot generation scripts

Community Replication

Challenge the results:

Replicate the surgery on the same model using the provided scripts
Deploy to your own hardware (Pi cluster, x86, GPU — any platform)
Measure power with your own instrument (epi-meter or calibrated AC meter)
Calculate EPI using epi-bench
Submit to data/community/ via pull request
Open a Discussion with your findings

14. Future Work

Direction	Description
Other MoE models	Apply the same matrix to Mixtral-8x7B, DeepSeek-V3, DBRX
Combined surgery	Expert pruning + quantization depth interaction on EPI
Router retraining	Fine-tune the router after pruning — does it redistribute load and lower EPI?
Dynamic expert selection	Vary active expert count (K) and measure EPI — is K=6 better than K=8 on ARM?
EPI prediction	Train the DGX to predict EPI from surgery parameters before deploying

15. Citation

@article{abner2026expertpruningepi,
  title   = {Dropping MoE Experts and Measuring Energy Per Intelligence:
             Where Is the Efficient Operating Point?},
  author  = {Abner, Francisco},
  year    = {2026},
  url     = {https://github.com/Franzabner/expert-pruning-epi},
  note    = {YOSO-YAi LLC. Data collection in progress.}
}

16. References

#	Reference
[1]	MoE-Pruner: Pruning Mixture-of-Experts Large Language Models (2024)
[2]	NAEE: N-gram Aware Expert Elimination for MoE Models (2024)
[3]	EEP: Expert-level Efficient Pruning for Mixture-of-Experts (2024)
[4]	Abner, F. "Energy Per Intelligence: A Metric for Evaluating Model Surgery From the Perspective of an Electrical Engineer." YOSO-YAi LLC, 2026. GitHub
[5]	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (Frantar & Alistarh, 2023)
[6]	Qwen Technical Report (Qwen Team, 2024)

17. License

Content	License
Paper (README, figures)	CC BY 4.0
Code (surgery, analysis, visualization)	MIT
Data (power traces, benchmarks, EPI results)	CC BY 4.0

The future of labor is compute. Measured in energy per intelligence.

Where is the efficient operating point? Measure, don't guess.

Francisco Abner — Electrical Engineer, CEO & Founder, YOSO-YAi LLC

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
figures		figures
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-CODE		LICENSE-CODE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Dropping MoE Experts and Measuring Energy Per Intelligence

Where Is the Efficient Operating Point?

Table of Contents

1. Introduction

2. Background

Mixture-of-Experts Architecture

Expert Pruning

EPI (Energy Per Intelligence)

3. Research Questions

4. Experimental Design

The EPI Curve

Hardware

Automation

5. Surgery Matrix

Matrix Parameters

Run Matrix

6. Target Model

Primary: Qwen3-30B-A3B

Why This Model

7. Methodology

Surgery Procedure

Pruning Strategy

Benchmarks

Measurement

8. Expected Results

H1: EPI Curve Is U-Shaped

H2: Accuracy-Optimal ≠ EPI-Optimal

H3: Late-Layer Pruning Is More Energy-Efficient

H4: Energy Savings Are Sub-Linear

9. Results

EPI Curve (All Layers)

EPI Curve (Early Layers Only)

EPI Curve (Late Layers Only)

Summary

10. Analysis

11. Discussion

12. Comparison to Prior Work

13. Reproducibility

Community Replication

14. Future Work

15. Citation

16. References

17. License

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages