Skip to content

Franzabner/expert-pruning-epi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YOSO-YAi

Dropping MoE Experts and Measuring Energy Per Intelligence

Where Is the Efficient Operating Point?

Francisco Abner Electrical Engineer, CEO & Founder — YOSO-YAi LLC New Albany, Ohio

Paper Status License Framework Tooling Instrument


Abstract — Mixture-of-Experts (MoE) models activate only a subset of their parameters per token, making them natural candidates for expert pruning. Existing work evaluates pruning by accuracy retained — perplexity, benchmark scores. None measures the energy cost of the result on production hardware. This paper applies Energy Per Intelligence (EPI) to expert pruning: we systematically drop 1 through N experts from an MoE model, deploy each variant to a Raspberry Pi 5 cluster, measure real power consumption with the epi-meter board, and calculate EPI for every configuration. The result is an EPI curve — a plot of EPI vs. experts removed — that reveals the efficient operating point: the depth of pruning that minimizes the energy cost per unit of useful intelligence. We compare this operating point to the accuracy-only optimal identified by prior work and show that the two do not always coincide.


Table of Contents

  1. Introduction
  2. Background
  3. Research Questions
  4. Experimental Design
  5. Surgery Matrix
  6. Target Model
  7. Methodology
  8. Expected Results
  9. Results
  10. Analysis
  11. Discussion
  12. Comparison to Prior Work
  13. Reproducibility
  14. Future Work
  15. Citation
  16. References
  17. License

1. Introduction

An MoE model is a circuit with redundant paths. Each expert is a sub-network that activates for specific inputs. A model with 128 experts but only 8 active per token has 120 experts idle at any given moment — but they still occupy memory, influence routing, and shape the model's capacity distribution.

Expert pruning removes experts permanently. The pruned model is smaller, loads faster, and — the hypothesis — consumes less energy per token. But does the energy saving justify the accuracy loss? At what point does removing one more expert cost more intelligence than it saves in joules?

Existing expert pruning papers (MoE-Pruner [1], NAEE [2], EEP [3]) answer the accuracy question: how many experts can you drop before the model degrades? None of them answer the energy question: how many experts should you drop to minimize the energy cost per unit of useful intelligence?

This paper answers the energy question. We use EPI — Energy Per Intelligence — as the metric, the epi-meter as the measurement instrument, and a Raspberry Pi 5 cluster as the production hardware target. The result is an EPI curve that identifies the efficient operating point — the pruning depth that minimizes EPI.


2. Background

Mixture-of-Experts Architecture

MoE models use a gating network (router) to select a subset of experts for each token. For a model with E total experts and K active experts per token:

Token → Router → Select K of E experts → Weighted sum of expert outputs

Benefits: parameter efficiency (large total capacity, small active compute). Cost: routing overhead, memory for all experts, load imbalance.

Expert Pruning

Expert pruning removes entire expert sub-networks from the model:

Method Strategy Evaluation
MoE-Pruner [1] Prune by router frequency (least-used experts) Perplexity, downstream accuracy
NAEE [2] N-gram aware elimination (prune by token context) Perplexity, benchmark scores
EEP [3] Expert-level efficiency pruning Accuracy retention per parameter removed

Gap: All three measure accuracy. None measures energy. None uses EPI. None tests on ARM hardware.

EPI (Energy Per Intelligence)

EPI = Joules per Token / Task Accuracy

Defined in energy-per-intelligence. Lower EPI = more useful intelligence per joule. Measured with the epi-meter on a Pi 5 cluster.


3. Research Questions

# Question
RQ1 How does EPI change as experts are removed from an MoE model?
RQ2 Where is the EPI minimum — the efficient operating point?
RQ3 Does the EPI-optimal pruning depth differ from the accuracy-optimal pruning depth?
RQ4 Does the location of the EPI minimum vary across layer groups (early, middle, late)?
RQ5 Does the energy saving from expert removal come from reduced computation, reduced memory bandwidth, or both?

4. Experimental Design

The EPI Curve

The central output of this paper is the EPI curve: a plot of EPI (Y-axis) vs. number of experts removed (X-axis). The shape reveals the tradeoff:

 EPI
  │
  │   ╲
  │    ╲         ← Accuracy drops faster than energy saves
  │     ╲
  │      ╲___
  │          ╲__ ← Efficient operating point (EPI minimum)
  │             ╲
  │              ╲
  │               ╲  ← Energy still dropping but accuracy
  │                ╲    collapses — EPI rises
  │                 ╲
  └──────────────────────── Experts Removed
  0   1   2   3   4   5   6   7   8

The EPI minimum is the point where removing one more expert costs more intelligence (accuracy) than it saves in energy (joules). This is the efficient operating point — and it is invisible to accuracy-only analysis.

Hardware

Component Specification
Surgery platform DGX Spark (GB10, 128GB, 1 PFLOP)
Deployment target Pi 5 cluster (4x 16GB, distributed-llama)
Measurement epi-meter board (RP2350, 4x ATM90E26, CT clamps)
Orchestrator Pi 5 with 10-inch display, SQLite logging
Quantization Q4_K_M (GGUF) via llama.cpp

Automation

The surgery matrix is executed autonomously by NemoClaw (YOSO-YAi's autonomous systems engineer on the DGX Spark) via n8n workflows:

CEO triggers via Telegram
  → n8n generates surgery matrix
    → For each configuration:
      → NemoClaw performs surgery (mergekit + safetensors)
        → Validate modified model
          → Quantize to GGUF (llama.cpp)
            → Deploy to Pi cluster (rsync)
              → Wait 60s for stabilization
                → Run benchmark suite
                  → Collect epi-meter power trace
                    → Calculate EPI (epi-bench)
                      → Log results
    → Generate summary report
      → Send to CEO via Telegram

5. Surgery Matrix

Each surgery run is one combination of parameters. The full matrix systematically explores expert removal depth and layer targeting.

Matrix Parameters

Parameter Values Count
Experts dropped 1, 2, 3, 4, 5, 6, 7, 8 8
Layer targeting all layers, early only (0–25%), late only (75–100%) 3
Total runs 8 × 3 24

Plus 1 baseline (zero experts removed) = 25 total configurations.

Run Matrix

Run Experts Dropped Target Layers Surgery ID
0 0 (baseline) baseline
1 1 all drop1_all
2 1 early drop1_early
3 1 late drop1_late
4 2 all drop2_all
5 2 early drop2_early
6 2 late drop2_late
7 3 all drop3_all
... ... ... ...
22 8 all drop8_all
23 8 early drop8_early
24 8 late drop8_late

6. Target Model

Primary: Qwen3-30B-A3B

Parameter Value
Model Qwen3-30B-A3B
Architecture MoE (Mixture of Experts)
Total parameters 30B
Active parameters 3B per token
Total experts 128
Active experts 8 per token
Router Top-K gating
Quantization Q4_K_M (GGUF)
Inference engine distributed-llama
Deployment 4x Pi 5 16GB

Why This Model

  • MoE architecture with a large number of experts (128) — ample room for pruning exploration
  • High expert-to-active ratio (128:8 = 16:1) — most experts are idle per token
  • Fits on Pi cluster after Q4_K_M quantization — production-relevant deployment target
  • Active community — results are immediately relevant to the Qwen ecosystem

7. Methodology

Surgery Procedure

For each run in the matrix:

  1. Load model — Open the Qwen3-30B-A3B safetensors on DGX Spark
  2. Identify targets — Select experts to drop based on router frequency analysis
  3. Perform surgery — Remove expert weight tensors using mergekit/safetensors
  4. Validate — Load modified model, run sanity prompts, verify coherent output
  5. Quantize — Convert to Q4_K_M GGUF using llama.cpp
  6. Deploy — rsync GGUF shards to Pi cluster
  7. Stabilize — Wait 60 seconds for thermal equilibrium
  8. Benchmark — Run MMLU (5-shot), ARC-Challenge (25-shot), HellaSwag (10-shot)
  9. Measure — Capture epi-meter power trace for entire benchmark duration
  10. Calculate — Compute EPI using epi-bench
  11. Log — Store all results with full metadata

Pruning Strategy

Experts are ranked by router frequency — the fraction of tokens for which each expert is selected. Least-frequently-used experts are dropped first. This follows the approach of MoE-Pruner [1] to enable direct comparison.

Benchmarks

Benchmark Shots Measures
MMLU 5 Broad knowledge (57 domains)
ARC-Challenge 25 Reasoning
HellaSwag 10 Commonsense inference

Composite accuracy: A = (MMLU + ARC-C + HellaSwag) / 3

Measurement

  • Instrument: epi-meter board (4x ATM90E26, CT clamps on each Pi node's AC inlet)
  • Measurement point: AC side — captures total system draw including PSU losses
  • Sampling: 1 Hz JSON telemetry to Orchestrator SQLite
  • Repetitions: 3 per configuration, median reported
  • Environmental: Fixed CPU frequency governor (performance), ambient temperature logged

8. Expected Results

Hypotheses to be validated or refuted by measurement.

H1: EPI Curve Is U-Shaped

Removing experts initially improves EPI (energy drops faster than accuracy). Past a critical point, accuracy collapses while energy savings plateau — EPI rises. The curve has a minimum.

H2: Accuracy-Optimal ≠ EPI-Optimal

The pruning depth that retains maximum accuracy is not the same as the pruning depth that minimizes EPI. Accuracy-only analysis would recommend a shallower prune than the energy-optimal point.

H3: Late-Layer Pruning Is More Energy-Efficient

Dropping experts from late layers (near the output) saves more energy per unit of accuracy lost than dropping from early layers. Late layers may have more redundant computation.

H4: Energy Savings Are Sub-Linear

Removing the first expert saves more energy than removing the eighth. Energy savings per expert removed decrease as the remaining experts compensate.


9. Results

Status: Data collection pending. The YOSO-YAi FACTORY and epi-meter board are scheduled to be operational in May 2026. This section will be populated with measured data.

EPI Curve (All Layers)

Experts Dropped J/Token MMLU ARC-C HSwag Accuracy EPI vs Baseline
0 (baseline)
1 —%
2 —%
3 —%
4 —%
5 —%
6 —%
7 —%
8 —%

EPI Curve (Early Layers Only)

Experts Dropped J/Token Accuracy EPI vs Baseline
1 —%
2 —%
... —%
8 —%

EPI Curve (Late Layers Only)

Experts Dropped J/Token Accuracy EPI vs Baseline
1 —%
2 —%
... —%
8 —%

Summary

Configuration Best EPI Experts Dropped Layer Target EPI Improvement
Overall best —%
Best (all layers) all —%
Best (early only) early —%
Best (late only) late —%

10. Analysis

Pending measurement data.

Planned analyses:

  1. EPI curve shape — Is it U-shaped as hypothesized? Where is the minimum?
  2. Accuracy vs. EPI divergence — Plot accuracy-optimal vs. EPI-optimal. Do they differ?
  3. Layer group comparison — Overlay EPI curves for all/early/late layer targeting
  4. Power trace analysis — Do pruned models show lower per-node power, shorter inference time, or both?
  5. Pareto frontier — Plot accuracy vs. J/token for all 25 configurations. Which are Pareto-optimal?
  6. Energy decomposition — Attribute energy savings to computation vs. memory bandwidth

11. Discussion

Pending measurement data.

Expected discussion topics:

  • The electrician's perspective: What does the EPI curve look like as a circuit diagram? Each expert is a component drawing power. Removing components reduces load — but the remaining circuit must compensate.
  • Practical recommendation: Which pruning depth should a practitioner choose for deployment on resource-constrained ARM hardware?
  • MoE routing after pruning: Does the router adapt its distribution after expert removal? Does this affect energy?
  • Generalizability: Would a different MoE model (Mixtral, DeepSeek-V3) show the same EPI curve shape?

12. Comparison to Prior Work

Paper Metric Hardware Measures Energy? Measures EPI?
MoE-Pruner [1] Perplexity, benchmarks GPU (A100/H100) No No
NAEE [2] Perplexity, benchmarks GPU No No
EEP [3] Accuracy/parameter GPU No No
This paper EPI, J/token, accuracy Pi 5 cluster (ARM) Yes (epi-meter) Yes

The key difference: prior work asks "how much accuracy can we keep?" This paper asks "how much useful intelligence per joule can we get?"


13. Reproducibility

Component Repository Contents
EPI Framework energy-per-intelligence Metric definition, measurement protocol
Measurement Board epi-meter KiCad schematic, firmware, BOM, build guide
Calculation Tooling epi-bench EPI calculator, Pareto plotter, power trace tools
Raw Data data/ in this repo Power traces, benchmarks, EPI results per configuration
Surgery Code code/surgery/ in this repo Expert identification, pruning scripts
Analysis Code code/analysis/ in this repo EPI curve analysis, statistical comparisons
Visualization code/visualization/ in this repo Plot generation scripts

Community Replication

Challenge the results:

  1. Replicate the surgery on the same model using the provided scripts
  2. Deploy to your own hardware (Pi cluster, x86, GPU — any platform)
  3. Measure power with your own instrument (epi-meter or calibrated AC meter)
  4. Calculate EPI using epi-bench
  5. Submit to data/community/ via pull request
  6. Open a Discussion with your findings

14. Future Work

Direction Description
Other MoE models Apply the same matrix to Mixtral-8x7B, DeepSeek-V3, DBRX
Combined surgery Expert pruning + quantization depth interaction on EPI
Router retraining Fine-tune the router after pruning — does it redistribute load and lower EPI?
Dynamic expert selection Vary active expert count (K) and measure EPI — is K=6 better than K=8 on ARM?
EPI prediction Train the DGX to predict EPI from surgery parameters before deploying

15. Citation

@article{abner2026expertpruningepi,
  title   = {Dropping MoE Experts and Measuring Energy Per Intelligence:
             Where Is the Efficient Operating Point?},
  author  = {Abner, Francisco},
  year    = {2026},
  url     = {https://github.com/Franzabner/expert-pruning-epi},
  note    = {YOSO-YAi LLC. Data collection in progress.}
}

16. References

# Reference
[1] MoE-Pruner: Pruning Mixture-of-Experts Large Language Models (2024)
[2] NAEE: N-gram Aware Expert Elimination for MoE Models (2024)
[3] EEP: Expert-level Efficient Pruning for Mixture-of-Experts (2024)
[4] Abner, F. "Energy Per Intelligence: A Metric for Evaluating Model Surgery From the Perspective of an Electrical Engineer." YOSO-YAi LLC, 2026. GitHub
[5] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (Frantar & Alistarh, 2023)
[6] Qwen Technical Report (Qwen Team, 2024)

17. License

Content License
Paper (README, figures) CC BY 4.0
Code (surgery, analysis, visualization) MIT
Data (power traces, benchmarks, EPI results) CC BY 4.0

The future of labor is compute. Measured in energy per intelligence.

Where is the efficient operating point? Measure, don't guess.

YOSO-YAi

Francisco Abner — Electrical Engineer, CEO & Founder, YOSO-YAi LLC

About

Dropping MoE Experts and Measuring Energy Per Intelligence: Where Is the Efficient Operating Point? Expert pruning on Qwen3-30B-A3B measured by EPI on Pi 5 cluster.

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE-CODE

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages