Forge-Edge

Inference optimization for transformer models combining two independent techniques that can be used separately or stacked together.

Tested on Qwen2.5-14B running on Apple Silicon via MLX.

What's inside

forge-edge-repo/
├── README.md
├── requirements.txt
├── infer_forge_edge_stacked.py     # Run Stage 1 + Stage 2A together (benchmark)
├── chat_forge_edge_v5.py           # Interactive chat with both stages active
│
├── word_router/                    # Stage 1 — Surprisal Router
│   ├── train_router_8b.py          # Train the router MLP on cached hidden states
│   ├── forge_edge_poc_v2.py        # Proof-of-concept / earlier training variant
│   ├── infer_forge_edge_qwen14b.py # Inference engine with router gating
│   └── benchmark_4.py              # Per-token benchmark with router decision logging
│
└── ablation/                       # Stage 2A — MLP Lobotomy
    ├── profile_ffn_sparsity.py     # Analyse FFN activation sparsity per layer
    ├── benchmark_mlp_ablation_mlx.py # Benchmark ablated model vs baseline
    ├── lobotomy_qwen2.5_14b_instruct_bf16.json  # Layer sensitivity ranking (14B bf16)
    ├── ablation_benchmark_mlx.json  # Benchmark results
    └── ablation_mlx_v2.json         # v2 benchmark results

How it works

Stage 1 — Word Router

Transformer vocabulary follows a Zipfian distribution: roughly 20% of tokens account for ~95% of all generated text. The router exploits this.

At each generation step, a small MLP (1.6M params) reads the hidden state and predicts whether the next token will be rare. If not, the model skips the expensive full-vocabulary matmul (hidden × W_vocab, 32K tokens) and uses a pre-sliced common head covering only ~6,400 tokens instead — a ~5× cheaper operation.

hidden state
    │
    ├──► Router MLP ──► P(rare) > threshold?
    │                        │
    │              Yes       │       No
    │         ┌──────────────┴──────────────┐
    │         ▼                             ▼
    │   Full head (32K)           Common head (~6.4K)
    │   expensive                 ~5× cheaper
    └────────────────────────────────────────

Result: ~20% latency reduction with <0.5% quality loss.

Stage 2A — MLP Lobotomy

Not all transformer layers contribute equally. A sensitivity analysis (profile_ffn_sparsity.py) identifies which FFN/MLP blocks can be zeroed out with minimal impact on output quality. These are then replaced at load time with a ZeroMLP that returns zeros, keeping the residual stream intact but skipping the computation entirely.

The cut list for Qwen2.5-14B bf16 is pre-computed in lobotomy_qwen2.5_14b_instruct_bf16.json (11 layers removed).

Result: ~15–20% additional latency reduction with <1% quality loss.

Stacked

Both stages compose cleanly. infer_forge_edge_stacked.py and chat_forge_edge_v5.py apply Stage 2A at load time and Stage 1 at each generation step.

Combined result: ~35–40% throughput improvement, ~30–40% energy reduction, <2% quality loss.

Setup

pip install -r requirements.txt

Requires Python 3.10+ and Apple Silicon (MLX). For NVIDIA GPU usage, the router training and inference files also support PyTorch checkpoints.

Usage

Train the router (Stage 1)

Phase 1 caches backbone hidden states (~15 min, one-time). Phase 2 trains the router MLP (~2 min).

python word_router/train_router_8b.py

# Skip Phase 1 if cache already exists
python word_router/train_router_8b.py --skip-cache

# More training steps
python word_router/train_router_8b.py --max-steps 20000

Checkpoint saved to ./checkpoints_8b/step_XXXXXXX/ containing:

router.safetensors — router weights
common_ids.npy — the common token set
meta.txt — training metadata

Run ablation analysis (Stage 2A)

Profiles which FFN layers are safest to cut based on activation sparsity (Gini coefficient / Zipfian structure).

python ablation/profile_ffn_sparsity.py --model mlx-community/Qwen2.5-14B-Instruct-bf16

Output: per-layer fire rates, Gini coefficients, estimated bandwidth savings. Use the results to build a lobotomy JSON.

Benchmark Stage 1 only

# Router vs baseline comparison
python word_router/benchmark_4.py --mode both \
  --checkpoint ./qwen14b \
  --max-tokens 200

# Router only
python word_router/benchmark_4.py --mode fe \
  --checkpoint ./qwen14b \
  --threshold 0.75

Benchmark both stages stacked

# Router only
python infer_forge_edge_stacked.py \
  --checkpoint ./qwen14b \
  --max-tokens 500

# Router + Lobotomy (Stage 1 + 2A)
python infer_forge_edge_stacked.py \
  --checkpoint ./qwen14b \
  --lobotomy \
  --max-tokens 500

Interactive chat (both stages)

python chat_forge_edge_v5.py \
  --model mlx-community/Qwen2.5-14B-Instruct-bf16 \
  --checkpoint ./qwen14b/step_0015000 \
  --lobotomy ablation/lobotomy_qwen2.5_14b_instruct_bf16.json \
  --threshold 0.8

Type exit or quit to end the session. Throughput (tok/s) is printed after each response.

Results (Qwen2.5-14B, M1 Max)

Configuration	Throughput	Latency	Power	Quality loss
Baseline	~25 tok/s	~40 ms/t	~110 W	—
+ Stage 1 (router)	~30 tok/s	~33 ms/t	~100 W	<0.5%
+ Stage 2A (lobotomy)	~34 tok/s	~29 ms/t	~85 W	<1%
Stacked (1 + 2A)	~40 tok/s	~25 ms/t	~70 W	<2%

Checkpoints

Pre-trained router checkpoints for Qwen2.5-14B are in ./qwen14b/. Place your trained checkpoints in the same structure:

qwen14b/
└── step_0015000/
    ├── router.safetensors
    ├── common_ids.npy
    └── meta.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forge-Edge

What's inside

How it works

Stage 1 — Word Router

Stage 2A — MLP Lobotomy

Stacked

Setup

Usage

Train the router (Stage 1)

Run ablation analysis (Stage 2A)

Benchmark Stage 1 only

Benchmark both stages stacked

Interactive chat (both stages)

Results (Qwen2.5-14B, M1 Max)

Checkpoints

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ablation		ablation
word_router		word_router
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat_forge_edge_v5.py		chat_forge_edge_v5.py
infer_forge_edge_stacked.py		infer_forge_edge_stacked.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Forge-Edge

What's inside

How it works

Stage 1 — Word Router

Stage 2A — MLP Lobotomy

Stacked

Setup

Usage

Train the router (Stage 1)

Run ablation analysis (Stage 2A)

Benchmark Stage 1 only

Benchmark both stages stacked

Interactive chat (both stages)

Results (Qwen2.5-14B, M1 Max)

Checkpoints

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages