Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
752a3be
Add Project Scope & Intent notice to README
claude Mar 3, 2026
1b792fc
Merge pull request #15 from maderix/claude/add-readme-scope-notice-EL9sS
maderix Mar 3, 2026
2b3b7ae
Fix token sampling underflow on short datasets
TastyHeadphones Mar 3, 2026
ebac5dd
Python Bridge+Memory leak fix+More functions
vipuldivyanshu92 Mar 3, 2026
65cfc32
optimize singleton token params in generate_text
guitared Mar 3, 2026
b8f09a6
fix non-interactive session error and sudo password input for powerme…
guitared Mar 3, 2026
a14ce09
Capitalize doc header
guitared Mar 3, 2026
c330774
Merge PR #19: Bridge API + ANE classifier/softmax/rmsnorm_bwd offload…
maderix Mar 3, 2026
cb474e1
Add dynamic weight training pipeline — 110ms/step without recompilation
maderix Mar 3, 2026
4c14ed0
CLI fixes + --no-ane-extras flag + README benchmark table
maderix Mar 3, 2026
3c1aae6
Merge dynamic training pipeline + CLI fixes + benchmark comparison
Mar 3, 2026
443194b
Dashboard v2: live stats, JSON parsing, all three pipelines
Mar 3, 2026
d3d0030
Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory…
Mar 3, 2026
c04168e
Add --data path support for static training pipelines
nabbilkhan Mar 3, 2026
541bf4e
fix: correctness & safety improvements
Mar 2, 2026
0d9e139
Fix docs: add training data download instructions
04cb Mar 4, 2026
4a6f3e4
Revise README for clarity and project details
maderix Mar 4, 2026
3efa27d
Merge pull request #17 from TastyHeadphones/tastyheadphones/short-dat…
maderix Mar 4, 2026
37939c8
Merge pull request #34 from 04cb/fix/docs-add-training-data-link
maderix Mar 4, 2026
7fbb912
Merge pull request #20 from guitared/main
maderix Mar 4, 2026
44309b7
Merge pull request #27 from jskromer/fix/macos26-inmemory-benchmarks
maderix Mar 4, 2026
032f866
Merge pull request #29 from nabbilkhan/contrib/fix-training-data-paths
maderix Mar 4, 2026
05fc8f8
Merge pull request #31 from alvgeppetto-debug/fix/safety-correctness
maderix Mar 4, 2026
e986572
Replace assert() with non-fatal bounds checks on token IDs
Mar 4, 2026
050bc4f
Add cross-generation ANE benchmark report from issue #3
Mar 4, 2026
1a7d884
Add NE core counts, clarify FP16 vs rated TOPS methodology
Mar 4, 2026
efcf193
Add model config to benchmark report, update README with current results
Mar 4, 2026
a12f080
Add pipeline scaffolding for multi-group ANE training
codegen-sh[bot] Mar 2, 2026
f486dda
Address review feedback: configurable headroom, mmap hardening, unit …
codegen-sh[bot] Mar 2, 2026
5d5ea41
Fix 6 review issues: checkpoint counting bug, headroom/memory consist…
codegen-sh[bot] Mar 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 72 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,67 @@

Training neural networks directly on Apple's Neural Engine (ANE) via reverse-engineered private APIs. No CoreML training APIs, no Metal, no GPU — pure ANE compute.

## Project Scope & Intent

I'm genuinely grateful for all the attention this project has received — I never expected a weekend research hack to blow up like this. Thank you to everyone who starred, forked, ran benchmarks on their own hardware, and shared the work. It means a lot.

That said, I want to set clear expectations about what this project is and isn't.

This is a **research project**, not a production framework.

The goal was to demonstrate that **training on the Apple Neural Engine — and potentially other NPUs — is possible**, and that the barrier has always been software support, not hardware capability. The ANE is a remarkably capable piece of silicon that Apple restricts to inference-only use through CoreML. This project bypasses that restriction using reverse-engineered private APIs to show what's possible when you give the hardware a chance.

### What This Project Is

- A proof of concept for ANE training via `_ANEClient` and `_ANECompiler` private APIs
- A set of benchmarks documenting real ANE performance characteristics (throughput, power, SRAM behavior)
- A reference for anyone exploring direct ANE access outside CoreML
- Research code that I update when I find something interesting

### What This Project Is Not

- A maintained framework or library
- A replacement for CoreML, MLX, llama.cpp, or any production inference stack
- A path to training large models on consumer hardware (yet)

### On The Hype

Some coverage of this project has overstated its implications. To be clear:

- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining
- Many element-wise operations still fall back to CPU
- This does **not** replace GPU training for anything beyond small research models today

The honest results — including all limitations — are documented in the accompanying articles:
- [Part 1: Reverse Engineering](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine)
- [Part 2: Benchmarks](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615)

### On Maintenance

I don't intend to grow this into a large community project. My focus is on original research (compiler infrastructure for edge AI optimization), and maintaining an open-source framework takes time away from that.

That said:
- I'll keep pushing updates when I discover something interesting
- Bug fixes and benchmark contributions (especially on hardware I don't own) are welcome
- Feature requests will likely go unaddressed — but feel free to fork
- PRs will be merged at a relatively slow pace, otherwise I become the bottleneck for community growth around this tech

### Fork it, build on it

This is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it.If in future, community decides to maintain one source of truth repo, I'm in full support of that.

---

## What This Is

A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.

**Current results (M4, single transformer layer, dim=768, seq=512):**
- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
- 6 ANE kernel dispatches per training step
**Current results — Stories110M (12-layer, dim=768, seq=256, 109M params):**
- Static pipeline: **91 ms/step** (M3 Ultra), **106 ms/step** (M4)
- Dynamic pipeline: **110 ms/step**, no recompilation
- 72 ANE kernels per step (static), 9 shared kernels (dynamic)
- All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
- Adam optimizer, gradient accumulation, checkpoint/resume
- Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart

## Architecture

Expand Down Expand Up @@ -59,6 +111,14 @@ Key optimizations:
└── Makefile
```

## Training Data

Training requires pretokenized TinyStories data. To download:
```bash
cd training && bash download_data.sh
```
See [training/README.md](training/README.md) for detailed training instructions.

## Building

Requires macOS 15+ on Apple Silicon (tested on M4).
Expand Down Expand Up @@ -87,8 +147,8 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve

- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
- **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
- **Compile overhead** — Static pipeline recompiles 60+ kernels every 10 steps (~3.7s); dynamic pipeline avoids this
- **Low utilization** — Training sustains ~1-2 TFLOPS out of 15.8+ peak due to CPU fallbacks and I/O overhead

## Performance History

Expand All @@ -104,8 +164,13 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve

## Disclaimer

This project is independent research into Apple Neural Engine architecture. It uses undocumented APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.

## License

MIT — see [LICENSE](LICENSE)

---

*Built by a human + Claude, one weekend at a time.*

163 changes: 163 additions & 0 deletions benchmarks/ANE_BENCHMARK_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Apple Neural Engine — Cross-Generation Benchmark Report

Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).

## Model Configuration

All training benchmarks use **Stories110M** — a Llama2-architecture transformer:

```
Parameter Value
────────────────────────
Architecture Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
Layers 12
Dimension 768
Hidden (FFN) 2048
Heads 12
Vocab 32000 (Llama 2 BPE)
Sequence 256
Total Params 109.53M (84.95M transformer + 24.58M embedding)
Training Data TinyStories (~20M tokens, pretokenized)
Optimizer Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
Precision FP16 on ANE, FP32 on CPU
```

Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2).
Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU.

## Training Performance (Static Pipeline)

```
Chip ms/step ANE ms Compile/10 ANE TFLOPS Util% Contributor
─────────────────────────────────────────────────────────────────────────────────
M1 Pro 148-163 32-35 7.9-8.5s 0.57-0.63 3.6-4.0 @moriwang
M1 Max 143-167 35-45 ~7.1s 0.54-0.65 3.4-4.1 @andyg5000
M3 Ultra* 91 ~10 ~3.7s 0.88 5.6 (repo ref)
M4 Pro 69-73 8.9 ~3.5s 1.28 8.1 @srt54558
M4 Max 64 10.2 ~3.5s 1.45 9.2 @SethBurkart123
M5 101-120 9.1-9.8 3.2-3.4s 0.77-0.91 4.9-5.8 @GitBubble
```

*M3 Ultra = reference platform this project was developed on.

## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)

```
Chip NE Cores FP16 TFLOPS (measured) Rated TOPS (Apple spec*)
────────────────────────────────────────────────────────────────────────────
M1 Pro 16 FAIL 11 (MIL compat issue)
M1 Max 16 FAIL 11 (MIL compat issue)
M3 Pro 16 9.98 15.8
M3 Ultra 32 - 31.6 (ref platform)
M4 Pro 16 12.57 38
M4 Max 16 10.93 38
M5 16 12.17 not disclosed
M5 (other) 16 12.44 not disclosed
```

*Apple's "Rated TOPS" changed methodology across generations — M1/M3 report FP16,
M4 reports INT8/mixed-precision peak. The numbers are not directly comparable across
generations. Use the measured FP16 TFLOPS column for apples-to-apples comparison.
All chips have 16 NE cores except Ultra variants (32 cores, two dies via UltraFusion).
Max variants share the same 16-core NE as Pro — the M4 Max vs M4 Pro TFLOPS difference
is run-to-run variance, not hardware.*

## Comparative Chart

```
ANE Training Speed (ms/step, lower is better)
══════════════════════════════════════════════════════════════

M1 Pro ████████████████████████████████████████░░░░ 148-163 ms
M1 Max ██████████████████████████████████████░░░░░░ 143-167 ms
M3 Ultra ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 91 ms
M4 Pro ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 69-73 ms
M4 Max ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 64 ms
M5 ████████████████████████░░░░░░░░░░░░░░░░░░░░ 101-120 ms

0 50 100 150 200


Peak ANE Throughput (TFLOPS, higher is better)
══════════════════════════════════════════════════════════════

M1 Pro FAIL (MIL compat)
M1 Max FAIL (MIL compat)
M3 Pro ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░ 9.98
M4 Pro ████████████████████████████████░░░░░░░░░░░░░ 12.57
M4 Max ██████████████████████░░░░░░░░░░░░░░░░░░░░░░ 10.93
M5 █████████████████████████░░░░░░░░░░░░░░░░░░░ 12.17

0 3 6 9 12 15 18


ANE Sustained Throughput (TFLOPS, 5s window)
══════════════════════════════════════════════════════════════

M3 Pro ██████████████████████████████████████████████ 15.04 (95.2%)

0 3 6 9 12 15 18
(Only M3 Pro submitted sustained benchmark)
```

## Key Findings

### M1/M1 Pro/M1 Max
- **Standalone benchmarks fail** — `ane_mil_gen.h` single-blob weight format rejected
- **Training works** via `stories_mil.h` (separate per-matrix weight blobs)
- ANE compiler handles weight blobs differently from M4+
- Training at 148-167 ms/step, ~0.6 TFLOPS

### M3 Pro
- **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted
- Fixed 512-wide lane structure in SRAM tiling
- **Peak: 16.77 TFLOPS** (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
- **Sustained: 15.04 TFLOPS** over 5 seconds (95.2% utilization)
- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)

### M4 Pro / M4 Max
- Flexible channel support (256/384/512/768+)
- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
- `sram_probe` and `inmem_bench` fail on M4 Pro (same MIL compat issue)

### M5
- Training works out of the box with existing `program(1.3)` MIL
- Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
- ANE appears to be same H16 family as M4
- **M5 Pro/Max not yet benchmarked** — Fusion Architecture may change ANE behavior

### Cross-Generation MIL Compatibility

```
Feature M1 M3 M4 M5
─────────────────────────────────────────────────────────
program(1.3) / ios18 PARTIAL YES YES YES
Single-blob weights FAIL YES YES YES
Per-matrix weight blobs YES YES YES YES
Channel flexibility ? ch=512 FLEX FLEX
BLOBFILE offset refs FAIL YES YES YES
```

## macOS Compatibility Issues

- **macOS 26.x** — `[MLModel compileModelAtURL:]` broken for standalone benchmarks
(fixed in PR #27: switched to in-memory MIL compilation)
- **macOS 15.x** — Works for all M-series with correct MIL format
- M1 generation requires `stories_mil.h` path, not `ane_mil_gen.h`

## How to Contribute

Run on your hardware and post results to [Issue #3](https://github.com/maderix/ANE/issues/3):

```bash
cd training && make train_large
./train_large ane_stories110M_ckpt.bin 256 20 1e-4
```

Include: chip model, macOS version, full output with JSON lines.

---
*Report compiled 2026-03-04 from community submissions.*
*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton*
113 changes: 113 additions & 0 deletions benchmarks/community_results.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
{
"report_date": "2026-03-04",
"source": "https://github.com/maderix/ANE/issues/3",
"model": "Stories110M (12-layer transformer, 109M params)",
"config": {"dim": 768, "hidden": 2048, "heads": 12, "seq": 256, "vocab": 32000, "layers": 12},
"training_results": [
{
"chip": "M1 Pro",
"cores": "10-core CPU",
"ram_gb": 32,
"macos": "15.0",
"ms_per_step": [148, 163],
"ane_ms": [32, 35],
"compile_ms": [7900, 8500],
"ane_tflops": [0.57, 0.63],
"ane_util_pct": [3.6, 4.0],
"benchmarks_pass": false,
"notes": "Standalone benchmarks fail (MIL compat). Training works via stories_mil.h.",
"contributor": "moriwang"
},
{
"chip": "M1 Max",
"cores": "10-core CPU",
"ram_gb": 64,
"macos": "15.6.1",
"ms_per_step": [143, 167],
"ane_ms": [35, 45],
"compile_ms": [7100, 7100],
"ane_tflops": [0.54, 0.65],
"ane_util_pct": [3.4, 4.1],
"benchmarks_pass": false,
"notes": "Same MIL compat issue as M1 Pro.",
"contributor": "andyg5000"
},
{
"chip": "M3 Pro",
"cores": "12-core CPU",
"ram_gb": 36,
"macos": "15.7.4",
"peak_tflops": 16.77,
"sustained_tflops": 15.04,
"sustained_util_pct": 95.2,
"channel_constraint": "ch=512 only",
"notes": "Only ch=512 compiles. 52 values tested. Peak at 128x conv 512ch sp2048.",
"contributor": "D-Ogi"
},
{
"chip": "M4 Pro",
"cores": "unknown",
"ram_gb": null,
"macos": null,
"ms_per_step": [69, 73],
"ane_ms": [8.9, 8.9],
"compile_ms": [3465, 3465],
"ane_tflops": [1.28, 1.28],
"ane_util_pct": [8.1, 8.1],
"peak_tflops_inmem": 12.57,
"notes": "sram_probe and inmem_bench fail. inmem_peak and training work.",
"contributor": "srt54558"
},
{
"chip": "M4 Max",
"cores": "unknown",
"ram_gb": null,
"macos": null,
"ms_per_step": [64, 64],
"ane_ms": [10.2, 10.2],
"compile_ms": [3531, 3531],
"ane_tflops": [1.45, 1.45],
"ane_util_pct": [9.2, 9.2],
"peak_tflops_inmem": 10.93,
"notes": "Fastest training ms/step overall.",
"contributor": "SethBurkart123"
},
{
"chip": "M5",
"cores": "10-core (4P+6E)",
"ram_gb": 16,
"macos": "26.3",
"ms_per_step": [101, 120],
"ane_ms": [9.1, 9.8],
"compile_ms": [3200, 3400],
"ane_tflops": [0.77, 0.91],
"ane_util_pct": [4.9, 5.8],
"peak_tflops_inmem": 12.44,
"notes": "H16 ANE family (same as M4). Training works with existing program(1.3) MIL.",
"contributor": "GitBubble"
},
{
"chip": "M5",
"cores": "unknown",
"ram_gb": 32,
"macos": "26.4",
"peak_tflops_inmem": 12.17,
"notes": "inmem_peak only, no training data submitted.",
"contributor": "elijah-pelton"
}
],
"neural_engine_specs": {
"M1": {"ne_cores": 16, "rated_tops": 11},
"M1_Max": {"ne_cores": 16, "rated_tops": 11},
"M1_Ultra": {"ne_cores": 32, "rated_tops": 22},
"M2": {"ne_cores": 16, "rated_tops": 15.8},
"M2_Max": {"ne_cores": 16, "rated_tops": 15.8},
"M2_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
"M3": {"ne_cores": 16, "rated_tops": 15.8},
"M3_Max": {"ne_cores": 16, "rated_tops": 15.8},
"M3_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
"M4": {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
"M4_Max": {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
"M5": {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19}
}
}
Loading