dermitchell1993 · codegen-sh · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
diff --git a/README.md b/README.md
@@ -2,15 +2,67 @@
 
 Training neural networks directly on Apple's Neural Engine (ANE) via reverse-engineered private APIs. No CoreML training APIs, no Metal, no GPU — pure ANE compute.
 
+## Project Scope & Intent
+
+I'm genuinely grateful for all the attention this project has received — I never expected a weekend research hack to blow up like this. Thank you to everyone who starred, forked, ran benchmarks on their own hardware, and shared the work. It means a lot.
+
+That said, I want to set clear expectations about what this project is and isn't.
+
+This is a **research project**, not a production framework.
+
+The goal was to demonstrate that **training on the Apple Neural Engine — and potentially other NPUs — is possible**, and that the barrier has always been software support, not hardware capability. The ANE is a remarkably capable piece of silicon that Apple restricts to inference-only use through CoreML. This project bypasses that restriction using reverse-engineered private APIs to show what's possible when you give the hardware a chance.
+
+### What This Project Is
+
+- A proof of concept for ANE training via `_ANEClient` and `_ANECompiler` private APIs
+- A set of benchmarks documenting real ANE performance characteristics (throughput, power, SRAM behavior)
+- A reference for anyone exploring direct ANE access outside CoreML
+- Research code that I update when I find something interesting
+
+### What This Project Is Not
+
+- A maintained framework or library
+- A replacement for CoreML, MLX, llama.cpp, or any production inference stack
+- A path to training large models on consumer hardware (yet)
+
+### On The Hype
+
+Some coverage of this project has overstated its implications. To be clear:
+
+- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining
+- Many element-wise operations still fall back to CPU
+- This does **not** replace GPU training for anything beyond small research models today
+
+The honest results — including all limitations — are documented in the accompanying articles:
+- [Part 1: Reverse Engineering](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine)
+- [Part 2: Benchmarks](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615)
+
+### On Maintenance
+
+I don't intend to grow this into a large community project. My focus is on original research (compiler infrastructure for edge AI optimization), and maintaining an open-source framework takes time away from that.
+
+That said:
+- I'll keep pushing updates when I discover something interesting
+- Bug fixes and benchmark contributions (especially on hardware I don't own) are welcome
+- Feature requests will likely go unaddressed — but feel free to fork
+- PRs will be merged at a relatively slow pace, otherwise I become the bottleneck for community growth around this tech
+
+### Fork it, build on it
+
+This is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it.If in future, community decides to maintain one source of truth repo, I'm in full support of that.
+
+---
+
 ## What This Is
 
 A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
 
-**Current results (M4, single transformer layer, dim=768, seq=512):**
-- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
-- 6 ANE kernel dispatches per training step
+**Current results — Stories110M (12-layer, dim=768, seq=256, 109M params):**
+- Static pipeline: **91 ms/step** (M3 Ultra), **106 ms/step** (M4)
+- Dynamic pipeline: **110 ms/step**, no recompilation
+- 72 ANE kernels per step (static), 9 shared kernels (dynamic)
 - All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
-- Adam optimizer, gradient accumulation, checkpoint/resume
+- Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart
 
 ## Architecture
 
@@ -59,6 +111,14 @@ Key optimizations:
     └── Makefile
 ```
 
+## Training Data
+
+Training requires pretokenized TinyStories data. To download:
+```bash
+cd training && bash download_data.sh
+```
+See [training/README.md](training/README.md) for detailed training instructions.
+
 ## Building
 
 Requires macOS 15+ on Apple Silicon (tested on M4).
@@ -87,8 +147,8 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 
 - **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
 - **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
-- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
-- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
+- **Compile overhead** — Static pipeline recompiles 60+ kernels every 10 steps (~3.7s); dynamic pipeline avoids this
+- **Low utilization** — Training sustains ~1-2 TFLOPS out of 15.8+ peak due to CPU fallbacks and I/O overhead
 
 ## Performance History
 
@@ -104,8 +164,13 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 
 ## Disclaimer
 
-This project is independent research into Apple Neural Engine architecture. It uses undocumented APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
+This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
 
 ## License
 
 MIT — see [LICENSE](LICENSE)
+
+---
+
+*Built by a human + Claude, one weekend at a time.*
+
diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md
@@ -0,0 +1,163 @@
+# Apple Neural Engine — Cross-Generation Benchmark Report
+
+Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
+
+## Model Configuration
+
+All training benchmarks use **Stories110M** — a Llama2-architecture transformer:
+
+```
+Parameter       Value
+────────────────────────
+Architecture    Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
+Layers          12
+Dimension       768
+Hidden (FFN)    2048
+Heads           12
+Vocab           32000 (Llama 2 BPE)
+Sequence        256
+Total Params    109.53M (84.95M transformer + 24.58M embedding)
+Training Data   TinyStories (~20M tokens, pretokenized)
+Optimizer       Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
+Precision       FP16 on ANE, FP32 on CPU
+```
+
+Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2).
+Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU.
+
+## Training Performance (Static Pipeline)
+
+```
+Chip            ms/step   ANE ms   Compile/10   ANE TFLOPS   Util%    Contributor
+─────────────────────────────────────────────────────────────────────────────────
+M1 Pro          148-163   32-35    7.9-8.5s     0.57-0.63    3.6-4.0  @moriwang
+M1 Max          143-167   35-45    ~7.1s        0.54-0.65    3.4-4.1  @andyg5000
+M3 Ultra*       91        ~10      ~3.7s        0.88         5.6      (repo ref)
+M4 Pro          69-73     8.9      ~3.5s        1.28         8.1      @srt54558
+M4 Max          64        10.2     ~3.5s        1.45         9.2      @SethBurkart123
+M5              101-120   9.1-9.8  3.2-3.4s     0.77-0.91    4.9-5.8  @GitBubble
+```
+
+*M3 Ultra = reference platform this project was developed on.
+
+## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)
+
+```
+Chip            NE Cores  FP16 TFLOPS (measured)    Rated TOPS (Apple spec*)
+────────────────────────────────────────────────────────────────────────────
+M1 Pro          16        FAIL                      11    (MIL compat issue)
+M1 Max          16        FAIL                      11    (MIL compat issue)
+M3 Pro          16        9.98                      15.8
+M3 Ultra        32        -                         31.6  (ref platform)
+M4 Pro          16        12.57                     38
+M4 Max          16        10.93                     38
+M5              16        12.17                     not disclosed
+M5 (other)      16        12.44                     not disclosed
+```
+
+*Apple's "Rated TOPS" changed methodology across generations — M1/M3 report FP16,
+M4 reports INT8/mixed-precision peak. The numbers are not directly comparable across
+generations. Use the measured FP16 TFLOPS column for apples-to-apples comparison.
+All chips have 16 NE cores except Ultra variants (32 cores, two dies via UltraFusion).
+Max variants share the same 16-core NE as Pro — the M4 Max vs M4 Pro TFLOPS difference
+is run-to-run variance, not hardware.*
+
+## Comparative Chart
+
+```
+ANE Training Speed (ms/step, lower is better)
+══════════════════════════════════════════════════════════════
+
+M1 Pro    ████████████████████████████████████████░░░░  148-163 ms
+M1 Max    ██████████████████████████████████████░░░░░░  143-167 ms
+M3 Ultra  ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░   91 ms
+M4 Pro    ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   69-73 ms
+M4 Max    ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   64 ms
+M5        ████████████████████████░░░░░░░░░░░░░░░░░░░░  101-120 ms
+
+          0        50       100       150       200
+
+
+Peak ANE Throughput (TFLOPS, higher is better)
+══════════════════════════════════════════════════════════════
+
+M1 Pro    FAIL (MIL compat)
+M1 Max    FAIL (MIL compat)
+M3 Pro    ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░  9.98
+M4 Pro    ████████████████████████████████░░░░░░░░░░░░░  12.57
+M4 Max    ██████████████████████░░░░░░░░░░░░░░░░░░░░░░  10.93
+M5        █████████████████████████░░░░░░░░░░░░░░░░░░░  12.17
+
+          0     3     6     9     12    15    18
+
+
+ANE Sustained Throughput (TFLOPS, 5s window)
+══════════════════════════════════════════════════════════════
+
+M3 Pro    ██████████████████████████████████████████████  15.04 (95.2%)
+
+          0     3     6     9     12    15    18
+          (Only M3 Pro submitted sustained benchmark)
+```
+
+## Key Findings
+
+### M1/M1 Pro/M1 Max
+- **Standalone benchmarks fail** — `ane_mil_gen.h` single-blob weight format rejected
+- **Training works** via `stories_mil.h` (separate per-matrix weight blobs)
+- ANE compiler handles weight blobs differently from M4+
+- Training at 148-167 ms/step, ~0.6 TFLOPS
+
+### M3 Pro
+- **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted
+- Fixed 512-wide lane structure in SRAM tiling
+- **Peak: 16.77 TFLOPS** (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
+- **Sustained: 15.04 TFLOPS** over 5 seconds (95.2% utilization)
+- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)
+
+### M4 Pro / M4 Max
+- Flexible channel support (256/384/512/768+)
+- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
+- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
+- `sram_probe` and `inmem_bench` fail on M4 Pro (same MIL compat issue)
+
+### M5
+- Training works out of the box with existing `program(1.3)` MIL
+- Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
+- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
+- ANE appears to be same H16 family as M4
+- **M5 Pro/Max not yet benchmarked** — Fusion Architecture may change ANE behavior
+
+### Cross-Generation MIL Compatibility
+
+```
+Feature                    M1       M3       M4       M5
+─────────────────────────────────────────────────────────
+program(1.3) / ios18       PARTIAL  YES      YES      YES
+Single-blob weights        FAIL     YES      YES      YES
+Per-matrix weight blobs    YES      YES      YES      YES
+Channel flexibility        ?        ch=512   FLEX     FLEX
+BLOBFILE offset refs       FAIL     YES      YES      YES
+```
+
+## macOS Compatibility Issues
+
+- **macOS 26.x** — `[MLModel compileModelAtURL:]` broken for standalone benchmarks
+  (fixed in PR #27: switched to in-memory MIL compilation)
+- **macOS 15.x** — Works for all M-series with correct MIL format
+- M1 generation requires `stories_mil.h` path, not `ane_mil_gen.h`
+
+## How to Contribute
+
+Run on your hardware and post results to [Issue #3](https://github.com/maderix/ANE/issues/3):
+
+```bash
+cd training && make train_large
+./train_large ane_stories110M_ckpt.bin 256 20 1e-4
+```
+
+Include: chip model, macOS version, full output with JSON lines.
+
+---
+*Report compiled 2026-03-04 from community submissions.*
+*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton*
diff --git a/benchmarks/community_results.json b/benchmarks/community_results.json
@@ -0,0 +1,113 @@
+{
+  "report_date": "2026-03-04",
+  "source": "https://github.com/maderix/ANE/issues/3",
+  "model": "Stories110M (12-layer transformer, 109M params)",
+  "config": {"dim": 768, "hidden": 2048, "heads": 12, "seq": 256, "vocab": 32000, "layers": 12},
+  "training_results": [
+    {
+      "chip": "M1 Pro",
+      "cores": "10-core CPU",
+      "ram_gb": 32,
+      "macos": "15.0",
+      "ms_per_step": [148, 163],
+      "ane_ms": [32, 35],
+      "compile_ms": [7900, 8500],
+      "ane_tflops": [0.57, 0.63],
+      "ane_util_pct": [3.6, 4.0],
+      "benchmarks_pass": false,
+      "notes": "Standalone benchmarks fail (MIL compat). Training works via stories_mil.h.",
+      "contributor": "moriwang"
+    },
+    {
+      "chip": "M1 Max",
+      "cores": "10-core CPU",
+      "ram_gb": 64,
+      "macos": "15.6.1",
+      "ms_per_step": [143, 167],
+      "ane_ms": [35, 45],
+      "compile_ms": [7100, 7100],
+      "ane_tflops": [0.54, 0.65],
+      "ane_util_pct": [3.4, 4.1],
+      "benchmarks_pass": false,
+      "notes": "Same MIL compat issue as M1 Pro.",
+      "contributor": "andyg5000"
+    },
+    {
+      "chip": "M3 Pro",
+      "cores": "12-core CPU",
+      "ram_gb": 36,
+      "macos": "15.7.4",
+      "peak_tflops": 16.77,
+      "sustained_tflops": 15.04,
+      "sustained_util_pct": 95.2,
+      "channel_constraint": "ch=512 only",
+      "notes": "Only ch=512 compiles. 52 values tested. Peak at 128x conv 512ch sp2048.",
+      "contributor": "D-Ogi"
+    },
+    {
+      "chip": "M4 Pro",
+      "cores": "unknown",
+      "ram_gb": null,
+      "macos": null,
+      "ms_per_step": [69, 73],
+      "ane_ms": [8.9, 8.9],
+      "compile_ms": [3465, 3465],
+      "ane_tflops": [1.28, 1.28],
+      "ane_util_pct": [8.1, 8.1],
+      "peak_tflops_inmem": 12.57,
+      "notes": "sram_probe and inmem_bench fail. inmem_peak and training work.",
+      "contributor": "srt54558"
+    },
+    {
+      "chip": "M4 Max",
+      "cores": "unknown",
+      "ram_gb": null,
+      "macos": null,
+      "ms_per_step": [64, 64],
+      "ane_ms": [10.2, 10.2],
+      "compile_ms": [3531, 3531],
+      "ane_tflops": [1.45, 1.45],
+      "ane_util_pct": [9.2, 9.2],
+      "peak_tflops_inmem": 10.93,
+      "notes": "Fastest training ms/step overall.",
+      "contributor": "SethBurkart123"
+    },
+    {
+      "chip": "M5",
+      "cores": "10-core (4P+6E)",
+      "ram_gb": 16,
+      "macos": "26.3",
+      "ms_per_step": [101, 120],
+      "ane_ms": [9.1, 9.8],
+      "compile_ms": [3200, 3400],
+      "ane_tflops": [0.77, 0.91],
+      "ane_util_pct": [4.9, 5.8],
+      "peak_tflops_inmem": 12.44,
+      "notes": "H16 ANE family (same as M4). Training works with existing program(1.3) MIL.",
+      "contributor": "GitBubble"
+    },
+    {
+      "chip": "M5",
+      "cores": "unknown",
+      "ram_gb": 32,
+      "macos": "26.4",
+      "peak_tflops_inmem": 12.17,
+      "notes": "inmem_peak only, no training data submitted.",
+      "contributor": "elijah-pelton"
+    }
+  ],
+  "neural_engine_specs": {
+    "M1":       {"ne_cores": 16, "rated_tops": 11},
+    "M1_Max":   {"ne_cores": 16, "rated_tops": 11},
+    "M1_Ultra": {"ne_cores": 32, "rated_tops": 22},
+    "M2":       {"ne_cores": 16, "rated_tops": 15.8},
+    "M2_Max":   {"ne_cores": 16, "rated_tops": 15.8},
+    "M2_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
+    "M3":       {"ne_cores": 16, "rated_tops": 15.8},
+    "M3_Max":   {"ne_cores": 16, "rated_tops": 15.8},
+    "M3_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
+    "M4":       {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
+    "M4_Max":   {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
+    "M5":       {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19}
+  }
+}