diff --git a/README.md b/README.md
index d2c7bb2..ed2362d 100644
--- a/README.md
+++ b/README.md
@@ -2,15 +2,67 @@
 
 Training neural networks directly on Apple's Neural Engine (ANE) via reverse-engineered private APIs. No CoreML training APIs, no Metal, no GPU — pure ANE compute.
 
+## Project Scope & Intent
+
+I'm genuinely grateful for all the attention this project has received — I never expected a weekend research hack to blow up like this. Thank you to everyone who starred, forked, ran benchmarks on their own hardware, and shared the work. It means a lot.
+
+That said, I want to set clear expectations about what this project is and isn't.
+
+This is a **research project**, not a production framework.
+
+The goal was to demonstrate that **training on the Apple Neural Engine — and potentially other NPUs — is possible**, and that the barrier has always been software support, not hardware capability. The ANE is a remarkably capable piece of silicon that Apple restricts to inference-only use through CoreML. This project bypasses that restriction using reverse-engineered private APIs to show what's possible when you give the hardware a chance.
+
+### What This Project Is
+
+- A proof of concept for ANE training via `_ANEClient` and `_ANECompiler` private APIs
+- A set of benchmarks documenting real ANE performance characteristics (throughput, power, SRAM behavior)
+- A reference for anyone exploring direct ANE access outside CoreML
+- Research code that I update when I find something interesting
+
+### What This Project Is Not
+
+- A maintained framework or library
+- A replacement for CoreML, MLX, llama.cpp, or any production inference stack
+- A path to training large models on consumer hardware (yet)
+
+### On The Hype
+
+Some coverage of this project has overstated its implications. To be clear:
+
+- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining
+- Many element-wise operations still fall back to CPU
+- This does **not** replace GPU training for anything beyond small research models today
+
+The honest results — including all limitations — are documented in the accompanying articles:
+- [Part 1: Reverse Engineering](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine)
+- [Part 2: Benchmarks](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615)
+
+### On Maintenance
+
+I don't intend to grow this into a large community project. My focus is on original research (compiler infrastructure for edge AI optimization), and maintaining an open-source framework takes time away from that.
+
+That said:
+- I'll keep pushing updates when I discover something interesting
+- Bug fixes and benchmark contributions (especially on hardware I don't own) are welcome
+- Feature requests will likely go unaddressed — but feel free to fork
+- PRs will be merged at a relatively slow pace, otherwise I become the bottleneck for community growth around this tech
+
+### Fork it, build on it
+
+This is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it.If in future, community decides to maintain one source of truth repo, I'm in full support of that.
+
+---
+
 ## What This Is
 
 A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
 
-**Current results (M4, single transformer layer, dim=768, seq=512):**
-- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
-- 6 ANE kernel dispatches per training step
+**Current results — Stories110M (12-layer, dim=768, seq=256, 109M params):**
+- Static pipeline: **91 ms/step** (M3 Ultra), **106 ms/step** (M4)
+- Dynamic pipeline: **110 ms/step**, no recompilation
+- 72 ANE kernels per step (static), 9 shared kernels (dynamic)
 - All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
-- Adam optimizer, gradient accumulation, checkpoint/resume
+- Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart
 
 ## Architecture
 
@@ -59,6 +111,14 @@ Key optimizations:
     └── Makefile
 ```
 
+## Training Data
+
+Training requires pretokenized TinyStories data. To download:
+```bash
+cd training && bash download_data.sh
+```
+See [training/README.md](training/README.md) for detailed training instructions.
+
 ## Building
 
 Requires macOS 15+ on Apple Silicon (tested on M4).
@@ -87,8 +147,8 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 
 - **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
 - **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
-- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
-- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
+- **Compile overhead** — Static pipeline recompiles 60+ kernels every 10 steps (~3.7s); dynamic pipeline avoids this
+- **Low utilization** — Training sustains ~1-2 TFLOPS out of 15.8+ peak due to CPU fallbacks and I/O overhead
 
 ## Performance History
 
@@ -104,8 +164,13 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 
 ## Disclaimer
 
-This project is independent research into Apple Neural Engine architecture. It uses undocumented APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
+This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
 
 ## License
 
 MIT — see [LICENSE](LICENSE)
+
+---
+
+*Built by a human + Claude, one weekend at a time.*
+
diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md
new file mode 100644
index 0000000..b7095a0
--- /dev/null
+++ b/benchmarks/ANE_BENCHMARK_REPORT.md
@@ -0,0 +1,163 @@
+# Apple Neural Engine — Cross-Generation Benchmark Report
+
+Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
+
+## Model Configuration
+
+All training benchmarks use **Stories110M** — a Llama2-architecture transformer:
+
+```
+Parameter       Value
+────────────────────────
+Architecture    Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
+Layers          12
+Dimension       768
+Hidden (FFN)    2048
+Heads           12
+Vocab           32000 (Llama 2 BPE)
+Sequence        256
+Total Params    109.53M (84.95M transformer + 24.58M embedding)
+Training Data   TinyStories (~20M tokens, pretokenized)
+Optimizer       Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
+Precision       FP16 on ANE, FP32 on CPU
+```
+
+Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2).
+Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU.
+
+## Training Performance (Static Pipeline)
+
+```
+Chip            ms/step   ANE ms   Compile/10   ANE TFLOPS   Util%    Contributor
+─────────────────────────────────────────────────────────────────────────────────
+M1 Pro          148-163   32-35    7.9-8.5s     0.57-0.63    3.6-4.0  @moriwang
+M1 Max          143-167   35-45    ~7.1s        0.54-0.65    3.4-4.1  @andyg5000
+M3 Ultra*       91        ~10      ~3.7s        0.88         5.6      (repo ref)
+M4 Pro          69-73     8.9      ~3.5s        1.28         8.1      @srt54558
+M4 Max          64        10.2     ~3.5s        1.45         9.2      @SethBurkart123
+M5              101-120   9.1-9.8  3.2-3.4s     0.77-0.91    4.9-5.8  @GitBubble
+```
+
+*M3 Ultra = reference platform this project was developed on.
+
+## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)
+
+```
+Chip            NE Cores  FP16 TFLOPS (measured)    Rated TOPS (Apple spec*)
+────────────────────────────────────────────────────────────────────────────
+M1 Pro          16        FAIL                      11    (MIL compat issue)
+M1 Max          16        FAIL                      11    (MIL compat issue)
+M3 Pro          16        9.98                      15.8
+M3 Ultra        32        -                         31.6  (ref platform)
+M4 Pro          16        12.57                     38
+M4 Max          16        10.93                     38
+M5              16        12.17                     not disclosed
+M5 (other)      16        12.44                     not disclosed
+```
+
+*Apple's "Rated TOPS" changed methodology across generations — M1/M3 report FP16,
+M4 reports INT8/mixed-precision peak. The numbers are not directly comparable across
+generations. Use the measured FP16 TFLOPS column for apples-to-apples comparison.
+All chips have 16 NE cores except Ultra variants (32 cores, two dies via UltraFusion).
+Max variants share the same 16-core NE as Pro — the M4 Max vs M4 Pro TFLOPS difference
+is run-to-run variance, not hardware.*
+
+## Comparative Chart
+
+```
+ANE Training Speed (ms/step, lower is better)
+══════════════════════════════════════════════════════════════
+
+M1 Pro    ████████████████████████████████████████░░░░  148-163 ms
+M1 Max    ██████████████████████████████████████░░░░░░  143-167 ms
+M3 Ultra  ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░   91 ms
+M4 Pro    ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   69-73 ms
+M4 Max    ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   64 ms
+M5        ████████████████████████░░░░░░░░░░░░░░░░░░░░  101-120 ms
+
+          0        50       100       150       200
+
+
+Peak ANE Throughput (TFLOPS, higher is better)
+══════════════════════════════════════════════════════════════
+
+M1 Pro    FAIL (MIL compat)
+M1 Max    FAIL (MIL compat)
+M3 Pro    ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░  9.98
+M4 Pro    ████████████████████████████████░░░░░░░░░░░░░  12.57
+M4 Max    ██████████████████████░░░░░░░░░░░░░░░░░░░░░░  10.93
+M5        █████████████████████████░░░░░░░░░░░░░░░░░░░  12.17
+
+          0     3     6     9     12    15    18
+
+
+ANE Sustained Throughput (TFLOPS, 5s window)
+══════════════════════════════════════════════════════════════
+
+M3 Pro    ██████████████████████████████████████████████  15.04 (95.2%)
+
+          0     3     6     9     12    15    18
+          (Only M3 Pro submitted sustained benchmark)
+```
+
+## Key Findings
+
+### M1/M1 Pro/M1 Max
+- **Standalone benchmarks fail** — `ane_mil_gen.h` single-blob weight format rejected
+- **Training works** via `stories_mil.h` (separate per-matrix weight blobs)
+- ANE compiler handles weight blobs differently from M4+
+- Training at 148-167 ms/step, ~0.6 TFLOPS
+
+### M3 Pro
+- **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted
+- Fixed 512-wide lane structure in SRAM tiling
+- **Peak: 16.77 TFLOPS** (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
+- **Sustained: 15.04 TFLOPS** over 5 seconds (95.2% utilization)
+- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)
+
+### M4 Pro / M4 Max
+- Flexible channel support (256/384/512/768+)
+- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
+- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
+- `sram_probe` and `inmem_bench` fail on M4 Pro (same MIL compat issue)
+
+### M5
+- Training works out of the box with existing `program(1.3)` MIL
+- Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
+- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
+- ANE appears to be same H16 family as M4
+- **M5 Pro/Max not yet benchmarked** — Fusion Architecture may change ANE behavior
+
+### Cross-Generation MIL Compatibility
+
+```
+Feature                    M1       M3       M4       M5
+─────────────────────────────────────────────────────────
+program(1.3) / ios18       PARTIAL  YES      YES      YES
+Single-blob weights        FAIL     YES      YES      YES
+Per-matrix weight blobs    YES      YES      YES      YES
+Channel flexibility        ?        ch=512   FLEX     FLEX
+BLOBFILE offset refs       FAIL     YES      YES      YES
+```
+
+## macOS Compatibility Issues
+
+- **macOS 26.x** — `[MLModel compileModelAtURL:]` broken for standalone benchmarks
+  (fixed in PR #27: switched to in-memory MIL compilation)
+- **macOS 15.x** — Works for all M-series with correct MIL format
+- M1 generation requires `stories_mil.h` path, not `ane_mil_gen.h`
+
+## How to Contribute
+
+Run on your hardware and post results to [Issue #3](https://github.com/maderix/ANE/issues/3):
+
+```bash
+cd training && make train_large
+./train_large ane_stories110M_ckpt.bin 256 20 1e-4
+```
+
+Include: chip model, macOS version, full output with JSON lines.
+
+---
+*Report compiled 2026-03-04 from community submissions.*
+*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton*
diff --git a/benchmarks/community_results.json b/benchmarks/community_results.json
new file mode 100644
index 0000000..e975925
--- /dev/null
+++ b/benchmarks/community_results.json
@@ -0,0 +1,113 @@
+{
+  "report_date": "2026-03-04",
+  "source": "https://github.com/maderix/ANE/issues/3",
+  "model": "Stories110M (12-layer transformer, 109M params)",
+  "config": {"dim": 768, "hidden": 2048, "heads": 12, "seq": 256, "vocab": 32000, "layers": 12},
+  "training_results": [
+    {
+      "chip": "M1 Pro",
+      "cores": "10-core CPU",
+      "ram_gb": 32,
+      "macos": "15.0",
+      "ms_per_step": [148, 163],
+      "ane_ms": [32, 35],
+      "compile_ms": [7900, 8500],
+      "ane_tflops": [0.57, 0.63],
+      "ane_util_pct": [3.6, 4.0],
+      "benchmarks_pass": false,
+      "notes": "Standalone benchmarks fail (MIL compat). Training works via stories_mil.h.",
+      "contributor": "moriwang"
+    },
+    {
+      "chip": "M1 Max",
+      "cores": "10-core CPU",
+      "ram_gb": 64,
+      "macos": "15.6.1",
+      "ms_per_step": [143, 167],
+      "ane_ms": [35, 45],
+      "compile_ms": [7100, 7100],
+      "ane_tflops": [0.54, 0.65],
+      "ane_util_pct": [3.4, 4.1],
+      "benchmarks_pass": false,
+      "notes": "Same MIL compat issue as M1 Pro.",
+      "contributor": "andyg5000"
+    },
+    {
+      "chip": "M3 Pro",
+      "cores": "12-core CPU",
+      "ram_gb": 36,
+      "macos": "15.7.4",
+      "peak_tflops": 16.77,
+      "sustained_tflops": 15.04,
+      "sustained_util_pct": 95.2,
+      "channel_constraint": "ch=512 only",
+      "notes": "Only ch=512 compiles. 52 values tested. Peak at 128x conv 512ch sp2048.",
+      "contributor": "D-Ogi"
+    },
+    {
+      "chip": "M4 Pro",
+      "cores": "unknown",
+      "ram_gb": null,
+      "macos": null,
+      "ms_per_step": [69, 73],
+      "ane_ms": [8.9, 8.9],
+      "compile_ms": [3465, 3465],
+      "ane_tflops": [1.28, 1.28],
+      "ane_util_pct": [8.1, 8.1],
+      "peak_tflops_inmem": 12.57,
+      "notes": "sram_probe and inmem_bench fail. inmem_peak and training work.",
+      "contributor": "srt54558"
+    },
+    {
+      "chip": "M4 Max",
+      "cores": "unknown",
+      "ram_gb": null,
+      "macos": null,
+      "ms_per_step": [64, 64],
+      "ane_ms": [10.2, 10.2],
+      "compile_ms": [3531, 3531],
+      "ane_tflops": [1.45, 1.45],
+      "ane_util_pct": [9.2, 9.2],
+      "peak_tflops_inmem": 10.93,
+      "notes": "Fastest training ms/step overall.",
+      "contributor": "SethBurkart123"
+    },
+    {
+      "chip": "M5",
+      "cores": "10-core (4P+6E)",
+      "ram_gb": 16,
+      "macos": "26.3",
+      "ms_per_step": [101, 120],
+      "ane_ms": [9.1, 9.8],
+      "compile_ms": [3200, 3400],
+      "ane_tflops": [0.77, 0.91],
+      "ane_util_pct": [4.9, 5.8],
+      "peak_tflops_inmem": 12.44,
+      "notes": "H16 ANE family (same as M4). Training works with existing program(1.3) MIL.",
+      "contributor": "GitBubble"
+    },
+    {
+      "chip": "M5",
+      "cores": "unknown",
+      "ram_gb": 32,
+      "macos": "26.4",
+      "peak_tflops_inmem": 12.17,
+      "notes": "inmem_peak only, no training data submitted.",
+      "contributor": "elijah-pelton"
+    }
+  ],
+  "neural_engine_specs": {
+    "M1":       {"ne_cores": 16, "rated_tops": 11},
+    "M1_Max":   {"ne_cores": 16, "rated_tops": 11},
+    "M1_Ultra": {"ne_cores": 32, "rated_tops": 22},
+    "M2":       {"ne_cores": 16, "rated_tops": 15.8},
+    "M2_Max":   {"ne_cores": 16, "rated_tops": 15.8},
+    "M2_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
+    "M3":       {"ne_cores": 16, "rated_tops": 15.8},
+    "M3_Max":   {"ne_cores": 16, "rated_tops": 15.8},
+    "M3_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
+    "M4":       {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
+    "M4_Max":   {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
+    "M5":       {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19}
+  }
+}
diff --git a/bridge/Makefile b/bridge/Makefile
new file mode 100644
index 0000000..753d749
--- /dev/null
+++ b/bridge/Makefile
@@ -0,0 +1,17 @@
+CC = xcrun clang
+CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc -fPIC
+FRAMEWORKS = -framework Foundation -framework IOSurface -ldl
+TARGET = libane_bridge.dylib
+
+all: $(TARGET)
+
+$(TARGET): ane_bridge.m ane_bridge.h
+	$(CC) $(CFLAGS) -dynamiclib -o $@ ane_bridge.m $(FRAMEWORKS)
+
+test: test_bridge.m ane_bridge.h $(TARGET)
+	$(CC) $(CFLAGS) -o test_bridge test_bridge.m -L. -lane_bridge $(FRAMEWORKS)
+
+clean:
+	rm -f $(TARGET) test_bridge
+
+.PHONY: all clean test
diff --git a/bridge/ane_bridge.h b/bridge/ane_bridge.h
new file mode 100644
index 0000000..3e8ff47
--- /dev/null
+++ b/bridge/ane_bridge.h
@@ -0,0 +1,87 @@
+// ane_bridge.h — C-callable bridge to ANE private APIs for Python ctypes
+// Wraps _ANEInMemoryModel via private AppleNeuralEngine.framework
+
+#ifndef ANE_BRIDGE_H
+#define ANE_BRIDGE_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Opaque kernel handle
+typedef struct ANEKernelHandle ANEKernelHandle;
+
+// Initialize ANE runtime (load private framework, resolve classes)
+// Returns 0 on success, -1 on failure
+int ane_bridge_init(void);
+
+// Compile a MIL program with weight blobs into an ANE kernel
+// mil_text: UTF-8 MIL program text
+// mil_len: length of MIL text
+// weight_data: raw weight blob (can be NULL)
+// weight_len: length of weight blob
+// n_inputs: number of input tensors
+// input_sizes: array of byte sizes for each input
+// n_outputs: number of output tensors
+// output_sizes: array of byte sizes for each output
+// Returns kernel handle or NULL on failure
+ANEKernelHandle *ane_bridge_compile(const char *mil_text, size_t mil_len,
+                                     const uint8_t *weight_data, size_t weight_len,
+                                     int n_inputs, const size_t *input_sizes,
+                                     int n_outputs, const size_t *output_sizes);
+
+// Compile with multiple named weight files (for transformer kernels)
+// weight_names: array of weight file paths (e.g. "@model_path/weights/wq.bin")
+// weight_datas: array of weight data pointers
+// weight_lens: array of weight data lengths
+// n_weights: number of weight files
+ANEKernelHandle *ane_bridge_compile_multi_weights(
+    const char *mil_text, size_t mil_len,
+    const char **weight_names, const uint8_t **weight_datas,
+    const size_t *weight_lens, int n_weights,
+    int n_inputs, const size_t *input_sizes,
+    int n_outputs, const size_t *output_sizes);
+
+// Evaluate (run) a compiled kernel on ANE
+// Returns true on success
+bool ane_bridge_eval(ANEKernelHandle *kernel);
+
+// Write data to kernel input tensor
+void ane_bridge_write_input(ANEKernelHandle *kernel, int idx,
+                             const void *data, size_t bytes);
+
+// Read data from kernel output tensor
+void ane_bridge_read_output(ANEKernelHandle *kernel, int idx,
+                              void *data, size_t bytes);
+
+// Free a compiled kernel and all associated resources
+void ane_bridge_free(ANEKernelHandle *kernel);
+
+// Get compile count (for exec() restart budgeting)
+int ane_bridge_get_compile_count(void);
+
+// Reset compile count
+void ane_bridge_reset_compile_count(void);
+
+// Build a weight blob in ANE format (128-byte header + fp16 data)
+// src: float32 weights [rows x cols]
+// Returns allocated buffer and sets out_len. Caller must free().
+uint8_t *ane_bridge_build_weight_blob(const float *src, int rows, int cols,
+                                       size_t *out_len);
+
+// Build a transposed weight blob in ANE format
+uint8_t *ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols,
+                                                   size_t *out_len);
+
+// Free a blob allocated by ane_bridge_build_weight_blob*
+void ane_bridge_free_blob(void *ptr);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif // ANE_BRIDGE_H
diff --git a/bridge/ane_bridge.m b/bridge/ane_bridge.m
new file mode 100644
index 0000000..2b27ddc
--- /dev/null
+++ b/bridge/ane_bridge.m
@@ -0,0 +1,328 @@
+// ane_bridge.m — Objective-C implementation of ANE bridge for Python ctypes
+// Wraps _ANEInMemoryModel private APIs into C-callable functions
+
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#include "ane_bridge.h"
+
+// --- Private class references ---
+static Class g_ANEDesc = nil;
+static Class g_ANEInMem = nil;
+static Class g_ANEReq = nil;
+static Class g_ANEIO = nil;
+static bool g_initialized = false;
+static int g_compile_count = 0;
+
+// --- Kernel handle struct ---
+struct ANEKernelHandle {
+    id model;               // _ANEInMemoryModel
+    IOSurfaceRef *ioInputs;
+    IOSurfaceRef *ioOutputs;
+    id request;             // _ANERequest
+    NSString *tmpDir;
+    int nInputs, nOutputs;
+    size_t *inputBytes;
+    size_t *outputBytes;
+};
+
+// --- Public API ---
+
+int ane_bridge_init(void) {
+    if (g_initialized) return 0;
+
+    void *handle = dlopen(
+        "/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine",
+        RTLD_NOW);
+    if (!handle) {
+        fprintf(stderr, "ane_bridge: Failed to load AppleNeuralEngine.framework\n");
+        return -1;
+    }
+
+    g_ANEDesc  = NSClassFromString(@"_ANEInMemoryModelDescriptor");
+    g_ANEInMem = NSClassFromString(@"_ANEInMemoryModel");
+    g_ANEReq   = NSClassFromString(@"_ANERequest");
+    g_ANEIO    = NSClassFromString(@"_ANEIOSurfaceObject");
+
+    if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) {
+        fprintf(stderr, "ane_bridge: Failed to resolve ANE private classes\n");
+        return -1;
+    }
+
+    g_initialized = true;
+    g_compile_count = 0;
+    return 0;
+}
+
+static IOSurfaceRef create_surface(size_t bytes) {
+    return IOSurfaceCreate((__bridge CFDictionaryRef)@{
+        (id)kIOSurfaceWidth: @(bytes),
+        (id)kIOSurfaceHeight: @1,
+        (id)kIOSurfaceBytesPerElement: @1,
+        (id)kIOSurfaceBytesPerRow: @(bytes),
+        (id)kIOSurfaceAllocSize: @(bytes),
+        (id)kIOSurfacePixelFormat: @0
+    });
+}
+
+ANEKernelHandle *ane_bridge_compile_multi_weights(
+    const char *mil_text, size_t mil_len,
+    const char **weight_names, const uint8_t **weight_datas,
+    const size_t *weight_lens, int n_weights,
+    int n_inputs, const size_t *input_sizes,
+    int n_outputs, const size_t *output_sizes)
+{
+    @autoreleasepool {
+        if (!g_initialized) {
+            fprintf(stderr, "ane_bridge: Not initialized\n");
+            return NULL;
+        }
+
+        NSData *milData = [NSData dataWithBytes:mil_text length:mil_len];
+        NSError *e = nil;
+
+        // Build weight dictionary
+        NSMutableDictionary *wdict = [NSMutableDictionary dictionary];
+        for (int i = 0; i < n_weights; i++) {
+            NSString *name = [NSString stringWithUTF8String:weight_names[i]];
+            NSData *data = [NSData dataWithBytes:weight_datas[i] length:weight_lens[i]];
+            wdict[name] = @{@"offset": @0, @"data": data};
+        }
+
+        id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
+            g_ANEDesc, @selector(modelWithMILText:weights:optionsPlist:),
+            milData, wdict.count > 0 ? wdict : nil, nil);
+        if (!desc) {
+            fprintf(stderr, "ane_bridge: modelWithMILText failed\n");
+            return NULL;
+        }
+
+        id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(
+            g_ANEInMem, @selector(inMemoryModelWithDescriptor:), desc);
+        if (!mdl) {
+            fprintf(stderr, "ane_bridge: inMemoryModelWithDescriptor failed\n");
+            return NULL;
+        }
+
+        // Pre-populate temp dir
+        id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
+        NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
+        NSFileManager *fm = [NSFileManager defaultManager];
+        [fm createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"]
+            withIntermediateDirectories:YES attributes:nil error:nil];
+        [milData writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+
+        for (int i = 0; i < n_weights; i++) {
+            NSString *name = [NSString stringWithUTF8String:weight_names[i]];
+            // Extract filename from path like "@model_path/weights/wq.bin" -> "weights/wq.bin"
+            NSString *relPath = name;
+            if ([name hasPrefix:@"@model_path/"]) {
+                relPath = [name substringFromIndex:12];
+            }
+            NSString *fullPath = [td stringByAppendingPathComponent:relPath];
+            NSString *dir = [fullPath stringByDeletingLastPathComponent];
+            [fm createDirectoryAtPath:dir withIntermediateDirectories:YES attributes:nil error:nil];
+            NSData *data = [NSData dataWithBytes:weight_datas[i] length:weight_lens[i]];
+            [data writeToFile:fullPath atomically:YES];
+        }
+
+        // Compile
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e)) {
+            fprintf(stderr, "ane_bridge: ANE compile failed: %s\n",
+                    e ? [[e description] UTF8String] : "unknown");
+            [fm removeItemAtPath:td error:nil];
+            return NULL;
+        }
+
+        // Load (with one retry after a brief pause for ANE slot reclamation)
+        BOOL loaded = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
+        if (!loaded) {
+            fprintf(stderr, "ane_bridge: ANE load failed (retrying in 100ms): %s\n",
+                    e ? [[e description] UTF8String] : "unknown");
+            usleep(100000); // 100ms
+            e = nil;
+            loaded = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                    mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
+        }
+        if (!loaded) {
+            fprintf(stderr, "ane_bridge: ANE load failed after retry: %s\n",
+                    e ? [[e description] UTF8String] : "unknown");
+            [fm removeItemAtPath:td error:nil];
+            return NULL;
+        }
+
+        g_compile_count++;
+
+        // Create kernel handle
+        ANEKernelHandle *k = (ANEKernelHandle *)calloc(1, sizeof(ANEKernelHandle));
+        k->model = mdl;
+        k->tmpDir = td;
+        k->nInputs = n_inputs;
+        k->nOutputs = n_outputs;
+        k->inputBytes = (size_t *)malloc(n_inputs * sizeof(size_t));
+        k->outputBytes = (size_t *)malloc(n_outputs * sizeof(size_t));
+        memcpy(k->inputBytes, input_sizes, n_inputs * sizeof(size_t));
+        memcpy(k->outputBytes, output_sizes, n_outputs * sizeof(size_t));
+
+        // Create IOSurfaces
+        k->ioInputs = (IOSurfaceRef *)malloc(n_inputs * sizeof(IOSurfaceRef));
+        k->ioOutputs = (IOSurfaceRef *)malloc(n_outputs * sizeof(IOSurfaceRef));
+        for (int i = 0; i < n_inputs; i++)
+            k->ioInputs[i] = create_surface(input_sizes[i]);
+        for (int i = 0; i < n_outputs; i++)
+            k->ioOutputs[i] = create_surface(output_sizes[i]);
+
+        // Build request
+        NSMutableArray *wIns = [NSMutableArray arrayWithCapacity:n_inputs];
+        NSMutableArray *iIdx = [NSMutableArray arrayWithCapacity:n_inputs];
+        for (int i = 0; i < n_inputs; i++) {
+            [wIns addObject:((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(
+                g_ANEIO, @selector(objectWithIOSurface:), k->ioInputs[i])];
+            [iIdx addObject:@(i)];
+        }
+        NSMutableArray *wOuts = [NSMutableArray arrayWithCapacity:n_outputs];
+        NSMutableArray *oIdx = [NSMutableArray arrayWithCapacity:n_outputs];
+        for (int i = 0; i < n_outputs; i++) {
+            [wOuts addObject:((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(
+                g_ANEIO, @selector(objectWithIOSurface:), k->ioOutputs[i])];
+            [oIdx addObject:@(i)];
+        }
+        k->request = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(
+            g_ANEReq,
+            @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+            wIns, iIdx, wOuts, oIdx, nil, nil, @0);
+
+        return k;
+    }
+}
+
+ANEKernelHandle *ane_bridge_compile(const char *mil_text, size_t mil_len,
+                                     const uint8_t *weight_data, size_t weight_len,
+                                     int n_inputs, const size_t *input_sizes,
+                                     int n_outputs, const size_t *output_sizes) {
+    if (weight_data && weight_len > 0) {
+        const char *name = "@model_path/weights/weight.bin";
+        return ane_bridge_compile_multi_weights(
+            mil_text, mil_len,
+            &name, &weight_data, &weight_len, 1,
+            n_inputs, input_sizes,
+            n_outputs, output_sizes);
+    } else {
+        return ane_bridge_compile_multi_weights(
+            mil_text, mil_len,
+            NULL, NULL, NULL, 0,
+            n_inputs, input_sizes,
+            n_outputs, output_sizes);
+    }
+}
+
+bool ane_bridge_eval(ANEKernelHandle *kernel) {
+    @autoreleasepool {
+        if (!kernel || !kernel->model) return false;
+        NSError *e = nil;
+        return ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+            kernel->model, @selector(evaluateWithQoS:options:request:error:),
+            21, @{}, kernel->request, &e);
+    }
+}
+
+void ane_bridge_write_input(ANEKernelHandle *kernel, int idx,
+                             const void *data, size_t bytes) {
+    if (!kernel || idx < 0 || idx >= kernel->nInputs) return;
+    IOSurfaceLock(kernel->ioInputs[idx], 0, NULL);
+    memcpy(IOSurfaceGetBaseAddress(kernel->ioInputs[idx]), data, bytes);
+    IOSurfaceUnlock(kernel->ioInputs[idx], 0, NULL);
+}
+
+void ane_bridge_read_output(ANEKernelHandle *kernel, int idx,
+                              void *data, size_t bytes) {
+    if (!kernel || idx < 0 || idx >= kernel->nOutputs) return;
+    IOSurfaceLock(kernel->ioOutputs[idx], kIOSurfaceLockReadOnly, NULL);
+    memcpy(data, IOSurfaceGetBaseAddress(kernel->ioOutputs[idx]), bytes);
+    IOSurfaceUnlock(kernel->ioOutputs[idx], kIOSurfaceLockReadOnly, NULL);
+}
+
+void ane_bridge_free(ANEKernelHandle *kernel) {
+    @autoreleasepool {
+        if (!kernel) return;
+        NSError *e = nil;
+        if (kernel->model) {
+            ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(
+                kernel->model, @selector(unloadWithQoS:error:), 21, &e);
+        }
+        for (int i = 0; i < kernel->nInputs; i++)
+            if (kernel->ioInputs[i]) CFRelease(kernel->ioInputs[i]);
+        for (int i = 0; i < kernel->nOutputs; i++)
+            if (kernel->ioOutputs[i]) CFRelease(kernel->ioOutputs[i]);
+        if (kernel->tmpDir) {
+            [[NSFileManager defaultManager] removeItemAtPath:kernel->tmpDir error:nil];
+        }
+        free(kernel->ioInputs);
+        free(kernel->ioOutputs);
+        free(kernel->inputBytes);
+        free(kernel->outputBytes);
+        
+        // Explicitly nil Objective-C objects to trigger ARC release before freeing struct
+        kernel->model = nil;
+        kernel->request = nil;
+        kernel->tmpDir = nil;
+        
+        free(kernel);
+    }
+}
+
+int ane_bridge_get_compile_count(void) {
+    return g_compile_count;
+}
+
+void ane_bridge_reset_compile_count(void) {
+    g_compile_count = 0;
+}
+
+uint8_t *ane_bridge_build_weight_blob(const float *src, int rows, int cols,
+                                       size_t *out_len) {
+    int wsize = rows * cols * 2; // fp16
+    int total = 128 + wsize;
+    uint8_t *buf = (uint8_t *)calloc(total, 1);
+
+    // ANE blob header
+    buf[0] = 0x01; buf[4] = 0x02;
+    buf[64] = 0xEF; buf[65] = 0xBE; buf[66] = 0xAD; buf[67] = 0xDE;
+    buf[68] = 0x01;
+    *(uint32_t*)(buf + 72) = wsize;
+    *(uint32_t*)(buf + 80) = 128;
+
+    // Convert float32 -> float16
+    _Float16 *fp16 = (_Float16 *)(buf + 128);
+    for (int i = 0; i < rows * cols; i++) {
+        fp16[i] = (_Float16)src[i];
+    }
+
+    *out_len = total;
+    return buf;
+}
+
+uint8_t *ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols,
+                                                   size_t *out_len) {
+    int wsize = rows * cols * 2;
+    int total = 128 + wsize;
+    uint8_t *buf = (uint8_t *)calloc(total, 1);
+
+    buf[0] = 0x01; buf[4] = 0x02;
+    buf[64] = 0xEF; buf[65] = 0xBE; buf[66] = 0xAD; buf[67] = 0xDE;
+    buf[68] = 0x01;
+    *(uint32_t*)(buf + 72) = wsize;
+    *(uint32_t*)(buf + 80) = 128;
+
+    _Float16 *fp16 = (_Float16 *)(buf + 128);
+    for (int i = 0; i < rows; i++)
+        for (int j = 0; j < cols; j++)
+            fp16[j * rows + i] = (_Float16)src[i * cols + j];
+
+    *out_len = total;
+    return buf;
+}
diff --git a/bridge/libane_bridge.dylib b/bridge/libane_bridge.dylib
new file mode 100755
index 0000000..72acc32
Binary files /dev/null and b/bridge/libane_bridge.dylib differ
diff --git a/inmem_bench.m b/inmem_bench.m
index 8a5af33..51bd0aa 100644
--- a/inmem_bench.m
+++ b/inmem_bench.m
@@ -1,5 +1,4 @@
 #import <Foundation/Foundation.h>
-#import <CoreML/CoreML.h>
 #import <objc/runtime.h>
 #import <objc/message.h>
 #import <dlfcn.h>
@@ -9,18 +8,45 @@
 static mach_timebase_info_data_t g_tb;
 static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
 
+static NSData *buildWeightBlob(int ch) {
+    NSUInteger wsize = (NSUInteger)ch * ch * 2;
+    NSUInteger total = 64 + 64 + wsize;
+    uint8_t *buf = calloc(total, 1);
+    buf[0] = 0x01; buf[4] = 0x02;
+    uint8_t *chunk = buf + 64;
+    chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE;
+    chunk[4]=0x01; chunk[10]=0x08;
+    uint16_t *fp16 = (uint16_t*)(chunk + 64);
+    for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++)
+        fp16[j] = (arc4random() & 0x03FF) | 0x2000;
+    return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES];
+}
+
+static NSString *genMIL(int ch, int sp) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ch, sp];
+    [m appendString:
+        @"        string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n"
+        @"        tensor<int32, [2]> c_strides = const()[name = string(\"c_strides\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        tensor<int32, [4]> c_pad = const()[name = string(\"c_pad\"), val = tensor<int32, [4]>([0, 0, 0, 0])];\n"
+        @"        tensor<int32, [2]> c_dilations = const()[name = string(\"c_dilations\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n"
+        @"        string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp];
+    [m appendFormat:@"        tensor<fp16, [%d, %d, 1, 1]> W = const()[name = string(\"W\"), val = tensor<fp16, [%d, %d, 1, 1]>(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp];
+    [m appendString:@"        string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1, %d, 1, %d]> y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
 double benchInMem(int ch, int sp) {
     @autoreleasepool {
         NSError *e = nil;
-        NSString *path = [NSString stringWithFormat:@"/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp];
-        NSURL *compiled = [MLModel compileModelAtURL:[NSURL fileURLWithPath:path] error:&e];
-        if (e) return -1;
-
-        NSData *milData = [[NSString stringWithContentsOfFile:
-            [[compiled path] stringByAppendingPathComponent:@"model.mil"]
-            encoding:NSUTF8StringEncoding error:nil] dataUsingEncoding:NSUTF8StringEncoding];
-        NSData *weightBlob = [NSData dataWithContentsOfFile:
-            [[compiled path] stringByAppendingPathComponent:@"weights/weight.bin"]];
+        NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy];
+        NSData *wb = buildWeightBlob(ch);
 
         Class Desc = NSClassFromString(@"_ANEInMemoryModelDescriptor");
         Class IMM = NSClassFromString(@"_ANEInMemoryModel");
@@ -28,7 +54,7 @@
         Class AIO = NSClassFromString(@"_ANEIOSurfaceObject");
 
         NSDictionary *wdict = @{
-            @"@model_path/weights/weight.bin": @{@"offset": @64, @"data": weightBlob}
+            @"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb}
         };
         id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
             Desc, @selector(modelWithMILText:weights:optionsPlist:),
@@ -43,7 +69,7 @@
         [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"]
             withIntermediateDirectories:YES attributes:nil error:nil];
         [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES];
-        [weightBlob writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
+        [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
 
         BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
             model, @selector(compileWithQoS:options:error:), 21, @{}, &e);
diff --git a/sram_bench.m b/sram_bench.m
index 9dc3a35..85b46d5 100644
--- a/sram_bench.m
+++ b/sram_bench.m
@@ -1,5 +1,4 @@
 #import <Foundation/Foundation.h>
-#import <CoreML/CoreML.h>
 #import <objc/runtime.h>
 #import <objc/message.h>
 #import <dlfcn.h>
@@ -8,25 +7,79 @@
 
 static mach_timebase_info_data_t g_tb;
 static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
-static id g_client;
-static Class AM, AR, AIO;
 
-double bench(const char *path, int ch, int sp) {
+static NSData *buildWeightBlob(int ch) {
+    NSUInteger wsize = (NSUInteger)ch * ch * 2;
+    NSUInteger total = 64 + 64 + wsize;
+    uint8_t *buf = calloc(total, 1);
+    buf[0] = 0x01; buf[4] = 0x02;
+    uint8_t *chunk = buf + 64;
+    chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE;
+    chunk[4]=0x01; chunk[10]=0x08;
+    uint16_t *fp16 = (uint16_t*)(chunk + 64);
+    for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++)
+        fp16[j] = (arc4random() & 0x03FF) | 0x2000;
+    return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES];
+}
+
+static NSString *genMIL(int ch, int sp) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ch, sp];
+    [m appendString:
+        @"        string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n"
+        @"        tensor<int32, [2]> c_strides = const()[name = string(\"c_strides\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        tensor<int32, [4]> c_pad = const()[name = string(\"c_pad\"), val = tensor<int32, [4]>([0, 0, 0, 0])];\n"
+        @"        tensor<int32, [2]> c_dilations = const()[name = string(\"c_dilations\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n"
+        @"        string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp];
+    [m appendFormat:@"        tensor<fp16, [%d, %d, 1, 1]> W = const()[name = string(\"W\"), val = tensor<fp16, [%d, %d, 1, 1]>(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp];
+    [m appendString:@"        string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1, %d, 1, %d]> y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+double bench(int ch, int sp) {
     @autoreleasepool {
         NSError *e = nil;
-        NSURL *compiled = [MLModel compileModelAtURL:
-            [NSURL fileURLWithPath:[NSString stringWithUTF8String:path]] error:&e];
-        if (e) return -1;
-        id model = ((id(*)(Class,SEL,id,id))objc_msgSend)(AM, @selector(modelAtURL:key:), compiled, @"s");
-        BOOL ok = ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(compileModel:options:qos:error:), model,
-            @{@"kANEFModelType":@"kANEFModelMIL",@"kANEFNetPlistFilenameKey":@"model.mil"}, 21, &e);
-        if (!ok) return -2;
-        ok = ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(loadModel:options:qos:error:), model, @{}, 21, &e);
-        if (!ok) return -3;
-
-        NSUInteger bytes = ch * sp * 4; // FP32 input
+        NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy];
+        NSData *wb = buildWeightBlob(ch);
+
+        Class D = NSClassFromString(@"_ANEInMemoryModelDescriptor");
+        Class I = NSClassFromString(@"_ANEInMemoryModel");
+        Class AR = NSClassFromString(@"_ANERequest");
+        Class AIO = NSClassFromString(@"_ANEIOSurfaceObject");
+
+        id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
+            D, @selector(modelWithMILText:weights:optionsPlist:),
+            milData, @{@"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb}}, nil);
+        if (!desc) return -2;
+
+        id model = ((id(*)(Class,SEL,id))objc_msgSend)(
+            I, @selector(inMemoryModelWithDescriptor:), desc);
+        if (!model) return -3;
+
+        id hexId = ((id(*)(id,SEL))objc_msgSend)(model, @selector(hexStringIdentifier));
+        NSString *tmpDir = [NSTemporaryDirectory() stringByAppendingPathComponent:hexId];
+        NSFileManager *fm = [NSFileManager defaultManager];
+        [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"]
+            withIntermediateDirectories:YES attributes:nil error:nil];
+        [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+        [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
+
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                model, @selector(compileWithQoS:options:error:), 21, @{}, &e)) {
+            [fm removeItemAtPath:tmpDir error:nil]; return -4;
+        }
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                model, @selector(loadWithQoS:options:error:), 21, @{}, &e)) {
+            [fm removeItemAtPath:tmpDir error:nil]; return -5;
+        }
+
+        NSUInteger bytes = ch * sp * 4;
         IOSurfaceRef ioIn = IOSurfaceCreate((__bridge CFDictionaryRef)@{
             (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1,
             (id)kIOSurfaceBytesPerElement:@1,(id)kIOSurfaceBytesPerRow:@(bytes),
@@ -35,7 +88,6 @@
             (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1,
             (id)kIOSurfaceBytesPerElement:@1,(id)kIOSurfaceBytesPerRow:@(bytes),
             (id)kIOSurfaceAllocSize:@(bytes),(id)kIOSurfacePixelFormat:@0});
-
         id wIn = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(AIO, @selector(objectWithIOSurface:), ioIn);
         id wOut = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(AIO, @selector(objectWithIOSurface:), ioOut);
         id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(AR,
@@ -43,19 +95,20 @@
             @[wIn], @[@0], @[wOut], @[@0], nil, nil, @0);
 
         for (int i = 0; i < 5; i++)
-            ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)(
-                g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e);
+            ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+                model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
 
         int iters = 30;
         uint64_t t0 = mach_absolute_time();
         for (int i = 0; i < iters; i++)
-            ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)(
-                g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e);
+            ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+                model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
         double ms = ticksToMs(mach_absolute_time() - t0) / iters;
 
-        ((void(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(unloadModel:options:qos:error:), model, @{}, 21, &e);
+        ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(
+            model, @selector(unloadWithQoS:error:), 21, &e);
         CFRelease(ioIn); CFRelease(ioOut);
+        [fm removeItemAtPath:tmpDir error:nil];
         return ms;
     }
 }
@@ -63,10 +116,6 @@
 int main() {
     mach_timebase_info(&g_tb);
     dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
-    g_client = [NSClassFromString(@"_ANEClient") performSelector:@selector(sharedConnection)];
-    AM = NSClassFromString(@"_ANEModel");
-    AR = NSClassFromString(@"_ANERequest");
-    AIO = NSClassFromString(@"_ANEIOSurfaceObject");
 
     printf("=== ANE SRAM Probe: 1x1 Conv with Increasing Weight Size ===\n\n");
     printf("%-25s %8s %8s %8s %10s %8s\n", "Config", "W (MB)", "Act(MB)", "Tot(MB)", "ms/eval", "TFLOPS");
@@ -82,9 +131,7 @@ int main() {
         double tot = w_mb + 2 * a_mb;
         double gflop = 2.0 * ch * ch * sp / 1e9;
 
-        char path[256];
-        snprintf(path, sizeof(path), "/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp);
-        double ms = bench(path, ch, sp);
+        double ms = bench(ch, sp);
 
         double tflops = (ms > 0) ? gflop / ms : -1;
         char label[64];
diff --git a/sram_probe.m b/sram_probe.m
index 0766187..4ca4df6 100644
--- a/sram_probe.m
+++ b/sram_probe.m
@@ -1,5 +1,4 @@
 #import <Foundation/Foundation.h>
-#import <CoreML/CoreML.h>
 #import <objc/runtime.h>
 #import <objc/message.h>
 #import <dlfcn.h>
@@ -8,20 +7,78 @@
 
 static mach_timebase_info_data_t g_tb;
 static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
-static id g_client; static Class AM, AR, AIO;
 
-double bench(const char *path, int ch, int sp) {
+static NSData *buildWeightBlob(int ch) {
+    NSUInteger wsize = (NSUInteger)ch * ch * 2;
+    NSUInteger total = 64 + 64 + wsize;
+    uint8_t *buf = calloc(total, 1);
+    buf[0] = 0x01; buf[4] = 0x02;
+    uint8_t *chunk = buf + 64;
+    chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE;
+    chunk[4]=0x01; chunk[10]=0x08;
+    uint16_t *fp16 = (uint16_t*)(chunk + 64);
+    for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++)
+        fp16[j] = (arc4random() & 0x03FF) | 0x2000;
+    return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES];
+}
+
+static NSString *genMIL(int ch, int sp) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ch, sp];
+    [m appendString:
+        @"        string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n"
+        @"        tensor<int32, [2]> c_strides = const()[name = string(\"c_strides\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        tensor<int32, [4]> c_pad = const()[name = string(\"c_pad\"), val = tensor<int32, [4]>([0, 0, 0, 0])];\n"
+        @"        tensor<int32, [2]> c_dilations = const()[name = string(\"c_dilations\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n"
+        @"        string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp];
+    [m appendFormat:@"        tensor<fp16, [%d, %d, 1, 1]> W = const()[name = string(\"W\"), val = tensor<fp16, [%d, %d, 1, 1]>(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp];
+    [m appendString:@"        string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1, %d, 1, %d]> y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+double bench(int ch, int sp) {
     @autoreleasepool {
         NSError *e = nil;
-        NSURL *compiled = [MLModel compileModelAtURL:
-            [NSURL fileURLWithPath:[NSString stringWithUTF8String:path]] error:&e];
-        if (e) return -1;
-        id model = ((id(*)(Class,SEL,id,id))objc_msgSend)(AM, @selector(modelAtURL:key:), compiled, @"s");
-        ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(compileModel:options:qos:error:), model,
-            @{@"kANEFModelType":@"kANEFModelMIL",@"kANEFNetPlistFilenameKey":@"model.mil"}, 21, &e);
-        ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(loadModel:options:qos:error:), model, @{}, 21, &e);
+        NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy];
+        NSData *wb = buildWeightBlob(ch);
+
+        Class D = NSClassFromString(@"_ANEInMemoryModelDescriptor");
+        Class I = NSClassFromString(@"_ANEInMemoryModel");
+        Class AR = NSClassFromString(@"_ANERequest");
+        Class AIO = NSClassFromString(@"_ANEIOSurfaceObject");
+
+        id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
+            D, @selector(modelWithMILText:weights:optionsPlist:),
+            milData, @{@"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb}}, nil);
+        if (!desc) return -2;
+
+        id model = ((id(*)(Class,SEL,id))objc_msgSend)(
+            I, @selector(inMemoryModelWithDescriptor:), desc);
+        if (!model) return -3;
+
+        id hexId = ((id(*)(id,SEL))objc_msgSend)(model, @selector(hexStringIdentifier));
+        NSString *tmpDir = [NSTemporaryDirectory() stringByAppendingPathComponent:hexId];
+        NSFileManager *fm = [NSFileManager defaultManager];
+        [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"]
+            withIntermediateDirectories:YES attributes:nil error:nil];
+        [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+        [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
+
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                model, @selector(compileWithQoS:options:error:), 21, @{}, &e)) {
+            [fm removeItemAtPath:tmpDir error:nil]; return -4;
+        }
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                model, @selector(loadWithQoS:options:error:), 21, @{}, &e)) {
+            [fm removeItemAtPath:tmpDir error:nil]; return -5;
+        }
+
         NSUInteger bytes = ch * sp * 4;
         IOSurfaceRef ioIn = IOSurfaceCreate((__bridge CFDictionaryRef)@{
             (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1,
@@ -36,18 +93,22 @@
         id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(AR,
             @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
             @[wIn], @[@0], @[wOut], @[@0], nil, nil, @0);
+
         for (int i = 0; i < 5; i++)
-            ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)(
-                g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e);
+            ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+                model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
+
         int iters = 50;
         uint64_t t0 = mach_absolute_time();
         for (int i = 0; i < iters; i++)
-            ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)(
-                g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e);
+            ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+                model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
         double ms = ticksToMs(mach_absolute_time() - t0) / iters;
-        ((void(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(unloadModel:options:qos:error:), model, @{}, 21, &e);
+
+        ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(
+            model, @selector(unloadWithQoS:error:), 21, &e);
         CFRelease(ioIn); CFRelease(ioOut);
+        [fm removeItemAtPath:tmpDir error:nil];
         return ms;
     }
 }
@@ -55,9 +116,6 @@
 int main() {
     mach_timebase_info(&g_tb);
     dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
-    g_client = [NSClassFromString(@"_ANEClient") performSelector:@selector(sharedConnection)];
-    AM = NSClassFromString(@"_ANEModel"); AR = NSClassFromString(@"_ANERequest");
-    AIO = NSClassFromString(@"_ANEIOSurfaceObject");
 
     printf("=== ANE SRAM Fine Probe (weights only vary, spatial=64) ===\n\n");
     printf("%-12s %8s %10s %8s %12s\n", "Channels", "W (MB)", "ms/eval", "TFLOPS", "GFLOPS/MB");
@@ -70,9 +128,7 @@ int main() {
         int ch = chs[i], sp = sps[i];
         double w_mb = (double)ch * ch * 2 / 1024 / 1024;
         double gf = 2.0 * ch * ch * sp / 1e9;
-        char path[256];
-        snprintf(path, sizeof(path), "/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp);
-        double ms = bench(path, ch, sp);
+        double ms = bench(ch, sp);
         double tf = (ms > 0) ? gf / ms : 0;
         double eff = (ms > 0) ? tf * 1000 / w_mb : 0;
         printf("%6d ch   %7.1f  %8.3f ms %7.2f  %10.1f %s\n",
diff --git a/training/Makefile b/training/Makefile
index 9cc9e34..74b9211 100644
--- a/training/Makefile
+++ b/training/Makefile
@@ -1,36 +1,58 @@
-CC = xcrun clang
-CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
-FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface
-LDFLAGS = $(FRAMEWORKS) -ldl
-
-HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h
-
-train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
-	$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS)
-
-train_large: train_large.m $(HEADERS_LARGE)
-	$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
-
-PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced
-
-test_weight_reload: test_weight_reload.m
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
-
-test_perf_stats: test_perf_stats.m
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
-
-test_qos_sweep: test_qos_sweep.m
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
-
-test_ane_advanced: test_ane_advanced.m
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
-
-probes: $(PROBES)
-
-tokenize:
-	python3 tokenize.py
-
-clean:
-	rm -f train train_large $(PROBES)
-
-.PHONY: clean tokenize probes
+CC = xcrun clang
+CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
+FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface
+LDFLAGS = $(FRAMEWORKS) -ldl
+
+HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h
+
+HEADERS_ANE = $(HEADERS_LARGE) ane_rmsnorm_bwd.h ane_classifier.h
+
+HEADERS_PIPELINE = model_config.h pipeline.h gradient_checkpoint.h
+
+train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
+	$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS)
+
+train_large: train_large.m $(HEADERS_LARGE)
+	$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
+
+train_large_ane: train_large_ane.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ train_large_ane.m $(LDFLAGS) -framework Accelerate
+
+train_pipeline: train_pipeline.m $(HEADERS_PIPELINE)
+	$(CC) $(CFLAGS) -o $@ train_pipeline.m $(LDFLAGS) -framework Accelerate
+
+train_pipeline_live: train_pipeline.m $(HEADERS_PIPELINE) $(HEADERS_LARGE)
+	$(CC) $(CFLAGS) -DANE_LIVE -o train_pipeline train_pipeline.m $(LDFLAGS) -framework Accelerate
+
+PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced
+
+test_rmsnorm_bwd: test_rmsnorm_bwd.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
+
+test_classifier: test_classifier.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
+
+test_weight_reload: test_weight_reload.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_perf_stats: test_perf_stats.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_qos_sweep: test_qos_sweep.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_ane_advanced: test_ane_advanced.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+probes: $(PROBES)
+
+tokenize:
+	python3 tokenize.py
+
+test_pipeline_unit: test_pipeline_unit.c $(HEADERS_PIPELINE)
+	cc -O2 -Wall -o $@ $< -lm
+
+clean:
+	rm -f train train_large train_large_ane train_pipeline test_pipeline_unit $(PROBES) test_rmsnorm_bwd test_classifier
+
+.PHONY: clean tokenize probes
diff --git a/training/README.md b/training/README.md
index 53edbb9..a3f33eb 100644
--- a/training/README.md
+++ b/training/README.md
@@ -8,62 +8,136 @@ Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly
 
 - **Model**: Stories110M — dim=768, hidden=2048, heads=12, layers=12, vocab=32000, seq=256
 - **109.53M params** (84.95M transformer + 24.58M embedding)
-- **72 ANE kernels** per compile (60 weight-bearing, 12 weight-free sdpaBwd2)
-- **6 kernel types per layer**: fwdAttn, fwdFFN, ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd
+- **SDPA causal mask workaround**: ANE hardware ignores attn_mask — decompose into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv)
 
-## Performance
+## Three Training Pipelines
 
-| Component | Time (ms/step) |
-|-----------|---------------|
-| ANE eval | 9.6 |
-| IO (fp16 conversion) | 4.1 |
-| Classifier (cblas) | 9.1 |
-| Cross-entropy + residuals | 14.4 |
-| RMSNorm | 0.1 |
-| **Total** | **107 ms/step** |
+### 1. Static Baseline (`train_large`)
+Original pipeline. Weights baked as constants in MIL kernels — recompile every 10 steps via `exec()` restart.
+
+- 60 weight-bearing + 12 weight-free kernels = 72 per compile batch
+- Classifier + softmax + RMSNorm backward on CPU
+- **106.7 ms/step**, 7.6s compile per restart
+
+### 2. Static + ANE Extras (`train_large_ane`) — PR#19
+Offloads classifier forward (32K conv), softmax, final RMSNorm, and RMSNorm backward to ANE. Bridge API for C-callable ANE access.
+
+- 86 kernels per compile batch (+24 rmsnorm_bwd, +1 classifier, +1 finalRms)
+- **91.8 ms/step** (14% faster), 9.6s compile per restart
+- Use `--no-ane-extras` to disable and fall back to CPU (for debugging)
+
+### 3. Dynamic Weight Pipeline (`training_dynamic/`)
+Weights passed via IOSurface spatial dimension — compile 9 kernels once at startup, no recompilation needed.
+
+- 9 shared kernels across all 12 layers
+- **111 ms/step**, 0.4s one-time compile
+- No exec() restart, no compile limit issues
+
+## Performance Comparison (20 Steps)
+
+| | Static Baseline | PR#19 + ANE extras | PR#19 no extras | Dynamic |
+|---|---|---|---|---|
+| **Wall time** | **10.1s** | **11.7s** | **10.7s** | **~2.6s** |
+| Compile | 7.6s (75.7%) | 9.6s (81.6%) | 7.5s (69.7%) | 0.4s (15%) |
+| Train | 2.1s (21.2%) | 1.8s (15.6%) | 2.9s (27.4%) | 2.2s (85%) |
+| **ms/step** | **106.7** | **91.8** | **147.0** | **111** |
+| Kernels/restart | 72 | 86 | 60 | 9 (once) |
+| ANE TFLOPS | 0.87 | 1.15 | 0.72 | — |
+| Total TFLOPS | 1.63 | 1.90 | 1.19 | — |
+
+**Key insights:**
+- Dynamic wins on wall time for any practical run length (3.9x faster at 20 steps)
+- PR#19 has the best per-step throughput (92ms) but compile overhead dominates short runs
+- Static restarts every 10 steps, so dynamic's zero-recompile advantage compounds
 
 ## Files
 
 | File | Description |
 |------|-------------|
-| `train_large.m` | Main training loop — 12-layer forward/backward, checkpoint, exec() restart |
-| `stories_config.h` | Model config, structs, alloc helpers |
+| `train_large.m` | Static baseline — 72 kernels, classifier/softmax on CPU |
+| `train_large_ane.m` | PR#19 — 86 kernels, classifier/softmax/rmsnorm_bwd on ANE |
+| `training_dynamic/train.m` | Dynamic pipeline — 9 kernels, weights via IOSurface |
+| `training_dynamic/mil_dynamic.h` | MIL generators for dynamic weight kernels |
+| `training_dynamic/config.h` | Model config (DIM=768, HIDDEN=2048, etc.) |
+| `training_dynamic/io.h` | IOSurface I/O + MIL compilation helpers |
+| `training_dynamic/cpu_ops.h` | CPU ops (SiLU backward, cross-entropy, Adam) |
+| `stories_config.h` | Static pipeline config, structs, alloc helpers |
 | `stories_io.h` | IOSurface I/O, NEON fp16 conversion, kernel compile/eval |
-| `stories_mil.h` | MIL program generators for all 6 ANE kernel types |
-| `stories_cpu_ops.h` | vDSP-vectorized RMSNorm, cross-entropy, Adam, embedding ops |
-| `dashboard.py` | TUI dashboard — loss curve, power/CPU/memory graphs, text generation |
-| `tokenize.py` | Extract pretokenized TinyStories data |
+| `stories_mil.h` | MIL generators for static pipeline (6 kernel types) |
+| `stories_cpu_ops.h` | vDSP-vectorized RMSNorm, cross-entropy, Adam |
+| `ane_classifier.h` | ANE classifier fwd (32K conv), softmax kernels |
+| `ane_rmsnorm_bwd.h` | ANE rmsnorm backward kernel |
+| `dashboard.py` | TUI dashboard — loss curve, power/CPU/memory graphs |
 | `Makefile` | Build targets |
 
-## How it works
-
-1. **Forward pass**: Each layer runs fwdAttn (QKV + SDPA + Wo) and fwdFFN (W1 + SiLU(W3) + W2) on ANE via MIL-compiled kernels. Final RMSNorm + classifier matmul on CPU (cblas).
+## Usage
 
-2. **Backward pass**: Reverse layer order. ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd on ANE. Weight gradients (dW) via async cblas_sgemm on CPU. RMSNorm backward via vDSP.
+### 1. Download Training Data
 
-3. **Compile budget**: ANE has a ~119 compile limit per process. With 72 kernels per batch, we run 10 accumulation steps then `exec()` restart with checkpoint resume.
+```bash
+bash download_data.sh
+```
 
-4. **Data**: Real TinyStories text (20M tokens), mmap'd uint16 token IDs, random position sampling per step.
+Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from HuggingFace. Produces `tinystories_data00.bin` (~41 MB, ~20M tokens).
 
-## Usage
+### 2. Build & Train
 
 ```bash
-# Extract tokenized data
-python3 tokenize.py
+# Static baseline (classifier + softmax on CPU)
+make train_large
+./train_large stories110M.bin 256 100 1e-4
+./train_large --model stories110M.bin --steps 100 --lr 1e-4
+./train_large --data ./tinystories_data00.bin --steps 100 --lr 1e-4
+
+# PR#19: ANE-offloaded classifier + softmax + rmsnorm_bwd
+make train_large_ane
+./train_large_ane stories110M.bin 256 100 1e-4
+./train_large_ane --no-ane-extras --steps 100    # disable ANE extras
+./train_large_ane --data ./tinystories_data00.bin --steps 100 --lr 1e-4
+
+# Dynamic pipeline (no recompilation)
+cd training_dynamic && make train
+./train --scratch              # train from random init
+./train                        # resume from checkpoint
+./train --steps 200 --lr 1e-4  # custom steps/lr
+```
 
-# Build and train
-make train_large
-./train_large                    # fresh start
-./train_large --resume           # resume from checkpoint
+**CLI flags (`train_large` / `train_large_ane`):**
+- `--steps N` (default 10000)
+- `--lr F` (default 3e-4)
+- `--model PATH` — pretrained weights file
+- `--data PATH` — tokenized TinyStories `.bin` file (default: `tinystories_data00.bin`)
+- `--ckpt PATH` — checkpoint file (preserved across exec() restarts)
+- `--resume` — resume from checkpoint
+- `--no-ane-extras` — (train_large_ane only) disable ANE classifier/softmax/rmsnorm_bwd
 
-# Monitor with dashboard
+### 3. Monitor with Dashboard
+
+```bash
 pip install blessed psutil numpy
-python3 dashboard.py --resume    # needs sudo for powermetrics
+sudo python3 dashboard.py          # static pipeline
+sudo python3 dashboard.py --dynamic # dynamic pipeline
+```
+
+### 4. Benchmarking
+
+All programs print an **Efficiency Report** at completion:
+
+```
+=== Efficiency Report ===
+Total steps:     20
+Wall time:       11738 ms (11.7 s)
+Compile time:    9583 ms (81.6%)
+Train time:      1835 ms (15.6%)
+Avg train:       91.8 ms/step
+ANE TFLOPS:      1.15 sustained
 ```
 
-## Key techniques
+## Key Techniques
 
-- **NEON vectorized fp16<->fp32**: ARM NEON intrinsics for fast IOSurface data transfer
+- **NEON vectorized fp16↔fp32**: ARM NEON intrinsics for fast IOSurface data transfer
 - **vDSP cross-entropy**: `vDSP_mtrans` + `vvexpf` + `vDSP_sve` — 8x faster than scalar
 - **Async weight gradients**: cblas_sgemm dispatched to background queue, overlapped with ANE
-- **SDPA causal mask workaround**: ANE hardware ignores attn_mask, so we decompose attention into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv)
+- **Vocab compaction** (dynamic): 32K → 9.2K active tokens, 3.5x reduction in classifier work
+- **Dynamic weight packing**: Activations + weights concatenated in IOSurface spatial dimension — one kernel serves all 12 layers
+- **exec() restart**: Workaround for ANE ~119 compile limit per process
diff --git a/training/ane_classifier.h b/training/ane_classifier.h
new file mode 100644
index 0000000..1b1b0e8
--- /dev/null
+++ b/training/ane_classifier.h
@@ -0,0 +1,102 @@
+// ane_classifier.h — MIL generators for classifier matmul and softmax on ANE
+// Replaces classifier cblas_sgemm and cross-entropy softmax from CPU
+#pragma once
+#include "stories_mil.h"
+
+// ============================================================
+// Classifier forward: logits = embed @ x_final
+// embed: [VOCAB, DIM] baked as conv weight [VOCAB, DIM, 1, 1]
+// x:     [1, DIM, 1, SEQ] input
+// out:   [1, VOCAB, 1, SEQ] logits
+//
+// VOCAB=32000 output channels — this is the largest conv we've attempted.
+// If it fails, we'll need to tile into smaller chunks.
+// ============================================================
+static NSString *gen_classifier_fwd(void) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", DIM, SEQ];
+    [m appendString:@CONV_CONST];
+    [m appendFormat:@"        tensor<fp16, [%d,%d,1,1]> We = const()[name=string(\"We\"), "
+        "val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/embed.bin\"), offset=uint64(64)))];\n",
+        VOCAB, DIM, VOCAB, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=We,x=x)[name=string(\"cls\")];\n", VOCAB, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// ============================================================
+// Classifier backward: dx = embed^T @ dlogits
+// ANE rejects conv with 32000 input channels.
+// Use matmul instead: reshape dlogits to [1, VOCAB, SEQ],
+// bake embed^T as [1, DIM, VOCAB], matmul → [1, DIM, SEQ],
+// reshape back to [1, DIM, 1, SEQ].
+// ============================================================
+static NSString *gen_classifier_bwd(void) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> dl) {\n", VOCAB, SEQ];
+    // Reshape dlogits from [1, VOCAB, 1, SEQ] to [1, VOCAB, SEQ]
+    [m appendFormat:@"        tensor<int32, [3]> sh3 = const()[name=string(\"sh3\"), val=tensor<int32, [3]>([1,%d,%d])];\n", VOCAB, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d]> dl3 = reshape(shape=sh3,x=dl)[name=string(\"rdl\")];\n", VOCAB, SEQ];
+    // embed_t as baked constant [1, DIM, VOCAB]
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d]> Wet = const()[name=string(\"Wet\"), "
+        "val=tensor<fp16, [1,%d,%d]>(BLOBFILE(path=string(\"@model_path/weights/embed_t.bin\"), offset=uint64(64)))];\n",
+        DIM, VOCAB, DIM, VOCAB];
+    // matmul: [1, DIM, VOCAB] @ [1, VOCAB, SEQ] -> [1, DIM, SEQ]
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d]> dx3 = matmul(transpose_x=bF,transpose_y=bF,x=Wet,y=dl3)[name=string(\"mm\")];\n", DIM, SEQ];
+    // Reshape back to [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<int32, [4]> sh4 = const()[name=string(\"sh4\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = reshape(shape=sh4,x=dx3)[name=string(\"out\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// ============================================================
+// Softmax over VOCAB dimension (channel axis) for cross-entropy
+// Input:  logits [1, VOCAB, 1, SEQ]
+// Output: probs  [1, VOCAB, 1, SEQ]
+//
+// softmax(x, axis=1) = exp(x - max(x)) / sum(exp(x - max(x)))
+//
+// Note: After getting probs from ANE, the NLL loss + gradient
+// (prob[target] -= 1.0) are done on CPU since they need target indexing.
+// ============================================================
+static NSString *gen_softmax_vocab(void) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", VOCAB, SEQ];
+    [m appendString:@"        int32 ax = const()[name=string(\"ax\"), val=int32(1)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = softmax(axis=ax,x=x)[name=string(\"sm\")];\n", VOCAB, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// ============================================================
+// Final RMSNorm on ANE (replaces CPU rmsnorm for final layer)
+// Input:  x [1, DIM, 1, SEQ]
+// Baked:  rms_final weights [DIM]
+// Output: xn [1, DIM, 1, SEQ]
+// ============================================================
+static NSString *gen_final_rmsnorm(void) {
+    float invd = 1.0f/(float)DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> sq = mul(x=x,y=x)[name=string(\"sq\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [1]> rax = const()[name=string(\"rax\"), val=tensor<int32, [1]>([1])];\n"];
+    [m appendFormat:@"        bool kd = const()[name=string(\"kd\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss = reduce_sum(x=sq,axes=rax,keep_dims=kd)[name=string(\"ss\")];\n", SEQ];
+    [m appendFormat:@"        fp16 invd = const()[name=string(\"invd\"), val=fp16(%f)];\n", invd];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss2 = mul(x=ss,y=invd)[name=string(\"ss2\")];\n", SEQ];
+    [m appendFormat:@"        fp16 eps = const()[name=string(\"eps\"), val=fp16(0.00001)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss3 = add(x=ss2,y=eps)[name=string(\"ss3\")];\n", SEQ];
+    [m appendFormat:@"        fp16 nhalf = const()[name=string(\"nhalf\"), val=fp16(-0.5)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> rrms = pow(x=ss3,y=nhalf)[name=string(\"rrms\")];\n", SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xr = mul(x=x,y=rrms)[name=string(\"xr\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,1]> rw = const()[name=string(\"rw\"), val=tensor<fp16, [1,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/rms_w.bin\"), offset=uint64(64)))];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = mul(x=xr,y=rw)[name=string(\"out\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
diff --git a/training/ane_rmsnorm_bwd.h b/training/ane_rmsnorm_bwd.h
new file mode 100644
index 0000000..eb51896
--- /dev/null
+++ b/training/ane_rmsnorm_bwd.h
@@ -0,0 +1,78 @@
+// ane_rmsnorm_bwd.h — MIL generator for RMSNorm backward on ANE
+// Replaces CPU rmsnorm_bwd() from stories_cpu_ops.h
+//
+// RMSNorm forward:  xn = x * rrms * w,  where rrms = 1/sqrt(mean(x²) + eps)
+// RMSNorm backward: dx = w * rrms * (dy - x * sum(dy*w*x) * invd * rrms²)
+//
+// Input:  concat(dy, x) as [1, 2*DIM, 1, SEQ]
+// Baked:  RMSNorm weights w [1, DIM, 1, 1] as BLOBFILE
+// Output: dx [1, DIM, 1, SEQ]
+//
+// Note: dw (weight gradient) stays on CPU — it requires reduce_sum over SEQ
+// and accumulation across steps, which is cheap and better done on CPU.
+#pragma once
+#include "stories_mil.h"
+
+// Generate MIL for RMSNorm backward
+// Input: concat(dy, x) [1, 2*DIM, 1, SEQ]
+// Baked weights: rms_w [DIM] — the RMSNorm scale weights
+// Output: dx [1, DIM, 1, SEQ]
+static NSString *gen_rmsnorm_bwd(void) {
+    float invd = 1.0f / (float)DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    
+    // Input: concat of dy and x along channel dimension
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> inp) {\n", 2*DIM, SEQ];
+    
+    // Slice out dy [1, DIM, 1, SEQ] and x [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<int32, [4]> sz = const()[name=string(\"sz\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dy = slice_by_size(x=inp,begin=b0,size=sz)[name=string(\"sdy\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> x = slice_by_size(x=inp,begin=b1,size=sz)[name=string(\"sx\")];\n", DIM, SEQ];
+    
+    // Step 1: Compute rrms = 1/sqrt(mean(x²) + eps)
+    // sq = x * x
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> sq = mul(x=x,y=x)[name=string(\"sq\")];\n", DIM, SEQ];
+    // ss = sum(sq, axis=1, keepdims=true)  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<int32, [1]> rax = const()[name=string(\"rax\"), val=tensor<int32, [1]>([1])];\n"];
+    [m appendFormat:@"        bool kd = const()[name=string(\"kd\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss = reduce_sum(x=sq,axes=rax,keep_dims=kd)[name=string(\"ss\")];\n", SEQ];
+    // ss2 = ss * invd + eps
+    [m appendFormat:@"        fp16 invd = const()[name=string(\"invd\"), val=fp16(%f)];\n", invd];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss2 = mul(x=ss,y=invd)[name=string(\"ss2\")];\n", SEQ];
+    [m appendFormat:@"        fp16 eps = const()[name=string(\"eps\"), val=fp16(0.00001)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss3 = add(x=ss2,y=eps)[name=string(\"ss3\")];\n", SEQ];
+    // rrms = pow(ss3, -0.5) → [1,1,1,SEQ]
+    [m appendFormat:@"        fp16 nhalf = const()[name=string(\"nhalf\"), val=fp16(-0.5)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> rrms = pow(x=ss3,y=nhalf)[name=string(\"rrms\")];\n", SEQ];
+    
+    // Step 2: Load RMSNorm weights w [1, DIM, 1, 1]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,1]> w = const()[name=string(\"w\"), val=tensor<fp16, [1,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/rms_w.bin\"), offset=uint64(64)))];\n", DIM, DIM];
+    
+    // Step 3: Compute dot = sum(dy * w * x, axis=1) * invd * rrms²
+    // dyw = dy * w  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dyw = mul(x=dy,y=w)[name=string(\"dyw\")];\n", DIM, SEQ];
+    // dywx = dyw * x  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dywx = mul(x=dyw,y=x)[name=string(\"dywx\")];\n", DIM, SEQ];
+    // dot_sum = sum(dywx, axis=1, keepdims=true)  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> dot_sum = reduce_sum(x=dywx,axes=rax,keep_dims=kd)[name=string(\"ds\")];\n", SEQ];
+    // dot_scaled = dot_sum * invd  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> dot_sc = mul(x=dot_sum,y=invd)[name=string(\"dsc\")];\n", SEQ];
+    // rrms_sq = rrms * rrms  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> rrms2 = mul(x=rrms,y=rrms)[name=string(\"rr2\")];\n", SEQ];
+    // coeff = dot_scaled * rrms_sq  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> coeff = mul(x=dot_sc,y=rrms2)[name=string(\"cof\")];\n", SEQ];
+    
+    // Step 4: dx = (dy * w - x * coeff) * rrms
+    // x_coeff = x * coeff  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xc = mul(x=x,y=coeff)[name=string(\"xc\")];\n", DIM, SEQ];
+    // diff = dyw - xc  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> diff = sub(x=dyw,y=xc)[name=string(\"dif\")];\n", DIM, SEQ];
+    // dx = diff * rrms  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = mul(x=diff,y=rrms)[name=string(\"out\")];\n", DIM, SEQ];
+    
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
diff --git a/training/ane_runtime.h b/training/ane_runtime.h
index 585d0f0..58bcb79 100644
--- a/training/ane_runtime.h
+++ b/training/ane_runtime.h
@@ -141,9 +141,14 @@ static void ane_read_output(ANEKernel *k, int idx, void *data, size_t bytes) {
 
 static bool ane_eval(ANEKernel *k) {
     NSError *e = nil;
-    return ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+    BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
         k->model, @selector(evaluateWithQoS:options:request:error:),
         21, @{}, k->request, &e);
+    if (!ok) {
+        fprintf(stderr, "ANE eval failed: %s\n",
+                e ? [[e description] UTF8String] : "unknown error");
+    }
+    return ok;
 }
 
 static void ane_free(ANEKernel *k) {
diff --git a/training/dashboard.py b/training/dashboard.py
index a3a1503..18203d7 100644
--- a/training/dashboard.py
+++ b/training/dashboard.py
@@ -1,6 +1,6 @@
 """TUI dashboard for ANE training (train_large). Uses blessed for terminal UI."""
 
-import argparse, fcntl, math, os, re, select, signal, struct, subprocess, sys, time, threading
+import argparse, fcntl, json, math, os, re, select, signal, struct, subprocess, sys, time, threading
 from collections import deque
 from pathlib import Path
 
@@ -20,7 +20,9 @@
 
 DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS = 768, 2048, 12, 256, 32000, 12
 HD = DIM // HEADS
-CKPT_PATH = 'ane_stories110M_ckpt.bin'
+CKPT_PATH_STATIC = 'ane_stories110M_ckpt.bin'
+CKPT_PATH_DYNAMIC = 'training_dynamic/ane_stories110M_dyn_ckpt.bin'
+CKPT_PATH = CKPT_PATH_STATIC  # set in main() based on --dynamic
 TOKENIZER_PATH = str(Path(__file__).resolve().parent.parent.parent / 'assets' / 'models' / 'tokenizer.bin')
 
 
@@ -56,6 +58,9 @@ def __init__(self):
         self.mem_mb_history = deque(maxlen=300)
         self.proc_mem_mb_history = deque(maxlen=300)
         self.train_pid = None
+        self.step_timestamps = []  # (step, time.monotonic()) for running ms/step
+        self.train_start = None    # wall clock when first step seen
+        self.compile_ms = 0.0      # total compile time
 
 S = State()
 
@@ -142,7 +147,7 @@ def softmax(x):
     e = np.exp(x)
     return e / np.sum(e)
 
-def generate_text(W, tok, max_tokens=64, temperature=0.8):
+def generate_text(W, max_tokens=64, temperature=0.8):
     tokenizer = get_tokenizer()
     if tokenizer is None:
         return '[no tokenizer]'
@@ -244,7 +249,7 @@ def generation_thread():
                 with S.gen_lock:
                     S.gen_status = 'idle'
                 continue
-            text = generate_text(W, get_tokenizer(), max_tokens=64, temperature=0.8)
+            text = generate_text(W, max_tokens=64, temperature=0.8)
             with S.gen_lock:
                 S.gen_text = text
                 S.gen_step = S.step
@@ -278,23 +283,69 @@ def sysmetrics_thread():
 RE_CONFIG = re.compile(r'dim=(\d+) hidden=(\d+) heads=(\d+) seq=(\d+) vocab=(\d+) layers=(\d+)')
 RE_PARAMS = re.compile(r'Params: ([\d.]+)M \(transformer ([\d.]+)M \+ embed ([\d.]+)M\)')
 RE_KERNELS = re.compile(r'Kernels: (\d+).*?(\d+) weight-bearing')
+RE_KERNELS_DYN = re.compile(r'Kernels: (\d+) compiled, (\d+) weight-bearing')
 RE_ACCUM = re.compile(r'Accum (\d+).*LR=([\d.e+-]+)')
-RE_STEP = re.compile(r'step\s+(\d+)\s+loss=([\d.]+)')
+RE_STEP = re.compile(r'step\s+(\d+)\s+loss=([\d.]+)(?:\s+lr=([\d.e+-]+))?(?:\s+([\d.]+)ms/step)?')
 RE_BATCH = re.compile(r'\[batch (\d+): compile=([\d.]+)ms train=([\d.]+)ms \(([\d.]+)ms/step\) compiles=(\d+)\]')
 RE_TIMING = re.compile(r'ane=([\d.]+) io=([\d.]+) cls=([\d.]+) elem=([\d.]+) rms=([\d.]+) cblas_wait=([\d.]+)')
+RE_TIMING_DYN = re.compile(r'ane_fwd=([\d.]+) io_fwd=([\d.]+) rms=([\d.]+) ane_bwd=([\d.]+) io_bwd=([\d.]+) silu=([\d.]+) rms_bwd=([\d.]+) cls=([\d.]+) cblas_wait=([\d.]+) dw_copy=([\d.]+)')
 RE_RESTART = re.compile(r'\[exec\(\) restart step (\d+)')
 RE_RESUME = re.compile(r'\[RESUMED step (\d+), loss=([\d.]+)\]')
 RE_FLOPS = re.compile(r'FLOPs/step: fwd=([\d.]+)M bwd_dx=([\d.]+)M bwd_dW=([\d.]+)M sdpa_bwd=([\d.]+)M total=([\d.]+)M')
 RE_ANE_FLOPS = re.compile(r'ANE FLOPs/step: ([\d.]+)M')
 RE_ANE_TFLOPS = re.compile(r'ANE TFLOPS:\s+([\d.]+)')
 RE_ANE_UTIL = re.compile(r'ANE utilization:\s+([\d.]+)%')
-RE_EFFICIENCY = re.compile(r'(Total steps|Wall time|Compile time|Train time|Avg compile|Avg train|ANE TFLOPS|Total TFLOPS|ANE utilization):?\s+(.+)')
+RE_EFFICIENCY = re.compile(r'(Total steps|Wall time|Compile time|Compile|Train time|Avg compile|Avg train|ANE TFLOPS|Total TFLOPS|ANE utilization):?\s+(.+)')
+RE_COMPILED = re.compile(r'Compiled (\d+) kernels in (\d+)ms')
 RE_ANE_POWER = re.compile(r'ANE Power:\s+([\d.]+)\s*mW')
 RE_CPU_POWER = re.compile(r'CPU Power:\s+([\d.]+)\s*mW')
 RE_GPU_POWER = re.compile(r'GPU Power:\s+([\d.]+)\s*mW')
 
 def parse_line(line):
     S.logs.append(line)
+    # Parse JSON lines from static pipeline ({"type":"step",...} or {"type":"batch",...})
+    stripped = line.strip()
+    if stripped.startswith('{'):
+        try:
+            j = json.loads(stripped)
+            jt = j.get('type')
+            if jt == 'step':
+                S.step, S.loss = j['step'], j['loss']
+                S.loss_history.append((S.step, S.loss))
+                S.best_loss = min(S.best_loss, S.loss)
+                S.compiles = j.get('compiles', S.compiles)
+                now = time.monotonic()
+                if S.train_start is None:
+                    S.train_start = now
+                S.step_timestamps.append((S.step, now))
+                if len(S.step_timestamps) >= 2:
+                    dt = S.step_timestamps[-1][1] - S.step_timestamps[-2][1]
+                    if dt > 0:
+                        S.ms_per_step = dt * 1000
+                # Extract component timing from JSON
+                ct = {}
+                for k in ('t_ane', 't_io', 't_cls', 't_elem', 't_rms', 't_cblas_wait'):
+                    if k in j:
+                        ct[k[2:]] = j[k]  # strip 't_' prefix
+                if ct:
+                    S.component_timing = ct
+                return
+            elif jt == 'batch':
+                S.batch_num = j.get('batch', S.batch_num)
+                compile_ms = j.get('compile_ms', 0)
+                train_ms = j.get('train_ms', 0)
+                S.ms_per_step = j.get('ms_per_step', S.ms_per_step)
+                S.compile_ms += compile_ms
+                S.compile_pct = 100 * S.compile_ms / (S.compile_ms + train_ms) if S.compile_ms + train_ms > 0 else 0
+                return
+            elif jt == 'perf':
+                if 'ane_tflops' in j:
+                    S.flops['ane_tflops'] = j['ane_tflops']
+                if 'ane_util_pct' in j:
+                    S.flops['ane_util'] = j['ane_util_pct']
+                return
+        except (json.JSONDecodeError, KeyError):
+            pass
     m = RE_CONFIG.search(line)
     if m:
         S.model_config = dict(zip(['dim', 'hidden', 'heads', 'seq', 'vocab', 'layers'], map(int, m.groups())))
@@ -303,7 +354,7 @@ def parse_line(line):
     if m:
         S.params = {'total': float(m[1]), 'transformer': float(m[2]), 'embed': float(m[3])}
         return
-    m = RE_KERNELS.search(line)
+    m = RE_KERNELS_DYN.search(line) or RE_KERNELS.search(line)
     if m:
         S.kernels = {'total': int(m[1]), 'weight_bearing': int(m[2])}
         return
@@ -323,6 +374,18 @@ def parse_line(line):
     m = RE_STEP.search(line)
     if m:
         S.step, S.loss = int(m[1]), float(m[2])
+        if m[3]:
+            S.training['lr'] = m[3]
+        if m[4]:
+            S.ms_per_step = float(m[4])
+        now = time.monotonic()
+        if S.train_start is None:
+            S.train_start = now
+        S.step_timestamps.append((S.step, now))
+        if not m[4] and len(S.step_timestamps) >= 2:
+            dt = S.step_timestamps[-1][1] - S.step_timestamps[-2][1]
+            if dt > 0:
+                S.ms_per_step = dt * 1000
         S.loss_history.append((S.step, S.loss))
         S.best_loss = min(S.best_loss, S.loss)
         return
@@ -334,6 +397,16 @@ def parse_line(line):
         S.compiles = int(m[5])
         S.compile_pct = 100 * compile_ms / (compile_ms + train_ms) if compile_ms + train_ms > 0 else 0
         return
+    m = RE_TIMING_DYN.search(line)
+    if m:
+        vals = list(map(float, m.groups()))
+        S.component_timing = {
+            'ane_fwd': vals[0], 'io_fwd': vals[1], 'rms': vals[2],
+            'ane_bwd': vals[3], 'io_bwd': vals[4], 'silu': vals[5],
+            'rms_bwd': vals[6], 'cls': vals[7], 'cblas_wait': vals[8], 'dw_copy': vals[9],
+            '_dynamic': True
+        }
+        return
     m = RE_TIMING.search(line)
     if m:
         S.component_timing = dict(zip(['ane', 'io', 'cls', 'elem', 'rms', 'cblas_wait'], map(float, m.groups())))
@@ -346,6 +419,11 @@ def parse_line(line):
     if m:
         S.flops['ane_util'] = float(m[1])
         return
+    m = RE_COMPILED.search(line)
+    if m:
+        S.compiles = int(m[1])
+        S.compile_ms += float(m[2])
+        return
     m = RE_EFFICIENCY.search(line)
     if m:
         S.efficiency[m[1].strip()] = m[2].strip()
@@ -514,23 +592,49 @@ def put(y, x, text, style=''):
     # Training stats (right panel)
     sr = row
     step_str = f'{S.step}' + (f'/{S.total_steps}' if S.total_steps and S.total_steps < 999999 else '')
-    put(sr, mid_x + 1, f' Step: {step_str}  Loss: {S.loss:.4f}' if S.loss else ' Step: --', term.yellow)
+    # Elapsed time
+    elapsed = 0.0
+    if S.train_start:
+        elapsed = time.monotonic() - S.train_start
+    elapsed_str = f'{elapsed:.1f}s' if elapsed < 60 else f'{elapsed/60:.1f}m'
+    put(sr, mid_x + 1, f' Step: {step_str}  Loss: {S.loss:.4f}  [{elapsed_str}]' if S.loss else ' Step: --', term.yellow)
     sr += 1
-    put(sr, mid_x + 1, f' Best: {S.best_loss:.4f}   ms/step: {S.ms_per_step:.1f}' if S.best_loss < float('inf') else ' Best: --')
+    # ms/step + steps/sec
+    sps = 1000.0 / S.ms_per_step if S.ms_per_step > 0 else 0
+    put(sr, mid_x + 1, f' Best: {S.best_loss:.4f}   {S.ms_per_step:.1f}ms/step ({sps:.1f} steps/s)' if S.best_loss < float('inf') else ' Best: --')
     sr += 1
+    # TFLOPS
     ane_tflops = S.flops.get('ane_tflops', 0)
     ane_util = S.flops.get('ane_util', 0)
+    total_tflops = 0
+    if S.ms_per_step > 0 and S.flops.get('ane', 0) > 0:
+        if not ane_tflops:
+            ane_tflops = (S.flops['ane'] * 1e6) / (S.ms_per_step * 1e-3) / 1e12
+        total_tflops = (S.flops.get('total', 0) * 1e6) / (S.ms_per_step * 1e-3) / 1e12
+    if not ane_util and ane_tflops:
+        ane_util = 100.0 * ane_tflops / 15.8
+    compile_str = f'  Compile: {S.compile_ms/1000:.1f}s' if S.compile_ms > 0 else ''
     if ane_tflops:
-        put(sr, mid_x + 1, f' ANE: {ane_tflops:.2f}T  Compile: {S.compile_pct:.0f}%  Util: {ane_util:.1f}%')
-    else:
-        put(sr, mid_x + 1, f' Compile: {S.compile_pct:.0f}%')
+        tflops_str = f' ANE: {ane_tflops:.2f}T'
+        if total_tflops:
+            tflops_str += f'  Total: {total_tflops:.2f}T'
+        tflops_str += f'  Util: {ane_util:.1f}%{compile_str}'
+        put(sr, mid_x + 1, tflops_str)
+    elif compile_str:
+        put(sr, mid_x + 1, f'{compile_str}')
     sr += 1
     ct = S.component_timing
     if ct:
-        put(sr, mid_x + 1, f' ane={ct.get("ane", 0):.1f} io={ct.get("io", 0):.1f} cls={ct.get("cls", 0):.1f} elem={ct.get("elem", 0):.1f}')
-        sr += 1
-        put(sr, mid_x + 1, f' rms={ct.get("rms", 0):.1f} cblas_wait={ct.get("cblas_wait", 0):.1f} ms/step')
-        sr += 1
+        if ct.get('_dynamic'):
+            put(sr, mid_x + 1, f' fwd={ct.get("ane_fwd",0):.1f} bwd={ct.get("ane_bwd",0):.1f} io={ct.get("io_fwd",0)+ct.get("io_bwd",0):.1f} silu={ct.get("silu",0):.1f}')
+            sr += 1
+            put(sr, mid_x + 1, f' cls={ct.get("cls",0):.1f} rms={ct.get("rms",0)+ct.get("rms_bwd",0):.1f} dw={ct.get("dw_copy",0):.1f} ms/step')
+            sr += 1
+        else:
+            put(sr, mid_x + 1, f' ane={ct.get("ane", 0):.1f} io={ct.get("io", 0):.1f} cls={ct.get("cls", 0):.1f} elem={ct.get("elem", 0):.1f}')
+            sr += 1
+            put(sr, mid_x + 1, f' rms={ct.get("rms", 0):.1f} cblas_wait={ct.get("cblas_wait", 0):.1f} ms/step')
+            sr += 1
     pw = S.power
     if any(pw.values()):
         put(sr, mid_x + 1, '\u2500 Power ' + '\u2500' * max(0, right_w - 9), term.cyan)
@@ -659,10 +763,24 @@ def set_nonblock(fd):
     fl = fcntl.fcntl(fd, fcntl.F_GETFL)
     fcntl.fcntl(fd, fcntl.F_SETFL, fl | os.O_NONBLOCK)
 
-def spawn_training(resume=False, steps=10000):
-    cmd = 'make train_large 2>&1 && ./train_large'
+def spawn_training(resume=False, steps=10000, dynamic=False, ane=False, scratch=False,
+                   lr=None, accum=None, no_ane_extras=False):
+    if dynamic:
+        cmd = 'cd training_dynamic && make 2>&1 && ./train'
+    elif ane:
+        cmd = 'make train_large_ane 2>&1 && ./train_large_ane'
+    else:
+        cmd = 'make train_large 2>&1 && ./train_large'
     if resume:
         cmd += ' --resume'
+    if scratch and dynamic:
+        cmd += ' --scratch'
+    if lr is not None:
+        cmd += f' --lr {lr}'
+    if accum is not None and dynamic:
+        cmd += f' --accum {accum}'
+    if no_ane_extras and ane:
+        cmd += ' --no-ane-extras'
     cmd += f' --steps {steps}'
     proc = subprocess.Popen(
         ['bash', '-c', cmd],
@@ -672,6 +790,8 @@ def spawn_training(resume=False, steps=10000):
     return proc
 
 def spawn_powermetrics():
+    if not sys.stdin.isatty():
+        return None
     try:
         proc = subprocess.Popen(
             ['sudo', 'powermetrics', '--samplers', 'cpu_power,gpu_power,ane_power', '-i', '1000'],
@@ -684,6 +804,12 @@ def spawn_powermetrics():
 def main():
     parser = argparse.ArgumentParser(description='ANE Training Dashboard (stories110M)')
     parser.add_argument('--resume', action='store_true', help='Resume from checkpoint')
+    parser.add_argument('--dynamic', action='store_true', help='Dynamic weight pipeline (training_dynamic/)')
+    parser.add_argument('--ane', action='store_true', help='PR#19: ANE-offloaded classifier/softmax/rmsnorm_bwd')
+    parser.add_argument('--no-ane-extras', action='store_true', help='Disable ANE extras (use with --ane)')
+    parser.add_argument('--scratch', action='store_true', help='Train from scratch (random init)')
+    parser.add_argument('--lr', type=float, default=None, help='Learning rate')
+    parser.add_argument('--accum', type=int, default=None, help='Gradient accumulation steps')
     parser.add_argument('--infinite', action='store_true', help='Train indefinitely')
     parser.add_argument('--no-powermetrics', action='store_true')
     parser.add_argument('--no-generate', action='store_true', help='Disable text generation')
@@ -694,10 +820,15 @@ def main():
         args.steps = 999999999
     S.total_steps = args.steps
 
+    global CKPT_PATH
+    CKPT_PATH = CKPT_PATH_DYNAMIC if args.dynamic else CKPT_PATH_STATIC
+
     term = Terminal()
     procs = []
 
-    train_proc = spawn_training(resume=args.resume, steps=args.steps)
+    train_proc = spawn_training(resume=args.resume, steps=args.steps, dynamic=args.dynamic,
+                                scratch=args.scratch, lr=args.lr, accum=args.accum,
+                                ane=args.ane, no_ane_extras=args.no_ane_extras)
     S.train_pid = train_proc.pid
     procs.append(train_proc)
 
@@ -837,7 +968,9 @@ def on_resize(*a):
                         if train_proc:
                             train_proc.terminate()
                             train_proc.wait()
-                        train_proc = spawn_training(resume=True, steps=args.steps)
+                        train_proc = spawn_training(resume=True, steps=args.steps, dynamic=args.dynamic,
+                                                        lr=args.lr, accum=args.accum,
+                                                        ane=args.ane, no_ane_extras=args.no_ane_extras)
                         S.train_pid = train_proc.pid
                         procs = [p for p in procs if p.poll() is None]
                         procs.append(train_proc)
@@ -851,7 +984,7 @@ def force_gen():
                             try:
                                 W = load_weights_from_ckpt(CKPT_PATH)
                                 if W:
-                                    text = generate_text(W, get_tokenizer(), max_tokens=64, temperature=0.8)
+                                    text = generate_text(W, max_tokens=64, temperature=0.8)
                                     with S.gen_lock:
                                         S.gen_text = text
                                         S.gen_step = S.step
diff --git a/training/download_data.sh b/training/download_data.sh
new file mode 100755
index 0000000..2d27d96
--- /dev/null
+++ b/training/download_data.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+# Download pretokenized TinyStories data for ANE training
+# Format: flat uint16 token IDs (Llama2 BPE, 32K vocab)
+# Source: enio/TinyStories on HuggingFace (pretokenized with karpathy/llama2.c)
+#
+# The tar.gz contains data00.bin..data49.bin (50 shards).
+# We extract only data00.bin and rename it to tinystories_data00.bin.
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+OUTPUT="$SCRIPT_DIR/tinystories_data00.bin"
+
+if [ -f "$OUTPUT" ]; then
+    SIZE=$(stat -f%z "$OUTPUT" 2>/dev/null || stat -c%s "$OUTPUT" 2>/dev/null)
+    TOKENS=$((SIZE / 2))
+    echo "$OUTPUT already exists ($TOKENS tokens, $(echo "scale=1; $SIZE/1000000" | bc) MB)"
+    exit 0
+fi
+
+TAR_URL="https://huggingface.co/datasets/enio/TinyStories/resolve/main/tok32000/TinyStories_tok32000.tar.gz?download=true"
+TAR_FILE="$SCRIPT_DIR/TinyStories_tok32000.tar.gz"
+
+echo "=== TinyStories Data Download ==="
+echo "Downloading pretokenized TinyStories (32K vocab, ~993 MB)..."
+echo "  Source: enio/TinyStories on HuggingFace"
+echo "  This will take a few minutes depending on your connection."
+echo ""
+
+# Download the tar.gz
+if [ ! -f "$TAR_FILE" ]; then
+    if command -v curl &>/dev/null; then
+        curl -L --progress-bar -o "$TAR_FILE" "$TAR_URL"
+    elif command -v wget &>/dev/null; then
+        wget --show-progress -O "$TAR_FILE" "$TAR_URL"
+    else
+        echo "Error: need curl or wget"
+        exit 1
+    fi
+else
+    echo "Tar file already downloaded, skipping..."
+fi
+
+# Verify it's actually a gzip file (not an error page)
+if ! file "$TAR_FILE" | grep -q "gzip"; then
+    echo "Error: Downloaded file is not a valid gzip archive."
+    echo "Content: $(head -c 100 "$TAR_FILE")"
+    rm -f "$TAR_FILE"
+    exit 1
+fi
+
+echo ""
+echo "Extracting data00.bin from archive..."
+
+# List what's in the archive to find the right path
+DATA_FILE=$(tar tzf "$TAR_FILE" 2>/dev/null | grep 'data00\.bin' | head -1)
+if [ -z "$DATA_FILE" ]; then
+    echo "Error: data00.bin not found in archive. Contents:"
+    tar tzf "$TAR_FILE" | head -20
+    exit 1
+fi
+echo "  Found: $DATA_FILE"
+
+# Extract just data00.bin
+tar xzf "$TAR_FILE" -C "$SCRIPT_DIR" "$DATA_FILE"
+
+# Move to expected location (might be in a subdirectory)
+EXTRACTED="$SCRIPT_DIR/$DATA_FILE"
+if [ "$EXTRACTED" != "$OUTPUT" ]; then
+    mv "$EXTRACTED" "$OUTPUT"
+    # Clean up any extracted subdirectories
+    rmdir "$(dirname "$EXTRACTED")" 2>/dev/null || true
+fi
+
+# Clean up tar.gz to save disk space
+echo "Cleaning up archive..."
+rm -f "$TAR_FILE"
+
+SIZE=$(stat -f%z "$OUTPUT" 2>/dev/null || stat -c%s "$OUTPUT" 2>/dev/null)
+TOKENS=$((SIZE / 2))
+echo ""
+echo "Done: $OUTPUT"
+echo "  $TOKENS tokens ($(echo "scale=1; $SIZE/1000000" | bc) MB)"
+
+# Sanity check
+python3 -c "
+import struct
+with open('$OUTPUT', 'rb') as f:
+    tokens = struct.unpack('<10H', f.read(20))
+    print(f'First 10 tokens: {tokens}')
+" 2>/dev/null || true
diff --git a/training/forward.h b/training/forward.h
index adcf898..1a2a31f 100644
--- a/training/forward.h
+++ b/training/forward.h
@@ -7,7 +7,7 @@
 // ANE conv eval: input [S, in_dim] row-major → transpose to [in_dim, S] channels-first
 // ANE computes conv(W, x) with baked W → output [out_dim, S]
 // Transpose back to [S, out_dim] row-major
-static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y,
+static bool ane_conv_eval(ANEKernel *kernel, const float *x, float *y,
                           int S, int in_dim, int out_dim) {
     float *x_t = (float*)malloc(S * in_dim * sizeof(float));
     for (int t = 0; t < S; t++)
@@ -15,7 +15,11 @@ static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y,
             x_t[i*S + t] = x[t*in_dim + i];
 
     ane_write_input(kernel, 0, x_t, S * in_dim * sizeof(float));
-    ane_eval(kernel);
+    bool ok = ane_eval(kernel);
+    if (!ok) {
+        free(x_t);
+        return false;
+    }
 
     float *y_t = (float*)malloc(S * out_dim * sizeof(float));
     ane_read_output(kernel, 0, y_t, S * out_dim * sizeof(float));
@@ -25,6 +29,7 @@ static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y,
             y[t*out_dim + i] = y_t[i*S + t];
 
     free(x_t); free(y_t);
+    return true;
 }
 
 // CPU matmul fallback: y = W @ x, W[out_dim, in_dim], x[S, in_dim] → y[S, out_dim]
diff --git a/training/gradient_checkpoint.h b/training/gradient_checkpoint.h
new file mode 100644
index 0000000..29a5aa0
--- /dev/null
+++ b/training/gradient_checkpoint.h
@@ -0,0 +1,165 @@
+// gradient_checkpoint.h — Activation checkpointing for deep models
+// Trades compute for memory: recompute forward activations during backward
+// instead of storing all layers' activations simultaneously
+#pragma once
+#include "model_config.h"
+
+// ===== Checkpoint policies =====
+
+typedef enum {
+    CKPT_ALL,           // save all layers' activations (current behavior)
+    CKPT_BOUNDARY,      // save only group boundary activations, recompute within group
+    CKPT_SQRT,          // save every √N layers (optimal memory/compute tradeoff)
+    CKPT_EVERY_N,       // save every N-th layer (configurable)
+    CKPT_NONE           // save nothing, recompute everything (max memory savings)
+} CheckpointPolicy;
+
+typedef struct {
+    CheckpointPolicy policy;
+    int interval;           // for CKPT_EVERY_N: save every N layers
+    int n_layers;           // total layers in model
+    int n_groups;           // layer groups in pipeline
+    int layers_per_group;   // layers per group (from pipeline plan)
+    // Derived
+    int n_checkpointed;     // how many layers have saved activations
+    bool *is_saved;         // per-layer: true if activation is saved (not recomputed)
+} CheckpointManager;
+
+// ===== Initialization =====
+
+// custom_interval: used for CKPT_EVERY_N (pass 0 for default=4, ignored for other policies)
+static CheckpointManager checkpoint_init(CheckpointPolicy policy, const ModelConfig *cfg,
+                                          const PipelinePlan *plan, int custom_interval) {
+    CheckpointManager cm = {0};
+    cm.policy = policy;
+    cm.n_layers = cfg->dims.n_layers;
+    cm.n_groups = plan->n_groups;
+    cm.layers_per_group = (plan->n_groups > 0) ? plan->groups[0].n_layers : cfg->dims.n_layers;
+    cm.is_saved = (bool *)calloc(cfg->dims.n_layers, sizeof(bool));
+
+    switch (policy) {
+    case CKPT_ALL:
+        for (int i = 0; i < cm.n_layers; i++) cm.is_saved[i] = true;
+        break;
+
+    case CKPT_BOUNDARY:
+        for (int g = 0; g < plan->n_groups; g++) {
+            cm.is_saved[plan->groups[g].start_layer] = true;
+        }
+        cm.is_saved[cm.n_layers - 1] = true;
+        break;
+
+    case CKPT_SQRT: {
+        int interval = (int)sqrtf((float)cm.n_layers);
+        if (interval < 1) interval = 1;
+        cm.interval = interval;
+        for (int i = 0; i < cm.n_layers; i += interval) cm.is_saved[i] = true;
+        cm.is_saved[cm.n_layers - 1] = true;
+        break;
+    }
+
+    case CKPT_EVERY_N:
+        cm.interval = (custom_interval > 0) ? custom_interval : 4;
+        for (int i = 0; i < cm.n_layers; i += cm.interval) cm.is_saved[i] = true;
+        cm.is_saved[cm.n_layers - 1] = true;
+        break;
+
+    case CKPT_NONE:
+        cm.is_saved[0] = true;
+        break;
+    }
+
+    // Count actual saved layers — single source of truth, no fragile arithmetic
+    cm.n_checkpointed = 0;
+    for (int i = 0; i < cm.n_layers; i++) {
+        if (cm.is_saved[i]) cm.n_checkpointed++;
+    }
+
+    return cm;
+}
+
+static void checkpoint_free(CheckpointManager *cm) {
+    free(cm->is_saved);
+    cm->is_saved = NULL;
+}
+
+// ===== Query functions =====
+
+// Should we save this layer's activations during forward pass?
+static bool checkpoint_should_save(const CheckpointManager *cm, int layer_idx) {
+    if (layer_idx < 0 || layer_idx >= cm->n_layers) return false;
+    return cm->is_saved[layer_idx];
+}
+
+// Does this layer need forward recompute during backward pass?
+static bool checkpoint_needs_recompute(const CheckpointManager *cm, int layer_idx) {
+    return !checkpoint_should_save(cm, layer_idx);
+}
+
+// Find the nearest saved checkpoint before this layer (for recompute starting point)
+static int checkpoint_nearest_saved_before(const CheckpointManager *cm, int layer_idx) {
+    for (int i = layer_idx; i >= 0; i--) {
+        if (cm->is_saved[i]) return i;
+    }
+    return 0;   // fallback to first layer
+}
+
+// How many layers need recompute between the nearest checkpoint and this layer?
+static int checkpoint_recompute_depth(const CheckpointManager *cm, int layer_idx) {
+    int saved = checkpoint_nearest_saved_before(cm, layer_idx);
+    return layer_idx - saved;
+}
+
+// ===== Memory estimation =====
+
+// Memory for saved activations only (bytes)
+static size_t checkpoint_saved_memory(const CheckpointManager *cm, const ModelDims *d) {
+    return (size_t)cm->n_checkpointed * layer_activation_bytes(d);
+}
+
+// Memory savings vs. saving all layers (bytes)
+static size_t checkpoint_memory_saved(const CheckpointManager *cm, const ModelDims *d) {
+    size_t all = (size_t)cm->n_layers * layer_activation_bytes(d);
+    size_t used = checkpoint_saved_memory(cm, d);
+    return all - used;
+}
+
+// Extra forward FLOPs due to recompute (fraction of 1.0)
+static double checkpoint_recompute_overhead(const CheckpointManager *cm) {
+    int recomputed = cm->n_layers - cm->n_checkpointed;
+    return (double)recomputed / (double)cm->n_layers;
+}
+
+// ===== Pretty-print =====
+
+static const char *checkpoint_policy_name(CheckpointPolicy p) {
+    switch (p) {
+        case CKPT_ALL: return "ALL";
+        case CKPT_BOUNDARY: return "BOUNDARY";
+        case CKPT_SQRT: return "SQRT";
+        case CKPT_EVERY_N: return "EVERY_N";
+        case CKPT_NONE: return "NONE";
+        default: return "UNKNOWN";
+    }
+}
+
+static void checkpoint_print(const CheckpointManager *cm, const ModelDims *d) {
+    printf("=== Checkpoint Policy: %s ===\n", checkpoint_policy_name(cm->policy));
+    printf("  %d/%d layers checkpointed", cm->n_checkpointed, cm->n_layers);
+    if (cm->policy == CKPT_SQRT || cm->policy == CKPT_EVERY_N)
+        printf(" (interval=%d)", cm->interval);
+    printf("\n");
+    printf("  Activation memory: %.1fMB (saved) / %.1fMB (all)\n",
+           checkpoint_saved_memory(cm, d) / 1e6,
+           (double)cm->n_layers * layer_activation_bytes(d) / 1e6);
+    printf("  Memory savings: %.1fMB (%.0f%%)\n",
+           checkpoint_memory_saved(cm, d) / 1e6,
+           100.0 * checkpoint_memory_saved(cm, d) / ((double)cm->n_layers * layer_activation_bytes(d)));
+    printf("  Recompute overhead: %.0f%% extra forward FLOPs\n",
+           100.0 * checkpoint_recompute_overhead(cm));
+    printf("  Saved layers: ");
+    for (int i = 0; i < cm->n_layers; i++) {
+        if (cm->is_saved[i]) printf("%d ", i);
+    }
+    printf("\n");
+}
diff --git a/training/model.h b/training/model.h
index 6cee52f..4e68ebc 100644
--- a/training/model.h
+++ b/training/model.h
@@ -78,7 +78,10 @@ typedef struct {
 static int model_load_weights(Model *m, const char *path) {
     FILE *f = fopen(path, "rb");
     if (!f) { fprintf(stderr, "Cannot open %s\n", path); return -1; }
-    fread(&m->cfg, sizeof(Config), 1, f);
+    if (fread(&m->cfg, sizeof(Config), 1, f) != 1) {
+        fprintf(stderr, "ERROR: failed to read config from %s\n", path);
+        fclose(f); return -1;
+    }
     bool shared = m->cfg.vocab_size > 0;
     if (m->cfg.vocab_size < 0) m->cfg.vocab_size = -m->cfg.vocab_size;
 
@@ -89,7 +92,10 @@ static int model_load_weights(Model *m, const char *path) {
     int d = m->cfg.dim, hd = m->cfg.hidden_dim, nl = m->cfg.n_layers, vs = m->cfg.vocab_size;
 
     m->token_embedding = (float*)malloc(vs * d * sizeof(float));
-    fread(m->token_embedding, sizeof(float), vs * d, f);
+    if (fread(m->token_embedding, sizeof(float), vs * d, f) != (size_t)(vs * d)) {
+        fprintf(stderr, "ERROR: short read on token_embedding (file truncated?)\n");
+        fclose(f); return -1;
+    }
 
     float *rms_att_all = (float*)malloc(nl * d * sizeof(float));
     float *wq_all = (float*)malloc(nl * d * d * sizeof(float));
@@ -101,15 +107,24 @@ static int model_load_weights(Model *m, const char *path) {
     float *w2_all = (float*)malloc(nl * d * hd * sizeof(float));
     float *w3_all = (float*)malloc(nl * hd * d * sizeof(float));
 
-    fread(rms_att_all, sizeof(float), nl * d, f);
-    fread(wq_all, sizeof(float), nl * d * d, f);
-    fread(wk_all, sizeof(float), nl * d * d, f);
-    fread(wv_all, sizeof(float), nl * d * d, f);
-    fread(wo_all, sizeof(float), nl * d * d, f);
-    fread(rms_ffn_all, sizeof(float), nl * d, f);
-    fread(w1_all, sizeof(float), nl * hd * d, f);
-    fread(w2_all, sizeof(float), nl * d * hd, f);
-    fread(w3_all, sizeof(float), nl * hd * d, f);
+    #define FREAD_CHECK(buf, count, file, label) do { \
+        size_t _n = fread(buf, sizeof(float), count, file); \
+        if (_n != (size_t)(count)) { \
+            fprintf(stderr, "ERROR: short read on %s: got %zu, expected %zu (file truncated?)\n", \
+                    label, _n, (size_t)(count)); \
+            fclose(file); return -1; \
+        } \
+    } while(0)
+
+    FREAD_CHECK(rms_att_all, nl * d, f, "rms_att");
+    FREAD_CHECK(wq_all, nl * d * d, f, "wq");
+    FREAD_CHECK(wk_all, nl * d * d, f, "wk");
+    FREAD_CHECK(wv_all, nl * d * d, f, "wv");
+    FREAD_CHECK(wo_all, nl * d * d, f, "wo");
+    FREAD_CHECK(rms_ffn_all, nl * d, f, "rms_ffn");
+    FREAD_CHECK(w1_all, nl * hd * d, f, "w1");
+    FREAD_CHECK(w2_all, nl * d * hd, f, "w2");
+    FREAD_CHECK(w3_all, nl * hd * d, f, "w3");
 
     for (int l = 0; l < nl; l++) {
         m->rms_att_w[l] = (float*)malloc(d * sizeof(float));
@@ -135,14 +150,15 @@ static int model_load_weights(Model *m, const char *path) {
     free(rms_ffn_all); free(w1_all); free(w2_all); free(w3_all);
 
     m->rms_final_w = (float*)malloc(d * sizeof(float));
-    fread(m->rms_final_w, sizeof(float), d, f);
+    FREAD_CHECK(m->rms_final_w, d, f, "rms_final");
 
     if (shared) {
         m->wcls = m->token_embedding;
     } else {
         m->wcls = (float*)malloc(vs * d * sizeof(float));
-        fread(m->wcls, sizeof(float), vs * d, f);
+        FREAD_CHECK(m->wcls, vs * d, f, "wcls");
     }
+    #undef FREAD_CHECK
     fclose(f);
     return 0;
 }
@@ -188,32 +204,45 @@ static int model_compile_kernels(Model *m, int seq_len) {
     return 0;
 }
 
-// Recompile all kernels after weight update — unload all first to avoid ANE model limit
+// Recompile all kernels after weight update — compile new first, then swap
 static int model_recompile_kernels(Model *m) {
     int d = m->cfg.dim, hd = m->cfg.hidden_dim, vs = m->cfg.vocab_size;
     int S = m->seq_len;
-    // Phase 1: unload+free all
+
+    // Phase 1: compile new kernels into temporaries
+    ANEKernel *new_q[N_LAYERS], *new_k[N_LAYERS], *new_v[N_LAYERS], *new_o[N_LAYERS];
+    ANEKernel *new_w1[N_LAYERS], *new_w2[N_LAYERS], *new_w3[N_LAYERS];
     for (int l = 0; l < N_LAYERS; l++) {
-        ane_free(m->kern_q[l]); ane_free(m->kern_k[l]); ane_free(m->kern_v[l]); ane_free(m->kern_o[l]);
-        ane_free(m->kern_w1[l]); ane_free(m->kern_w2[l]); ane_free(m->kern_w3[l]);
-        m->kern_q[l]=m->kern_k[l]=m->kern_v[l]=m->kern_o[l]=NULL;
-        m->kern_w1[l]=m->kern_w2[l]=m->kern_w3[l]=NULL;
+        new_q[l] = compile_conv_kernel(m->wq[l], d, d, S);
+        new_k[l] = compile_conv_kernel(m->wk[l], d, d, S);
+        new_v[l] = compile_conv_kernel(m->wv[l], d, d, S);
+        new_o[l] = compile_conv_kernel(m->wo[l], d, d, S);
+        new_w1[l] = compile_conv_kernel(m->w1[l], d, hd, S);
+        new_w2[l] = compile_conv_kernel(m->w2[l], hd, d, S);
+        new_w3[l] = compile_conv_kernel(m->w3[l], d, hd, S);
+        if (!new_q[l] || !new_k[l] || !new_v[l] || !new_o[l] ||
+            !new_w1[l] || !new_w2[l] || !new_w3[l]) {
+            // Cleanup partially compiled new kernels
+            for (int i = 0; i <= l; i++) {
+                ane_free(new_q[i]); ane_free(new_k[i]); ane_free(new_v[i]); ane_free(new_o[i]);
+                ane_free(new_w1[i]); ane_free(new_w2[i]); ane_free(new_w3[i]);
+            }
+            fprintf(stderr, "Recompile failed at layer %d, keeping old kernels\n", l);
+            return -1;
+        }
     }
-    if (m->kern_cls) { ane_free(m->kern_cls); m->kern_cls=NULL; }
-    // Phase 2: recompile all
+    ANEKernel *new_cls = compile_conv_kernel(m->wcls, d, vs, S);
+
+    // Phase 2: all compiles succeeded — swap and free old
     for (int l = 0; l < N_LAYERS; l++) {
-        m->kern_q[l] = compile_conv_kernel(m->wq[l], d, d, S);
-        m->kern_k[l] = compile_conv_kernel(m->wk[l], d, d, S);
-        m->kern_v[l] = compile_conv_kernel(m->wv[l], d, d, S);
-        m->kern_o[l] = compile_conv_kernel(m->wo[l], d, d, S);
-        m->kern_w1[l] = compile_conv_kernel(m->w1[l], d, hd, S);
-        m->kern_w2[l] = compile_conv_kernel(m->w2[l], hd, d, S);
-        m->kern_w3[l] = compile_conv_kernel(m->w3[l], d, hd, S);
-        if (!m->kern_q[l] || !m->kern_k[l] || !m->kern_v[l] || !m->kern_o[l] ||
-            !m->kern_w1[l] || !m->kern_w2[l] || !m->kern_w3[l]) return -1;
+        ane_free(m->kern_q[l]); ane_free(m->kern_k[l]); ane_free(m->kern_v[l]); ane_free(m->kern_o[l]);
+        ane_free(m->kern_w1[l]); ane_free(m->kern_w2[l]); ane_free(m->kern_w3[l]);
+        m->kern_q[l] = new_q[l]; m->kern_k[l] = new_k[l];
+        m->kern_v[l] = new_v[l]; m->kern_o[l] = new_o[l];
+        m->kern_w1[l] = new_w1[l]; m->kern_w2[l] = new_w2[l]; m->kern_w3[l] = new_w3[l];
     }
-    m->kern_cls = compile_conv_kernel(m->wcls, d, vs, S);
-    // cls may fail for large vocab — that's OK, forward uses CPU fallback
+    if (m->kern_cls) ane_free(m->kern_cls);
+    m->kern_cls = new_cls;  // may be NULL for large vocab — forward uses CPU fallback
     return 0;
 }
 
diff --git a/training/model_config.h b/training/model_config.h
new file mode 100644
index 0000000..fa8be1c
--- /dev/null
+++ b/training/model_config.h
@@ -0,0 +1,314 @@
+// model_config.h — Parameterized model configuration for pipeline training
+// Replaces hardcoded #defines with portable structs + preset configs
+#pragma once
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <math.h>
+
+// ===== Model configuration =====
+
+typedef struct {
+    int dim;            // model dimension (embedding/residual width)
+    int hidden_dim;     // FFN hidden dimension
+    int n_heads;        // number of attention heads
+    int n_kv_heads;     // number of KV heads (for GQA; == n_heads for MHA)
+    int n_layers;       // total transformer layers
+    int vocab_size;     // vocabulary size
+    int seq_len;        // maximum sequence length
+    // Derived (computed by model_config_init)
+    int head_dim;       // dim / n_heads
+    int kv_dim;         // head_dim * n_kv_heads
+    int score_ch;       // n_heads * seq_len (attention score channels for SDPA bwd)
+} ModelDims;
+
+typedef struct {
+    int compile_budget;     // max ANE compilations per process (~119)
+    int kernels_per_layer;  // weight-bearing kernels per layer (currently 5)
+    int static_per_layer;   // weight-free kernels per layer (sdpaBwd2 = 1)
+    int accum_steps;        // gradient accumulation steps per compile batch
+    float headroom_pct;     // safety margin as fraction of budget (0.0-1.0, default 0.10)
+} CompileConfig;
+
+typedef struct {
+    ModelDims dims;
+    CompileConfig compile;
+    const char *name;       // human-readable model name
+} ModelConfig;
+
+// ===== Layer group for pipeline scheduling =====
+
+typedef struct {
+    int start_layer;        // first layer index (inclusive)
+    int end_layer;          // last layer index (exclusive)
+    int n_layers;           // end_layer - start_layer
+    int weight_kernels;     // weight-bearing kernels in this group
+    int static_kernels;     // weight-free kernels in this group
+    int total_kernels;      // weight_kernels + static_kernels
+} LayerGroup;
+
+typedef struct {
+    LayerGroup *groups;
+    int n_groups;
+    int total_exec_restarts;    // estimated exec() restarts per training step
+} PipelinePlan;
+
+// ===== Derived dimension helpers =====
+
+static void model_dims_init(ModelDims *d) {
+    d->head_dim = (d->n_heads > 0) ? d->dim / d->n_heads : 0;
+    d->kv_dim = d->head_dim * d->n_kv_heads;
+    d->score_ch = d->n_heads * d->seq_len;
+}
+
+// ===== Per-layer memory sizes (bytes) =====
+
+// Weight sizes in floats (fp32)
+static inline size_t wq_size(const ModelDims *d) { return (size_t)d->dim * d->dim; }
+static inline size_t wo_size(const ModelDims *d) { return (size_t)d->dim * d->dim; }
+static inline size_t w1_size(const ModelDims *d) { return (size_t)d->hidden_dim * d->dim; }
+static inline size_t w2_size(const ModelDims *d) { return (size_t)d->dim * d->hidden_dim; }
+static inline size_t w3_size(const ModelDims *d) { return (size_t)d->hidden_dim * d->dim; }
+
+static inline size_t layer_weight_floats(const ModelDims *d) {
+    return 4 * wq_size(d)    // Wq, Wk, Wv, Wo
+         + w1_size(d) + w2_size(d) + w3_size(d)   // W1, W2, W3
+         + 2 * (size_t)d->dim;                     // rms_att, rms_ffn
+}
+
+static inline size_t layer_weight_bytes(const ModelDims *d) {
+    return layer_weight_floats(d) * sizeof(float);
+}
+
+// Adam state: 2x weight size (m + v vectors)
+static inline size_t layer_adam_bytes(const ModelDims *d) {
+    return 2 * layer_weight_bytes(d);
+}
+
+// Activation buffers per layer (saved for backward)
+static inline size_t layer_activation_floats(const ModelDims *d) {
+    int S = d->seq_len, D = d->dim, H = d->hidden_dim;
+    // layer_in, xnorm, Q, K, V, attn_out, o_out, x2, x2norm = 9 * D*S
+    // h1, h3, silu_out = 3 * H*S
+    // ffn_out = D*S
+    return (size_t)(10 * D * S + 3 * H * S);
+}
+
+static inline size_t layer_activation_bytes(const ModelDims *d) {
+    return layer_activation_floats(d) * sizeof(float);
+}
+
+// Gradient accumulators per layer
+static inline size_t layer_gradient_bytes(const ModelDims *d) {
+    return layer_weight_bytes(d);   // same layout as weights
+}
+
+// Total model memory (weights + adam + activations + gradients for all layers)
+static inline size_t total_model_bytes(const ModelConfig *cfg) {
+    const ModelDims *d = &cfg->dims;
+    size_t per_layer = layer_weight_bytes(d) + layer_adam_bytes(d)
+                     + layer_activation_bytes(d) + layer_gradient_bytes(d);
+    size_t global = (size_t)d->dim * sizeof(float)                  // rms_final
+                  + (size_t)d->vocab_size * d->dim * sizeof(float)  // embed
+                  + (size_t)d->dim * 2 * sizeof(float)              // rms_final adam
+                  + (size_t)d->vocab_size * d->dim * 2 * sizeof(float)  // embed adam
+                  + (size_t)d->dim * sizeof(float)                  // rms_final grad
+                  + (size_t)d->vocab_size * d->dim * sizeof(float); // embed grad
+    return per_layer * d->n_layers + global;
+}
+
+// ===== Pipeline planning =====
+
+// Compute how many layers can fit in one compile batch
+static int max_layers_per_compile(const CompileConfig *cc) {
+    float headroom = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f)
+                   ? cc->headroom_pct : 0.10f;
+    int usable = (int)(cc->compile_budget * (1.0f - headroom));
+    int per_layer = cc->kernels_per_layer + cc->static_per_layer;
+    if (per_layer <= 0) return 1;
+    return usable / per_layer;
+}
+
+// Compute optimal layer groups for a model given compile budget
+// Returns a PipelinePlan (caller must free plan.groups)
+static PipelinePlan compute_pipeline_plan(const ModelConfig *cfg) {
+    PipelinePlan plan = {0};
+    int max_per = max_layers_per_compile(&cfg->compile);
+    if (max_per <= 0) max_per = 1;
+
+    // Clamp to actual layer count
+    int group_size = (max_per < cfg->dims.n_layers) ? max_per : cfg->dims.n_layers;
+
+    plan.n_groups = (cfg->dims.n_layers + group_size - 1) / group_size;
+    plan.groups = (LayerGroup *)calloc(plan.n_groups, sizeof(LayerGroup));
+
+    int kpl = cfg->compile.kernels_per_layer;
+    int spl = cfg->compile.static_per_layer;
+
+    for (int g = 0; g < plan.n_groups; g++) {
+        LayerGroup *lg = &plan.groups[g];
+        lg->start_layer = g * group_size;
+        lg->end_layer = lg->start_layer + group_size;
+        if (lg->end_layer > cfg->dims.n_layers)
+            lg->end_layer = cfg->dims.n_layers;
+        lg->n_layers = lg->end_layer - lg->start_layer;
+        lg->weight_kernels = lg->n_layers * kpl;
+        lg->static_kernels = lg->n_layers * spl;
+        lg->total_kernels = lg->weight_kernels + lg->static_kernels;
+    }
+
+    // Each compile batch needs one exec() restart (except possibly the last)
+    // Forward: n_groups compiles. Backward: n_groups compiles.
+    // Per training step: forward + backward = 2 * n_groups compile batches
+    // Each batch may need exec() restart. Worst case:
+    plan.total_exec_restarts = 2 * plan.n_groups;
+
+    return plan;
+}
+
+static void pipeline_plan_free(PipelinePlan *plan) {
+    free(plan->groups);
+    plan->groups = NULL;
+    plan->n_groups = 0;
+}
+
+// ===== Pretty-print plan =====
+
+static void pipeline_plan_print(const ModelConfig *cfg, const PipelinePlan *plan) {
+    printf("=== Pipeline Plan: %s ===\n", cfg->name);
+    printf("  %d layers | dim=%d hidden=%d heads=%d seq=%d vocab=%d\n",
+           cfg->dims.n_layers, cfg->dims.dim, cfg->dims.hidden_dim,
+           cfg->dims.n_heads, cfg->dims.seq_len, cfg->dims.vocab_size);
+    printf("  Compile budget: %d | %d weight-kernels/layer + %d static/layer\n",
+           cfg->compile.compile_budget, cfg->compile.kernels_per_layer,
+           cfg->compile.static_per_layer);
+    printf("  %d layer group(s):\n", plan->n_groups);
+    for (int g = 0; g < plan->n_groups; g++) {
+        const LayerGroup *lg = &plan->groups[g];
+        printf("    Group %d: layers [%d..%d) — %d layers, %d kernels (%d weight + %d static)\n",
+               g, lg->start_layer, lg->end_layer, lg->n_layers,
+               lg->total_kernels, lg->weight_kernels, lg->static_kernels);
+    }
+    printf("  Est. exec() restarts per step: %d\n", plan->total_exec_restarts);
+    printf("  Memory per layer: weights=%.1fMB adam=%.1fMB acts=%.1fMB grads=%.1fMB\n",
+           layer_weight_bytes(&cfg->dims)/1e6, layer_adam_bytes(&cfg->dims)/1e6,
+           layer_activation_bytes(&cfg->dims)/1e6, layer_gradient_bytes(&cfg->dims)/1e6);
+    printf("  Total model state: %.1fMB\n", total_model_bytes(cfg)/1e6);
+}
+
+// ===== FLOP estimation per step =====
+
+static inline double flops_per_step(const ModelConfig *cfg) {
+    const ModelDims *d = &cfg->dims;
+    int N = d->n_layers, D = d->dim, H = d->hidden_dim, S = d->seq_len;
+    int HD = d->head_dim, NH = d->n_heads;
+    // Forward: 4 linear projections (QKV+O) + 3 FFN projections per layer
+    double fwd = N * (4.0*2*D*D*S + 2.0*2*D*H*S + 2.0*H*D*S);
+    // Backward dx same flops as forward
+    double bwd_dx = fwd;
+    // Backward dW same flops as forward
+    double bwd_dw = fwd;
+    // SDPA backward (attention score computation)
+    double sdpa = N * 2.0 * NH * 5 * S * S * HD;
+    // Classifier (forward + backward)
+    double cls = 3.0 * 2.0 * d->vocab_size * D * S;
+    return fwd + bwd_dx + bwd_dw + sdpa + cls;
+}
+
+static inline double ane_flops_per_step(const ModelConfig *cfg) {
+    const ModelDims *d = &cfg->dims;
+    int N = d->n_layers, D = d->dim, H = d->hidden_dim, S = d->seq_len;
+    int HD = d->head_dim, NH = d->n_heads;
+    double fwd = N * (4.0*2*D*D*S + 2.0*2*D*H*S + 2.0*H*D*S);
+    double bwd_dx = fwd;
+    double sdpa = N * 2.0 * NH * 5 * S * S * HD;
+    return fwd + bwd_dx + sdpa;  // dW is on CPU (cblas)
+}
+
+// ===== Model presets =====
+
+static ModelConfig model_config_stories110m(void) {
+    ModelConfig cfg = {0};
+    cfg.name = "Stories110M";
+    cfg.dims = (ModelDims){
+        .dim = 768, .hidden_dim = 2048, .n_heads = 12,
+        .n_kv_heads = 12, .n_layers = 12, .vocab_size = 32000, .seq_len = 256
+    };
+    cfg.compile = (CompileConfig){
+        .compile_budget = 119, .kernels_per_layer = 5,
+        .static_per_layer = 1, .accum_steps = 10, .headroom_pct = 0.10f
+    };
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
+static ModelConfig model_config_stories42m(void) {
+    ModelConfig cfg = {0};
+    cfg.name = "Stories42M";
+    cfg.dims = (ModelDims){
+        .dim = 512, .hidden_dim = 1376, .n_heads = 8,
+        .n_kv_heads = 8, .n_layers = 8, .vocab_size = 32000, .seq_len = 256
+    };
+    cfg.compile = (CompileConfig){
+        .compile_budget = 119, .kernels_per_layer = 5,
+        .static_per_layer = 1, .accum_steps = 10, .headroom_pct = 0.10f
+    };
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
+static ModelConfig model_config_llama_1b(void) {
+    ModelConfig cfg = {0};
+    cfg.name = "LLaMA-1.1B";
+    cfg.dims = (ModelDims){
+        .dim = 2048, .hidden_dim = 5504, .n_heads = 16,
+        .n_kv_heads = 16, .n_layers = 22, .vocab_size = 32000, .seq_len = 512
+    };
+    cfg.compile = (CompileConfig){
+        .compile_budget = 119, .kernels_per_layer = 5,
+        .static_per_layer = 1, .accum_steps = 4, .headroom_pct = 0.10f
+    };
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
+static ModelConfig model_config_llama_7b(void) {
+    ModelConfig cfg = {0};
+    cfg.name = "LLaMA-7B";
+    cfg.dims = (ModelDims){
+        .dim = 4096, .hidden_dim = 11008, .n_heads = 32,
+        .n_kv_heads = 32, .n_layers = 32, .vocab_size = 32000, .seq_len = 512
+    };
+    cfg.compile = (CompileConfig){
+        .compile_budget = 119, .kernels_per_layer = 5,
+        .static_per_layer = 1, .accum_steps = 2, .headroom_pct = 0.10f
+    };
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
+// Parse config from command-line (returns preset, caller can override)
+static ModelConfig model_config_from_args(int argc, char *argv[]) {
+    ModelConfig cfg = model_config_stories110m(); // default
+    for (int i = 1; i < argc; i++) {
+        if (strcmp(argv[i], "--model") == 0 && i+1 < argc) {
+            const char *name = argv[++i];
+            if (strcmp(name, "stories42m") == 0) cfg = model_config_stories42m();
+            else if (strcmp(name, "stories110m") == 0) cfg = model_config_stories110m();
+            else if (strcmp(name, "llama1b") == 0) cfg = model_config_llama_1b();
+            else if (strcmp(name, "llama7b") == 0) cfg = model_config_llama_7b();
+            else fprintf(stderr, "Unknown model: %s (using stories110m)\n", name);
+        }
+        else if (strcmp(argv[i], "--dim") == 0 && i+1 < argc) cfg.dims.dim = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--hidden") == 0 && i+1 < argc) cfg.dims.hidden_dim = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--heads") == 0 && i+1 < argc) cfg.dims.n_heads = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--layers") == 0 && i+1 < argc) cfg.dims.n_layers = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--seq") == 0 && i+1 < argc) cfg.dims.seq_len = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--vocab") == 0 && i+1 < argc) cfg.dims.vocab_size = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--budget") == 0 && i+1 < argc) cfg.compile.compile_budget = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--accum") == 0 && i+1 < argc) cfg.compile.accum_steps = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--headroom") == 0 && i+1 < argc) cfg.compile.headroom_pct = atof(argv[++i]);
+    }
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
diff --git a/training/pipeline.h b/training/pipeline.h
new file mode 100644
index 0000000..d3ceb95
--- /dev/null
+++ b/training/pipeline.h
@@ -0,0 +1,515 @@
+// pipeline.h — Layer-group scheduling and mmap state for multi-group ANE training
+// Manages compile budgets, exec() restarts, and cross-exec shared tensor state
+#pragma once
+#include "model_config.h"
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+
+// ===== Compile budget tracker =====
+
+typedef struct {
+    int budget;         // max compilations allowed
+    int used;           // compilations consumed so far
+    int headroom;       // safety margin (budget * 0.1)
+} CompileBudget;
+
+static CompileBudget budget_init(const CompileConfig *cc) {
+    CompileBudget b;
+    b.budget = cc->compile_budget;
+    b.used = 0;
+    float pct = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f)
+              ? cc->headroom_pct : 0.10f;
+    b.headroom = (int)(cc->compile_budget * pct);
+    return b;
+}
+
+static bool budget_can_fit(const CompileBudget *b, int n_kernels) {
+    return (b->used + n_kernels) <= (b->budget - b->headroom);
+}
+
+static void budget_consume(CompileBudget *b, int n_kernels) {
+    b->used += n_kernels;
+}
+
+static bool budget_needs_restart(const CompileBudget *b) {
+    return b->used >= (b->budget - b->headroom);
+}
+
+static int budget_remaining(const CompileBudget *b) {
+    int r = b->budget - b->headroom - b->used;
+    return (r > 0) ? r : 0;
+}
+
+// ===== Pipeline execution phases =====
+
+typedef enum {
+    PHASE_INIT = 0,
+    PHASE_FORWARD,          // running forward pass through layer groups
+    PHASE_BACKWARD,         // running backward pass through layer groups (reverse)
+    PHASE_WEIGHT_UPDATE,    // Adam step on accumulated gradients
+    PHASE_DONE              // training step complete
+} PipelinePhase;
+
+typedef enum {
+    ACTION_COMPILE_GROUP,       // compile kernels for current layer group
+    ACTION_RUN_FORWARD_GROUP,   // execute forward pass for compiled group
+    ACTION_RUN_BACKWARD_GROUP,  // execute backward pass for compiled group
+    ACTION_EXEC_RESTART,        // save state and exec() to reset compile budget
+    ACTION_WEIGHT_UPDATE,       // run optimizer on all layers
+    ACTION_STEP_DONE,           // training step complete
+    ACTION_ERROR                // something went wrong
+} PipelineAction;
+
+// ===== Scheduler state =====
+
+typedef struct {
+    ModelConfig config;
+    PipelinePlan plan;
+    CompileBudget budget;
+
+    PipelinePhase phase;
+    int current_group;      // index into plan.groups
+    int current_step;       // training step number
+    int accum_step;         // gradient accumulation step within batch
+    int total_steps;        // total training steps requested
+    float learning_rate;
+    float last_loss;
+
+    // Flags
+    bool group_compiled;    // whether current group's kernels are compiled
+    bool needs_restart;     // whether we need exec() before next group
+} PipelineScheduler;
+
+static PipelineScheduler pipeline_scheduler_init(ModelConfig config, int total_steps, float lr) {
+    PipelineScheduler s = {0};
+    s.config = config;
+    s.plan = compute_pipeline_plan(&config);
+    s.budget = budget_init(&config.compile);
+    s.phase = PHASE_FORWARD;
+    s.current_group = 0;
+    s.current_step = 0;
+    s.accum_step = 0;
+    s.total_steps = total_steps;
+    s.learning_rate = lr;
+    s.last_loss = 999.0f;
+    s.group_compiled = false;
+    s.needs_restart = false;
+    return s;
+}
+
+// Get the next action the training loop should take
+static PipelineAction pipeline_next_action(PipelineScheduler *s) {
+    if (s->current_step >= s->total_steps)
+        return ACTION_STEP_DONE;
+
+    switch (s->phase) {
+    case PHASE_FORWARD:
+        if (s->current_group >= s->plan.n_groups) {
+            // Forward pass complete for all groups — start backward
+            s->phase = PHASE_BACKWARD;
+            s->current_group = s->plan.n_groups - 1;
+            s->group_compiled = false;
+            return pipeline_next_action(s);
+        }
+        if (!s->group_compiled) {
+            // Check if we have compile budget for this group
+            LayerGroup *lg = &s->plan.groups[s->current_group];
+            if (!budget_can_fit(&s->budget, lg->total_kernels)) {
+                s->needs_restart = true;
+                return ACTION_EXEC_RESTART;
+            }
+            return ACTION_COMPILE_GROUP;
+        }
+        return ACTION_RUN_FORWARD_GROUP;
+
+    case PHASE_BACKWARD:
+        if (s->current_group < 0) {
+            // Backward complete — weight update
+            s->phase = PHASE_WEIGHT_UPDATE;
+            return ACTION_WEIGHT_UPDATE;
+        }
+        if (!s->group_compiled) {
+            LayerGroup *lg = &s->plan.groups[s->current_group];
+            if (!budget_can_fit(&s->budget, lg->total_kernels)) {
+                s->needs_restart = true;
+                return ACTION_EXEC_RESTART;
+            }
+            return ACTION_COMPILE_GROUP;
+        }
+        return ACTION_RUN_BACKWARD_GROUP;
+
+    case PHASE_WEIGHT_UPDATE:
+        return ACTION_WEIGHT_UPDATE;
+
+    case PHASE_DONE:
+        return ACTION_STEP_DONE;
+
+    default:
+        return ACTION_ERROR;
+    }
+}
+
+// Called after successfully compiling a layer group's kernels
+static void pipeline_group_compiled(PipelineScheduler *s) {
+    LayerGroup *lg = &s->plan.groups[s->current_group];
+    budget_consume(&s->budget, lg->total_kernels);
+    s->group_compiled = true;
+}
+
+// Called after successfully running forward for current group
+static void pipeline_forward_group_done(PipelineScheduler *s) {
+    s->current_group++;
+    s->group_compiled = false;
+}
+
+// Called after successfully running backward for current group
+static void pipeline_backward_group_done(PipelineScheduler *s) {
+    s->current_group--;
+    s->group_compiled = false;
+}
+
+// Called after weight update completes
+static void pipeline_weight_update_done(PipelineScheduler *s) {
+    s->accum_step++;
+    if (s->accum_step >= s->config.compile.accum_steps) {
+        s->accum_step = 0;
+        s->current_step++;
+    }
+    // Reset for next forward pass
+    s->phase = PHASE_FORWARD;
+    s->current_group = 0;
+    s->group_compiled = false;
+}
+
+// ===== mmap-based cross-exec state =====
+//
+// Layout: [Header][Layer 0 weights][Layer 0 adam][Layer 0 grads]...[Global state]
+// All tensors stored as fp32. The mmap file persists across exec() restarts.
+
+#define MMAP_SENTINEL 0x414E4550  // "ANEP" — file format identifier
+#define MMAP_VERSION 1
+
+typedef struct {
+    int sentinel;       // MMAP_SENTINEL for file identification
+    int version;
+    int n_layers;
+    int dim;
+    int hidden_dim;
+    int n_heads;
+    int vocab_size;
+    int seq_len;
+    // Scheduler state (for exec restart)
+    int phase;
+    int current_group;
+    int current_step;
+    int accum_step;
+    int total_steps;
+    int compile_count;      // compiles used in current process
+    int adam_t;             // Adam timestep
+    float learning_rate;
+    float last_loss;
+    // Offsets into mmap (bytes from base)
+    size_t layer_weights_offset;    // start of per-layer weight data
+    size_t layer_adam_offset;       // start of per-layer adam state
+    size_t layer_grads_offset;      // start of per-layer gradient accumulators
+    size_t layer_acts_offset;       // start of per-layer activation checkpoints
+    size_t global_offset;           // start of global state (rms_final, embed, etc.)
+    size_t total_size;              // total mmap size
+    int pad[4];                     // alignment
+} MmapHeader;
+
+typedef struct {
+    int fd;
+    void *base;
+    size_t size;
+    MmapHeader *header;
+    const char *path;
+} MmapState;
+
+// Compute mmap layout for a given config
+static size_t mmap_compute_size(const ModelConfig *cfg) {
+    const ModelDims *d = &cfg->dims;
+    size_t header = sizeof(MmapHeader);
+    // Round up to page boundary
+    header = (header + 4095) & ~(size_t)4095;
+
+    size_t per_layer_weights = layer_weight_bytes(d);
+    size_t per_layer_adam = layer_adam_bytes(d);
+    size_t per_layer_grads = layer_gradient_bytes(d);
+    size_t per_layer_acts = layer_activation_bytes(d);
+
+    size_t all_layers = (size_t)d->n_layers * (per_layer_weights + per_layer_adam + per_layer_grads + per_layer_acts);
+
+    // Global: rms_final + embed + their adam states + embed gradients
+    size_t global = (size_t)d->dim * 4                          // rms_final
+                  + (size_t)d->vocab_size * d->dim * 4          // embed
+                  + (size_t)d->dim * 2 * 4                      // rms_final adam (m+v)
+                  + (size_t)d->vocab_size * d->dim * 2 * 4      // embed adam
+                  + (size_t)d->dim * 4                          // rms_final grad
+                  + (size_t)d->vocab_size * d->dim * 4;         // embed grad
+
+    return header + all_layers + global;
+}
+
+// Create a new mmap state file
+static MmapState *mmap_state_create(const char *path, const ModelConfig *cfg) {
+    size_t total = mmap_compute_size(cfg);
+    int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0644);
+    if (fd < 0) { perror("mmap_state_create: open"); return NULL; }
+    if (ftruncate(fd, total) < 0) { perror("mmap_state_create: ftruncate"); close(fd); return NULL; }
+
+    void *base = mmap(NULL, total, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+    if (base == MAP_FAILED) { perror("mmap_state_create: mmap"); close(fd); return NULL; }
+
+    MmapState *ms = (MmapState *)calloc(1, sizeof(MmapState));
+    if (!ms) { perror("mmap_state_create: calloc"); munmap(base, total); close(fd); return NULL; }
+    ms->fd = fd;
+    ms->base = base;
+    ms->size = total;
+    ms->path = path;
+    ms->header = (MmapHeader *)base;
+
+    // Initialize header
+    MmapHeader *h = ms->header;
+    h->sentinel = MMAP_SENTINEL;
+    h->version = MMAP_VERSION;
+    h->n_layers = cfg->dims.n_layers;
+    h->dim = cfg->dims.dim;
+    h->hidden_dim = cfg->dims.hidden_dim;
+    h->n_heads = cfg->dims.n_heads;
+    h->vocab_size = cfg->dims.vocab_size;
+    h->seq_len = cfg->dims.seq_len;
+
+    // Compute offsets
+    size_t header_end = (sizeof(MmapHeader) + 4095) & ~(size_t)4095;
+    const ModelDims *d = &cfg->dims;
+    size_t pw = layer_weight_bytes(d);
+    size_t pa = layer_adam_bytes(d);
+    size_t pg = layer_gradient_bytes(d);
+    size_t pact = layer_activation_bytes(d);
+
+    h->layer_weights_offset = header_end;
+    h->layer_adam_offset = h->layer_weights_offset + (size_t)d->n_layers * pw;
+    h->layer_grads_offset = h->layer_adam_offset + (size_t)d->n_layers * pa;
+    h->layer_acts_offset = h->layer_grads_offset + (size_t)d->n_layers * pg;
+    h->global_offset = h->layer_acts_offset + (size_t)d->n_layers * pact;
+    h->total_size = total;
+
+    return ms;
+}
+
+// Reopen existing mmap state (after exec() restart)
+static MmapState *mmap_state_open(const char *path) {
+    int fd = open(path, O_RDWR);
+    if (fd < 0) { perror("mmap_state_open: open"); return NULL; }
+    struct stat st;
+    if (fstat(fd, &st) < 0) { perror("mmap_state_open: fstat"); close(fd); return NULL; }
+
+    void *base = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+    if (base == MAP_FAILED) { perror("mmap_state_open: mmap"); close(fd); return NULL; }
+
+    if ((size_t)st.st_size < sizeof(MmapHeader)) {
+        fprintf(stderr, "mmap_state_open: file too small (%lld bytes)\n", (long long)st.st_size);
+        munmap(base, st.st_size);
+        close(fd);
+        return NULL;
+    }
+
+    MmapHeader *h = (MmapHeader *)base;
+    if (h->sentinel != MMAP_SENTINEL || h->version != MMAP_VERSION) {
+        fprintf(stderr, "mmap_state_open: invalid header (sentinel=0x%08x version=%d)\n",
+                h->sentinel, h->version);
+        munmap(base, st.st_size);
+        close(fd);
+        return NULL;
+    }
+
+    if (h->total_size != 0 && (size_t)st.st_size < h->total_size) {
+        fprintf(stderr, "mmap_state_open: file truncated (expected %zu, got %lld)\n",
+                h->total_size, (long long)st.st_size);
+        munmap(base, st.st_size);
+        close(fd);
+        return NULL;
+    }
+
+    MmapState *ms = (MmapState *)calloc(1, sizeof(MmapState));
+    if (!ms) { perror("mmap_state_open: calloc"); munmap(base, st.st_size); close(fd); return NULL; }
+    ms->fd = fd;
+    ms->base = base;
+    ms->size = st.st_size;
+    ms->path = path;
+    ms->header = h;
+    return ms;
+}
+
+// Close and unmap (does NOT delete the file)
+static void mmap_state_close(MmapState *ms) {
+    if (!ms) return;
+    if (ms->base && ms->base != MAP_FAILED) {
+        if (msync(ms->base, ms->size, MS_SYNC) < 0) perror("mmap_state_close: msync");
+        if (munmap(ms->base, ms->size) < 0) perror("mmap_state_close: munmap");
+    }
+    if (ms->fd >= 0) close(ms->fd);
+    free(ms);
+}
+
+// Delete the mmap file (call after training completes)
+static void mmap_state_destroy(MmapState *ms) {
+    if (!ms) return;
+    const char *p = ms->path;
+    mmap_state_close(ms);
+    unlink(p);
+}
+
+// ===== Typed accessors into mmap regions =====
+
+// Reconstruct ModelDims from mmap header (avoids repeating in each accessor)
+static inline ModelDims mmap_dims(const MmapState *ms) {
+    return (ModelDims){
+        .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim,
+        .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size,
+        .seq_len = ms->header->seq_len
+    };
+}
+
+// Get pointer to layer L's weights in mmap (NULL if out of bounds)
+static float *mmap_layer_weights(MmapState *ms, int layer) {
+    if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL;
+    ModelDims d = mmap_dims(ms);
+    return (float *)((char *)ms->base + ms->header->layer_weights_offset
+                     + (size_t)layer * layer_weight_bytes(&d));
+}
+
+// Get pointer to layer L's adam state in mmap (NULL if out of bounds)
+static float *mmap_layer_adam(MmapState *ms, int layer) {
+    if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL;
+    ModelDims d = mmap_dims(ms);
+    return (float *)((char *)ms->base + ms->header->layer_adam_offset
+                     + (size_t)layer * layer_adam_bytes(&d));
+}
+
+// Get pointer to layer L's gradient accumulators in mmap (NULL if out of bounds)
+static float *mmap_layer_grads(MmapState *ms, int layer) {
+    if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL;
+    ModelDims d = mmap_dims(ms);
+    return (float *)((char *)ms->base + ms->header->layer_grads_offset
+                     + (size_t)layer * layer_gradient_bytes(&d));
+}
+
+// Get pointer to layer L's activation checkpoint in mmap (NULL if out of bounds)
+static float *mmap_layer_acts(MmapState *ms, int layer) {
+    if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL;
+    ModelDims d = mmap_dims(ms);
+    return (float *)((char *)ms->base + ms->header->layer_acts_offset
+                     + (size_t)layer * layer_activation_bytes(&d));
+}
+
+// Get pointer to global state region (rms_final, embed, etc.)
+static float *mmap_global(MmapState *ms) {
+    return (float *)((char *)ms->base + ms->header->global_offset);
+}
+
+// ===== Save/restore scheduler state to/from mmap header =====
+
+static void pipeline_save_to_mmap(const PipelineScheduler *s, MmapState *ms) {
+    MmapHeader *h = ms->header;
+    h->phase = (int)s->phase;
+    h->current_group = s->current_group;
+    h->current_step = s->current_step;
+    h->accum_step = s->accum_step;
+    h->total_steps = s->total_steps;
+    h->learning_rate = s->learning_rate;
+    h->last_loss = s->last_loss;
+    msync(ms->base, sizeof(MmapHeader), MS_SYNC);
+}
+
+static void pipeline_restore_from_mmap(PipelineScheduler *s, const MmapState *ms) {
+    const MmapHeader *h = ms->header;
+    s->phase = (PipelinePhase)h->phase;
+    s->current_group = h->current_group;
+    s->current_step = h->current_step;
+    s->accum_step = h->accum_step;
+    s->total_steps = h->total_steps;
+    s->learning_rate = h->learning_rate;
+    s->last_loss = h->last_loss;
+    // Reset compile budget (new process after exec)
+    s->budget = budget_init(&s->config.compile);
+    s->group_compiled = false;
+    s->needs_restart = false;
+}
+
+// ===== exec() restart with mmap persistence =====
+
+// Call this when ACTION_EXEC_RESTART is returned.
+// Saves scheduler state to mmap, syncs, and exec()s.
+// Does not return on success.
+static void pipeline_exec_restart(PipelineScheduler *s, MmapState *ms, char *argv[]) {
+    pipeline_save_to_mmap(s, ms);
+    printf("[pipeline] exec() restart: step=%d phase=%d group=%d compiles=%d\n",
+           s->current_step, s->phase, s->current_group, s->budget.used);
+    fflush(stdout);
+
+    // Sync all mmap data before exec
+    msync(ms->base, ms->size, MS_SYNC);
+
+    // exec with --pipeline-resume flag
+    execl(argv[0], argv[0], "--pipeline-resume", ms->path, NULL);
+    perror("pipeline_exec_restart: execl");
+}
+
+// Resume from exec() restart. Returns true if this is a resume.
+static bool pipeline_check_resume(int argc, char *argv[], PipelineScheduler *s, MmapState **ms_out) {
+    for (int i = 1; i < argc; i++) {
+        if (strcmp(argv[i], "--pipeline-resume") == 0 && i+1 < argc) {
+            const char *mmap_path = argv[i+1];
+            MmapState *ms = mmap_state_open(mmap_path);
+            if (!ms) {
+                fprintf(stderr, "[pipeline] Failed to reopen mmap at %s\n", mmap_path);
+                return false;
+            }
+            pipeline_restore_from_mmap(s, ms);
+            *ms_out = ms;
+            printf("[pipeline] Resumed: step=%d phase=%d group=%d\n",
+                   s->current_step, s->phase, s->current_group);
+            return true;
+        }
+    }
+    return false;
+}
+
+// ===== Pipeline pretty-print helpers =====
+
+static const char *phase_name(PipelinePhase p) {
+    switch (p) {
+        case PHASE_INIT: return "INIT";
+        case PHASE_FORWARD: return "FORWARD";
+        case PHASE_BACKWARD: return "BACKWARD";
+        case PHASE_WEIGHT_UPDATE: return "WEIGHT_UPDATE";
+        case PHASE_DONE: return "DONE";
+        default: return "UNKNOWN";
+    }
+}
+
+static const char *action_name(PipelineAction a) {
+    switch (a) {
+        case ACTION_COMPILE_GROUP: return "COMPILE_GROUP";
+        case ACTION_RUN_FORWARD_GROUP: return "RUN_FORWARD_GROUP";
+        case ACTION_RUN_BACKWARD_GROUP: return "RUN_BACKWARD_GROUP";
+        case ACTION_EXEC_RESTART: return "EXEC_RESTART";
+        case ACTION_WEIGHT_UPDATE: return "WEIGHT_UPDATE";
+        case ACTION_STEP_DONE: return "STEP_DONE";
+        case ACTION_ERROR: return "ERROR";
+        default: return "UNKNOWN";
+    }
+}
+
+static void pipeline_print_status(const PipelineScheduler *s) {
+    printf("[pipeline] step=%d/%d accum=%d/%d phase=%s group=%d/%d budget=%d/%d\n",
+           s->current_step, s->total_steps,
+           s->accum_step, s->config.compile.accum_steps,
+           phase_name(s->phase), s->current_group, s->plan.n_groups,
+           s->budget.used, s->budget.budget);
+}
diff --git a/training/stories_cpu_ops.h b/training/stories_cpu_ops.h
index c9f2cfa..cd103c5 100644
--- a/training/stories_cpu_ops.h
+++ b/training/stories_cpu_ops.h
@@ -2,14 +2,13 @@
 #pragma once
 #include "stories_config.h"
 
-static float *g_rms_tmp = NULL;
 
 static void rmsnorm(float *out, const float *x, const float *w, int d, int S) {
-    if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4);
+    float *rms_tmp = (float*)malloc(S * sizeof(float));
     float *ss = (float*)calloc(S, sizeof(float));
     for (int i=0; i<d; i++) {
-        vDSP_vmul(x+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vadd(g_rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
+        vDSP_vmul(x+i*S, 1, x+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vadd(rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
     }
     float invd = 1.0f/d, eps=1e-5f;
     vDSP_vsmsa(ss, 1, &invd, &eps, ss, 1, (vDSP_Length)S);
@@ -18,15 +17,15 @@ static void rmsnorm(float *out, const float *x, const float *w, int d, int S) {
         vDSP_vmul(x+i*S, 1, ss, 1, out+i*S, 1, (vDSP_Length)S);
         vDSP_vsmul(out+i*S, 1, &w[i], out+i*S, 1, (vDSP_Length)S);
     }
-    free(ss);
+    free(ss); free(rms_tmp);
 }
 
 static void rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, const float *w, int d, int S) {
-    if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4);
+    float *rms_tmp = (float*)malloc(S * sizeof(float));
     float *ss = (float*)calloc(S, sizeof(float));
     for (int i=0; i<d; i++) {
-        vDSP_vmul(x+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vadd(g_rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
+        vDSP_vmul(x+i*S, 1, x+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vadd(rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
     }
     float invd = 1.0f/d, eps=1e-5f;
     vDSP_vsmsa(ss, 1, &invd, &eps, ss, 1, (vDSP_Length)S);
@@ -34,23 +33,23 @@ static void rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, c
     int n = S; vvrsqrtf(rrms, ss, &n);
     float *dot = (float*)calloc(S, sizeof(float));
     for (int i=0; i<d; i++) {
-        vDSP_vmul(dy+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vsma(g_rms_tmp, 1, &w[i], dot, 1, dot, 1, (vDSP_Length)S);
+        vDSP_vmul(dy+i*S, 1, x+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsma(rms_tmp, 1, &w[i], dot, 1, dot, 1, (vDSP_Length)S);
     }
     vDSP_vmul(rrms, 1, rrms, 1, ss, 1, (vDSP_Length)S);
     vDSP_vsmul(ss, 1, &invd, ss, 1, (vDSP_Length)S);
     vDSP_vmul(dot, 1, ss, 1, dot, 1, (vDSP_Length)S);
     for (int i=0; i<d; i++) {
-        vDSP_vmul(x+i*S, 1, dot, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vsub(g_rms_tmp, 1, dy+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vmul(g_rms_tmp, 1, rrms, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vsmul(g_rms_tmp, 1, &w[i], dx+i*S, 1, (vDSP_Length)S);
-        vDSP_vmul(dy+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vmul(g_rms_tmp, 1, rrms, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        float s; vDSP_sve(g_rms_tmp, 1, &s, (vDSP_Length)S);
+        vDSP_vmul(x+i*S, 1, dot, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsub(rms_tmp, 1, dy+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vmul(rms_tmp, 1, rrms, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsmul(rms_tmp, 1, &w[i], dx+i*S, 1, (vDSP_Length)S);
+        vDSP_vmul(dy+i*S, 1, x+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vmul(rms_tmp, 1, rrms, 1, rms_tmp, 1, (vDSP_Length)S);
+        float s; vDSP_sve(rms_tmp, 1, &s, (vDSP_Length)S);
         dw[i] += s;
     }
-    free(ss); free(rrms); free(dot);
+    free(ss); free(rrms); free(dot); free(rms_tmp);
 }
 
 static void adam_update(float *w, const float *g, AdamState *s, int t, float lr, float b1, float b2, float eps) {
@@ -96,6 +95,7 @@ static float cross_entropy_loss(float *dlogits, const float *logits, const uint1
         vDSP_vsmul(row, 1, &inv_sum, row, 1, (vDSP_Length)V);
         // loss
         int tgt = targets[t];
+        if (tgt < 0 || tgt >= V) { fprintf(stderr, "WARN: target token %d out of vocab range [0,%d), skipping\n", tgt, V); continue; }
         total_loss -= logf(row[tgt] + 1e-10f);
         // gradient: softmax - one_hot, then /S
         row[tgt] -= 1.0f;
@@ -112,6 +112,7 @@ static float cross_entropy_loss(float *dlogits, const float *logits, const uint1
 static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq) {
     for (int t = 0; t < seq; t++) {
         int tok = tokens[t];
+        if (tok < 0 || tok >= VOCAB) { fprintf(stderr, "WARN: token %d out of range [0,%d)\n", tok, VOCAB); continue; }
         for (int d = 0; d < dim; d++) {
             x[d*seq + t] = embed[tok*dim + d];
         }
@@ -122,6 +123,7 @@ static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, i
 static void embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq) {
     for (int t = 0; t < seq; t++) {
         int tok = tokens[t];
+        if (tok < 0 || tok >= VOCAB) { continue; }
         for (int d = 0; d < dim; d++) {
             d_embed[tok*dim + d] += dx[d*seq + t];
         }
diff --git a/training/test_classifier.m b/training/test_classifier.m
new file mode 100644
index 0000000..363e46e
--- /dev/null
+++ b/training/test_classifier.m
@@ -0,0 +1,255 @@
+// test_classifier.m — Test classifier matmul (32000 channels) and softmax on ANE
+// This tests the riskiest operations: VOCAB-sized conv and softmax
+// Build: xcrun clang -O2 -framework Foundation -framework IOSurface \
+//        -framework CoreML -framework Accelerate -ldl -lobjc \
+//        -o test_classifier test_classifier.m
+#include "ane_classifier.h"
+#include "stories_cpu_ops.h"
+
+int main(void) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        ane_init();
+        mach_timebase_info(&g_tb);
+        
+        printf("=== Test: Classifier + Softmax on ANE ===\n");
+        printf("DIM=%d SEQ=%d VOCAB=%d\n\n", DIM, SEQ, VOCAB);
+        
+        // ======== Test 1: Final RMSNorm ========
+        printf("--- Test 1: Final RMSNorm on ANE ---\n");
+        {
+            float *x = (float*)malloc(DIM * SEQ * 4);
+            float *w = (float*)malloc(DIM * 4);
+            float *out_cpu = (float*)malloc(DIM * SEQ * 4);
+            float *out_ane = (float*)malloc(DIM * SEQ * 4);
+            srand48(42);
+            for (int i = 0; i < DIM * SEQ; i++) x[i] = (float)(drand48() * 2 - 1);
+            for (int i = 0; i < DIM; i++) w[i] = (float)(drand48() * 0.5 + 0.75);
+            
+            rmsnorm(out_cpu, x, w, DIM, SEQ);
+            
+            Kern *kern = compile_kern_mil_w(gen_final_rmsnorm(), (@{
+                @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":build_blob(w, 1, DIM)},
+            }), DIM*SEQ*2, DIM*SEQ*2);
+            
+            if (!kern) { printf("FAIL: Final RMSNorm compile failed\n"); return 1; }
+            printf("Compile OK\n");
+            
+            io_write_fp16(kern->ioIn, x, DIM, SEQ);
+            ane_eval(kern);
+            io_read_fp16(kern->ioOut, out_ane, 0, DIM, SEQ);
+            
+            float max_err = 0;
+            for (int i = 0; i < DIM*SEQ; i++) {
+                float e = fabsf(out_cpu[i] - out_ane[i]);
+                if (e > max_err) max_err = e;
+            }
+            printf("Max error: %.6f %s\n\n", max_err, max_err < 0.05 ? "PASS ✅" : "FAIL ❌");
+            free_kern(kern);
+            free(x); free(w); free(out_cpu); free(out_ane);
+        }
+        
+        // ======== Test 2: Classifier forward (32000-channel conv) ========
+        printf("--- Test 2: Classifier Forward (VOCAB=%d channel conv) ---\n", VOCAB);
+        {
+            float *x_final = (float*)malloc(DIM * SEQ * 4);
+            float *embed = (float*)malloc((size_t)VOCAB * DIM * 4);
+            float *logits_cpu = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            float *logits_ane = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            
+            srand48(123);
+            for (int i = 0; i < DIM * SEQ; i++) x_final[i] = (float)(drand48() * 2 - 1) * 0.1f;
+            for (size_t i = 0; i < (size_t)VOCAB * DIM; i++) embed[i] = (float)(drand48() * 2 - 1) * 0.02f;
+            
+            // CPU reference: logits = embed @ x_final
+            // logits[v, t] = sum_d embed[v,d] * x_final[d,t]
+            // embed is [VOCAB, DIM] row-major, x_final is [DIM, SEQ] channel-first
+            uint64_t t0 = mach_absolute_time();
+            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
+                        VOCAB, SEQ, DIM, 1.0f,
+                        embed, DIM, x_final, SEQ, 0.0f, logits_cpu, SEQ);
+            uint64_t t1 = mach_absolute_time();
+            printf("CPU cblas_sgemm: %.2f ms\n", tb_ms(t1-t0));
+            
+            // ANE: build weight blob for embed [VOCAB, DIM]
+            printf("Building embed blob (%.1f MB fp16)...\n", (float)VOCAB*DIM*2/1e6);
+            NSData *embed_blob = build_blob(embed, VOCAB, DIM);
+            
+            printf("Compiling classifier kernel...\n");
+            t0 = mach_absolute_time();
+            Kern *cls = compile_kern_mil_w(gen_classifier_fwd(), (@{
+                @"@model_path/weights/embed.bin": @{@"offset":@0, @"data":embed_blob},
+            }), DIM*SEQ*2, VOCAB*SEQ*2);
+            t1 = mach_absolute_time();
+            
+            if (!cls) {
+                printf("FAIL: Classifier compile failed (32000 channels too large for ANE)\n");
+                printf("This confirms tiling is needed.\n\n");
+            } else {
+                printf("Compile OK in %.0f ms (compiles=%d)\n", tb_ms(t1-t0), g_compile_count);
+                
+                io_write_fp16(cls->ioIn, x_final, DIM, SEQ);
+                t0 = mach_absolute_time();
+                ane_eval(cls);
+                t1 = mach_absolute_time();
+                printf("ANE eval: %.2f ms\n", tb_ms(t1-t0));
+                
+                // Read back and compare (sample — full read would be 32000*256*4 = 32MB)
+                io_read_fp16(cls->ioOut, logits_ane, 0, VOCAB, SEQ);
+                
+                float max_err = 0, sum_err = 0;
+                int cnt = 0;
+                for (int v = 0; v < VOCAB; v++) {
+                    for (int t = 0; t < SEQ; t++) {
+                        int idx = v*SEQ + t;
+                        float e = fabsf(logits_cpu[idx] - logits_ane[idx]);
+                        sum_err += e;
+                        cnt++;
+                        if (e > max_err) max_err = e;
+                    }
+                }
+                printf("Max error: %.6f  Mean error: %.6f  %s\n",
+                       max_err, sum_err/cnt, max_err < 1.0 ? "PASS ✅" : "FAIL ❌");
+                
+                // Benchmark
+                int N = 10;
+                t0 = mach_absolute_time();
+                for (int i = 0; i < N; i++) ane_eval(cls);
+                t1 = mach_absolute_time();
+                printf("Benchmark: %d evals in %.2f ms (%.2f ms/eval)\n\n", N, tb_ms(t1-t0), tb_ms(t1-t0)/N);
+                free_kern(cls);
+            }
+            free(x_final); free(embed); free(logits_cpu); free(logits_ane);
+        }
+        
+        // ======== Test 3: Softmax over VOCAB dimension ========
+        printf("--- Test 3: Softmax over VOCAB=%d ---\n", VOCAB);
+        {
+            float *logits = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            float *probs_cpu = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            float *probs_ane = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            
+            srand48(999);
+            for (size_t i = 0; i < (size_t)VOCAB * SEQ; i++) 
+                logits[i] = (float)(drand48() * 10 - 5);
+            
+            // CPU reference softmax (per position, over vocab)
+            // logits is [VOCAB, SEQ] channel-first
+            uint64_t t0 = mach_absolute_time();
+            for (int t = 0; t < SEQ; t++) {
+                float maxv = -1e30f;
+                for (int v = 0; v < VOCAB; v++) {
+                    float val = logits[v*SEQ+t];
+                    if (val > maxv) maxv = val;
+                }
+                float sum = 0;
+                for (int v = 0; v < VOCAB; v++) {
+                    probs_cpu[v*SEQ+t] = expf(logits[v*SEQ+t] - maxv);
+                    sum += probs_cpu[v*SEQ+t];
+                }
+                for (int v = 0; v < VOCAB; v++) probs_cpu[v*SEQ+t] /= sum;
+            }
+            uint64_t t1 = mach_absolute_time();
+            printf("CPU softmax: %.2f ms\n", tb_ms(t1-t0));
+            
+            printf("Compiling softmax kernel...\n");
+            int sm_bytes = VOCAB * SEQ * 2;
+            Kern *sm = compile_kern_mil_w(gen_softmax_vocab(), @{}, sm_bytes, sm_bytes);
+            
+            if (!sm) {
+                printf("FAIL: Softmax compile failed\n\n");
+            } else {
+                printf("Compile OK\n");
+                
+                io_write_fp16(sm->ioIn, logits, VOCAB, SEQ);
+                t0 = mach_absolute_time();
+                ane_eval(sm);
+                t1 = mach_absolute_time();
+                printf("ANE eval: %.2f ms\n", tb_ms(t1-t0));
+                
+                io_read_fp16(sm->ioOut, probs_ane, 0, VOCAB, SEQ);
+                
+                // Check: probs should sum to ~1.0 per position
+                float max_err = 0;
+                for (int t = 0; t < 4; t++) {
+                    float sum_cpu = 0, sum_ane = 0;
+                    for (int v = 0; v < VOCAB; v++) {
+                        sum_cpu += probs_cpu[v*SEQ+t];
+                        sum_ane += probs_ane[v*SEQ+t];
+                        float e = fabsf(probs_cpu[v*SEQ+t] - probs_ane[v*SEQ+t]);
+                        if (e > max_err) max_err = e;
+                    }
+                    printf("  pos %d: CPU sum=%.4f  ANE sum=%.4f\n", t, sum_cpu, sum_ane);
+                }
+                printf("Max error (first 4 positions): %.6f  %s\n",
+                       max_err, max_err < 0.01 ? "PASS ✅" : "FAIL ❌");
+                
+                int N = 10;
+                t0 = mach_absolute_time();
+                for (int i = 0; i < N; i++) ane_eval(sm);
+                t1 = mach_absolute_time();
+                printf("Benchmark: %d evals in %.2f ms (%.2f ms/eval)\n\n", N, tb_ms(t1-t0), tb_ms(t1-t0)/N);
+                free_kern(sm);
+            }
+            free(logits); free(probs_cpu); free(probs_ane);
+        }
+        
+        // ======== Test 4: Classifier backward ========
+        printf("--- Test 4: Classifier Backward (DIM=%d from VOCAB=%d) ---\n", DIM, VOCAB);
+        {
+            float *dlogits = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            float *embed = (float*)malloc((size_t)VOCAB * DIM * 4);
+            float *dx_cpu = (float*)malloc(DIM * SEQ * 4);
+            float *dx_ane = (float*)malloc(DIM * SEQ * 4);
+            
+            srand48(456);
+            for (size_t i = 0; i < (size_t)VOCAB * SEQ; i++) dlogits[i] = (float)(drand48() * 2 - 1) * 0.01f;
+            for (size_t i = 0; i < (size_t)VOCAB * DIM; i++) embed[i] = (float)(drand48() * 2 - 1) * 0.02f;
+            
+            // CPU: dx = embed^T @ dlogits
+            uint64_t t0 = mach_absolute_time();
+            cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
+                        DIM, SEQ, VOCAB, 1.0f,
+                        embed, DIM, dlogits, SEQ, 0.0f, dx_cpu, SEQ);
+            uint64_t t1 = mach_absolute_time();
+            printf("CPU cblas_sgemm: %.2f ms\n", tb_ms(t1-t0));
+            
+            // Build transposed embed blob
+            NSData *embed_t_blob = build_blob_t(embed, VOCAB, DIM);
+            
+            printf("Compiling classifier backward...\n");
+            Kern *clsb = compile_kern_mil_w(gen_classifier_bwd(), (@{
+                @"@model_path/weights/embed_t.bin": @{@"offset":@0, @"data":embed_t_blob},
+            }), VOCAB*SEQ*2, DIM*SEQ*2);
+            
+            if (!clsb) {
+                printf("FAIL: Classifier backward compile failed\n\n");
+            } else {
+                printf("Compile OK\n");
+                
+                io_write_fp16(clsb->ioIn, dlogits, VOCAB, SEQ);
+                t0 = mach_absolute_time();
+                ane_eval(clsb);
+                t1 = mach_absolute_time();
+                printf("ANE eval: %.2f ms\n", tb_ms(t1-t0));
+                
+                io_read_fp16(clsb->ioOut, dx_ane, 0, DIM, SEQ);
+                
+                float max_err = 0, sum_err = 0;
+                for (int i = 0; i < DIM*SEQ; i++) {
+                    float e = fabsf(dx_cpu[i] - dx_ane[i]);
+                    sum_err += e;
+                    if (e > max_err) max_err = e;
+                }
+                printf("Max error: %.6f  Mean error: %.6f  %s\n\n",
+                       max_err, sum_err/(DIM*SEQ), max_err < 1.0 ? "PASS ✅" : "FAIL ❌");
+                free_kern(clsb);
+            }
+            free(dlogits); free(embed); free(dx_cpu); free(dx_ane);
+        }
+        
+        printf("=== All tests complete ===\n");
+        printf("Total ANE compiles used: %d\n", g_compile_count);
+        return 0;
+    }
+}
diff --git a/training/test_dynamic_matmul.m b/training/test_dynamic_matmul.m
new file mode 100644
index 0000000..72addbd
--- /dev/null
+++ b/training/test_dynamic_matmul.m
@@ -0,0 +1,333 @@
+// test_dynamic_matmul.m — Benchmark dynamic matmul on ANE (no recompile)
+// Layout: input [1, D, 1, S+D] — activations in sp[0:S], weight rows in sp[S:S+D]
+// MIL: slice → reshape → matmul → reshape → output
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#import <mach/mach_time.h>
+#include <arm_neon.h>
+#include <Accelerate/Accelerate.h>
+
+#include "stories_io.h"
+
+// Generate MIL for y = x @ W where both come from input IOSurface
+// Input: [1, IC, 1, SEQ+OC] fp32
+//   sp[0:SEQ]    = activations x[IC, SEQ]
+//   sp[SEQ:SEQ+OC] = weight W[IC, OC] (each channel d holds W[d, :])
+// Output: [1, OC, 1, SEQ] fp32
+static NSString *gen_dynamic_matmul_mil(int ic, int oc, int seq) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n"
+        "[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
+        "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
+        "{\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    int sp_total = seq + oc;
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ic, sp_total];
+    // Cast to fp16
+    [m appendString:@"        string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", ic, sp_total];
+    // Slice activations [1, IC, 1, SEQ]
+    [m appendString:@"        tensor<int32, [4]> ba = const()[name = string(\"ba\"), val = tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sa = const()[name = string(\"sa\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", ic, seq];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> act = slice_by_size(x=xh,begin=ba,size=sa)[name=string(\"act\")];\n", ic, seq];
+    // Slice weight [1, IC, 1, OC]
+    [m appendFormat:@"        tensor<int32, [4]> bw = const()[name = string(\"bw\"), val = tensor<int32, [4]>([0,0,0,%d])];\n", seq];
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name = string(\"sw\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", ic, oc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> wt = slice_by_size(x=xh,begin=bw,size=sw)[name=string(\"wt\")];\n", ic, oc];
+    // Reshape act: [1,IC,1,SEQ] → [1,1,IC,SEQ] → transpose → [1,1,SEQ,IC]
+    [m appendFormat:@"        tensor<int32, [4]> ra = const()[name = string(\"ra\"), val = tensor<int32, [4]>([1,1,%d,%d])];\n", ic, seq];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a2 = reshape(shape=ra,x=act)[name=string(\"a2\")];\n", ic, seq];
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name = string(\"pm\"), val = tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a3 = transpose(perm=pm,x=a2)[name=string(\"a3\")];\n", seq, ic];
+    // Reshape weight: [1,IC,1,OC] → [1,1,IC,OC]
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name = string(\"rw\"), val = tensor<int32, [4]>([1,1,%d,%d])];\n", ic, oc];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W = reshape(shape=rw,x=wt)[name=string(\"W\")];\n", ic, oc];
+    // matmul: [1,1,SEQ,IC] @ [1,1,IC,OC] → [1,1,SEQ,OC]
+    [m appendString:@"        bool bF = const()[name = string(\"bF\"), val = bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> yh = matmul(transpose_x=bF,transpose_y=bF,x=a3,y=W)[name=string(\"mm\")];\n", seq, oc];
+    // Reshape+transpose back: [1,1,SEQ,OC] → transpose → [1,1,OC,SEQ] → reshape → [1,OC,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> yt = transpose(perm=pm,x=yh)[name=string(\"yt\")];\n", oc, seq];
+    [m appendFormat:@"        tensor<int32, [4]> ro = const()[name = string(\"ro\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", oc, seq];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> yr = reshape(shape=ro,x=yt)[name=string(\"yr\")];\n", oc, seq];
+    // Cast back to fp32
+    [m appendString:@"        string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype = to32, x = yr)[name = string(\"cout\")];\n", oc, seq];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// Tiled version: splits OC into tiles, each tile is a separate kernel
+// For W[IC, OC], tile along OC: each tile handles W[:, t*T:(t+1)*T]
+// Input per tile: [1, IC, 1, SEQ+T]
+// Output per tile: [1, T, 1, SEQ]
+typedef struct {
+    Kern **tiles;
+    int n_tiles, tile_oc, ic, oc, seq;
+} TiledMatmul;
+
+static TiledMatmul *compile_tiled_matmul(int ic, int oc, int tile_oc, int seq) {
+    TiledMatmul *tm = (TiledMatmul*)calloc(1, sizeof(TiledMatmul));
+    tm->ic = ic; tm->oc = oc; tm->seq = seq; tm->tile_oc = tile_oc;
+    tm->n_tiles = (oc + tile_oc - 1) / tile_oc;
+    tm->tiles = (Kern**)calloc(tm->n_tiles, sizeof(Kern*));
+    for (int t = 0; t < tm->n_tiles; t++) {
+        int this_oc = (t == tm->n_tiles-1 && oc % tile_oc) ? (oc % tile_oc) : tile_oc;
+        NSString *mil = gen_dynamic_matmul_mil(ic, this_oc, seq);
+        int in_bytes = ic * (seq + this_oc) * 4;
+        int out_bytes = this_oc * seq * 4;
+        tm->tiles[t] = compile_kern_mil_w(mil, @{}, in_bytes, out_bytes);
+        if (!tm->tiles[t]) { printf("Tile %d compile FAIL\n", t); return NULL; }
+    }
+    return tm;
+}
+
+// Write activations + weight tile into IOSurface
+// act: [IC, SEQ] column-major (channel-first)
+// W: [IC, OC] — full weight matrix, we extract the tile
+static void write_tile_input(TiledMatmul *tm, int tile_idx, const float *act, const float *W) {
+    Kern *k = tm->tiles[tile_idx];
+    int ic = tm->ic, seq = tm->seq, toc = tm->tile_oc;
+    int oc_off = tile_idx * toc;
+    int this_oc = (tile_idx == tm->n_tiles-1 && tm->oc % toc) ? (tm->oc % toc) : toc;
+
+    IOSurfaceLock(k->ioIn, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+    // Activations: buf[d * (seq+this_oc) + t] = act[d * seq + t]
+    for (int d = 0; d < ic; d++) {
+        memcpy(buf + d*(seq+this_oc), act + d*seq, seq*sizeof(float));
+        // Weight: buf[d * (seq+this_oc) + seq + c] = W[d * oc + oc_off + c]
+        for (int c = 0; c < this_oc; c++)
+            buf[d*(seq+this_oc) + seq + c] = W[d*tm->oc + oc_off + c];
+    }
+    IOSurfaceUnlock(k->ioIn, 0, NULL);
+}
+
+// Read tile output into full output buffer
+static void read_tile_output(TiledMatmul *tm, int tile_idx, float *out) {
+    Kern *k = tm->tiles[tile_idx];
+    int seq = tm->seq, toc = tm->tile_oc;
+    int oc_off = tile_idx * toc;
+    int this_oc = (tile_idx == tm->n_tiles-1 && tm->oc % toc) ? (tm->oc % toc) : toc;
+
+    IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+    float *obuf = (float*)IOSurfaceGetBaseAddress(k->ioOut);
+    for (int c = 0; c < this_oc; c++)
+        memcpy(out + (oc_off+c)*seq, obuf + c*seq, seq*sizeof(float));
+    IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+}
+
+int main(int argc, char **argv) {
+    @autoreleasepool {
+        mach_timebase_info(&g_tb);
+        ane_init();
+
+        // === Test 1: Single 64×64 dynamic matmul (correctness) ===
+        printf("=== Test 1: 64×64 dynamic matmul correctness ===\n");
+        {
+        int D = 64, S = 64;
+        NSString *mil = gen_dynamic_matmul_mil(D, D, S);
+        int in_b = D * (S+D) * 4, out_b = D * S * 4;
+        Kern *k = compile_kern_mil_w(mil, @{}, in_b, out_b);
+        if (!k) { printf("FAIL\n"); return 1; }
+
+        // Identity test
+        IOSurfaceLock(k->ioIn, 0, NULL);
+        float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+        memset(inp, 0, in_b);
+        for (int d = 0; d < D; d++)
+            for (int s = 0; s < S; s++)
+                inp[d*(S+D) + s] = (float)(d*S + s) * 0.001f;
+        for (int d = 0; d < D; d++)
+            for (int c = 0; c < D; c++)
+                inp[d*(S+D) + S + c] = (d == c) ? 1.0f : 0.0f;
+        IOSurfaceUnlock(k->ioIn, 0, NULL);
+
+        ane_eval(k);
+        IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        float *out = (float*)IOSurfaceGetBaseAddress(k->ioOut);
+        float me = 0;
+        for (int d = 0; d < D; d++)
+            for (int s = 0; s < S; s++) {
+                float e = fabsf(out[d*S+s] - inp[d*(S+D)+s]);
+                if (e > me) me = e;
+            }
+        IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        printf("Identity: max_err=%.6f %s\n", me, me < 0.01 ? "PASS" : "FAIL");
+
+        // 2× test
+        IOSurfaceLock(k->ioIn, 0, NULL);
+        for (int d = 0; d < D; d++)
+            for (int c = 0; c < D; c++)
+                inp[d*(S+D) + S + c] = (d == c) ? 2.0f : 0.0f;
+        IOSurfaceUnlock(k->ioIn, 0, NULL);
+        ane_eval(k);
+        IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        float sr = 0; int cnt = 0;
+        for (int i = 0; i < D*S; i++)
+            if (fabsf(inp[i/(S)*((S)+D) + i%S]) > 0.001f) { sr += out[i]/inp[i/S*(S+D)+i%S]; cnt++; }
+        IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        printf("2× W: ratio=%.3f %s\n\n", cnt?sr/cnt:0, fabsf(sr/cnt-2.0f)<0.1?"PASS":"FAIL");
+        free_kern(k);
+        }
+
+        // === Test 2: 768×768 single kernel (if it compiles) ===
+        printf("=== Test 2: 768×768 single dynamic matmul ===\n");
+        {
+        int D = 768, S = 256;
+        int sp_total = S + D;  // 256 + 768 = 1024
+        int in_b = D * sp_total * 4;  // 768 * 1024 * 4 = 3.1MB
+        int out_b = D * S * 4;        // 768 * 256 * 4 = 786KB
+        printf("IOSurface: in=%.1fMB out=%.1fKB\n", in_b/1e6, out_b/1e3);
+
+        NSString *mil = gen_dynamic_matmul_mil(D, D, S);
+        uint64_t t0 = mach_absolute_time();
+        Kern *k = compile_kern_mil_w(mil, @{}, in_b, out_b);
+        double compile_ms = tb_ms(mach_absolute_time() - t0);
+        if (!k) { printf("768×768 compile FAIL\n"); }
+        else {
+            printf("Compile: %.1fms\n", compile_ms);
+            // Random weights
+            float *act = (float*)calloc(D*S, sizeof(float));
+            float *W = (float*)calloc(D*D, sizeof(float));
+            for (int i = 0; i < D*S; i++) act[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.1f;
+            for (int i = 0; i < D*D; i++) W[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.01f;
+
+            // Write to IOSurface
+            IOSurfaceLock(k->ioIn, 0, NULL);
+            float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+            for (int d = 0; d < D; d++) {
+                memcpy(inp + d*(S+D), act + d*S, S*4);
+                memcpy(inp + d*(S+D) + S, W + d*D, D*4);
+            }
+            IOSurfaceUnlock(k->ioIn, 0, NULL);
+
+            // Warmup
+            for (int i = 0; i < 3; i++) ane_eval(k);
+
+            // Benchmark
+            int iters = 50;
+            t0 = mach_absolute_time();
+            for (int i = 0; i < iters; i++) ane_eval(k);
+            double total_ms = tb_ms(mach_absolute_time() - t0);
+            double per_eval = total_ms / iters;
+            double flops = 2.0 * D * D * S;  // matmul FLOPs
+            double gflops = flops / (per_eval * 1e6);
+            printf("768×768×256 matmul: %.3fms/eval  %.1f GFLOP/s\n", per_eval, gflops);
+
+            // Benchmark with IO write (simulating weight update)
+            t0 = mach_absolute_time();
+            for (int i = 0; i < iters; i++) {
+                IOSurfaceLock(k->ioIn, 0, NULL);
+                float *p = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+                for (int d = 0; d < D; d++)
+                    memcpy(p + d*(S+D) + S, W + d*D, D*4);
+                IOSurfaceUnlock(k->ioIn, 0, NULL);
+                ane_eval(k);
+            }
+            total_ms = tb_ms(mach_absolute_time() - t0);
+            per_eval = total_ms / iters;
+            gflops = flops / (per_eval * 1e6);
+            printf("With weight IO: %.3fms/eval  %.1f GFLOP/s\n", per_eval, gflops);
+
+            free(act); free(W); free_kern(k);
+        }
+        }
+
+        // === Test 3: Tiled matmul benchmark ===
+        int tile_sizes[] = {64, 128, 256, 384, 768};
+        int n_tiles_test = sizeof(tile_sizes)/sizeof(tile_sizes[0]);
+        printf("\n=== Test 3: Tiled 768×768 matmul (varying tile_oc) ===\n");
+        printf("%-10s %-8s %-10s %-12s %-10s\n", "tile_oc", "tiles", "compile", "eval/ms", "GFLOP/s");
+        {
+        int D = 768, S = 256;
+        float *act = (float*)calloc(D*S, sizeof(float));
+        float *W = (float*)calloc(D*D, sizeof(float));
+        float *out_full = (float*)calloc(D*S, sizeof(float));
+        for (int i = 0; i < D*S; i++) act[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.1f;
+        for (int i = 0; i < D*D; i++) W[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.01f;
+
+        for (int ti = 0; ti < n_tiles_test; ti++) {
+            int T = tile_sizes[ti];
+            if (T > D) continue;
+            uint64_t t0 = mach_absolute_time();
+            TiledMatmul *tm = compile_tiled_matmul(D, D, T, S);
+            double compile_ms = tb_ms(mach_absolute_time() - t0);
+            if (!tm) { printf("%-10d FAIL\n", T); continue; }
+
+            // Warmup
+            for (int w = 0; w < 2; w++) {
+                for (int t = 0; t < tm->n_tiles; t++) {
+                    write_tile_input(tm, t, act, W);
+                    ane_eval(tm->tiles[t]);
+                }
+            }
+
+            // Benchmark (with IO)
+            int iters = 20;
+            t0 = mach_absolute_time();
+            for (int i = 0; i < iters; i++) {
+                for (int t = 0; t < tm->n_tiles; t++) {
+                    write_tile_input(tm, t, act, W);
+                    ane_eval(tm->tiles[t]);
+                    read_tile_output(tm, t, out_full);
+                }
+            }
+            double total_ms = tb_ms(mach_absolute_time() - t0);
+            double per_matmul = total_ms / iters;
+            double flops = 2.0 * D * D * S;
+            double gflops = flops / (per_matmul * 1e6);
+            printf("%-10d %-8d %-10.0fms %-12.3fms %-10.1f\n",
+                T, tm->n_tiles, compile_ms, per_matmul, gflops);
+
+            for (int t = 0; t < tm->n_tiles; t++) free_kern(tm->tiles[t]);
+            free(tm->tiles); free(tm);
+        }
+
+        // === Correctness check: compare with cblas ===
+        printf("\n=== Correctness: dynamic matmul vs cblas_sgemm ===\n");
+        {
+        int T = 768;  // full, no tiling
+        TiledMatmul *tm = compile_tiled_matmul(D, D, T, S);
+        if (tm) {
+            write_tile_input(tm, 0, act, W);
+            ane_eval(tm->tiles[0]);
+            read_tile_output(tm, 0, out_full);
+
+            // Reference: cblas  y = act^T @ W → y[s,oc] = sum_d act[d,s]*W[d,oc]
+            // act is [D,S] col-major, W is [D,D] row-major
+            // We want out[oc,s] = sum_d act[d,s] * W[d,oc]
+            // = W^T @ act where W^T is [D,D] and act is [D,S] → out is [D,S]
+            float *ref = (float*)calloc(D*S, sizeof(float));
+            // out[oc*S+s] = sum_d W[d*D+oc] * act[d*S+s]
+            // This is: (W^T) @ act in column-major: M=D,N=S,K=D
+            // cblas: C = alpha*A*B + beta*C
+            // A=W^T [D×D], B=act [D×S], C=ref [D×S]
+            cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans,
+                D, S, D, 1.0f, W, D, act, D, 0.0f, ref, D);
+            float me = 0;
+            for (int i = 0; i < D*S; i++) {
+                float e = fabsf(out_full[i] - ref[i]);
+                if (e > me) me = e;
+            }
+            printf("vs cblas: max_err=%.6f %s\n", me, me < 1.0 ? "PASS" : "FAIL");
+            free(ref);
+            for (int t = 0; t < tm->n_tiles; t++) free_kern(tm->tiles[t]);
+            free(tm->tiles); free(tm);
+        }
+        }
+
+        free(act); free(W); free(out_full);
+        }
+
+        // === Summary for training ===
+        printf("\n=== Summary ===\n");
+        printf("Stories110M: 12 layers × 10 matmuls/layer = 120 matmuls/step\n");
+        printf("Sizes: Wq/Wk/Wv/Wo [768,768], W1/W3 [2048,768], W2 [768,2048]\n");
+        printf("With dynamic weights: compile once, update IOSurface every step\n");
+
+        printf("\nDone.\n");
+    }
+    return 0;
+}
diff --git a/training/test_pipeline_unit.c b/training/test_pipeline_unit.c
new file mode 100644
index 0000000..dcdd9e5
--- /dev/null
+++ b/training/test_pipeline_unit.c
@@ -0,0 +1,448 @@
+// test_pipeline_unit.c — Unit tests for pipeline scheduler + checkpoint manager
+// Pure C, no ANE dependency. Validates state machine transitions and checkpoint logic.
+// Build: cc -O2 -o test_pipeline_unit test_pipeline_unit.c -lm
+// Run:   ./test_pipeline_unit
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <string.h>
+#include <math.h>
+#include <stdbool.h>
+
+// Stub out mmap/exec dependencies — we only test the pure logic
+#define _PIPELINE_SKIP_MMAP 1
+
+#include "model_config.h"
+#include "gradient_checkpoint.h"
+
+// ===== Test helpers =====
+
+static int tests_run = 0;
+static int tests_passed = 0;
+
+#define TEST(name) do { \
+    tests_run++; \
+    printf("  %-50s", name); \
+} while(0)
+
+#define PASS() do { tests_passed++; printf("PASS\n"); } while(0)
+#define FAIL(msg) do { printf("FAIL: %s\n", msg); } while(0)
+
+#define ASSERT_EQ(a, b, msg) do { \
+    if ((a) != (b)) { FAIL(msg); printf("    got %d, expected %d\n", (int)(a), (int)(b)); return; } \
+} while(0)
+
+#define ASSERT_TRUE(cond, msg) do { \
+    if (!(cond)) { FAIL(msg); return; } \
+} while(0)
+
+// ===== model_config.h tests =====
+
+static void test_dims_init(void) {
+    TEST("model_dims_init computes derived fields");
+    ModelDims d = {.dim = 768, .n_heads = 12, .n_kv_heads = 12, .seq_len = 256};
+    model_dims_init(&d);
+    ASSERT_EQ(d.head_dim, 64, "head_dim = dim / n_heads");
+    ASSERT_EQ(d.kv_dim, 768, "kv_dim = head_dim * n_kv_heads");
+    ASSERT_EQ(d.score_ch, 12 * 256, "score_ch = n_heads * seq_len");
+    PASS();
+}
+
+static void test_stories110m_preset(void) {
+    TEST("Stories110M preset");
+    ModelConfig cfg = model_config_stories110m();
+    ASSERT_EQ(cfg.dims.dim, 768, "dim");
+    ASSERT_EQ(cfg.dims.n_layers, 12, "n_layers");
+    ASSERT_EQ(cfg.dims.n_heads, 12, "n_heads");
+    ASSERT_EQ(cfg.compile.compile_budget, 119, "compile_budget");
+    ASSERT_TRUE(cfg.compile.headroom_pct > 0.0f, "headroom > 0");
+    PASS();
+}
+
+static void test_llama7b_preset(void) {
+    TEST("LLaMA-7B preset");
+    ModelConfig cfg = model_config_llama_7b();
+    ASSERT_EQ(cfg.dims.dim, 4096, "dim");
+    ASSERT_EQ(cfg.dims.n_layers, 32, "n_layers");
+    ASSERT_EQ(cfg.dims.hidden_dim, 11008, "hidden_dim");
+    PASS();
+}
+
+static void test_layer_memory_nonzero(void) {
+    TEST("Per-layer memory sizes are nonzero");
+    ModelConfig cfg = model_config_stories110m();
+    ASSERT_TRUE(layer_weight_bytes(&cfg.dims) > 0, "weight bytes");
+    ASSERT_TRUE(layer_adam_bytes(&cfg.dims) > 0, "adam bytes");
+    ASSERT_TRUE(layer_activation_bytes(&cfg.dims) > 0, "activation bytes");
+    ASSERT_TRUE(layer_gradient_bytes(&cfg.dims) > 0, "gradient bytes");
+    ASSERT_TRUE(total_model_bytes(&cfg) > 0, "total model bytes");
+    PASS();
+}
+
+static void test_adam_is_2x_weights(void) {
+    TEST("Adam state = 2x weight size");
+    ModelConfig cfg = model_config_stories110m();
+    ASSERT_EQ(layer_adam_bytes(&cfg.dims), 2 * layer_weight_bytes(&cfg.dims), "adam = 2 * weights");
+    PASS();
+}
+
+// ===== Pipeline planning tests =====
+
+static void test_max_layers_per_compile(void) {
+    TEST("max_layers_per_compile respects budget");
+    CompileConfig cc = {.compile_budget = 119, .kernels_per_layer = 5,
+                        .static_per_layer = 1, .headroom_pct = 0.10f};
+    int max = max_layers_per_compile(&cc);
+    // usable = floor(119 * 0.9) = 107, per_layer = 6, max = 107/6 = 17
+    ASSERT_EQ(max, 17, "max layers = 17 for budget=119, 6 kernels/layer, 10% headroom");
+    PASS();
+}
+
+static void test_configurable_headroom(void) {
+    TEST("Configurable headroom changes max layers");
+    CompileConfig cc5 = {.compile_budget = 119, .kernels_per_layer = 5,
+                         .static_per_layer = 1, .headroom_pct = 0.05f};
+    CompileConfig cc20 = {.compile_budget = 119, .kernels_per_layer = 5,
+                          .static_per_layer = 1, .headroom_pct = 0.20f};
+    int max5 = max_layers_per_compile(&cc5);    // floor(119*0.95/6) = 18
+    int max20 = max_layers_per_compile(&cc20);   // floor(119*0.80/6) = 15
+    ASSERT_TRUE(max5 > max20, "5% headroom fits more layers than 20%");
+    ASSERT_EQ(max5, 18, "5% headroom: 18 layers");
+    ASSERT_EQ(max20, 15, "20% headroom: 15 layers");
+    PASS();
+}
+
+static void test_invalid_headroom_defaults(void) {
+    TEST("Invalid headroom falls back to 10%");
+    CompileConfig cc_neg = {.compile_budget = 119, .kernels_per_layer = 5,
+                            .static_per_layer = 1, .headroom_pct = -0.5f};
+    CompileConfig cc_over = {.compile_budget = 119, .kernels_per_layer = 5,
+                             .static_per_layer = 1, .headroom_pct = 1.5f};
+    CompileConfig cc_def = {.compile_budget = 119, .kernels_per_layer = 5,
+                            .static_per_layer = 1, .headroom_pct = 0.10f};
+    ASSERT_EQ(max_layers_per_compile(&cc_neg), max_layers_per_compile(&cc_def),
+              "negative headroom -> default");
+    ASSERT_EQ(max_layers_per_compile(&cc_over), max_layers_per_compile(&cc_def),
+              "headroom > 1.0 -> default");
+    PASS();
+}
+
+static void test_plan_stories110m(void) {
+    TEST("Stories110M fits in 1 group");
+    ModelConfig cfg = model_config_stories110m();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_EQ(plan.n_groups, 1, "1 group");
+    ASSERT_EQ(plan.groups[0].start_layer, 0, "starts at 0");
+    ASSERT_EQ(plan.groups[0].end_layer, 12, "ends at 12");
+    ASSERT_EQ(plan.groups[0].n_layers, 12, "12 layers");
+    ASSERT_EQ(plan.groups[0].total_kernels, 72, "72 total kernels");
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_plan_llama7b_multiple_groups(void) {
+    TEST("LLaMA-7B needs multiple groups");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_TRUE(plan.n_groups >= 2, "at least 2 groups for 32 layers");
+    // Verify all layers covered
+    int total_layers = 0;
+    for (int g = 0; g < plan.n_groups; g++) {
+        total_layers += plan.groups[g].n_layers;
+        ASSERT_TRUE(plan.groups[g].n_layers > 0, "no empty groups");
+    }
+    ASSERT_EQ(total_layers, 32, "all 32 layers covered");
+    // Verify contiguous
+    for (int g = 1; g < plan.n_groups; g++) {
+        ASSERT_EQ(plan.groups[g].start_layer, plan.groups[g-1].end_layer, "contiguous groups");
+    }
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_plan_kernel_budget(void) {
+    TEST("No group exceeds compile budget");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    int usable = (int)(cfg.compile.compile_budget * (1.0f - cfg.compile.headroom_pct));
+    for (int g = 0; g < plan.n_groups; g++) {
+        ASSERT_TRUE(plan.groups[g].total_kernels <= usable,
+                    "group kernel count <= usable budget");
+    }
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+// ===== Gradient checkpoint tests =====
+
+static void test_ckpt_all_saves_everything(void) {
+    TEST("CKPT_ALL saves all layers");
+    ModelConfig cfg = model_config_stories110m();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan, 0);
+    ASSERT_EQ(cm.n_checkpointed, 12, "12 layers saved");
+    for (int i = 0; i < 12; i++) {
+        ASSERT_TRUE(checkpoint_should_save(&cm, i), "every layer saved");
+        ASSERT_TRUE(!checkpoint_needs_recompute(&cm, i), "no recompute needed");
+    }
+    ASSERT_TRUE(checkpoint_recompute_overhead(&cm) < 0.001, "zero overhead");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_none_saves_minimum(void) {
+    TEST("CKPT_NONE saves only layer 0");
+    ModelConfig cfg = model_config_stories110m();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_NONE, &cfg, &plan, 0);
+    ASSERT_EQ(cm.n_checkpointed, 1, "only 1 layer saved");
+    ASSERT_TRUE(checkpoint_should_save(&cm, 0), "layer 0 saved");
+    ASSERT_TRUE(checkpoint_needs_recompute(&cm, 5), "layer 5 needs recompute");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_sqrt_interval(void) {
+    TEST("CKPT_SQRT uses sqrt(N) interval");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0);
+    int expected_interval = (int)sqrtf(32.0f);  // 5
+    ASSERT_EQ(cm.interval, expected_interval, "interval = sqrt(32) = 5");
+    // Layer 0 always saved, then 5, 10, 15, 20, 25, 30, 31
+    ASSERT_TRUE(checkpoint_should_save(&cm, 0), "layer 0 saved");
+    ASSERT_TRUE(checkpoint_should_save(&cm, 5), "layer 5 saved");
+    ASSERT_TRUE(!checkpoint_should_save(&cm, 3), "layer 3 not saved");
+    ASSERT_TRUE(checkpoint_should_save(&cm, 31), "last layer saved");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_boundary(void) {
+    TEST("CKPT_BOUNDARY saves group edges");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_BOUNDARY, &cfg, &plan, 0);
+    // First layer of each group + last layer overall
+    for (int g = 0; g < plan.n_groups; g++) {
+        ASSERT_TRUE(checkpoint_should_save(&cm, plan.groups[g].start_layer),
+                    "group start layer saved");
+    }
+    ASSERT_TRUE(checkpoint_should_save(&cm, 31), "last layer saved");
+    // Middle of first group should not be saved
+    if (plan.groups[0].n_layers > 2) {
+        int mid = plan.groups[0].start_layer + plan.groups[0].n_layers / 2;
+        ASSERT_TRUE(checkpoint_needs_recompute(&cm, mid), "mid-group needs recompute");
+    }
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_memory_savings(void) {
+    TEST("Checkpoint memory savings are positive for non-ALL policies");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+
+    CheckpointManager cm_all = checkpoint_init(CKPT_ALL, &cfg, &plan, 0);
+    CheckpointManager cm_sqrt = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0);
+    CheckpointManager cm_none = checkpoint_init(CKPT_NONE, &cfg, &plan, 0);
+
+    size_t saved_sqrt = checkpoint_memory_saved(&cm_sqrt, &cfg.dims);
+    size_t saved_none = checkpoint_memory_saved(&cm_none, &cfg.dims);
+
+    ASSERT_TRUE(saved_sqrt > 0, "SQRT saves memory");
+    ASSERT_TRUE(saved_none > saved_sqrt, "NONE saves more than SQRT");
+    ASSERT_EQ(checkpoint_memory_saved(&cm_all, &cfg.dims), 0, "ALL saves nothing");
+
+    checkpoint_free(&cm_all);
+    checkpoint_free(&cm_sqrt);
+    checkpoint_free(&cm_none);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_recompute_depth(void) {
+    TEST("Recompute depth counts layers from nearest checkpoint");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0);
+    // With interval=5: checkpoints at 0, 5, 10, 15, 20, 25, 30, 31
+    // Layer 3: nearest saved before = 0, depth = 3
+    ASSERT_EQ(checkpoint_recompute_depth(&cm, 3), 3, "depth from layer 0 to 3");
+    // Layer 7: nearest saved before = 5, depth = 2
+    ASSERT_EQ(checkpoint_recompute_depth(&cm, 7), 2, "depth from layer 5 to 7");
+    // Layer 5: nearest saved = 5, depth = 0
+    ASSERT_EQ(checkpoint_recompute_depth(&cm, 5), 0, "checkpointed layer = 0 depth");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_out_of_bounds(void) {
+    TEST("Checkpoint queries handle out-of-bounds gracefully");
+    ModelConfig cfg = model_config_stories110m();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan, 0);
+    ASSERT_TRUE(!checkpoint_should_save(&cm, -1), "negative index returns false");
+    ASSERT_TRUE(!checkpoint_should_save(&cm, 100), "over-max index returns false");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_every_n_custom_interval(void) {
+    TEST("CKPT_EVERY_N respects custom_interval parameter");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm3 = checkpoint_init(CKPT_EVERY_N, &cfg, &plan, 3);
+    CheckpointManager cm8 = checkpoint_init(CKPT_EVERY_N, &cfg, &plan, 8);
+    ASSERT_EQ(cm3.interval, 3, "interval=3 when custom_interval=3");
+    ASSERT_EQ(cm8.interval, 8, "interval=8 when custom_interval=8");
+    ASSERT_TRUE(cm3.n_checkpointed > cm8.n_checkpointed,
+                "shorter interval = more checkpoints");
+    // Verify layer 0 and last layer always saved
+    ASSERT_TRUE(checkpoint_should_save(&cm3, 0), "layer 0 saved (interval=3)");
+    ASSERT_TRUE(checkpoint_should_save(&cm3, 31), "last layer saved (interval=3)");
+    ASSERT_TRUE(checkpoint_should_save(&cm8, 0), "layer 0 saved (interval=8)");
+    ASSERT_TRUE(checkpoint_should_save(&cm8, 31), "last layer saved (interval=8)");
+    checkpoint_free(&cm3);
+    checkpoint_free(&cm8);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_n_checkpointed_accuracy(void) {
+    TEST("n_checkpointed matches actual is_saved bit count");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointPolicy policies[] = {CKPT_ALL, CKPT_BOUNDARY, CKPT_SQRT, CKPT_EVERY_N, CKPT_NONE};
+    for (int p = 0; p < 5; p++) {
+        CheckpointManager cm = checkpoint_init(policies[p], &cfg, &plan, 0);
+        int actual = 0;
+        for (int i = 0; i < cm.n_layers; i++) {
+            if (cm.is_saved[i]) actual++;
+        }
+        ASSERT_EQ(cm.n_checkpointed, actual, "n_checkpointed matches is_saved count");
+        checkpoint_free(&cm);
+    }
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_dims_init_zero_heads(void) {
+    TEST("model_dims_init guards divide-by-zero on n_heads=0");
+    ModelDims d = {.dim = 768, .n_heads = 0, .n_kv_heads = 0, .seq_len = 256};
+    model_dims_init(&d);
+    ASSERT_EQ(d.head_dim, 0, "head_dim=0 when n_heads=0");
+    ASSERT_EQ(d.kv_dim, 0, "kv_dim=0 when n_heads=0");
+    ASSERT_EQ(d.score_ch, 0, "score_ch=0 when n_heads=0");
+    PASS();
+}
+
+// ===== FLOP estimation tests =====
+
+static void test_flops_nonzero(void) {
+    TEST("FLOP estimates are nonzero and ANE < total");
+    ModelConfig cfg = model_config_stories110m();
+    double total = flops_per_step(&cfg);
+    double ane = ane_flops_per_step(&cfg);
+    ASSERT_TRUE(total > 0, "total FLOPs > 0");
+    ASSERT_TRUE(ane > 0, "ANE FLOPs > 0");
+    ASSERT_TRUE(ane < total, "ANE FLOPs < total (dW is on CPU)");
+    PASS();
+}
+
+static void test_flops_scale_with_layers(void) {
+    TEST("FLOPs scale roughly linearly with layer count");
+    ModelConfig cfg12 = model_config_stories110m();
+    ModelConfig cfg8 = model_config_stories42m();
+    double f12 = flops_per_step(&cfg12);
+    double f8 = flops_per_step(&cfg8);
+    // Not exact linear due to different dims, but 12-layer should be >8-layer
+    ASSERT_TRUE(f12 > f8, "12 layers > 8 layers");
+    PASS();
+}
+
+// ===== Pipeline plan edge cases =====
+
+static void test_plan_single_layer(void) {
+    TEST("Single-layer model = 1 group");
+    ModelConfig cfg = model_config_stories110m();
+    cfg.dims.n_layers = 1;
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_EQ(plan.n_groups, 1, "1 group");
+    ASSERT_EQ(plan.groups[0].n_layers, 1, "1 layer in group");
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_plan_exact_budget_fit(void) {
+    TEST("Layers that exactly fill budget = 1 group");
+    ModelConfig cfg = model_config_stories110m();
+    // 17 layers * 6 kernels = 102 <= 107 usable (10% headroom on 119)
+    cfg.dims.n_layers = 17;
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_EQ(plan.n_groups, 1, "17 layers fit in 1 group");
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_plan_one_over_budget(void) {
+    TEST("One layer over budget = 2 groups");
+    ModelConfig cfg = model_config_stories110m();
+    // 18 layers * 6 kernels = 108 > 107 usable -> 2 groups
+    cfg.dims.n_layers = 18;
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_EQ(plan.n_groups, 2, "18 layers = 2 groups");
+    int total = plan.groups[0].n_layers + plan.groups[1].n_layers;
+    ASSERT_EQ(total, 18, "all layers covered");
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+// ===== Main =====
+
+int main(void) {
+    printf("=== Pipeline Unit Tests ===\n\n");
+
+    printf("[model_config.h]\n");
+    test_dims_init();
+    test_stories110m_preset();
+    test_llama7b_preset();
+    test_layer_memory_nonzero();
+    test_adam_is_2x_weights();
+
+    printf("\n[pipeline planning]\n");
+    test_max_layers_per_compile();
+    test_configurable_headroom();
+    test_invalid_headroom_defaults();
+    test_plan_stories110m();
+    test_plan_llama7b_multiple_groups();
+    test_plan_kernel_budget();
+    test_plan_single_layer();
+    test_plan_exact_budget_fit();
+    test_plan_one_over_budget();
+
+    printf("\n[gradient_checkpoint.h]\n");
+    test_ckpt_all_saves_everything();
+    test_ckpt_none_saves_minimum();
+    test_ckpt_sqrt_interval();
+    test_ckpt_boundary();
+    test_ckpt_memory_savings();
+    test_ckpt_recompute_depth();
+    test_ckpt_out_of_bounds();
+    test_ckpt_every_n_custom_interval();
+    test_ckpt_n_checkpointed_accuracy();
+    test_dims_init_zero_heads();
+
+    printf("\n[FLOP estimation]\n");
+    test_flops_nonzero();
+    test_flops_scale_with_layers();
+
+    printf("\n=== Results: %d/%d passed ===\n", tests_passed, tests_run);
+    return (tests_passed == tests_run) ? 0 : 1;
+}
diff --git a/training/test_rmsnorm_bwd.m b/training/test_rmsnorm_bwd.m
new file mode 100644
index 0000000..9014e53
--- /dev/null
+++ b/training/test_rmsnorm_bwd.m
@@ -0,0 +1,123 @@
+// test_rmsnorm_bwd.m — Test RMSNorm backward ANE kernel vs CPU reference
+// Build: xcrun clang -O2 -framework Foundation -framework IOSurface \
+//        -framework CoreML -framework Accelerate -ldl -lobjc \
+//        -o test_rmsnorm_bwd test_rmsnorm_bwd.m
+#include "ane_rmsnorm_bwd.h"
+#include "stories_cpu_ops.h"
+
+int main(void) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        ane_init();
+        mach_timebase_info(&g_tb);
+        
+        printf("=== Test: RMSNorm Backward on ANE ===\n");
+        printf("DIM=%d SEQ=%d\n\n", DIM, SEQ);
+        
+        // Allocate test data
+        float *x = (float*)malloc(DIM * SEQ * 4);
+        float *dy = (float*)malloc(DIM * SEQ * 4);
+        float *w = (float*)malloc(DIM * 4);
+        float *dx_cpu = (float*)calloc(DIM * SEQ, 4);
+        float *dw_cpu = (float*)calloc(DIM, 4);
+        float *dx_ane = (float*)malloc(DIM * SEQ * 4);
+        
+        // Random init (channel-first [DIM, SEQ])
+        srand48(42);
+        for (int i = 0; i < DIM * SEQ; i++) {
+            x[i] = (float)(drand48() * 2 - 1) * 0.5f;
+            dy[i] = (float)(drand48() * 2 - 1) * 0.1f;
+        }
+        for (int i = 0; i < DIM; i++) {
+            w[i] = (float)(drand48() * 0.5 + 0.75);  // close to 1.0
+        }
+        
+        // === CPU Reference ===
+        uint64_t t0 = mach_absolute_time();
+        rmsnorm_bwd(dx_cpu, dw_cpu, dy, x, w, DIM, SEQ);
+        uint64_t t1 = mach_absolute_time();
+        printf("CPU rmsnorm_bwd: %.2f ms\n", tb_ms(t1 - t0));
+        
+        // === ANE Kernel ===
+        printf("Compiling ANE rmsnorm_bwd kernel...\n");
+        NSString *mil = gen_rmsnorm_bwd();
+        
+        // Build weight blob for RMSNorm weights
+        NSData *rms_blob = build_blob(w, 1, DIM);
+        
+        int in_bytes = 2 * DIM * SEQ * 2;  // concat(dy, x) in fp16
+        int out_bytes = DIM * SEQ * 2;      // dx in fp16
+        
+        Kern *kern = compile_kern_mil_w(mil, (@{
+            @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":rms_blob},
+        }), in_bytes, out_bytes);
+        
+        if (!kern) {
+            printf("FAIL: ANE kernel compilation failed!\n");
+            return 1;
+        }
+        printf("Compile OK (compiles=%d)\n", g_compile_count);
+        
+        // Write input: concat(dy, x) into ioIn
+        // dy goes at channel offset 0, x goes at channel offset DIM
+        io_write_fp16_at(kern->ioIn, 0, dy, DIM, SEQ);
+        io_write_fp16_at(kern->ioIn, DIM, x, DIM, SEQ);
+        
+        // Evaluate
+        t0 = mach_absolute_time();
+        ane_eval(kern);
+        t1 = mach_absolute_time();
+        printf("ANE eval: %.3f ms\n", tb_ms(t1 - t0));
+        
+        // Read output
+        io_read_fp16(kern->ioOut, dx_ane, 0, DIM, SEQ);
+        
+        // === Compare ===
+        float max_err = 0, sum_err = 0;
+        int max_i = 0, max_j = 0;
+        for (int i = 0; i < DIM; i++) {
+            for (int j = 0; j < SEQ; j++) {
+                int idx = i * SEQ + j;
+                float err = fabsf(dx_cpu[idx] - dx_ane[idx]);
+                sum_err += err;
+                if (err > max_err) {
+                    max_err = err;
+                    max_i = i; max_j = j;
+                }
+            }
+        }
+        float mean_err = sum_err / (DIM * SEQ);
+        
+        printf("\n=== Results ===\n");
+        printf("Max absolute error: %.6f at [%d,%d] (CPU=%.6f ANE=%.6f)\n",
+               max_err, max_i, max_j, dx_cpu[max_i*SEQ+max_j], dx_ane[max_i*SEQ+max_j]);
+        printf("Mean absolute error: %.6f\n", mean_err);
+        
+        // Sample outputs
+        printf("\nSample dx values (first 4 channels, first 4 positions):\n");
+        printf("%-6s %-12s %-12s %-10s\n", "Idx", "CPU", "ANE", "Error");
+        for (int i = 0; i < 4 && i < DIM; i++) {
+            for (int j = 0; j < 4 && j < SEQ; j++) {
+                int idx = i * SEQ + j;
+                printf("[%d,%d] %-12.6f %-12.6f %-10.6f\n",
+                       i, j, dx_cpu[idx], dx_ane[idx], fabsf(dx_cpu[idx] - dx_ane[idx]));
+            }
+        }
+        
+        // Benchmark: multiple evals
+        int N = 100;
+        t0 = mach_absolute_time();
+        for (int i = 0; i < N; i++) ane_eval(kern);
+        t1 = mach_absolute_time();
+        printf("\nBenchmark: %d evals in %.2f ms (%.3f ms/eval)\n",
+               N, tb_ms(t1-t0), tb_ms(t1-t0)/N);
+        
+        // Pass/fail
+        bool pass = max_err < 0.05f && mean_err < 0.01f;
+        printf("\n%s (threshold: max<0.05, mean<0.01)\n", pass ? "PASS ✅" : "FAIL ❌");
+        
+        free_kern(kern);
+        free(x); free(dy); free(w); free(dx_cpu); free(dw_cpu); free(dx_ane);
+        return pass ? 0 : 1;
+    }
+}
diff --git a/training/test_weight_patch.m b/training/test_weight_patch.m
new file mode 100644
index 0000000..13473b7
--- /dev/null
+++ b/training/test_weight_patch.m
@@ -0,0 +1,450 @@
+// test_weight_patch.m — Test whether ANE weights can be patched after compile
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#import <mach/mach.h>
+#import <mach/mach_time.h>
+#import <mach/vm_map.h>
+#include <arm_neon.h>
+#include <Accelerate/Accelerate.h>
+
+#include "stories_io.h"
+
+// MIL: fp32 in → cast fp16 → conv → cast fp32 out (matches inmem_peak.m pattern)
+static NSString *gen_conv_mil(int ic, int oc, int sp) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n"
+        "[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
+        "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
+        "{\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ic, sp];
+    [m appendString:
+        @"        string pt = const()[name = string(\"pt\"), val = string(\"valid\")];\n"
+        "        tensor<int32, [2]> st = const()[name = string(\"st\"), val = tensor<int32, [2]>([1, 1])];\n"
+        "        tensor<int32, [4]> pd = const()[name = string(\"pd\"), val = tensor<int32, [4]>([0, 0, 0, 0])];\n"
+        "        tensor<int32, [2]> dl = const()[name = string(\"dl\"), val = tensor<int32, [2]>([1, 1])];\n"
+        "        int32 gr = const()[name = string(\"gr\"), val = int32(1)];\n"
+        "        string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> xh = cast(dtype = to16, x = x)[name = string(\"cast_in\")];\n", ic, sp];
+    [m appendFormat:@"        tensor<fp16, [%d, %d, 1, 1]> W = const()[name = string(\"W\"), "
+        "val = tensor<fp16, [%d, %d, 1, 1]>(BLOBFILE(path = string(\"@model_path/weights/w.bin\"), offset = uint64(64)))];\n",
+        oc, ic, oc, ic];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> yh = conv(dilations = dl, groups = gr, pad = pd, pad_type = pt, strides = st, weight = W, x = xh)"
+        "[name = string(\"conv\")];\n", oc, sp];
+    [m appendString:@"        string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1, %d, 1, %d]> y = cast(dtype = to32, x = yh)[name = string(\"cast_out\")];\n", oc, sp];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+int main(int argc, char **argv) {
+    @autoreleasepool {
+        mach_timebase_info(&g_tb);
+        ane_init();
+
+        int IC = 256, OC = 256, SP = 64;
+        int io_bytes = IC * SP * 4;  // fp32
+
+        // Identity weight
+        float *W_id = (float*)calloc(OC*IC, sizeof(float));
+        for (int i = 0; i < IC; i++) W_id[i*IC+i] = 1.0f;
+
+        NSString *mil = gen_conv_mil(IC, OC, SP);
+        NSDictionary *wd = @{@"@model_path/weights/w.bin": @{@"offset":@0, @"data":build_blob(W_id, OC, IC)}};
+
+        printf("=== Compiling conv %dx%d sp=%d ===\n", OC, IC, SP);
+        Kern *k = compile_kern_mil_w(mil, wd, io_bytes, io_bytes);
+        if (!k) { printf("COMPILE FAILED\n"); free(W_id); return 1; }
+        printf("Compile OK!\n");
+
+        // Write fp32 input
+        IOSurfaceLock(k->ioIn, 0, NULL);
+        float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+        for (int i = 0; i < IC*SP; i++) inp[i] = (i % 100) * 0.01f;
+        IOSurfaceUnlock(k->ioIn, 0, NULL);
+
+        // Eval with identity
+        ane_eval(k);
+        IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        float *out = (float*)IOSurfaceGetBaseAddress(k->ioOut);
+        printf("In:  [%.3f, %.3f, %.3f, %.3f]\n", inp[0], inp[1], inp[2], inp[3]);
+        printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+        float max_err = 0;
+        for (int i = 0; i < OC*SP; i++) {
+            float err = fabsf(out[i] - inp[i]);
+            if (err > max_err) max_err = err;
+        }
+        IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        printf("Identity max_err=%.6f %s\n\n", max_err, max_err < 0.1 ? "PASS" : "FAIL");
+
+        // === Approach 1: Patch weight on disk, unload+reload ===
+        printf("=== Approach 1: Disk patch + unload/reload ===\n");
+        float *W_2x = (float*)calloc(OC*IC, sizeof(float));
+        for (int i = 0; i < IC; i++) W_2x[i*IC+i] = 2.0f;
+        [build_blob(W_2x, OC, IC) writeToFile:
+            [(__bridge NSString*)k->tmpDir stringByAppendingPathComponent:@"weights/w.bin"] atomically:YES];
+
+        id mdl = (__bridge id)k->model;
+        NSError *e = nil;
+        ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
+        e = nil;
+        BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
+        printf("Reload: %s\n", ok?"OK":"FAIL");
+        if (ok) {
+            // Re-create request after reload
+            id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioIn);
+            id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioOut);
+            CFRelease(k->request);
+            k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
+                @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+                @[wI], @[@0], @[wO], @[@0], nil, nil, @0));
+            ane_eval(k);
+            IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+            float sr = 0; int cnt = 0;
+            for (int i = 0; i < OC*SP; i++)
+                if (fabsf(inp[i]) > 0.01f) { sr += out[i]/inp[i]; cnt++; }
+            IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("Ratio: %.3f (2.0=patched, 1.0=cached)\n\n", cnt>0?sr/cnt:0);
+        }
+
+        // === Approach 2: Memory scan ===
+        printf("=== Approach 2: Memory scan ===\n");
+        uint16_t pat1[8] = {0x3C00, 0, 0, 0, 0, 0, 0, 0};
+        uint16_t pat2[8] = {0x4000, 0, 0, 0, 0, 0, 0, 0};
+        mach_port_t task = mach_task_self();
+        vm_address_t addr = 0; vm_size_t sz; natural_t depth = 1;
+        int f1 = 0, f2 = 0;
+        while (1) {
+            struct vm_region_submap_info_64 info;
+            mach_msg_type_number_t count = VM_REGION_SUBMAP_INFO_COUNT_64;
+            if (vm_region_recurse_64(task, &addr, &sz, &depth, (vm_region_recurse_info_t)&info, &count) != KERN_SUCCESS) break;
+            if (info.is_submap) { depth++; continue; }
+            if (!(info.protection & VM_PROT_READ) || sz < (size_t)(OC*IC*2)) { addr += sz; continue; }
+            uint8_t *base = (uint8_t*)addr;
+            for (size_t off = 0; off + OC*IC*2 <= sz; off += 2) {
+                int w = 0;
+                if (memcmp(base+off, pat1, 16) == 0) w = 1;
+                else if (memcmp(base+off, pat2, 16) == 0) w = 2;
+                if (!w) continue;
+                uint16_t *p = (uint16_t*)(base+off), diag = (w==1)?0x3C00:0x4000;
+                int ok2 = 1;
+                for (int r = 0; r < OC && ok2; r++)
+                    for (int c = 0; c < IC && ok2; c++)
+                        if (p[r*IC+c] != ((r==c)?diag:0)) ok2 = 0;
+                if (!ok2) continue;
+                if (w==1) f1++; else f2++;
+                printf("  FOUND %dx @%p prot=%d/%d %s\n", w, (void*)(addr+off),
+                    info.protection, info.max_protection, (info.protection&VM_PROT_WRITE)?"WR":"RO");
+            }
+            addr += sz;
+        }
+        printf("Found: 1x=%d 2x=%d\n", f1, f2);
+
+        // Now patch ALL found weight patterns to 3× and re-eval
+        if (f1 > 0 || f2 > 0) {
+            printf("Patching all found patterns to 3x identity...\n");
+            addr = 0; depth = 1;
+            while (1) {
+                struct vm_region_submap_info_64 info2;
+                mach_msg_type_number_t count2 = VM_REGION_SUBMAP_INFO_COUNT_64;
+                if (vm_region_recurse_64(task, &addr, &sz, &depth, (vm_region_recurse_info_t)&info2, &count2) != KERN_SUCCESS) break;
+                if (info2.is_submap) { depth++; continue; }
+                if (!(info2.protection & VM_PROT_READ) || sz < (size_t)(OC*IC*2)) { addr += sz; continue; }
+                uint8_t *base2 = (uint8_t*)addr;
+                for (size_t off = 0; off + OC*IC*2 <= sz; off += 2) {
+                    int w2 = 0;
+                    if (memcmp(base2+off, pat1, 16) == 0) w2 = 1;
+                    else if (memcmp(base2+off, pat2, 16) == 0) w2 = 2;
+                    if (!w2) continue;
+                    uint16_t *p2 = (uint16_t*)(base2+off), diag2 = (w2==1)?0x3C00:0x4000;
+                    int ok3 = 1;
+                    for (int r = 0; r < OC && ok3; r++)
+                        for (int c = 0; c < IC && ok3; c++)
+                            if (p2[r*IC+c] != ((r==c)?diag2:0)) ok3 = 0;
+                    if (!ok3) continue;
+                    if (info2.protection & VM_PROT_WRITE) {
+                        printf("  Patching %dx @%p to 3x\n", w2, (void*)(addr+off));
+                        for (int r = 0; r < OC; r++)
+                            for (int c = 0; c < IC; c++)
+                                p2[r*IC+c] = (r==c) ? 0x4200 : 0; // fp16(3.0)
+                    }
+                }
+                addr += sz;
+            }
+
+            printf("\n=== Eval after memory patch (expect 3x) ===\n");
+            ane_eval(k);
+            IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+            float sr2 = 0; int cnt2 = 0;
+            for (int i = 0; i < OC*SP; i++)
+                if (fabsf(inp[i]) > 0.01f) { sr2 += out[i]/inp[i]; cnt2++; }
+            IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("Ratio: %.3f (3.0=mem patch works!, 1.0=ANE uses SRAM copy)\n", cnt2>0?sr2/cnt2:0);
+        }
+        printf("\n");
+
+        // === Approach 3: Explore classes ===
+        printf("=== ANE classes ===\n");
+        const char *cn[] = {"_ANEWeight", "_ANEProgramForEvaluation", "_ANEChainingRequest", NULL};
+        for (int i = 0; cn[i]; i++) {
+            Class cls = NSClassFromString([NSString stringWithUTF8String:cn[i]]);
+            if (!cls) { printf("%s: NOT FOUND\n", cn[i]); continue; }
+            printf("%s:\n", cn[i]);
+            unsigned int mc = 0; Method *ms = class_copyMethodList(cls, &mc);
+            for (unsigned j = 0; j < mc; j++) printf("  - %s\n", sel_getName(method_getName(ms[j])));
+            free(ms);
+            mc = 0; ms = class_copyMethodList(object_getClass(cls), &mc);
+            for (unsigned j = 0; j < mc; j++) printf("  + %s\n", sel_getName(method_getName(ms[j])));
+            free(ms); printf("\n");
+        }
+        @try { printf("programHandle: %s\n", [[[mdl valueForKey:@"programHandle"] description] UTF8String]); } @catch(id x) {}
+        @try { printf("intermediateBufferHandle: %s\n", [[[mdl valueForKey:@"intermediateBufferHandle"] description] UTF8String]); } @catch(id x) {}
+
+        // === Approach 4: _ANEWeight + updateWeightURL ===
+        printf("\n=== Approach 4: _ANEWeight API ===\n");
+        Class AW = NSClassFromString(@"_ANEWeight");
+        if (AW) {
+            // Write 5× identity weights to a new file
+            float *W_5x = (float*)calloc(OC*IC, sizeof(float));
+            for (int i = 0; i < IC; i++) W_5x[i*IC+i] = 5.0f;
+            NSString *wpath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"patched_w.bin"];
+            [build_blob(W_5x, OC, IC) writeToFile:wpath atomically:YES];
+            free(W_5x);
+
+            NSURL *wurl = [NSURL fileURLWithPath:wpath];
+            id wobj = ((id(*)(Class,SEL,id,id))objc_msgSend)(AW,
+                @selector(weightWithSymbolAndURL:weightURL:), @"W", wurl);
+            printf("  _ANEWeight: %s\n", wobj ? [[wobj description] UTF8String] : "nil");
+            if (wobj) {
+                printf("  weightSymbol: %s\n", [((id(*)(id,SEL))objc_msgSend)(wobj, @selector(weightSymbol)) UTF8String]);
+                printf("  weightURL: %s\n", [[((id(*)(id,SEL))objc_msgSend)(wobj, @selector(weightURL)) description] UTF8String]);
+            }
+
+            // Try to pass as weightsBuffer in request
+            printf("\n  Trying weightsBuffer in request...\n");
+            id wI2 = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioIn);
+            id wO2 = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioOut);
+
+            // Try passing weight array as weightsBuffer
+            if (wobj) {
+                CFRelease(k->request);
+                k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
+                    @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+                    @[wI2], @[@0], @[wO2], @[@0], @[wobj], nil, @0));
+                printf("  Request with weightsBuffer created\n");
+                @try {
+                    ane_eval(k);
+                    IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+                    printf("  Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+                    float sr3 = 0; int cnt3 = 0;
+                    for (int i2 = 0; i2 < OC*SP; i2++)
+                        if (fabsf(inp[i2]) > 0.01f) { sr3 += out[i2]/inp[i2]; cnt3++; }
+                    IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+                    printf("  Ratio: %.3f (5.0=weightsBuffer works!)\n", cnt3>0?sr3/cnt3:0);
+                } @catch(NSException *ex) {
+                    printf("  Eval exception: %s\n", [[ex description] UTF8String]);
+                }
+            }
+
+            // Also try IOSurface as weightsBuffer
+            printf("\n  Trying IOSurface as weightsBuffer...\n");
+            IOSurfaceRef wSurf = make_surface(OC*IC*2);  // fp16 weights
+            IOSurfaceLock(wSurf, 0, NULL);
+            _Float16 *wfp16 = (_Float16*)IOSurfaceGetBaseAddress(wSurf);
+            for (int r = 0; r < OC; r++)
+                for (int c2 = 0; c2 < IC; c2++)
+                    wfp16[r*IC+c2] = (r==c2) ? (_Float16)7.0f : (_Float16)0.0f;  // 7× identity
+            IOSurfaceUnlock(wSurf, 0, NULL);
+            id wSurfObj = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), wSurf);
+            CFRelease(k->request);
+            k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
+                @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+                @[wI2], @[@0], @[wO2], @[@0], wSurfObj, nil, @0));
+            @try {
+                ane_eval(k);
+                IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+                printf("  Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+                float sr4 = 0; int cnt4 = 0;
+                for (int i3 = 0; i3 < OC*SP; i3++)
+                    if (fabsf(inp[i3]) > 0.01f) { sr4 += out[i3]/inp[i3]; cnt4++; }
+                IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+                printf("  Ratio: %.3f (7.0=IOSurface weights work!)\n", cnt4>0?sr4/cnt4:0);
+            } @catch(NSException *ex) {
+                printf("  Eval exception: %s\n", [[ex description] UTF8String]);
+            }
+            CFRelease(wSurf);
+        }
+
+        // === Approach 5: Weights packed into input IOSurface (fp16 with cast) ===
+        printf("\n=== Approach 5: Dynamic weights via input IOSurface ===\n");
+        // Element-wise mul: x * w where both come from input
+        // Input [1, IC*2, 1, SP] fp32 → cast fp16 → slice → mul → cast fp32
+        {
+        int C5 = IC;
+        NSMutableString *m5 = [NSMutableString string];
+        [m5 appendString:@"program(1.3)\n"
+            "[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
+            "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
+            "{\"coremltools-version\", \"9.0\"}})]\n{\n"];
+        [m5 appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", C5*2, SP];
+        [m5 appendString:@"        string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"];
+        [m5 appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", C5*2, SP];
+        [m5 appendFormat:@"        tensor<int32, [4]> b0 = const()[name = string(\"b0\"), val = tensor<int32, [4]>([0,0,0,0])];\n"];
+        [m5 appendFormat:@"        tensor<int32, [4]> s0 = const()[name = string(\"s0\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", C5, SP];
+        [m5 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> data = slice_by_size(x=xh,begin=b0,size=s0)[name=string(\"data\")];\n", C5, SP];
+        [m5 appendFormat:@"        tensor<int32, [4]> b1 = const()[name = string(\"b1\"), val = tensor<int32, [4]>([0,%d,0,0])];\n", C5];
+        [m5 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> wt = slice_by_size(x=xh,begin=b1,size=s0)[name=string(\"wt\")];\n", C5, SP];
+        [m5 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> yh = mul(x=data,y=wt)[name=string(\"mul\")];\n", C5, SP];
+        [m5 appendString:@"        string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"];
+        [m5 appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype = to32, x = yh)[name = string(\"cout\")];\n", C5, SP];
+        [m5 appendString:@"    } -> (y);\n}\n"];
+
+        int io5_in = C5*2*SP*4;
+        int io5_out = C5*SP*4;
+        Kern *k5 = compile_kern_mil_w(m5, @{}, io5_in, io5_out);
+        if (k5) {
+            printf("Compile OK!\n");
+            IOSurfaceLock(k5->ioIn, 0, NULL);
+            float *in5 = (float*)IOSurfaceGetBaseAddress(k5->ioIn);
+            for (int i = 0; i < C5*SP; i++) in5[i] = (i%100)*0.01f;
+            for (int i = 0; i < C5*SP; i++) in5[C5*SP+i] = 2.0f;
+            IOSurfaceUnlock(k5->ioIn, 0, NULL);
+            ane_eval(k5);
+            IOSurfaceLock(k5->ioOut, kIOSurfaceLockReadOnly, NULL);
+            float *out5 = (float*)IOSurfaceGetBaseAddress(k5->ioOut);
+            printf("data=[%.3f,%.3f,%.3f], w=2.0 → out=[%.3f,%.3f,%.3f]\n",
+                in5[0],in5[1],in5[2], out5[0],out5[1],out5[2]);
+            IOSurfaceUnlock(k5->ioOut, kIOSurfaceLockReadOnly, NULL);
+
+            // Change weight dynamically — NO recompile!
+            IOSurfaceLock(k5->ioIn, 0, NULL);
+            for (int i = 0; i < C5*SP; i++) in5[C5*SP+i] = 5.0f;
+            IOSurfaceUnlock(k5->ioIn, 0, NULL);
+            ane_eval(k5);
+            IOSurfaceLock(k5->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("w=5.0 → out=[%.3f,%.3f,%.3f] (expect 5×)\n", out5[0],out5[1],out5[2]);
+            IOSurfaceUnlock(k5->ioOut, kIOSurfaceLockReadOnly, NULL);
+            free_kern(k5);
+        } else printf("Compile FAILED\n");
+        }
+
+        // === Approach 6: matmul with dynamic weights from input ===
+        printf("\n=== Approach 6: matmul with dynamic W from input ===\n");
+        // Pack x[1,D,S,1] and W[1,D,1,D] into input, then reshape+matmul
+        // Input shape: [1, D+D*D, 1, S] — first D channels=activations, rest=weight matrix flattened
+        // Actually, matmul needs [1,H,S,D] shapes. Let's try:
+        // Input: [1, D*(S+D), 1, 1] reshaped as needed
+        // Simpler: just test matmul with two sliced inputs
+        {
+        int D6 = 64, S6 = 64;  // small for test
+        // Input: [1, D6+D6, S6, D6] — but that's 4D...
+        // Actually ANE matmul works on [1,H,M,K] @ [1,H,K,N] → [1,H,M,N]
+        // Let's pack x[1,1,S6,D6] and W[1,1,D6,D6] into [1,2,S6,D6]
+        // Then slice → matmul
+        NSMutableString *m6 = [NSMutableString string];
+        [m6 appendString:@"program(1.3)\n"
+            "[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
+            "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
+            "{\"coremltools-version\", \"9.0\"}})]\n{\n"];
+        // Input: [1, D6+D6, 1, S6*D6] — flatten everything, then reshape
+        // Actually simplest: two separate regions in channel dim
+        // x_data: [1, D6, 1, S6] and W: [1, D6*D6, 1, 1]
+        // Total input channels: D6 + D6*D6
+        int total_ch = D6 + D6*D6;
+        [m6 appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", total_ch, S6];
+        [m6 appendString:@"        string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"];
+        [m6 appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", total_ch, S6];
+        // Slice activations: [1, D6, 1, S6]
+        [m6 appendFormat:@"        tensor<int32, [4]> b0 = const()[name = string(\"b0\"), val = tensor<int32, [4]>([0,0,0,0])];\n"];
+        [m6 appendFormat:@"        tensor<int32, [4]> sa = const()[name = string(\"sa\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> act = slice_by_size(x=xh,begin=b0,size=sa)[name=string(\"act\")];\n", D6, S6];
+        // Slice weight: [1, D6*D6, 1, S6] but we only need [D6, D6] → reshape
+        [m6 appendFormat:@"        tensor<int32, [4]> bw = const()[name = string(\"bw\"), val = tensor<int32, [4]>([0,%d,0,0])];\n", D6];
+        [m6 appendFormat:@"        tensor<int32, [4]> sw = const()[name = string(\"sw\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", D6*D6, S6];
+        [m6 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> wf = slice_by_size(x=xh,begin=bw,size=sw)[name=string(\"wf\")];\n", D6*D6, S6];
+        // Reshape weight to [1, D6, D6, S6] for matmul-like operation
+        // Actually for conv: weight needs to be [OC, IC, 1, 1] const. Can't use dynamic weight with conv.
+        // For matmul: need [1, 1, D6, D6] or similar
+        // Let's try: reshape wf to [1, D6, D6, S6], take first slice [:,:,:,0] → no, that's hard
+        // Simpler: reshape to [D6, D6] and use matmul
+        // But matmul expects specific ranks... let me try:
+        [m6 appendFormat:@"        tensor<int32, [4]> ws = const()[name = string(\"ws\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n", D6, D6];
+        // Only take first column of wf to get [1, D6*D6, 1, 1]
+        [m6 appendFormat:@"        tensor<int32, [4]> sw1 = const()[name = string(\"sw1\"), val = tensor<int32, [4]>([1,%d,1,1])];\n", D6*D6];
+        [m6 appendFormat:@"        tensor<fp16, [1,%d,1,1]> wf1 = slice_by_size(x=wf,begin=b0,size=sw1)[name=string(\"wf1\")];\n", D6*D6];
+        [m6 appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W = reshape(shape=ws,x=wf1)[name=string(\"W\")];\n", D6, D6];
+        // Reshape act to [1, 1, S6, D6] for matmul
+        [m6 appendFormat:@"        tensor<int32, [4]> as2 = const()[name = string(\"as2\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<int32, [4]> pm = const()[name = string(\"pm\"), val = tensor<int32, [4]>([0, 1, 3, 2])];\n"];
+        [m6 appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a2 = reshape(shape=as2,x=act)[name=string(\"a2\")];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a3 = transpose(perm=pm,x=a2)[name=string(\"a3\")];\n", S6, D6];
+        // matmul: [1,1,S6,D6] @ [1,1,D6,D6] → [1,1,S6,D6]
+        [m6 appendString:@"        bool bF = const()[name = string(\"bF\"), val = bool(false)];\n"];
+        [m6 appendFormat:@"        tensor<fp16, [1, 1, %d, %d]> yh = matmul(transpose_x = bF, transpose_y = bF, x = a3, y = W)[name = string(\"mm\")];\n", S6, D6];
+        // Reshape back to [1, D6, 1, S6]
+        [m6 appendFormat:@"        tensor<fp16, [1,1,%d,%d]> yt = transpose(perm=pm,x=yh)[name=string(\"yt\")];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<int32, [4]> os = const()[name = string(\"os\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> yr = reshape(shape=os,x=yt)[name=string(\"yr\")];\n", D6, S6];
+        [m6 appendString:@"        string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"];
+        [m6 appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype = to32, x = yr)[name = string(\"cout\")];\n", D6, S6];
+        [m6 appendString:@"    } -> (y);\n}\n"];
+
+        int io6_in = total_ch * S6 * 4;
+        int io6_out = D6 * S6 * 4;
+        Kern *k6 = compile_kern_mil_w(m6, @{}, io6_in, io6_out);
+        if (k6) {
+            printf("Dynamic matmul compile OK!\n");
+            // Set up: identity W, ramp input
+            IOSurfaceLock(k6->ioIn, 0, NULL);
+            float *in6 = (float*)IOSurfaceGetBaseAddress(k6->ioIn);
+            memset(in6, 0, io6_in);
+            // Activations: [D6, S6] in channel-first layout
+            for (int d = 0; d < D6; d++)
+                for (int s = 0; s < S6; s++)
+                    in6[d*S6+s] = (d*S6+s) * 0.001f;
+            // Weight: identity matrix [D6, D6] packed in channels D6..D6+D6*D6, only col 0
+            float *wbase = in6 + D6*S6;
+            for (int r = 0; r < D6; r++)
+                for (int c = 0; c < D6; c++)
+                    wbase[(r*D6+c)*S6] = (r==c) ? 1.0f : 0.0f;  // only sp=0 matters
+            IOSurfaceUnlock(k6->ioIn, 0, NULL);
+
+            ane_eval(k6);
+            IOSurfaceLock(k6->ioOut, kIOSurfaceLockReadOnly, NULL);
+            float *out6 = (float*)IOSurfaceGetBaseAddress(k6->ioOut);
+            printf("Identity W: in=[%.4f,%.4f,%.4f] out=[%.4f,%.4f,%.4f]\n",
+                in6[0],in6[1],in6[2], out6[0],out6[1],out6[2]);
+
+            // Check
+            float me6 = 0;
+            for (int i = 0; i < D6*S6; i++) {
+                float e6 = fabsf(out6[i] - in6[i]);
+                if (e6 > me6) me6 = e6;
+            }
+            IOSurfaceUnlock(k6->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("max_err=%.6f %s\n", me6, me6 < 0.1 ? "PASS" : "FAIL");
+
+            // Now: 2× identity — just change the IOSurface weight, no recompile!
+            IOSurfaceLock(k6->ioIn, 0, NULL);
+            for (int r = 0; r < D6; r++)
+                for (int c = 0; c < D6; c++)
+                    wbase[(r*D6+c)*S6] = (r==c) ? 2.0f : 0.0f;
+            IOSurfaceUnlock(k6->ioIn, 0, NULL);
+            ane_eval(k6);
+            IOSurfaceLock(k6->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("2× W: in=[%.4f,%.4f] out=[%.4f,%.4f] (expect 2×)\n",
+                in6[0],in6[1], out6[0],out6[1]);
+            IOSurfaceUnlock(k6->ioOut, kIOSurfaceLockReadOnly, NULL);
+            free_kern(k6);
+        } else printf("Dynamic matmul compile FAILED\n");
+        }
+
+        free_kern(k); free(W_id); free(W_2x);
+        printf("\nDone.\n");
+    }
+    return 0;
+}
diff --git a/training/tiny_train.m b/training/tiny_train.m
index e1e9d7d..0449dba 100644
--- a/training/tiny_train.m
+++ b/training/tiny_train.m
@@ -139,7 +139,7 @@ static void free_kern(Kern *k) {
     free(k);
 }
 
-static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_ch, int sp) {
+static bool ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_ch, int sp) {
     float *tmp = (float*)malloc(in_ch * sp * sizeof(float));
     for (int t = 0; t < sp; t++)
         for (int c = 0; c < in_ch; c++)
@@ -151,8 +151,13 @@ static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_
     NSError *e = nil;
     id mdl = (__bridge id)k->model;
     id req = (__bridge id)k->request;
-    ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+    BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
         mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
+    if (!ok) {
+        fprintf(stderr, "ANE eval failed: %s\n",
+                e ? [[e description] UTF8String] : "unknown error");
+        return false;
+    }
     float *tmp2 = (float*)malloc(out_ch * sp * sizeof(float));
     IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
     memcpy(tmp2, IOSurfaceGetBaseAddress(k->ioOut), out_ch * sp * sizeof(float));
@@ -161,6 +166,7 @@ static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_
         for (int c = 0; c < out_ch; c++)
             out[t*out_ch + c] = tmp2[c*sp + t];
     free(tmp2);
+    return true;
 }
 
 // === Checkpoint: save/restore training state for exec() restart ===
@@ -179,21 +185,25 @@ static void save_checkpoint(const char *path, int step, float loss,
                             int D, int H, int S, int total_steps, float lr,
                             const float *W1, const float *W2,
                             double cc, double ct, double cw, int cs, int cb) {
-    FILE *f = fopen(path, "wb");
+    char tmp_path[512];
+    snprintf(tmp_path, sizeof(tmp_path), "%s.tmp", path);
+    FILE *f = fopen(tmp_path, "wb");
+    if (!f) { fprintf(stderr, "Failed to open %s for checkpoint\n", tmp_path); return; }
     CkptHeader hdr = {step, loss, D, H, S, total_steps, lr, cc, ct, cw, cs, cb};
     fwrite(&hdr, sizeof(hdr), 1, f);
     fwrite(W1, sizeof(float), H * D, f);
     fwrite(W2, sizeof(float), D * H, f);
     fclose(f);
+    rename(tmp_path, path);  // atomic on POSIX
 }
 
 static bool load_checkpoint(const char *path, CkptHeader *hdr,
                             float *W1, float *W2, int H, int D) {
     FILE *f = fopen(path, "rb");
     if (!f) return false;
-    fread(hdr, sizeof(CkptHeader), 1, f);
-    fread(W1, sizeof(float), H * D, f);
-    fread(W2, sizeof(float), D * H, f);
+    if (fread(hdr, sizeof(CkptHeader), 1, f) != 1) { fclose(f); return false; }
+    if (fread(W1, sizeof(float), H * D, f) != (size_t)(H * D)) { fclose(f); return false; }
+    if (fread(W2, sizeof(float), D * H, f) != (size_t)(D * H)) { fclose(f); return false; }
     fclose(f);
     return true;
 }
diff --git a/training/train_large.m b/training/train_large.m
index e58ce08..96f8f7a 100644
--- a/training/train_large.m
+++ b/training/train_large.m
@@ -5,9 +5,9 @@
 #include "stories_mil.h"
 #include "stories_cpu_ops.h"
 
-#define CKPT_PATH "ane_stories110M_ckpt.bin"
-#define MODEL_PATH "../../assets/models/stories110M.bin"
-#define DATA_PATH "tinystories_data00.bin"
+#define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin"
+#define MODEL_PATH_DEFAULT "stories110M.bin"
+#define DATA_PATH_DEFAULT "tinystories_data00.bin"
 
 // ===== Weight loading from llama2.c format =====
 static bool load_pretrained(LayerWeights *lw, float *rms_final, float *embed, const char *path) {
@@ -192,12 +192,26 @@ int main(int argc, char *argv[]) {
         float adam_b1=0.9f, adam_b2=0.999f, adam_eps=1e-8f;
         int adam_t = 0, start_step = 0;
 
-        // Parse args
-        bool do_resume = false;
-        for (int i=1; i<argc; i++) {
-            if (strcmp(argv[i], "--resume") == 0) do_resume = true;
-            else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
-            else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
+        // Parse args
+        const char *ckpt_path = CKPT_PATH_DEFAULT;
+        const char *model_path = MODEL_PATH_DEFAULT;
+        const char *data_path = DATA_PATH_DEFAULT;
+        bool do_resume = false;
+        int pos = 0;
+        for (int i=1; i<argc; i++) {
+            if (strcmp(argv[i], "--resume") == 0) do_resume = true;
+            else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--ckpt") == 0 && i+1<argc) ckpt_path = argv[++i];
+            else if (strcmp(argv[i], "--model") == 0 && i+1<argc) model_path = argv[++i];
+            else if (strcmp(argv[i], "--data") == 0 && i+1<argc) data_path = argv[++i];
+            else if (argv[i][0] != '-') {
+                if (pos == 0) model_path = argv[i];
+                else if (pos == 1) { /* seq - compile-time constant */ }
+                else if (pos == 2) total_steps = atoi(argv[i]);
+                else if (pos == 3) lr = atof(argv[i]);
+                pos++;
+            }
         }
 
         // Allocate per-layer state
@@ -228,7 +242,7 @@ int main(int argc, char *argv[]) {
         float resume_loss = 0;
         bool resuming = false;
         if (do_resume) {
-            resuming = load_checkpoint(CKPT_PATH, &start_step, &total_steps, &lr, &resume_loss,
+            resuming = load_checkpoint(ckpt_path, &start_step, &total_steps, &lr, &resume_loss,
                 &cum_compile, &cum_train, &cum_wall, &cum_steps, &cum_batches, &adam_t,
                 lw, la, rms_final, &arms_final, embed, &aembed);
             if (resuming) printf("[RESUMED step %d, loss=%.4f]\n", start_step, resume_loss);
@@ -236,7 +250,7 @@ int main(int argc, char *argv[]) {
         if (!resuming) {
             printf("=== ANE Training: Stories110M (12 layers) ===\n");
             printf("dim=%d hidden=%d heads=%d seq=%d vocab=%d layers=%d\n", DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS);
-            if (!load_pretrained(lw, rms_final, embed, MODEL_PATH)) {
+            if (!load_pretrained(lw, rms_final, embed, model_path)) {
                 printf("Pretrained load failed, using random init\n");
                 srand48(42);
                 float scale_d=1.0f/sqrtf(DIM), scale_h=1.0f/sqrtf(HIDDEN);
@@ -271,14 +285,24 @@ int main(int argc, char *argv[]) {
         }
 
         // mmap token data
-        int data_fd = open(DATA_PATH, O_RDONLY);
-        if (data_fd < 0) { printf("Cannot open %s\n", DATA_PATH); return 1; }
+        int data_fd = open(data_path, O_RDONLY);
+        if (data_fd < 0) {
+            printf("Cannot open token data: %s\n", data_path);
+            printf("Hint: run `bash download_data.sh` in training/ or pass --data /path/to/tinystories_data00.bin\n");
+            return 1;
+        }
         struct stat st; fstat(data_fd, &st);
-        size_t data_len = st.st_size;
-        uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
-        if (token_data == MAP_FAILED) { printf("mmap failed\n"); return 1; }
-        size_t n_tokens = data_len / 2;
-        printf("Token data: %zu tokens (%.1f MB)\n", n_tokens, data_len/1e6);
+        size_t data_len = st.st_size;
+        uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
+        if (token_data == MAP_FAILED) { printf("mmap failed\n"); return 1; }
+        size_t n_tokens = data_len / 2;
+        if (n_tokens <= (size_t)(SEQ + 1)) {
+            printf("Token data too short: need at least %d tokens, got %zu\n", SEQ + 2, n_tokens);
+            munmap(token_data, data_len);
+            close(data_fd);
+            return 1;
+        }
+        printf("Token data: %zu tokens (%.1f MB)\n", n_tokens, data_len/1e6);
 
         // Gradient buffers shared across layers (reused each step)
         float *dy = (float*)malloc(SEQ*DIM*4);            // gradient flowing backward
@@ -322,15 +346,15 @@ int main(int argc, char *argv[]) {
             if (g_compile_count + TOTAL_WEIGHT_KERNELS > MAX_COMPILES) {
                 for (int L=0; L<NLAYERS; L++) { free_layer_kernels(&kern[L]); free_kern(sdpaBwd2[L]); }
                 double wall = tb_ms(mach_absolute_time() - t_wall_start);
-                save_checkpoint(CKPT_PATH, step, total_steps, lr, last_loss,
+                save_checkpoint(ckpt_path, step, total_steps, lr, last_loss,
                     total_compile_ms+cum_compile, total_train_ms+cum_train, wall+cum_wall,
                     total_steps_done+cum_steps, total_batches+cum_batches, adam_t,
                     lw, la, rms_final, &arms_final, embed, &aembed);
                 printf("[exec() restart step %d, %d compiles, loss=%.4f]\n", step, g_compile_count, last_loss);
                 fflush(stdout);
-                execl(argv[0], argv[0], "--resume", NULL);
-                perror("execl"); return 1;
-            }
+                execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, "--data", data_path, NULL);
+                perror("execl"); return 1;
+            }
 
             // Compile all layers' weight-bearing kernels
             uint64_t tc = mach_absolute_time();
diff --git a/training/train_large_ane.m b/training/train_large_ane.m
new file mode 100644
index 0000000..25e9160
--- /dev/null
+++ b/training/train_large_ane.m
@@ -0,0 +1,750 @@
+// train_large_ane.m — Stories110M training with CPU ops offloaded to ANE
+// Based on train_large.m but moves these operations from CPU to ANE:
+//   1. Final RMSNorm (was CPU vDSP) → ANE kernel
+//   2. Classifier forward embed@x (was CPU cblas) → ANE 32000-ch conv
+//   3. Cross-entropy softmax (was CPU vDSP) → ANE softmax kernel
+//   4. RMSNorm backward (was CPU vDSP) → ANE kernel
+// Still on CPU: dW gradients (parallel via GCD), Adam optimizer (needs weight mutation),
+//               classifier backward (ANE matmul slower than cblas for this shape),
+//               NLL loss + gradient (needs target indexing)
+//
+// Build: make train_large_ane
+// Run:   ./train_large_ane [--resume] [--steps N] [--lr F] [--data PATH]
+#include "stories_io.h"
+#include "stories_mil.h"
+#include "stories_cpu_ops.h"
+#include "ane_rmsnorm_bwd.h"
+#include "ane_classifier.h"
+
+#define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin"
+#define MODEL_PATH_DEFAULT "stories110M.bin"
+#define DATA_PATH_DEFAULT "tinystories_data00.bin"
+
+// ===== Weight loading from llama2.c format =====
+static bool load_pretrained(LayerWeights *lw, float *rms_final, float *embed, const char *path) {
+    FILE *f = fopen(path, "rb");
+    if (!f) { printf("Cannot open %s\n", path); return false; }
+    Llama2Config cfg;
+    fread(&cfg, sizeof(cfg), 1, f);
+    printf("  Model config: dim=%d hidden=%d layers=%d heads=%d vocab=%d seq=%d\n",
+           cfg.dim, cfg.hidden_dim, cfg.n_layers, cfg.n_heads, abs(cfg.vocab_size), cfg.seq_len);
+    if (cfg.dim != DIM || cfg.hidden_dim != HIDDEN || cfg.n_layers != NLAYERS) {
+        printf("  ERROR: Config mismatch!\n"); fclose(f); return false;
+    }
+    int V = abs(cfg.vocab_size);
+    bool shared = cfg.vocab_size > 0;
+    fread(embed, 4, V * DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].rms_att, 4, DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wq, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wk, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wv, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wo, 4, WO_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].rms_ffn, 4, DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W1, 4, W1_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W2, 4, W2_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W3, 4, W3_SZ, f);
+    fread(rms_final, 4, DIM, f);
+    fclose(f);
+    printf("  Loaded pretrained weights (%s)\n", shared ? "shared embed/cls" : "separate cls");
+    return true;
+}
+
+// ===== Compile one layer's kernels =====
+static bool compile_layer_kernels(LayerKernels *lk, LayerWeights *w) {
+    lk->fwdAttn = compile_kern_mil_w(gen_sdpa_fwd_taps(), (@{
+        @"@model_path/weights/rms1.bin": @{@"offset":@0, @"data":build_blob(w->rms_att,1,DIM)},
+        @"@model_path/weights/wq.bin": @{@"offset":@0, @"data":build_blob(w->Wq,DIM,DIM)},
+        @"@model_path/weights/wk.bin": @{@"offset":@0, @"data":build_blob(w->Wk,DIM,DIM)},
+        @"@model_path/weights/wv.bin": @{@"offset":@0, @"data":build_blob(w->Wv,DIM,DIM)},
+        @"@model_path/weights/wo.bin": @{@"offset":@0, @"data":build_blob(w->Wo,DIM,DIM)},
+        @"@model_path/weights/mask.bin": @{@"offset":@0, @"data":get_mask_blob()},
+    }), DIM*SEQ*2, 6*DIM*SEQ*2);
+    lk->fwdFFN = compile_kern_mil_w(gen_ffn_fwd_taps(), (@{
+        @"@model_path/weights/rms2.bin": @{@"offset":@0, @"data":build_blob(w->rms_ffn,1,DIM)},
+        @"@model_path/weights/w1.bin": @{@"offset":@0, @"data":build_blob(w->W1,HIDDEN,DIM)},
+        @"@model_path/weights/w3.bin": @{@"offset":@0, @"data":build_blob(w->W3,HIDDEN,DIM)},
+        @"@model_path/weights/w2.bin": @{@"offset":@0, @"data":build_blob(w->W2,DIM,HIDDEN)},
+    }), DIM*SEQ*2, (2*DIM+3*HIDDEN)*SEQ*2);
+    lk->ffnBwd = compile_kern_mil_w(gen_ffn_bwd(), (@{
+        @"@model_path/weights/w2t.bin": @{@"offset":@0, @"data":build_blob_t(w->W2,DIM,HIDDEN)},
+        @"@model_path/weights/w1t.bin": @{@"offset":@0, @"data":build_blob_t(w->W1,HIDDEN,DIM)},
+        @"@model_path/weights/w3t.bin": @{@"offset":@0, @"data":build_blob_t(w->W3,HIDDEN,DIM)},
+    }), (DIM+2*HIDDEN)*SEQ*2, (DIM+2*HIDDEN)*SEQ*2);
+    lk->sdpaBwd1 = compile_kern_mil_w(gen_sdpa_bwd1(), (@{
+        @"@model_path/weights/mask.bin": @{@"offset":@0, @"data":get_mask_blob()},
+        @"@model_path/weights/wot.bin": @{@"offset":@0, @"data":build_blob_t(w->Wo,DIM,DIM)},
+    }), 4*DIM*SEQ*2, (DIM+2*SCORE_CH)*SEQ*2);
+    lk->qkvBwd = compile_kern_mil_w(gen_qkvb(), (@{
+        @"@model_path/weights/wqt.bin": @{@"offset":@0, @"data":build_blob_t(w->Wq,DIM,DIM)},
+        @"@model_path/weights/wkt.bin": @{@"offset":@0, @"data":build_blob_t(w->Wk,DIM,DIM)},
+        @"@model_path/weights/wvt.bin": @{@"offset":@0, @"data":build_blob_t(w->Wv,DIM,DIM)},
+    }), 3*DIM*SEQ*2, DIM*SEQ*2);
+    return lk->fwdAttn && lk->fwdFFN && lk->ffnBwd && lk->sdpaBwd1 && lk->qkvBwd;
+}
+
+static Kern *compile_sdpa_bwd2(void) {
+    return compile_kern_mil_w(gen_sdpa_bwd2(), @{},
+        (2*SCORE_CH+2*DIM)*SEQ*2, 2*DIM*SEQ*2);
+}
+
+// NEW: Compile RMSNorm backward kernels (one per layer pair: attn + ffn)
+static Kern *compile_rmsnorm_bwd_kern(const float *rms_w) {
+    return compile_kern_mil_w(gen_rmsnorm_bwd(), (@{
+        @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":build_blob(rms_w, 1, DIM)},
+    }), 2*DIM*SEQ*2, DIM*SEQ*2);
+}
+
+// NEW: Compile classifier forward kernel
+static Kern *compile_classifier_fwd(const float *embed) {
+    return compile_kern_mil_w(gen_classifier_fwd(), (@{
+        @"@model_path/weights/embed.bin": @{@"offset":@0, @"data":build_blob(embed, VOCAB, DIM)},
+    }), DIM*SEQ*2, VOCAB*SEQ*2);
+}
+
+// NEW: Compile final RMSNorm kernel
+static Kern *compile_final_rmsnorm_kern(const float *rms_w) {
+    return compile_kern_mil_w(gen_final_rmsnorm(), (@{
+        @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":build_blob(rms_w, 1, DIM)},
+    }), DIM*SEQ*2, DIM*SEQ*2);
+}
+
+// NEW: Compile softmax kernel (no weights)
+static Kern *compile_softmax_kern(void) {
+    return compile_kern_mil_w(gen_softmax_vocab(), @{}, VOCAB*SEQ*2, VOCAB*SEQ*2);
+}
+
+static void free_layer_kernels(LayerKernels *lk) {
+    free_kern(lk->fwdAttn); free_kern(lk->fwdFFN); free_kern(lk->ffnBwd);
+    free_kern(lk->sdpaBwd1); free_kern(lk->qkvBwd);
+    lk->fwdAttn = lk->fwdFFN = lk->ffnBwd = lk->sdpaBwd1 = lk->qkvBwd = NULL;
+}
+
+// ===== Checkpoint save/load (same as train_large.m) =====
+static void save_checkpoint(const char *path, int step, int total_steps, float lr, float loss,
+                            double cc, double ct, double cw, int cs, int cb, int adam_t,
+                            LayerWeights *lw, LayerAdam *la, float *rms_final, AdamState *arms_final,
+                            float *embed, AdamState *aembed) {
+    FILE *f = fopen(path, "wb");
+    CkptHdr h = {0};
+    h.magic = 0x424C5A54; h.version = 2;
+    h.step = step; h.total_steps = total_steps;
+    h.n_layers = NLAYERS; h.vocab_size = VOCAB; h.dim = DIM;
+    h.hidden_dim = HIDDEN; h.n_heads = HEADS; h.seq_len = SEQ;
+    h.lr = lr; h.loss = loss;
+    h.cum_compile = cc; h.cum_train = ct; h.cum_wall = cw;
+    h.cum_steps = cs; h.cum_batches = cb; h.adam_t = adam_t;
+    fwrite(&h, sizeof(h), 1, f);
+    for (int L = 0; L < NLAYERS; L++) {
+        fwrite(lw[L].Wq,4,WQ_SZ,f); fwrite(lw[L].Wk,4,WQ_SZ,f);
+        fwrite(lw[L].Wv,4,WQ_SZ,f); fwrite(lw[L].Wo,4,WO_SZ,f);
+        fwrite(lw[L].W1,4,W1_SZ,f); fwrite(lw[L].W2,4,W2_SZ,f); fwrite(lw[L].W3,4,W3_SZ,f);
+        fwrite(lw[L].rms_att,4,DIM,f); fwrite(lw[L].rms_ffn,4,DIM,f);
+        fwrite(la[L].Wq.m,4,WQ_SZ,f); fwrite(la[L].Wq.v,4,WQ_SZ,f);
+        fwrite(la[L].Wk.m,4,WQ_SZ,f); fwrite(la[L].Wk.v,4,WQ_SZ,f);
+        fwrite(la[L].Wv.m,4,WQ_SZ,f); fwrite(la[L].Wv.v,4,WQ_SZ,f);
+        fwrite(la[L].Wo.m,4,WO_SZ,f); fwrite(la[L].Wo.v,4,WO_SZ,f);
+        fwrite(la[L].W1.m,4,W1_SZ,f); fwrite(la[L].W1.v,4,W1_SZ,f);
+        fwrite(la[L].W2.m,4,W2_SZ,f); fwrite(la[L].W2.v,4,W2_SZ,f);
+        fwrite(la[L].W3.m,4,W3_SZ,f); fwrite(la[L].W3.v,4,W3_SZ,f);
+        fwrite(la[L].rms_att.m,4,DIM,f); fwrite(la[L].rms_att.v,4,DIM,f);
+        fwrite(la[L].rms_ffn.m,4,DIM,f); fwrite(la[L].rms_ffn.v,4,DIM,f);
+    }
+    fwrite(rms_final,4,DIM,f);
+    fwrite(arms_final->m,4,DIM,f); fwrite(arms_final->v,4,DIM,f);
+    fwrite(embed,4,VOCAB*DIM,f);
+    fwrite(aembed->m,4,VOCAB*DIM,f); fwrite(aembed->v,4,VOCAB*DIM,f);
+    fclose(f);
+}
+
+static bool load_checkpoint(const char *path, int *step, int *total_steps, float *lr, float *loss,
+                             double *cc, double *ct, double *cw, int *cs, int *cb, int *adam_t,
+                             LayerWeights *lw, LayerAdam *la, float *rms_final, AdamState *arms_final,
+                             float *embed, AdamState *aembed) {
+    FILE *f = fopen(path, "rb");
+    if (!f) return false;
+    CkptHdr h;
+    fread(&h, sizeof(h), 1, f);
+    if (h.magic != 0x424C5A54 || h.version != 2) { fclose(f); return false; }
+    *step = h.step; *total_steps = h.total_steps; *lr = h.lr; *loss = h.loss;
+    *cc = h.cum_compile; *ct = h.cum_train; *cw = h.cum_wall;
+    *cs = h.cum_steps; *cb = h.cum_batches; *adam_t = h.adam_t;
+    for (int L = 0; L < NLAYERS; L++) {
+        fread(lw[L].Wq,4,WQ_SZ,f); fread(lw[L].Wk,4,WQ_SZ,f);
+        fread(lw[L].Wv,4,WQ_SZ,f); fread(lw[L].Wo,4,WO_SZ,f);
+        fread(lw[L].W1,4,W1_SZ,f); fread(lw[L].W2,4,W2_SZ,f); fread(lw[L].W3,4,W3_SZ,f);
+        fread(lw[L].rms_att,4,DIM,f); fread(lw[L].rms_ffn,4,DIM,f);
+        fread(la[L].Wq.m,4,WQ_SZ,f); fread(la[L].Wq.v,4,WQ_SZ,f);
+        fread(la[L].Wk.m,4,WQ_SZ,f); fread(la[L].Wk.v,4,WQ_SZ,f);
+        fread(la[L].Wv.m,4,WQ_SZ,f); fread(la[L].Wv.v,4,WQ_SZ,f);
+        fread(la[L].Wo.m,4,WO_SZ,f); fread(la[L].Wo.v,4,WO_SZ,f);
+        fread(la[L].W1.m,4,W1_SZ,f); fread(la[L].W1.v,4,W1_SZ,f);
+        fread(la[L].W2.m,4,W2_SZ,f); fread(la[L].W2.v,4,W2_SZ,f);
+        fread(la[L].W3.m,4,W3_SZ,f); fread(la[L].W3.v,4,W3_SZ,f);
+        fread(la[L].rms_att.m,4,DIM,f); fread(la[L].rms_att.v,4,DIM,f);
+        fread(la[L].rms_ffn.m,4,DIM,f); fread(la[L].rms_ffn.v,4,DIM,f);
+    }
+    fread(rms_final,4,DIM,f);
+    fread(arms_final->m,4,DIM,f); fread(arms_final->v,4,DIM,f);
+    fread(embed,4,VOCAB*DIM,f);
+    fread(aembed->m,4,VOCAB*DIM,f); fread(aembed->v,4,VOCAB*DIM,f);
+    fclose(f);
+    return true;
+}
+
+// ===== Main =====
+int main(int argc, char *argv[]) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        ane_init();
+        mach_timebase_info(&g_tb);
+
+        int total_steps = 10000;
+        float lr = 3e-4f;
+        float adam_b1=0.9f, adam_b2=0.999f, adam_eps=1e-8f;
+        int adam_t = 0, start_step = 0;
+        const char *ckpt_path = CKPT_PATH_DEFAULT;
+        const char *model_path = MODEL_PATH_DEFAULT;
+        const char *data_path = DATA_PATH_DEFAULT;
+        bool do_resume = false;
+        bool ane_extras = true;  // classifier, softmax, rmsnorm_bwd on ANE
+        int pos = 0;
+        for (int i=1; i<argc; i++) {
+            if (strcmp(argv[i], "--resume") == 0) do_resume = true;
+            else if (strcmp(argv[i], "--no-ane-extras") == 0) ane_extras = false;
+            else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--ckpt") == 0 && i+1<argc) ckpt_path = argv[++i];
+            else if (strcmp(argv[i], "--model") == 0 && i+1<argc) model_path = argv[++i];
+            else if (strcmp(argv[i], "--data") == 0 && i+1<argc) data_path = argv[++i];
+            else if (argv[i][0] != '-') {
+                if (pos == 0) model_path = argv[i];
+                else if (pos == 1) { /* seq - compile-time constant */ }
+                else if (pos == 2) total_steps = atoi(argv[i]);
+                else if (pos == 3) lr = atof(argv[i]);
+                pos++;
+            }
+        }
+
+        LayerWeights lw[NLAYERS]; LayerAdam la[NLAYERS];
+        LayerActs acts[NLAYERS]; LayerGrads grads[NLAYERS]; LayerKernels kern[NLAYERS];
+        for (int L=0; L<NLAYERS; L++) {
+            lw[L] = layer_weights_alloc(); la[L] = layer_adam_alloc();
+            acts[L] = layer_acts_alloc(); grads[L] = layer_grads_alloc();
+            memset(&kern[L], 0, sizeof(LayerKernels));
+        }
+        float *rms_final = (float*)malloc(DIM*4);
+        float *embed = (float*)malloc(VOCAB*DIM*4);
+        float *grms_final = (float*)calloc(DIM, 4);
+        float *gembed = (float*)calloc(VOCAB*DIM, 4);
+        AdamState arms_final = adam_alloc(DIM);
+        AdamState aembed = adam_alloc((size_t)VOCAB*DIM);
+        double cum_compile=0, cum_train=0, cum_wall=0;
+        int cum_steps=0, cum_batches=0;
+
+        float resume_loss = 0;
+        bool resuming = false;
+        if (do_resume) {
+            resuming = load_checkpoint(ckpt_path, &start_step, &total_steps, &lr, &resume_loss,
+                &cum_compile, &cum_train, &cum_wall, &cum_steps, &cum_batches, &adam_t,
+                lw, la, rms_final, &arms_final, embed, &aembed);
+            if (resuming) printf("[RESUMED step %d, loss=%.4f]\n", start_step, resume_loss);
+        }
+        if (!resuming) {
+            printf("=== ANE Training: Stories110M (ANE-offloaded) ===\n");
+            printf("dim=%d hidden=%d heads=%d seq=%d vocab=%d layers=%d\n", DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS);
+            if (ane_extras) printf("NEW: final_rmsnorm, classifier_fwd, softmax, rmsnorm_bwd on ANE\n");
+            else printf("ANE extras DISABLED (classifier/softmax/rmsnorm_bwd on CPU)\n");
+            if (!load_pretrained(lw, rms_final, embed, model_path)) {
+                printf("Pretrained load failed, using random init\n");
+                srand48(42);
+                float scale_d=1.0f/sqrtf(DIM), scale_h=1.0f/sqrtf(HIDDEN);
+                for (int L=0; L<NLAYERS; L++) {
+                    for(size_t i=0;i<WQ_SZ;i++){lw[L].Wq[i]=scale_d*(2*drand48()-1);lw[L].Wk[i]=scale_d*(2*drand48()-1);}
+                    for(size_t i=0;i<WQ_SZ;i++){lw[L].Wv[i]=scale_d*(2*drand48()-1);lw[L].Wo[i]=scale_d*(2*drand48()-1);}
+                    for(size_t i=0;i<W1_SZ;i++) lw[L].W1[i]=scale_h*(2*drand48()-1);
+                    for(size_t i=0;i<W2_SZ;i++) lw[L].W2[i]=scale_d*(2*drand48()-1);
+                    for(size_t i=0;i<W3_SZ;i++) lw[L].W3[i]=scale_h*(2*drand48()-1);
+                    for(int i=0;i<DIM;i++){lw[L].rms_att[i]=1.0f; lw[L].rms_ffn[i]=1.0f;}
+                }
+                for(int i=0;i<DIM;i++) rms_final[i]=1.0f;
+                float escale = 0.02f;
+                for(size_t i=0;i<(size_t)VOCAB*DIM;i++) embed[i]=escale*(2*drand48()-1);
+            }
+        }
+
+        // mmap token data
+        int data_fd = open(data_path, O_RDONLY);
+        if (data_fd < 0) {
+            printf("Cannot open token data: %s\n", data_path);
+            printf("Hint: run `bash download_data.sh` in training/ or pass --data /path/to/tinystories_data00.bin\n");
+            return 1;
+        }
+        struct stat st; fstat(data_fd, &st);
+        size_t data_len = st.st_size;
+        uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
+        if (token_data == MAP_FAILED) { printf("mmap failed\n"); return 1; }
+        size_t n_tokens = data_len / 2;
+        printf("Token data: %zu tokens (%.1f MB)\n", n_tokens, data_len/1e6);
+
+        // Gradient buffers
+        float *dy = (float*)malloc(SEQ*DIM*4);
+        float *dffn = (float*)malloc(SEQ*DIM*4);
+        float *dh1 = (float*)malloc(SEQ*HIDDEN*4);
+        float *dh3 = (float*)malloc(SEQ*HIDDEN*4);
+        float *dx_ffn = (float*)malloc(SEQ*DIM*4);
+        float *dx2 = (float*)malloc(SEQ*DIM*4);
+        float *do_out_buf = (float*)malloc(SEQ*DIM*4);
+        float *dq = (float*)malloc(SEQ*DIM*4);
+        float *dk = (float*)malloc(SEQ*DIM*4);
+        float *dv = (float*)malloc(SEQ*DIM*4);
+        float *dx_attn = (float*)malloc(SEQ*DIM*4);
+        float *x_cur = (float*)malloc(SEQ*DIM*4);
+        float *x_final = (float*)malloc(SEQ*DIM*4);
+        float *logits = (float*)malloc(SEQ*VOCAB*4);
+        float *dlogits = (float*)malloc(SEQ*VOCAB*4);
+        float *probs = (float*)malloc(SEQ*VOCAB*4);   // NEW: for ANE softmax output
+
+        // Compile static sdpaBwd2 kernels
+        Kern *sdpaBwd2[NLAYERS];
+        for (int L=0; L<NLAYERS; L++) {
+            sdpaBwd2[L] = compile_sdpa_bwd2();
+            if (!sdpaBwd2[L]) { printf("sdpaBwd2 compile failed\n"); return 1; }
+        }
+
+        // NEW: Compile ANE-offloaded kernels (static — no per-batch recompile needed)
+        // These have no weight-bearing or static weights that don't change per batch
+
+        // RMSNorm backward kernels — one per layer for attn and ffn
+        // These DO have baked weights (rms_att, rms_ffn) so they need recompile per batch
+        // But they're small weights, and we compile them alongside the layer kernels
+        Kern *rmsAttBwd[NLAYERS], *rmsFFNBwd[NLAYERS];
+        memset(rmsAttBwd, 0, sizeof(rmsAttBwd));
+        memset(rmsFFNBwd, 0, sizeof(rmsFFNBwd));
+
+        // Softmax kernel (no weights — compile once)
+        Kern *softmaxKern = NULL;
+        if (ane_extras) {
+            softmaxKern = compile_softmax_kern();
+            if (!softmaxKern) { printf("softmax compile failed\n"); return 1; }
+            printf("Softmax kernel compiled (no weights)\n");
+        }
+
+        // Final RMSNorm and classifier are recompiled per batch since they have baked weights
+        Kern *finalRmsKern = NULL, *classifierKern = NULL;
+
+        dispatch_queue_t dw_q = dispatch_queue_create("dw_cblas", DISPATCH_QUEUE_SERIAL);
+        dispatch_group_t dw_grp = dispatch_group_create();
+
+        float last_loss = 999.0f;
+        double total_compile_ms=0, total_train_ms=0;
+        int total_steps_done=0, total_batches=0;
+        uint64_t t_wall_start = mach_absolute_time();
+        srand48(42 + start_step);
+
+        int step = start_step;
+        while (step < total_steps) {
+            // Check compile budget — account for new kernels
+            // Per batch: 60 layer kernels [+ 24 rmsnorm_bwd + 1 classifier + 1 final_rms = 86 with extras]
+            int kernels_needed = TOTAL_WEIGHT_KERNELS + (ane_extras ? 2*NLAYERS + 2 : 0);
+            if (g_compile_count + kernels_needed > MAX_COMPILES) {
+                for (int L=0; L<NLAYERS; L++) {
+                    free_layer_kernels(&kern[L]); free_kern(sdpaBwd2[L]);
+                    free_kern(rmsAttBwd[L]); free_kern(rmsFFNBwd[L]);
+                }
+                free_kern(softmaxKern); free_kern(finalRmsKern); free_kern(classifierKern);
+                double wall = tb_ms(mach_absolute_time() - t_wall_start);
+                save_checkpoint(ckpt_path, step, total_steps, lr, last_loss,
+                    total_compile_ms+cum_compile, total_train_ms+cum_train, wall+cum_wall,
+                    total_steps_done+cum_steps, total_batches+cum_batches, adam_t,
+                    lw, la, rms_final, &arms_final, embed, &aembed);
+                printf("[exec() restart step %d, %d compiles, loss=%.4f]\n", step, g_compile_count, last_loss);
+                fflush(stdout);
+                if (ane_extras)
+                    execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, "--data", data_path, NULL);
+                else
+                    execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, "--data", data_path, "--no-ane-extras", NULL);
+                perror("execl"); return 1;
+            }
+
+            // Compile all layer kernels
+            uint64_t tc = mach_absolute_time();
+            for (int L=0; L<NLAYERS; L++) free_layer_kernels(&kern[L]);
+            bool compile_ok = true;
+            for (int L=0; L<NLAYERS; L++) {
+                printf("  Compiling layer %d/%d... (%d compiles)\r", L+1, NLAYERS, g_compile_count);
+                fflush(stdout);
+                if (!compile_layer_kernels(&kern[L], &lw[L])) {
+                    printf("\nCompile failed at layer %d\n", L);
+                    compile_ok = false; break;
+                }
+                // Compile RMSNorm backward kernels for this layer (if ane_extras)
+                if (ane_extras) {
+                    free_kern(rmsAttBwd[L]); free_kern(rmsFFNBwd[L]);
+                    rmsAttBwd[L] = compile_rmsnorm_bwd_kern(lw[L].rms_att);
+                    rmsFFNBwd[L] = compile_rmsnorm_bwd_kern(lw[L].rms_ffn);
+                    if (!rmsAttBwd[L] || !rmsFFNBwd[L]) {
+                        printf("\nrmsnorm_bwd compile failed at layer %d\n", L);
+                        compile_ok = false; break;
+                    }
+                }
+            }
+            if (!compile_ok) { g_compile_count = MAX_COMPILES; continue; }
+
+            // Re-compile sdpaBwd2 if needed
+            for (int L=0; L<NLAYERS; L++) {
+                if (!sdpaBwd2[L]) {
+                    sdpaBwd2[L] = compile_sdpa_bwd2();
+                    if (!sdpaBwd2[L]) { printf("sdpaBwd2 recompile failed\n"); return 1; }
+                }
+            }
+
+            // Compile final RMSNorm and classifier with current weights (if ane_extras)
+            if (ane_extras) {
+                free_kern(finalRmsKern); free_kern(classifierKern);
+                finalRmsKern = compile_final_rmsnorm_kern(rms_final);
+                classifierKern = compile_classifier_fwd(embed);
+                if (!finalRmsKern || !classifierKern) {
+                    printf("finalRms or classifier compile failed\n");
+                    g_compile_count = MAX_COMPILES; continue;
+                }
+                if (!softmaxKern) {
+                    softmaxKern = compile_softmax_kern();
+                    if (!softmaxKern) { printf("softmax recompile failed\n"); return 1; }
+                }
+            }
+
+            double cms = tb_ms(mach_absolute_time() - tc);
+            total_compile_ms += cms;
+            printf("  Compiled %d kernels in %.0fms                    \n", kernels_needed, cms);
+
+            // Zero gradient accumulators
+            for (int L=0; L<NLAYERS; L++) layer_grads_zero(&grads[L]);
+            memset(grms_final, 0, DIM*4);
+            memset(gembed, 0, (size_t)VOCAB*DIM*4);
+
+            int steps_batch = 0;
+            uint64_t tt = mach_absolute_time();
+            double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
+
+            for (int a=0; a<ACCUM_STEPS && step<total_steps; a++, step++) {
+                uint64_t t0,t1;
+                size_t max_pos = n_tokens - SEQ - 1;
+                size_t pos = (size_t)(drand48() * max_pos);
+                uint16_t *input_tokens = token_data + pos;
+                uint16_t *target_tokens = token_data + pos + 1;
+
+                // Embedding lookup
+                t0=mach_absolute_time();
+                embed_lookup(x_cur, embed, input_tokens, DIM, SEQ);
+                t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0);
+
+                // ===== FORWARD (12 layers) =====
+                for (int L=0; L<NLAYERS; L++) {
+                    LayerActs *ac = &acts[L];
+                    memcpy(ac->layer_in, x_cur, SEQ*DIM*4);
+
+                    t0=mach_absolute_time();
+                    dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+                    t1=mach_absolute_time(); t_cblas_wait+=tb_ms(t1-t0); t0=t1;
+
+                    io_write_fp16(kern[L].fwdAttn->ioIn, x_cur, DIM, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+                    ane_eval(kern[L].fwdAttn);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+                    io_read_fp16(kern[L].fwdAttn->ioOut, ac->o_out, 0, DIM, SEQ);
+                    io_read_fp16(kern[L].fwdAttn->ioOut, ac->attn_out, 4*DIM, DIM, SEQ);
+                    io_read_fp16(kern[L].fwdAttn->ioOut, ac->xnorm, 5*DIM, DIM, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+
+                    vDSP_vadd(x_cur, 1, ac->o_out, 1, ac->x2, 1, (vDSP_Length)(SEQ*DIM));
+                    t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0); t0=t1;
+
+                    io_write_fp16(kern[L].fwdFFN->ioIn, ac->x2, DIM, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+                    ane_eval(kern[L].fwdFFN);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->ffn_out, 0, DIM, SEQ);
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->h1, DIM, HIDDEN, SEQ);
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->h3, DIM+HIDDEN, HIDDEN, SEQ);
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->silu_out, DIM+2*HIDDEN, HIDDEN, SEQ);
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->x2norm, DIM+3*HIDDEN, DIM, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+
+                    vDSP_vadd(ac->x2, 1, ac->ffn_out, 1, x_cur, 1, (vDSP_Length)(SEQ*DIM));
+                    t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0);
+                }
+
+                t0=mach_absolute_time();
+                if (ane_extras) {
+                    // Final RMSNorm on ANE
+                    io_write_fp16(finalRmsKern->ioIn, x_cur, DIM, SEQ);
+                    ane_eval(finalRmsKern);
+                    io_read_fp16(finalRmsKern->ioOut, x_final, 0, DIM, SEQ);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+
+                    // Classifier on ANE
+                    io_write_fp16(classifierKern->ioIn, x_final, DIM, SEQ);
+                    ane_eval(classifierKern);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+
+                    // Softmax on ANE
+                    io_copy(softmaxKern->ioIn, 0, classifierKern->ioOut, 0, VOCAB, SEQ);
+                    ane_eval(softmaxKern);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+
+                    io_read_fp16(softmaxKern->ioOut, probs, 0, VOCAB, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+                } else {
+                    // CPU fallback: rmsnorm + classifier + softmax
+                    rmsnorm(x_final, x_cur, rms_final, DIM, SEQ);
+                    t1=mach_absolute_time(); t_rms+=tb_ms(t1-t0); t0=t1;
+
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
+                                VOCAB, SEQ, DIM, 1.0f,
+                                embed, DIM, x_final, SEQ, 0.0f, probs, SEQ);
+                    t1=mach_absolute_time(); t_cls+=tb_ms(t1-t0); t0=t1;
+
+                    // CPU softmax
+                    for (int t = 0; t < SEQ; t++) {
+                        float maxv = -1e30f;
+                        for (int v = 0; v < VOCAB; v++) { float val = probs[v*SEQ+t]; if (val > maxv) maxv = val; }
+                        float sum = 0;
+                        for (int v = 0; v < VOCAB; v++) { probs[v*SEQ+t] = expf(probs[v*SEQ+t] - maxv); sum += probs[v*SEQ+t]; }
+                        for (int v = 0; v < VOCAB; v++) probs[v*SEQ+t] /= sum;
+                    }
+                    t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0); t0=t1;
+                }
+
+                // NLL loss + gradient on CPU: dlogits = probs - one_hot(targets)
+                float total_loss = 0;
+                float invS = 1.0f / SEQ;
+                memcpy(dlogits, probs, (size_t)VOCAB*SEQ*4);
+                for (int t = 0; t < SEQ; t++) {
+                    int tgt = target_tokens[t];
+                    total_loss -= logf(probs[tgt*SEQ+t] + 1e-10f);
+                    dlogits[tgt*SEQ+t] -= 1.0f;  // subtract one_hot
+                }
+                // Scale by 1/S
+                vDSP_vsmul(dlogits, 1, &invS, dlogits, 1, (vDSP_Length)((size_t)VOCAB*SEQ));
+                float loss = total_loss / SEQ;
+                last_loss = loss;
+                t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0); t0=t1;
+
+                // ===== BACKWARD =====
+                // Classifier backward: dx_final = embed^T @ dlogits (CPU — ANE is slower)
+                cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
+                            DIM, SEQ, VOCAB, 1.0f,
+                            embed, DIM, dlogits, SEQ, 0.0f, dy, SEQ);
+                // dembed async on CPU
+                dispatch_group_async(dw_grp, dw_q, ^{
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
+                                VOCAB, DIM, SEQ, 1.0f,
+                                dlogits, SEQ, x_final, SEQ, 1.0f, gembed, DIM);
+                });
+
+                // Final RMSNorm backward (CPU — just one call, not worth ANE overhead)
+                {
+                    float *dx_rms_final = (float*)calloc(SEQ*DIM, 4);
+                    rmsnorm_bwd(dx_rms_final, grms_final, dy, x_cur, rms_final, DIM, SEQ);
+                    memcpy(dy, dx_rms_final, SEQ*DIM*4);
+                    free(dx_rms_final);
+                }
+                t1=mach_absolute_time(); t_rms+=tb_ms(t1-t0);
+
+                // ===== BACKWARD (12 layers, reverse) =====
+                for (int L=NLAYERS-1; L>=0; L--) {
+                    LayerActs *ac = &acts[L];
+                    LayerGrads *gr = &grads[L];
+                    memcpy(dffn, dy, SEQ*DIM*4);
+
+                    // FFN backward (ANE) — same as original
+                    io_write_fp16_at(kern[L].ffnBwd->ioIn, 0, dffn, DIM, SEQ);
+                    io_copy(kern[L].ffnBwd->ioIn, DIM, kern[L].fwdFFN->ioOut, DIM, 2*HIDDEN, SEQ);
+                    ane_eval(kern[L].ffnBwd);
+                    io_read_fp16(kern[L].ffnBwd->ioOut, dx_ffn, 0, DIM, SEQ);
+                    io_read_fp16(kern[L].ffnBwd->ioOut, dh1, DIM, HIDDEN, SEQ);
+                    io_read_fp16(kern[L].ffnBwd->ioOut, dh3, DIM+HIDDEN, HIDDEN, SEQ);
+
+                    // dW FFN async (CPU — parallel with ANE)
+                    float *capt_dffn = (float*)malloc(SEQ*DIM*4); memcpy(capt_dffn, dffn, SEQ*DIM*4);
+                    float *capt_silu = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_silu, ac->silu_out, SEQ*HIDDEN*4);
+                    float *capt_dh1 = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_dh1, dh1, SEQ*HIDDEN*4);
+                    float *capt_dh3 = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_dh3, dh3, SEQ*HIDDEN*4);
+                    float *capt_x2n = (float*)malloc(SEQ*DIM*4); memcpy(capt_x2n, ac->x2norm, SEQ*DIM*4);
+                    dispatch_group_async(dw_grp, dw_q, ^{
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, HIDDEN, SEQ,
+                                    1.0f, capt_dffn, SEQ, capt_silu, SEQ, 1.0f, gr->W2, HIDDEN);
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+                                    1.0f, capt_dh1, SEQ, capt_x2n, SEQ, 1.0f, gr->W1, DIM);
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+                                    1.0f, capt_dh3, SEQ, capt_x2n, SEQ, 1.0f, gr->W3, DIM);
+                        free(capt_dffn); free(capt_silu); free(capt_dh1); free(capt_dh3); free(capt_x2n);
+                    });
+
+                    // RMSNorm2 backward
+                    if (ane_extras) {
+                        io_write_fp16_at(rmsFFNBwd[L]->ioIn, 0, dx_ffn, DIM, SEQ);
+                        io_write_fp16_at(rmsFFNBwd[L]->ioIn, DIM, ac->x2, DIM, SEQ);
+                        ane_eval(rmsFFNBwd[L]);
+                        io_read_fp16(rmsFFNBwd[L]->ioOut, dx2, 0, DIM, SEQ);
+                    }
+                    // dw for rmsnorm_ffn on CPU (accumulate per step)
+                    {
+                        float *dw_tmp = (float*)calloc(DIM, 4);
+                        float *dx_scratch = (float*)malloc(SEQ*DIM*4);
+                        rmsnorm_bwd(dx_scratch, dw_tmp, dx_ffn, ac->x2, lw[L].rms_ffn, DIM, SEQ);
+                        if (!ane_extras) memcpy(dx2, dx_scratch, SEQ*DIM*4);
+                        for(int i=0;i<DIM;i++) gr->rms_ffn[i] += dw_tmp[i];
+                        free(dx_scratch); free(dw_tmp);
+                    }
+                    // Add residual: dx2 += dy
+                    for(int i=0;i<SEQ*DIM;i++) dx2[i] += dy[i];
+
+                    // dWo async (CPU)
+                    memcpy(do_out_buf, dx2, SEQ*DIM*4);
+                    float *capt_do = (float*)malloc(SEQ*DIM*4); memcpy(capt_do, do_out_buf, SEQ*DIM*4);
+                    float *capt_attn = (float*)malloc(SEQ*DIM*4); memcpy(capt_attn, ac->attn_out, SEQ*DIM*4);
+                    dispatch_group_async(dw_grp, dw_q, ^{
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                    1.0f, capt_do, SEQ, capt_attn, SEQ, 1.0f, gr->Wo, DIM);
+                        free(capt_do); free(capt_attn);
+                    });
+
+                    // SDPA backward (ANE) — same as original
+                    io_copy(kern[L].sdpaBwd1->ioIn, 0, kern[L].fwdAttn->ioOut, DIM, 3*DIM, SEQ);
+                    io_write_fp16_at(kern[L].sdpaBwd1->ioIn, 3*DIM, dx2, DIM, SEQ);
+                    ane_eval(kern[L].sdpaBwd1);
+                    io_copy(sdpaBwd2[L]->ioIn, 0, kern[L].sdpaBwd1->ioOut, DIM, 2*SCORE_CH, SEQ);
+                    io_copy(sdpaBwd2[L]->ioIn, 2*SCORE_CH, kern[L].fwdAttn->ioOut, DIM, 2*DIM, SEQ);
+                    ane_eval(sdpaBwd2[L]);
+
+                    io_read_fp16(sdpaBwd2[L]->ioOut, dq, 0, DIM, SEQ);
+                    io_read_fp16(sdpaBwd2[L]->ioOut, dk, DIM, DIM, SEQ);
+                    io_read_fp16(kern[L].sdpaBwd1->ioOut, dv, 0, DIM, SEQ);
+
+                    // dWq/dWk/dWv async (CPU)
+                    float *capt_dq = (float*)malloc(SEQ*DIM*4); memcpy(capt_dq, dq, SEQ*DIM*4);
+                    float *capt_dk = (float*)malloc(SEQ*DIM*4); memcpy(capt_dk, dk, SEQ*DIM*4);
+                    float *capt_dv = (float*)malloc(SEQ*DIM*4); memcpy(capt_dv, dv, SEQ*DIM*4);
+                    float *capt_xn = (float*)malloc(SEQ*DIM*4); memcpy(capt_xn, ac->xnorm, SEQ*DIM*4);
+                    dispatch_group_async(dw_grp, dw_q, ^{
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                    1.0f, capt_dq, SEQ, capt_xn, SEQ, 1.0f, gr->Wq, DIM);
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                    1.0f, capt_dk, SEQ, capt_xn, SEQ, 1.0f, gr->Wk, DIM);
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                    1.0f, capt_dv, SEQ, capt_xn, SEQ, 1.0f, gr->Wv, DIM);
+                        free(capt_dq); free(capt_dk); free(capt_dv); free(capt_xn);
+                    });
+
+                    // QKV backward (ANE) — same as original
+                    io_copy(kern[L].qkvBwd->ioIn, 0, sdpaBwd2[L]->ioOut, 0, 2*DIM, SEQ);
+                    io_copy(kern[L].qkvBwd->ioIn, 2*DIM, kern[L].sdpaBwd1->ioOut, 0, DIM, SEQ);
+                    ane_eval(kern[L].qkvBwd);
+                    io_read_fp16(kern[L].qkvBwd->ioOut, dx_attn, 0, DIM, SEQ);
+
+                    // RMSNorm1 backward
+                    float *dx_rms1 = (float*)malloc(SEQ*DIM*4);
+                    if (ane_extras) {
+                        io_write_fp16_at(rmsAttBwd[L]->ioIn, 0, dx_attn, DIM, SEQ);
+                        io_write_fp16_at(rmsAttBwd[L]->ioIn, DIM, ac->layer_in, DIM, SEQ);
+                        ane_eval(rmsAttBwd[L]);
+                        io_read_fp16(rmsAttBwd[L]->ioOut, dx_rms1, 0, DIM, SEQ);
+                    }
+                    // dw for rmsnorm_att on CPU
+                    {
+                        float *dw_tmp = (float*)calloc(DIM, 4);
+                        float *dx_scratch = (float*)malloc(SEQ*DIM*4);
+                        rmsnorm_bwd(dx_scratch, dw_tmp, dx_attn, ac->layer_in, lw[L].rms_att, DIM, SEQ);
+                        if (!ane_extras) memcpy(dx_rms1, dx_scratch, SEQ*DIM*4);
+                        for(int i=0;i<DIM;i++) gr->rms_att[i] += dw_tmp[i];
+                        free(dx_scratch); free(dw_tmp);
+                    }
+
+                    for(int i=0;i<SEQ*DIM;i++) dy[i] = dx_rms1[i] + dx2[i];
+                    free(dx_rms1);
+                }
+
+                // Embedding backward
+                dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+                embed_backward(gembed, dy, input_tokens, DIM, SEQ);
+
+                steps_batch++;
+                if (step % 10 == 0 || step == start_step)
+                    printf("step %-4d loss=%.4f\n", step, loss);
+            }
+            double tms = tb_ms(mach_absolute_time() - tt);
+            total_train_ms += tms;
+            total_steps_done += steps_batch;
+            total_batches++;
+
+            dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+
+            // Adam update
+            float gsc = 1.0f / steps_batch;
+            adam_t++;
+            for (int L=0; L<NLAYERS; L++) {
+                LayerGrads *g = &grads[L];
+                for(size_t i=0;i<WQ_SZ;i++){g->Wq[i]*=gsc;g->Wk[i]*=gsc;g->Wv[i]*=gsc;g->Wo[i]*=gsc;}
+                for(size_t i=0;i<W1_SZ;i++) g->W1[i]*=gsc;
+                for(size_t i=0;i<W2_SZ;i++) g->W2[i]*=gsc;
+                for(size_t i=0;i<W3_SZ;i++) g->W3[i]*=gsc;
+                for(int i=0;i<DIM;i++){g->rms_att[i]*=gsc; g->rms_ffn[i]*=gsc;}
+                adam_update(lw[L].Wq, g->Wq, &la[L].Wq, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].Wk, g->Wk, &la[L].Wk, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].Wv, g->Wv, &la[L].Wv, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].Wo, g->Wo, &la[L].Wo, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].W1, g->W1, &la[L].W1, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].W2, g->W2, &la[L].W2, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].W3, g->W3, &la[L].W3, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].rms_att, g->rms_att, &la[L].rms_att, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].rms_ffn, g->rms_ffn, &la[L].rms_ffn, adam_t, lr, adam_b1, adam_b2, adam_eps);
+            }
+            for(int i=0;i<DIM;i++) grms_final[i]*=gsc;
+            adam_update(rms_final, grms_final, &arms_final, adam_t, lr, adam_b1, adam_b2, adam_eps);
+            for(size_t i=0;i<(size_t)VOCAB*DIM;i++) gembed[i]*=gsc;
+            adam_update(embed, gembed, &aembed, adam_t, lr, adam_b1, adam_b2, adam_eps);
+
+            printf("  [batch %d: compile=%.0fms train=%.1fms (%.1fms/step) compiles=%d]\n",
+                   steps_batch, cms, tms, tms/steps_batch, g_compile_count);
+        }
+
+        // Efficiency report
+        double wall = tb_ms(mach_absolute_time() - t_wall_start);
+        total_compile_ms += cum_compile; total_train_ms += cum_train;
+        wall += cum_wall; total_steps_done += cum_steps; total_batches += cum_batches;
+
+        // FLOP accounting — same as train_large.m but classifier+softmax now on ANE
+        double fwd_flops = NLAYERS * (4.0*2*DIM*DIM*SEQ + 2.0*2*DIM*HIDDEN*SEQ + 2.0*HIDDEN*DIM*SEQ);
+        double sdpa_flops = NLAYERS * 2.0*HEADS*5*SEQ*SEQ*HD;
+        double cls_flops = 2.0*VOCAB*DIM*SEQ;
+        double total_flops = (fwd_flops*3 + sdpa_flops + cls_flops*3) * total_steps_done;
+        // In train_large_ane: classifier fwd + softmax run on ANE (not CPU)
+        double ane_flops = (fwd_flops*2 + sdpa_flops + cls_flops) * total_steps_done;
+
+        printf("\n=== NEW Efficiency Report ===\n");
+        printf("Total steps:     %d\n", total_steps_done);
+        printf("Wall time:       %.0f ms (%.1f s)\n", wall, wall/1000);
+        printf("Compile time:    %.0f ms (%.1f%%)\n", total_compile_ms, 100*total_compile_ms/wall);
+        printf("Train time:      %.0f ms (%.1f%%)\n", total_train_ms, 100*total_train_ms/wall);
+        printf("Avg train:       %.1f ms/step\n", total_train_ms/total_steps_done);
+        printf("ANE TFLOPS:      %.2f sustained\n", ane_flops / (total_train_ms * 1e9));
+        printf("Total TFLOPS:    %.2f (ANE+CPU)\n", total_flops / (total_train_ms * 1e9));
+        printf("ANE utilization: %.1f%% of 15.8 TFLOPS\n", 100*ane_flops/(total_train_ms*1e9)/15.8);
+        // Cleanup
+        for (int L=0; L<NLAYERS; L++) {
+            free_layer_kernels(&kern[L]); free_kern(sdpaBwd2[L]);
+            free_kern(rmsAttBwd[L]); free_kern(rmsFFNBwd[L]);
+            layer_weights_free(&lw[L]); layer_adam_free(&la[L]);
+            layer_acts_free(&acts[L]); layer_grads_free(&grads[L]);
+        }
+        free_kern(softmaxKern); free_kern(finalRmsKern); free_kern(classifierKern);
+        munmap(token_data, data_len); close(data_fd);
+        free(rms_final); free(embed); free(grms_final); free(gembed);
+        adam_free(&arms_final); adam_free(&aembed);
+        free(dy); free(dffn); free(dh1); free(dh3); free(dx_ffn); free(dx2);
+        free(do_out_buf); free(dq); free(dk); free(dv); free(dx_attn);
+        free(x_cur); free(x_final); free(logits); free(dlogits); free(probs);
+    }
+    return 0;
+}
diff --git a/training/train_pipeline.m b/training/train_pipeline.m
new file mode 100644
index 0000000..b0fc3a4
--- /dev/null
+++ b/training/train_pipeline.m
@@ -0,0 +1,257 @@
+// train_pipeline.m — Pipeline-scheduled multi-group ANE training
+//
+// Entry point that uses the pipeline scaffolding to train models
+// beyond the single-compile-batch limit.
+//
+// Architecture:
+//   ModelConfig  → what the model looks like
+//   PipelinePlan → which layers go in which compile groups
+//   PipelineScheduler → state machine driving forward/backward/restart
+//   MmapState    → cross-exec() shared memory for all tensor state
+//   CheckpointManager → activation save/recompute policy
+//
+// Usage:
+//   ./train_pipeline --model stories110m --steps 100 --lr 3e-4
+//   ./train_pipeline --model llama1b --steps 50 --lr 1e-4 --checkpoint sqrt
+//   ./train_pipeline --pipeline-resume /tmp/ane_pipeline.mmap   (auto after exec restart)
+//
+// Build:
+//   make train_pipeline
+//
+// Currently runs in planning/dry-run mode — prints the full execution
+// plan and simulates the scheduler state machine without compiling
+// actual MIL programs. Enable ANE_LIVE for real kernels.
+
+#import <Foundation/Foundation.h>
+#import <stdio.h>
+#import <stdlib.h>
+#import <string.h>
+#import <mach/mach_time.h>
+
+#include "model_config.h"
+#include "pipeline.h"
+#include "gradient_checkpoint.h"
+
+#define MMAP_PATH "/tmp/ane_pipeline.mmap"
+
+// ===== Forward declarations for ANE kernel operations =====
+// These would call into stories_io.h / stories_mil.h for real execution.
+// Stubbed here for planning mode.
+
+#ifdef ANE_LIVE
+#include "stories_io.h"
+#include "stories_mil.h"
+#include "stories_cpu_ops.h"
+// Real ANE kernel compilation and execution would go here
+#endif
+
+// ===== Dry-run simulation =====
+
+static void simulate_compile_group(const PipelineScheduler *s, const LayerGroup *lg) {
+    printf("    [compile] Layers [%d..%d): %d weight-bearing + %d static kernels\n",
+           lg->start_layer, lg->end_layer, lg->weight_kernels, lg->static_kernels);
+    printf("             Budget: %d/%d used → %d/%d after\n",
+           s->budget.used, s->budget.budget,
+           s->budget.used + lg->total_kernels, s->budget.budget);
+}
+
+static void simulate_forward_group(const PipelineScheduler *s, const LayerGroup *lg,
+                                    const CheckpointManager *cm) {
+    printf("    [forward] Layers [%d..%d)\n", lg->start_layer, lg->end_layer);
+    for (int L = lg->start_layer; L < lg->end_layer; L++) {
+        bool save = checkpoint_should_save(cm, L);
+        printf("      L%02d: fwdAttn → residual → fwdFFN → residual %s\n",
+               L, save ? "[SAVE acts]" : "[skip acts]");
+    }
+}
+
+static void simulate_backward_group(const PipelineScheduler *s, const LayerGroup *lg,
+                                      const CheckpointManager *cm) {
+    printf("    [backward] Layers [%d..%d) (reverse)\n", lg->start_layer, lg->end_layer);
+    for (int L = lg->end_layer - 1; L >= lg->start_layer; L--) {
+        bool recompute = checkpoint_needs_recompute(cm, L);
+        if (recompute) {
+            int from = checkpoint_nearest_saved_before(cm, L);
+            printf("      L%02d: [RECOMPUTE from L%02d] → ffnBwd → rmsnorm2_bwd → sdpaBwd1 → sdpaBwd2 → qkvBwd → rmsnorm1_bwd\n",
+                   L, from);
+        } else {
+            printf("      L%02d: ffnBwd → rmsnorm2_bwd → sdpaBwd1 → sdpaBwd2 → qkvBwd → rmsnorm1_bwd\n", L);
+        }
+    }
+}
+
+// ===== Main =====
+
+int main(int argc, char *argv[]) {
+    @autoreleasepool {
+        // Parse model config from command line
+        ModelConfig cfg = model_config_from_args(argc, argv);
+
+        // Parse additional training args
+        int total_steps = 100;
+        float lr = 3e-4f;
+        bool dry_run = true;
+        CheckpointPolicy ckpt_policy = CKPT_ALL;
+
+        for (int i = 1; i < argc; i++) {
+            if (strcmp(argv[i], "--steps") == 0 && i+1 < argc) total_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--lr") == 0 && i+1 < argc) lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--live") == 0) dry_run = false;
+            else if (strcmp(argv[i], "--checkpoint") == 0 && i+1 < argc) {
+                const char *p = argv[++i];
+                if (strcmp(p, "all") == 0) ckpt_policy = CKPT_ALL;
+                else if (strcmp(p, "boundary") == 0) ckpt_policy = CKPT_BOUNDARY;
+                else if (strcmp(p, "sqrt") == 0) ckpt_policy = CKPT_SQRT;
+                else if (strcmp(p, "none") == 0) ckpt_policy = CKPT_NONE;
+                else fprintf(stderr, "Unknown checkpoint policy: %s\n", p);
+            }
+        }
+
+        // Check for exec() resume
+        PipelineScheduler sched = pipeline_scheduler_init(cfg, total_steps, lr);
+        MmapState *ms = NULL;
+
+        if (pipeline_check_resume(argc, argv, &sched, &ms)) {
+            printf("[pipeline] Resumed from exec() restart\n");
+        } else {
+            // Fresh start
+            printf("=== ANE Pipeline Training ===\n");
+            if (dry_run) printf("  ** DRY RUN MODE — no ANE kernels compiled **\n\n");
+
+            // Print model config and pipeline plan
+            PipelinePlan plan = compute_pipeline_plan(&cfg);
+            pipeline_plan_print(&cfg, &plan);
+            printf("\n");
+
+            // Print checkpoint policy
+            CheckpointManager cm = checkpoint_init(ckpt_policy, &cfg, &plan, 0);
+            checkpoint_print(&cm, &cfg.dims);
+            printf("\n");
+
+            // Print FLOP estimates
+            double total_flops = flops_per_step(&cfg);
+            double ane_flops = ane_flops_per_step(&cfg);
+            printf("=== Compute Estimate ===\n");
+            printf("  FLOPs/step: %.0fM total, %.0fM ANE (%.0f%% on-engine)\n",
+                   total_flops/1e6, ane_flops/1e6, 100.0*ane_flops/total_flops);
+            printf("  At 15.8 TFLOPS ANE: %.1f ms/step theoretical minimum\n",
+                   ane_flops / 15.8e9);
+            printf("  Training: %d steps × %d accum = %d optimizer updates\n",
+                   total_steps, cfg.compile.accum_steps, total_steps / cfg.compile.accum_steps);
+            printf("\n");
+
+            // Print mmap state size
+            size_t mmap_sz = mmap_compute_size(&cfg);
+            printf("=== State Management ===\n");
+            printf("  mmap file: %s (%.1fMB)\n", MMAP_PATH, mmap_sz/1e6);
+            printf("  Per-layer: weights=%.1fMB adam=%.1fMB grads=%.1fMB acts=%.1fMB\n",
+                   layer_weight_bytes(&cfg.dims)/1e6, layer_adam_bytes(&cfg.dims)/1e6,
+                   layer_gradient_bytes(&cfg.dims)/1e6, layer_activation_bytes(&cfg.dims)/1e6);
+            printf("\n");
+
+            // Create mmap state
+            ms = mmap_state_create(MMAP_PATH, &cfg);
+            if (!ms) {
+                fprintf(stderr, "Failed to create mmap state\n");
+                checkpoint_free(&cm);
+                pipeline_plan_free(&plan);
+                return 1;
+            }
+
+            if (dry_run) {
+                // ===== Simulate the full scheduler state machine =====
+                printf("=== Execution Trace (1 training step) ===\n");
+                int max_actions = 200;  // safety limit
+                int action_count = 0;
+
+                while (action_count < max_actions) {
+                    PipelineAction action = pipeline_next_action(&sched);
+                    action_count++;
+
+                    printf("\n  [%d] %s (phase=%s group=%d)\n",
+                           action_count, action_name(action),
+                           phase_name(sched.phase), sched.current_group);
+
+                    switch (action) {
+                    case ACTION_COMPILE_GROUP: {
+                        LayerGroup *lg = &sched.plan.groups[sched.current_group];
+                        simulate_compile_group(&sched, lg);
+                        pipeline_group_compiled(&sched);
+                        break;
+                    }
+                    case ACTION_RUN_FORWARD_GROUP: {
+                        LayerGroup *lg = &sched.plan.groups[sched.current_group];
+                        simulate_forward_group(&sched, lg, &cm);
+                        pipeline_forward_group_done(&sched);
+                        break;
+                    }
+                    case ACTION_RUN_BACKWARD_GROUP: {
+                        LayerGroup *lg = &sched.plan.groups[sched.current_group];
+                        simulate_backward_group(&sched, lg, &cm);
+                        pipeline_backward_group_done(&sched);
+                        break;
+                    }
+                    case ACTION_EXEC_RESTART:
+                        printf("    [exec] Would restart process to reset compile budget\n");
+                        printf("           Saving scheduler state to mmap, calling exec()\n");
+                        // In dry-run, just reset the budget and continue
+                        sched.budget = budget_init(&cfg.compile);
+                        sched.needs_restart = false;
+                        break;
+
+                    case ACTION_WEIGHT_UPDATE:
+                        printf("    [adam] Optimizer step on all %d layers + global params\n",
+                               cfg.dims.n_layers);
+                        printf("           LR=%.1e adam_t=%d\n", sched.learning_rate, sched.current_step+1);
+                        pipeline_weight_update_done(&sched);
+                        break;
+
+                    case ACTION_STEP_DONE:
+                        printf("\n=== Training step complete ===\n");
+                        goto done_trace;
+
+                    case ACTION_ERROR:
+                        printf("    ERROR in scheduler\n");
+                        goto done_trace;
+                    }
+                }
+                done_trace:
+
+                printf("\nTotal actions simulated: %d\n", action_count);
+                printf("Compile budget consumed: %d/%d\n", sched.budget.used, sched.budget.budget);
+
+                // Summary for multi-group models
+                if (plan.n_groups > 1) {
+                    printf("\n=== Multi-Group Pipeline Summary ===\n");
+                    printf("  This model requires %d layer groups per training step\n", plan.n_groups);
+                    printf("  Forward pass: %d compile batches (left to right)\n", plan.n_groups);
+                    printf("  Backward pass: %d compile batches (right to left)\n", plan.n_groups);
+                    printf("  Each compile batch may need exec() restart\n");
+                    printf("  All tensor state survives restarts via mmap (%s)\n", MMAP_PATH);
+                    printf("  Checkpoint policy '%s' saves %d/%d layer activations (%.0f%% memory reduction)\n",
+                           checkpoint_policy_name(ckpt_policy), cm.n_checkpointed, cm.n_layers,
+                           100.0 * checkpoint_memory_saved(&cm, &cfg.dims) /
+                           ((double)cm.n_layers * layer_activation_bytes(&cfg.dims)));
+                }
+
+                checkpoint_free(&cm);
+            } else {
+                // ===== Live training mode =====
+#ifdef ANE_LIVE
+                printf("Live training not yet implemented — use train_large.m for Stories110M\n");
+                printf("This entry point will be wired up once the scaffolding is validated.\n");
+#else
+                printf("Compiled without ANE_LIVE — use --live with ANE_LIVE defined.\n");
+                printf("Build with: xcrun clang -DANE_LIVE -O2 ... train_pipeline.m\n");
+#endif
+                checkpoint_free(&cm);
+            }
+
+            pipeline_plan_free(&plan);
+        }
+
+        // Cleanup
+        if (ms) mmap_state_destroy(ms);
+    }
+    return 0;
+}
diff --git a/training/training_dynamic/Makefile b/training/training_dynamic/Makefile
new file mode 100644
index 0000000..8c02c11
--- /dev/null
+++ b/training/training_dynamic/Makefile
@@ -0,0 +1,9 @@
+CC = xcrun clang
+CFLAGS = -O2 -framework Foundation -framework IOSurface -framework Accelerate \
+         -isysroot $(shell xcrun --show-sdk-path) -fobjc-arc
+
+train: train.m config.h io.h cpu_ops.h mil_dynamic.h
+	$(CC) $(CFLAGS) -o train train.m
+
+clean:
+	rm -f train
diff --git a/training/training_dynamic/config.h b/training/training_dynamic/config.h
new file mode 100644
index 0000000..d66d045
--- /dev/null
+++ b/training/training_dynamic/config.h
@@ -0,0 +1,156 @@
+// config.h — Stories110M model config, structs, ANE init
+#pragma once
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#import <mach/mach_time.h>
+#import <Accelerate/Accelerate.h>
+#include <math.h>
+#include <unistd.h>
+#include <dispatch/dispatch.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <arm_neon.h>
+
+// Stories110M config
+#define DIM 768
+#define HIDDEN 2048
+#define HEADS 12
+#define HD (DIM/HEADS)
+#define SEQ 256
+#define NLAYERS 12
+#define VOCAB 32000
+
+// Weight sizes per layer
+#define WQ_SZ (DIM*DIM)
+#define WO_SZ (DIM*DIM)
+#define W1_SZ (HIDDEN*DIM)
+#define W2_SZ (DIM*HIDDEN)
+#define W3_SZ (HIDDEN*DIM)
+#define LAYER_PARAMS (4*WQ_SZ + W1_SZ + W2_SZ + W3_SZ + 2*DIM)
+
+// Attention score channels for SDPA backward
+#define SCORE_CH (HEADS*SEQ)
+
+// Per-layer weights
+typedef struct {
+    float *Wq, *Wk, *Wv, *Wo;
+    float *W1, *W2, *W3;
+    float *rms_att, *rms_ffn;
+} LayerWeights;
+
+// Adam optimizer state
+typedef struct { float *m, *v; size_t n; } AdamState;
+typedef struct {
+    AdamState Wq, Wk, Wv, Wo, W1, W2, W3, rms_att, rms_ffn;
+} LayerAdam;
+
+// Per-layer activations (saved for backward)
+typedef struct {
+    float *layer_in, *xnorm, *Q, *K, *V, *attn_out, *o_out;
+    float *x2, *x2norm, *h1, *h3, *silu_out, *ffn_out;
+} LayerActs;
+
+// Per-layer gradients
+typedef struct {
+    float *Wq, *Wk, *Wv, *Wo, *W1, *W2, *W3, *rms_att, *rms_ffn;
+} LayerGrads;
+
+// ANE kernel handle
+typedef struct { void *model; IOSurfaceRef ioIn, ioOut; void *request; void *tmpDir; } Kern;
+
+// Checkpoint header
+typedef struct {
+    int magic, version, step, total_steps;
+    int n_layers, vocab_size, dim, hidden_dim, n_heads, seq_len;
+    float lr, loss;
+    double cum_compile, cum_train, cum_wall;
+    int cum_steps, cum_batches, adam_t;
+    int pad[3];
+} CkptHdr;
+
+// llama2.c model file header
+typedef struct {
+    int dim, hidden_dim, n_layers, n_heads, n_kv_heads, vocab_size, seq_len;
+} Llama2Config;
+
+// Globals
+static Class g_D, g_I, g_AR, g_AIO;
+static mach_timebase_info_data_t g_tb;
+static int g_compile_count = 0;
+
+static void ane_init(void) {
+    dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
+    g_D  = NSClassFromString(@"_ANEInMemoryModelDescriptor");
+    g_I  = NSClassFromString(@"_ANEInMemoryModel");
+    g_AR = NSClassFromString(@"_ANERequest");
+    g_AIO= NSClassFromString(@"_ANEIOSurfaceObject");
+}
+static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
+
+// Alloc helpers
+static AdamState adam_alloc(size_t n) { AdamState s; s.m=(float*)calloc(n,4); s.v=(float*)calloc(n,4); s.n=n; return s; }
+static void adam_free(AdamState *s) { free(s->m); free(s->v); }
+
+static LayerWeights layer_weights_alloc(void) {
+    LayerWeights w;
+    w.Wq=(float*)malloc(WQ_SZ*4); w.Wk=(float*)malloc(WQ_SZ*4);
+    w.Wv=(float*)malloc(WQ_SZ*4); w.Wo=(float*)malloc(WO_SZ*4);
+    w.W1=(float*)malloc(W1_SZ*4); w.W2=(float*)malloc(W2_SZ*4); w.W3=(float*)malloc(W3_SZ*4);
+    w.rms_att=(float*)malloc(DIM*4); w.rms_ffn=(float*)malloc(DIM*4);
+    return w;
+}
+static void layer_weights_free(LayerWeights *w) {
+    free(w->Wq);free(w->Wk);free(w->Wv);free(w->Wo);
+    free(w->W1);free(w->W2);free(w->W3);free(w->rms_att);free(w->rms_ffn);
+}
+static LayerAdam layer_adam_alloc(void) {
+    LayerAdam a;
+    a.Wq=adam_alloc(WQ_SZ); a.Wk=adam_alloc(WQ_SZ); a.Wv=adam_alloc(WQ_SZ); a.Wo=adam_alloc(WO_SZ);
+    a.W1=adam_alloc(W1_SZ); a.W2=adam_alloc(W2_SZ); a.W3=adam_alloc(W3_SZ);
+    a.rms_att=adam_alloc(DIM); a.rms_ffn=adam_alloc(DIM);
+    return a;
+}
+static void layer_adam_free(LayerAdam *a) {
+    adam_free(&a->Wq);adam_free(&a->Wk);adam_free(&a->Wv);adam_free(&a->Wo);
+    adam_free(&a->W1);adam_free(&a->W2);adam_free(&a->W3);
+    adam_free(&a->rms_att);adam_free(&a->rms_ffn);
+}
+static LayerActs layer_acts_alloc(void) {
+    LayerActs a;
+    a.layer_in=(float*)malloc(SEQ*DIM*4);
+    a.xnorm=(float*)malloc(SEQ*DIM*4);
+    a.Q=(float*)malloc(SEQ*DIM*4); a.K=(float*)malloc(SEQ*DIM*4); a.V=(float*)malloc(SEQ*DIM*4);
+    a.attn_out=(float*)malloc(SEQ*DIM*4); a.o_out=(float*)malloc(SEQ*DIM*4);
+    a.x2=(float*)malloc(SEQ*DIM*4); a.x2norm=(float*)malloc(SEQ*DIM*4);
+    a.h1=(float*)malloc(SEQ*HIDDEN*4); a.h3=(float*)malloc(SEQ*HIDDEN*4);
+    a.silu_out=(float*)malloc(SEQ*HIDDEN*4); a.ffn_out=(float*)malloc(SEQ*DIM*4);
+    return a;
+}
+static void layer_acts_free(LayerActs *a) {
+    free(a->layer_in);free(a->xnorm);
+    free(a->Q);free(a->K);free(a->V);
+    free(a->attn_out);free(a->o_out);free(a->x2);free(a->x2norm);
+    free(a->h1);free(a->h3);free(a->silu_out);free(a->ffn_out);
+}
+static LayerGrads layer_grads_alloc(void) {
+    LayerGrads g;
+    g.Wq=(float*)calloc(WQ_SZ,4); g.Wk=(float*)calloc(WQ_SZ,4);
+    g.Wv=(float*)calloc(WQ_SZ,4); g.Wo=(float*)calloc(WO_SZ,4);
+    g.W1=(float*)calloc(W1_SZ,4); g.W2=(float*)calloc(W2_SZ,4); g.W3=(float*)calloc(W3_SZ,4);
+    g.rms_att=(float*)calloc(DIM,4); g.rms_ffn=(float*)calloc(DIM,4);
+    return g;
+}
+static void layer_grads_zero(LayerGrads *g) {
+    memset(g->Wq,0,WQ_SZ*4);memset(g->Wk,0,WQ_SZ*4);
+    memset(g->Wv,0,WQ_SZ*4);memset(g->Wo,0,WO_SZ*4);
+    memset(g->W1,0,W1_SZ*4);memset(g->W2,0,W2_SZ*4);memset(g->W3,0,W3_SZ*4);
+    memset(g->rms_att,0,DIM*4);memset(g->rms_ffn,0,DIM*4);
+}
+static void layer_grads_free(LayerGrads *g) {
+    free(g->Wq);free(g->Wk);free(g->Wv);free(g->Wo);
+    free(g->W1);free(g->W2);free(g->W3);free(g->rms_att);free(g->rms_ffn);
+}
diff --git a/training/training_dynamic/cpu_ops.h b/training/training_dynamic/cpu_ops.h
new file mode 100644
index 0000000..aed7e6f
--- /dev/null
+++ b/training/training_dynamic/cpu_ops.h
@@ -0,0 +1,164 @@
+// cpu_ops.h — CPU operations: RMSNorm, cross-entropy, Adam, embedding
+#pragma once
+#include "config.h"
+
+static float *g_rms_tmp = NULL;
+
+static void rmsnorm(float *out, const float *x, const float *w, int d, int S) {
+    if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4);
+    float *ss = (float*)calloc(S, sizeof(float));
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(x+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vadd(g_rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
+    }
+    float invd = 1.0f/d, eps=1e-5f;
+    vDSP_vsmsa(ss, 1, &invd, &eps, ss, 1, (vDSP_Length)S);
+    int n = S; vvrsqrtf(ss, ss, &n);
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(x+i*S, 1, ss, 1, out+i*S, 1, (vDSP_Length)S);
+        vDSP_vsmul(out+i*S, 1, &w[i], out+i*S, 1, (vDSP_Length)S);
+    }
+    free(ss);
+}
+
+static void rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, const float *w, int d, int S) {
+    if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4);
+    float *ss = (float*)calloc(S, sizeof(float));
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(x+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vadd(g_rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
+    }
+    float invd = 1.0f/d, eps=1e-5f;
+    vDSP_vsmsa(ss, 1, &invd, &eps, ss, 1, (vDSP_Length)S);
+    float *rrms = (float*)malloc(S*4);
+    int n = S; vvrsqrtf(rrms, ss, &n);
+    float *dot = (float*)calloc(S, sizeof(float));
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(dy+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsma(g_rms_tmp, 1, &w[i], dot, 1, dot, 1, (vDSP_Length)S);
+    }
+    vDSP_vmul(rrms, 1, rrms, 1, ss, 1, (vDSP_Length)S);
+    vDSP_vsmul(ss, 1, &invd, ss, 1, (vDSP_Length)S);
+    vDSP_vmul(dot, 1, ss, 1, dot, 1, (vDSP_Length)S);
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(x+i*S, 1, dot, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsub(g_rms_tmp, 1, dy+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vmul(g_rms_tmp, 1, rrms, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsmul(g_rms_tmp, 1, &w[i], dx+i*S, 1, (vDSP_Length)S);
+        vDSP_vmul(dy+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vmul(g_rms_tmp, 1, rrms, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        float s; vDSP_sve(g_rms_tmp, 1, &s, (vDSP_Length)S);
+        dw[i] += s;
+    }
+    free(ss); free(rrms); free(dot);
+}
+
+static void adam_update(float *w, const float *g, AdamState *s, int t, float lr, float b1, float b2, float eps) {
+    float bc1 = 1.0f - powf(b1, t), bc2 = 1.0f - powf(b2, t);
+    for (size_t i=0; i<s->n; i++) {
+        s->m[i] = b1*s->m[i] + (1-b1)*g[i];
+        s->v[i] = b2*s->v[i] + (1-b2)*g[i]*g[i];
+        float mh = s->m[i]/bc1, vh = s->v[i]/bc2;
+        w[i] -= lr * mh / (sqrtf(vh) + eps);
+    }
+}
+
+// Cross-entropy loss: operates on logits[V, S] column-major (each column = one token)
+// Avoids transposing by using a per-token temp buffer
+static float cross_entropy_loss(float *dlogits, const float *logits, const uint16_t *targets, int V, int S) {
+    float *col = (float*)malloc(V * 4);  // single column buffer
+    float total_loss = 0;
+    float invS = 1.0f / S;
+    for (int t = 0; t < S; t++) {
+        // Gather column t: logits[v, t] = logits[v*S + t], stride=S
+        cblas_scopy(V, logits + t, S, col, 1);
+        // Softmax
+        float maxv; vDSP_maxv(col, 1, &maxv, (vDSP_Length)V);
+        float neg_max = -maxv;
+        vDSP_vsadd(col, 1, &neg_max, col, 1, (vDSP_Length)V);
+        int n = V; vvexpf(col, col, &n);
+        float sum; vDSP_sve(col, 1, &sum, (vDSP_Length)V);
+        float inv_sum = 1.0f / sum;
+        vDSP_vsmul(col, 1, &inv_sum, col, 1, (vDSP_Length)V);
+        // Loss + gradient
+        int tgt = targets[t];
+        total_loss -= logf(col[tgt] + 1e-10f);
+        col[tgt] -= 1.0f;
+        vDSP_vsmul(col, 1, &invS, col, 1, (vDSP_Length)V);
+        // Scatter back: dlogits[v*S + t] = col[v]
+        cblas_scopy(V, col, 1, dlogits + t, S);
+    }
+    free(col);
+    return total_loss / S;
+}
+
+// Vocab compaction: build mapping from full 32K vocab to compact vocab
+typedef struct {
+    int compact_vocab;          // number of active tokens
+    int *full_to_compact;       // [VOCAB] → compact id (-1 if unused)
+    int *compact_to_full;       // [compact_vocab] → full vocab id
+} VocabMap;
+
+static VocabMap vocab_map_build(const uint16_t *data, size_t n_tokens, int full_vocab) {
+    VocabMap vm;
+    vm.full_to_compact = (int*)malloc(full_vocab * sizeof(int));
+    memset(vm.full_to_compact, -1, full_vocab * sizeof(int));
+    // Scan for used tokens
+    for (size_t i = 0; i < n_tokens; i++) {
+        vm.full_to_compact[data[i]] = 0;  // mark as used
+    }
+    // Assign compact IDs
+    int cid = 0;
+    for (int v = 0; v < full_vocab; v++) {
+        if (vm.full_to_compact[v] == 0)
+            vm.full_to_compact[v] = cid++;
+        else
+            vm.full_to_compact[v] = -1;
+    }
+    vm.compact_vocab = cid;
+    vm.compact_to_full = (int*)malloc(cid * sizeof(int));
+    for (int v = 0; v < full_vocab; v++) {
+        if (vm.full_to_compact[v] >= 0)
+            vm.compact_to_full[vm.full_to_compact[v]] = v;
+    }
+    return vm;
+}
+
+// Create compact embedding from full embedding
+static float *vocab_compact_embed(const float *full_embed, const VocabMap *vm, int dim) {
+    float *ce = (float*)malloc((size_t)vm->compact_vocab * dim * 4);
+    for (int c = 0; c < vm->compact_vocab; c++)
+        memcpy(ce + c*dim, full_embed + vm->compact_to_full[c]*dim, dim*4);
+    return ce;
+}
+
+// Scatter compact embed gradients back to full embed
+static void vocab_scatter_grads(float *full_gembed, const float *compact_gembed, const VocabMap *vm, int dim) {
+    for (int c = 0; c < vm->compact_vocab; c++) {
+        int fv = vm->compact_to_full[c];
+        for (int d = 0; d < dim; d++)
+            full_gembed[fv*dim + d] += compact_gembed[c*dim + d];
+    }
+}
+
+// Update full embed from compact embed (after adam)
+static void vocab_update_full(float *full_embed, const float *compact_embed, const VocabMap *vm, int dim) {
+    for (int c = 0; c < vm->compact_vocab; c++)
+        memcpy(full_embed + vm->compact_to_full[c]*dim, compact_embed + c*dim, dim*4);
+}
+
+static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq) {
+    for (int t = 0; t < seq; t++) {
+        int tok = tokens[t];
+        for (int d = 0; d < dim; d++)
+            x[d*seq + t] = embed[tok*dim + d];
+    }
+}
+
+static void embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq) {
+    for (int t = 0; t < seq; t++) {
+        int tok = tokens[t];
+        for (int d = 0; d < dim; d++)
+            d_embed[tok*dim + d] += dx[d*seq + t];
+    }
+}
diff --git a/training/training_dynamic/io.h b/training/training_dynamic/io.h
new file mode 100644
index 0000000..0a6969e
--- /dev/null
+++ b/training/training_dynamic/io.h
@@ -0,0 +1,147 @@
+// io.h — IOSurface helpers, NEON conversion, kernel compile/eval
+#pragma once
+#include "config.h"
+
+static IOSurfaceRef make_surface(size_t bytes) {
+    return IOSurfaceCreate((__bridge CFDictionaryRef)@{
+        (id)kIOSurfaceWidth:@(bytes), (id)kIOSurfaceHeight:@1,
+        (id)kIOSurfaceBytesPerElement:@1, (id)kIOSurfaceBytesPerRow:@(bytes),
+        (id)kIOSurfaceAllocSize:@(bytes), (id)kIOSurfacePixelFormat:@0});
+}
+
+// Blob builders for const weights (mask, rms)
+static NSData *build_blob(const float *w, int rows, int cols) {
+    int ws=rows*cols*2, tot=128+ws;
+    uint8_t *b=(uint8_t*)calloc(tot,1);
+    b[0]=1;b[4]=2;b[64]=0xEF;b[65]=0xBE;b[66]=0xAD;b[67]=0xDE;b[68]=1;
+    *(uint32_t*)(b+72)=ws;*(uint32_t*)(b+80)=128;
+    _Float16 *fp16=(_Float16*)(b+128);
+    for(int i=0;i<rows*cols;i++) fp16[i]=(_Float16)w[i];
+    return [NSData dataWithBytesNoCopy:b length:tot freeWhenDone:YES];
+}
+static NSData *build_blob_fp16(_Float16 *d, int cnt) {
+    int ws=cnt*2, tot=128+ws;
+    uint8_t *b=(uint8_t*)calloc(tot,1);
+    b[0]=1;b[4]=2;b[64]=0xEF;b[65]=0xBE;b[66]=0xAD;b[67]=0xDE;b[68]=1;
+    *(uint32_t*)(b+72)=ws;*(uint32_t*)(b+80)=128;
+    memcpy(b+128,d,ws);
+    return [NSData dataWithBytesNoCopy:b length:tot freeWhenDone:YES];
+}
+
+// NEON vectorized conversion
+static void cvt_f16_f32(float *dst, const _Float16 *src, int n) {
+    int i = 0;
+    for (; i+7 < n; i += 8) {
+        float16x8_t h = vld1q_f16((const __fp16*)(src+i));
+        vst1q_f32(dst+i,   vcvt_f32_f16(vget_low_f16(h)));
+        vst1q_f32(dst+i+4, vcvt_f32_f16(vget_high_f16(h)));
+    }
+    for (; i < n; i++) dst[i] = (float)src[i];
+}
+static void cvt_f32_f16(_Float16 *dst, const float *src, int n) {
+    int i = 0;
+    for (; i+7 < n; i += 8) {
+        float16x8_t h = vcombine_f16(vcvt_f16_f32(vld1q_f32(src+i)),
+                                      vcvt_f16_f32(vld1q_f32(src+i+4)));
+        vst1q_f16((__fp16*)(dst+i), h);
+    }
+    for (; i < n; i++) dst[i] = (_Float16)src[i];
+}
+
+// IOSurface I/O (channel-first [C,S] layout, fp16 on surface)
+static void io_write_fp16(IOSurfaceRef s, const float *data, int channels, int sp) {
+    IOSurfaceLock(s, 0, NULL);
+    cvt_f32_f16((_Float16*)IOSurfaceGetBaseAddress(s), data, channels * sp);
+    IOSurfaceUnlock(s, 0, NULL);
+}
+static void io_read_fp16(IOSurfaceRef s, float *data, int ch_off, int channels, int sp) {
+    IOSurfaceLock(s, kIOSurfaceLockReadOnly, NULL);
+    cvt_f16_f32(data, (_Float16*)IOSurfaceGetBaseAddress(s) + ch_off * sp, channels * sp);
+    IOSurfaceUnlock(s, kIOSurfaceLockReadOnly, NULL);
+}
+static void io_copy(IOSurfaceRef dst, int dst_ch, IOSurfaceRef src, int src_ch, int channels, int sp) {
+    IOSurfaceLock(dst, 0, NULL);
+    IOSurfaceLock(src, kIOSurfaceLockReadOnly, NULL);
+    memcpy((_Float16*)IOSurfaceGetBaseAddress(dst) + dst_ch*sp,
+           (_Float16*)IOSurfaceGetBaseAddress(src) + src_ch*sp,
+           channels * sp * sizeof(_Float16));
+    IOSurfaceUnlock(src, kIOSurfaceLockReadOnly, NULL);
+    IOSurfaceUnlock(dst, 0, NULL);
+}
+static void io_write_fp16_at(IOSurfaceRef s, int ch_off, const float *data, int channels, int sp) {
+    IOSurfaceLock(s, 0, NULL);
+    cvt_f32_f16((_Float16*)IOSurfaceGetBaseAddress(s) + ch_off * sp, data, channels * sp);
+    IOSurfaceUnlock(s, 0, NULL);
+}
+
+// fp32 IOSurface I/O (for dynamic matmul kernels that use fp32 input/output)
+// Layout: [1, IC, 1, SP] where SP = SEQ + OC
+// Write activations at sp[0:SEQ] and weights at sp[SEQ:SEQ+OC]
+static void io_write_dyn(IOSurfaceRef s, const float *act, int ic, int seq,
+                         const float *W, int oc) {
+    int sp = seq + oc;
+    IOSurfaceLock(s, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(s);
+    for (int d = 0; d < ic; d++) {
+        memcpy(buf + d*sp, act + d*seq, seq*4);
+        memcpy(buf + d*sp + seq, W + d*oc, oc*4);
+    }
+    IOSurfaceUnlock(s, 0, NULL);
+}
+
+// Read output from dynamic matmul kernel: [1, OC, 1, SEQ]
+static void io_read_dyn(IOSurfaceRef s, float *out, int oc, int seq) {
+    IOSurfaceLock(s, kIOSurfaceLockReadOnly, NULL);
+    memcpy(out, (float*)IOSurfaceGetBaseAddress(s), oc * seq * 4);
+    IOSurfaceUnlock(s, kIOSurfaceLockReadOnly, NULL);
+}
+
+// Compile MIL to ANE kernel
+static Kern *compile_kern_mil_w(NSString *mil, NSDictionary *weights, int ic_bytes, int oc_bytes) {
+    @autoreleasepool {
+    NSData *md = [mil dataUsingEncoding:NSUTF8StringEncoding];
+    id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(g_D, @selector(modelWithMILText:weights:optionsPlist:), md, weights, nil);
+    if (!desc) { printf("  [compile] desc=NULL\n"); return NULL; }
+    id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(g_I, @selector(inMemoryModelWithDescriptor:), desc);
+    id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
+    NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
+    [[NSFileManager defaultManager] createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"] withIntermediateDirectories:YES attributes:nil error:nil];
+    [md writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+    for (NSString *path in weights) {
+        NSString *rel = [path stringByReplacingOccurrencesOfString:@"@model_path/" withString:@""];
+        [weights[path][@"data"] writeToFile:[td stringByAppendingPathComponent:rel] atomically:YES];
+    }
+    NSError *e = nil;
+    if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e)) {
+        printf("  [compile] FAIL: %s\n", e ? [[e description] UTF8String] : "no error"); return NULL;
+    }
+    if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e)) {
+        printf("  [compile] load FAIL\n"); return NULL;
+    }
+    __sync_fetch_and_add(&g_compile_count, 1);
+    Kern *k = (Kern*)calloc(1, sizeof(Kern));
+    k->model = (void*)CFBridgingRetain(mdl);
+    k->ioIn = make_surface(ic_bytes);
+    k->ioOut = make_surface(oc_bytes);
+    id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioIn);
+    id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioOut);
+    k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
+        @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+        @[wI], @[@0], @[wO], @[@0], nil, nil, @0));
+    k->tmpDir = (void*)CFBridgingRetain(td);
+    return k;
+    }
+}
+static void free_kern(Kern *k) {
+    if (!k) return;
+    id mdl = (__bridge id)k->model; NSError *e = nil;
+    ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
+    CFRelease(k->ioIn); CFRelease(k->ioOut);
+    [[NSFileManager defaultManager] removeItemAtPath:(__bridge id)k->tmpDir error:nil];
+    CFRelease(k->model); CFRelease(k->request); CFRelease(k->tmpDir);
+    free(k);
+}
+static void ane_eval(Kern *k) {
+    id mdl = (__bridge id)k->model; id req = (__bridge id)k->request; NSError *e = nil;
+    ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
+}
diff --git a/training/training_dynamic/mil_dynamic.h b/training/training_dynamic/mil_dynamic.h
new file mode 100644
index 0000000..e6c5798
--- /dev/null
+++ b/training/training_dynamic/mil_dynamic.h
@@ -0,0 +1,590 @@
+// mil_dynamic.h — MIL generators using dynamic matmul (weights via IOSurface)
+// Instead of conv(const_weight, x), we use matmul(x, W) where both come from input.
+// Input layout: [1, IC, 1, SP] fp32, SP = SEQ + total_weight_cols
+// Activations in sp[0:SEQ], weight matrices packed sequentially in sp[SEQ:]
+#pragma once
+#include "io.h"
+
+#define MIL_HDR \
+    @"program(1.3)\n[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, " \
+    "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, " \
+    "{\"coremltools-version\", \"9.0\"}})]\n{\n"
+
+// Helper: generate a dynamic matmul within a MIL function
+// Slices activation [1,ic,1,seq] and weight [1,ic,1,oc] from input, does matmul
+// act_sp_off: spatial offset for activations (usually 0)
+// w_sp_off: spatial offset for weight block
+// Returns variable name of result [1,oc,1,seq] in fp16
+static void gen_dyn_matmul(NSMutableString *m, const char *prefix,
+                           int ic, int oc, int seq,
+                           int act_sp_off, int w_sp_off,
+                           const char *input_var) {
+    // Slice activations
+    [m appendFormat:@"        tensor<int32, [4]> %s_ba = const()[name=string(\"%s_ba\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", prefix, prefix, act_sp_off];
+    [m appendFormat:@"        tensor<int32, [4]> %s_sa = const()[name=string(\"%s_sa\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", prefix, prefix, ic, seq];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> %s_act = slice_by_size(x=%s,begin=%s_ba,size=%s_sa)[name=string(\"%s_act\")];\n", ic, seq, prefix, input_var, prefix, prefix, prefix];
+    // Slice weight
+    [m appendFormat:@"        tensor<int32, [4]> %s_bw = const()[name=string(\"%s_bw\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", prefix, prefix, w_sp_off];
+    [m appendFormat:@"        tensor<int32, [4]> %s_sw = const()[name=string(\"%s_sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", prefix, prefix, ic, oc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> %s_wt = slice_by_size(x=%s,begin=%s_bw,size=%s_sw)[name=string(\"%s_wt\")];\n", ic, oc, prefix, input_var, prefix, prefix, prefix];
+    // Reshape act: [1,ic,1,seq] → [1,1,ic,seq] → transpose → [1,1,seq,ic]
+    [m appendFormat:@"        tensor<int32, [4]> %s_ra = const()[name=string(\"%s_ra\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", prefix, prefix, ic, seq];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_a2 = reshape(shape=%s_ra,x=%s_act)[name=string(\"%s_a2\")];\n", ic, seq, prefix, prefix, prefix, prefix];
+    [m appendFormat:@"        tensor<int32, [4]> %s_pm = const()[name=string(\"%s_pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n", prefix, prefix];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_a3 = transpose(perm=%s_pm,x=%s_a2)[name=string(\"%s_a3\")];\n", seq, ic, prefix, prefix, prefix, prefix];
+    // Reshape weight: [1,ic,1,oc] → [1,1,ic,oc]
+    [m appendFormat:@"        tensor<int32, [4]> %s_rw = const()[name=string(\"%s_rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", prefix, prefix, ic, oc];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_W = reshape(shape=%s_rw,x=%s_wt)[name=string(\"%s_W\")];\n", ic, oc, prefix, prefix, prefix, prefix];
+    // matmul: [1,1,seq,ic] @ [1,1,ic,oc] → [1,1,seq,oc]
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_yh = matmul(transpose_x=bF,transpose_y=bF,x=%s_a3,y=%s_W)[name=string(\"%s_yh\")];\n", seq, oc, prefix, prefix, prefix, prefix];
+    // Transpose back + reshape: [1,1,seq,oc] → [1,1,oc,seq] → [1,oc,1,seq]
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_yt = transpose(perm=%s_pm,x=%s_yh)[name=string(\"%s_yt\")];\n", oc, seq, prefix, prefix, prefix, prefix];
+    [m appendFormat:@"        tensor<int32, [4]> %s_ro = const()[name=string(\"%s_ro\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", prefix, prefix, oc, seq];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> %s_y = reshape(shape=%s_ro,x=%s_yt)[name=string(\"%s_y\")];\n", oc, seq, prefix, prefix, prefix, prefix];
+}
+
+// ===== Dynamic matmul kernel: y = x @ W =====
+// Input: [1, IC, 1, SEQ+OC] fp32 — act[0:SEQ] + W[SEQ:SEQ+OC]
+// Output: [1, OC, 1, SEQ] fp32
+static NSString *gen_dyn_matmul_mil(int ic, int oc, int seq) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    int sp = seq + oc;
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ic, sp];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", ic, sp];
+    gen_dyn_matmul(m, "mm", ic, oc, seq, 0, seq, "xh");
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=mm_y)[name=string(\"cout\")];\n", oc, seq];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// ===== SDPA forward (dynamic weights) =====
+// Replaces gen_sdpa_fwd_taps: RMSNorm done on CPU, this kernel does QKV matmul + SDPA + Wo matmul
+// Input: [1, DIM, 1, SEQ + 4*DIM] fp32
+//   sp[0:SEQ]           = xnorm (rmsnorm output, DIM channels)
+//   sp[SEQ:SEQ+DIM]     = Wq[DIM,DIM]
+//   sp[SEQ+DIM:SEQ+2D]  = Wk[DIM,DIM]
+//   sp[SEQ+2D:SEQ+3D]   = Wv[DIM,DIM]
+//   sp[SEQ+3D:SEQ+4D]   = Wo[DIM,DIM]
+// Output: [1, 6*DIM, 1, SEQ] fp16 = concat(o_out, Q, K, V, attn_out, xnorm_pass)
+// NOTE: mask is still a const weight (it doesn't change)
+static NSString *gen_sdpa_fwd_dynamic(void) {
+    float sc = 1.0f/sqrtf((float)HD);
+    int w_total = 4*DIM;  // Wq+Wk+Wv+Wo
+    int sp_in = SEQ + w_total;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", DIM, sp_in];
+    // Cast to fp16
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", DIM, sp_in];
+
+    // Slice xnorm [1,DIM,1,SEQ]
+    [m appendString:@"        tensor<int32, [4]> bx = const()[name=string(\"bx\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sx = const()[name=string(\"sx\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xn = slice_by_size(x=xh,begin=bx,size=sx)[name=string(\"xn\")];\n", DIM, SEQ];
+
+    // Slice Wq [1,DIM,1,DIM]
+    [m appendFormat:@"        tensor<int32, [4]> bq = const()[name=string(\"bq\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name=string(\"sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wq = slice_by_size(x=xh,begin=bq,size=sw)[name=string(\"Wq\")];\n", DIM, DIM];
+
+    // Slice Wk
+    [m appendFormat:@"        tensor<int32, [4]> bk = const()[name=string(\"bk\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ+DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wk = slice_by_size(x=xh,begin=bk,size=sw)[name=string(\"Wk\")];\n", DIM, DIM];
+
+    // Slice Wv
+    [m appendFormat:@"        tensor<int32, [4]> bv = const()[name=string(\"bv\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ+2*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wv = slice_by_size(x=xh,begin=bv,size=sw)[name=string(\"Wv\")];\n", DIM, DIM];
+
+    // Slice Wo
+    [m appendFormat:@"        tensor<int32, [4]> bo = const()[name=string(\"bo\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ+3*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wo = slice_by_size(x=xh,begin=bo,size=sw)[name=string(\"Wo\")];\n", DIM, DIM];
+
+    // Reshape for matmul: [1,D,1,S] → [1,1,D,S] → [1,1,S,D]
+    [m appendFormat:@"        tensor<int32, [4]> r2 = const()[name=string(\"r2\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, SEQ];
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> xn2 = reshape(shape=r2,x=xn)[name=string(\"xn2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> xnt = transpose(perm=pm,x=xn2)[name=string(\"xnt\")];\n", SEQ, DIM];
+
+    // Reshape weights: [1,D,1,D] → [1,1,D,D]
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wq2 = reshape(shape=rw,x=Wq)[name=string(\"Wq2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wk2 = reshape(shape=rw,x=Wk)[name=string(\"Wk2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wv2 = reshape(shape=rw,x=Wv)[name=string(\"Wv2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wo2 = reshape(shape=rw,x=Wo)[name=string(\"Wo2\")];\n", DIM, DIM];
+
+    // QKV matmul: [1,1,S,D] @ [1,1,D,D] → [1,1,S,D]
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendString:@"        bool bT = const()[name=string(\"bT\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> qm = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=Wq2)[name=string(\"qm\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> km = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=Wk2)[name=string(\"km\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> vm = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=Wv2)[name=string(\"vm\")];\n", SEQ, DIM];
+
+    // Transpose back: [1,1,S,D] → [1,1,D,S] → reshape [1,D,1,S]
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> qt = transpose(perm=pm,x=qm)[name=string(\"qt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> kt = transpose(perm=pm,x=km)[name=string(\"kt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> vt = transpose(perm=pm,x=vm)[name=string(\"vt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> os = const()[name=string(\"os\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> qf = reshape(shape=os,x=qt)[name=string(\"qf\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> kf = reshape(shape=os,x=kt)[name=string(\"kf\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> vf = reshape(shape=os,x=vt)[name=string(\"vf\")];\n", DIM, SEQ];
+
+    // SDPA: reshape to heads, matmul, mask, softmax, matmul
+    [m appendFormat:@"        tensor<int32, [4]> qsh = const()[name=string(\"qsh\"), val=tensor<int32, [4]>([1,%d,%d,%d])];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> q4 = reshape(shape=qsh,x=qf)[name=string(\"rq\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> q = transpose(perm=pm,x=q4)[name=string(\"tq\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> k4 = reshape(shape=qsh,x=kf)[name=string(\"rk\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> k = transpose(perm=pm,x=k4)[name=string(\"tk\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> v4 = reshape(shape=qsh,x=vf)[name=string(\"rv\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> v = transpose(perm=pm,x=v4)[name=string(\"tv\")];\n", HEADS, SEQ, HD];
+
+    // Q @ K^T
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> sc1 = matmul(transpose_x=bF,transpose_y=bT,x=q,y=k)[name=string(\"mm1\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        fp16 scv = const()[name=string(\"scv\"), val=fp16(%f)];\n", sc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> sc2 = mul(x=sc1,y=scv)[name=string(\"scl\")];\n", HEADS, SEQ, SEQ];
+
+    // Causal mask (still const — doesn't change)
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> cm = const()[name=string(\"cm\"), val=tensor<fp16, [1,1,%d,%d]>(BLOBFILE(path=string(\"@model_path/weights/mask.bin\"), offset=uint64(64)))];\n", SEQ, SEQ, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> ms = add(x=sc2,y=cm)[name=string(\"msk\")];\n", HEADS, SEQ, SEQ];
+
+    // Softmax
+    [m appendString:@"        int32 sax = const()[name=string(\"sax\"), val=int32(-1)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> aw = softmax(axis=sax,x=ms)[name=string(\"sm\")];\n", HEADS, SEQ, SEQ];
+
+    // scores @ V
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> a4 = matmul(transpose_x=bF,transpose_y=bF,x=aw,y=v)[name=string(\"mm2\")];\n", HEADS, SEQ, HD];
+
+    // Reshape back to [1,DIM,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> at = transpose(perm=pm,x=a4)[name=string(\"ta\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> af = reshape(shape=os,x=at)[name=string(\"ra\")];\n", DIM, SEQ];
+
+    // Wo matmul: af → [1,1,S,D] @ Wo[1,1,D,D] → [1,1,S,D] → [1,D,1,S]
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> af2 = reshape(shape=r2,x=af)[name=string(\"af2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> aft = transpose(perm=pm,x=af2)[name=string(\"aft\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> om = matmul(transpose_x=bF,transpose_y=bF,x=aft,y=Wo2)[name=string(\"om\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> ot = transpose(perm=pm,x=om)[name=string(\"ot\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> oo = reshape(shape=os,x=ot)[name=string(\"oo\")];\n", DIM, SEQ];
+
+    // Output: concat(o_out, qf, kf, vf, af, xn) — same as original for backward compatibility
+    [m appendString:@"        int32 cax = const()[name=string(\"cax\"), val=int32(1)];\n"];
+    [m appendString:@"        bool cid = const()[name=string(\"cid\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = concat(axis=cax,interleave=cid,values=(oo,qf,kf,vf,af,xn))[name=string(\"cat\")];\n", 6*DIM, SEQ];
+    // Cast to fp32
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> out32 = cast(dtype=to32,x=out)[name=string(\"cout\")];\n", 6*DIM, SEQ];
+    [m appendString:@"    } -> (out32);\n}\n"];
+    return m;
+}
+
+// ===== FFN forward (dynamic weights) =====
+// RMSNorm on CPU. This kernel: xnorm @ W1 → SiLU, xnorm @ W3 → gate, gate*silu @ W2 → out
+// Input: [1, DIM, 1, SEQ + HIDDEN + HIDDEN + DIM] fp32
+//   sp[0:SEQ]                        = xnorm [DIM,SEQ]
+//   sp[SEQ:SEQ+HIDDEN]               = W1[DIM,HIDDEN]
+//   sp[SEQ+HIDDEN:SEQ+2*HIDDEN]      = W3[DIM,HIDDEN]
+//   sp[SEQ+2*HIDDEN:SEQ+2*HIDDEN+DIM]= W2[HIDDEN→DIM] — but W2 is [DIM,HIDDEN], we need HIDDEN input channels
+// PROBLEM: W2 has shape [DIM,HIDDEN] = HIDDEN input channels, but our kernel has DIM input channels.
+// Solution: separate kernels for W1/W3 (DIM→HIDDEN) and W2 (HIDDEN→DIM)
+// OR: do W1,W3 in one kernel, SiLU on CPU/ANE, W2 in another kernel.
+// Simpler: 3 separate matmul kernels per FFN direction. But that's too many dispatches.
+// Better: one kernel for W1+W3 (same input dim), CPU SiLU, one kernel for W2.
+
+// FFN part 1: xnorm @ W1, xnorm @ W3 (both DIM→HIDDEN)
+// Input: [1, DIM, 1, SEQ + 2*HIDDEN] fp32
+//   sp[0:SEQ]                  = xnorm
+//   sp[SEQ:SEQ+HIDDEN]         = W1[DIM,HIDDEN]
+//   sp[SEQ+HIDDEN:SEQ+2*HIDDEN]= W3[DIM,HIDDEN]
+// Output: [1, 2*HIDDEN, 1, SEQ] fp32 = concat(h1, h3)
+static NSString *gen_ffn_w13_dynamic(void) {
+    int sp_in = SEQ + 2*HIDDEN;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", DIM, sp_in];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", DIM, sp_in];
+
+    // Slice xnorm
+    [m appendString:@"        tensor<int32, [4]> bx = const()[name=string(\"bx\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sx = const()[name=string(\"sx\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xn = slice_by_size(x=xh,begin=bx,size=sx)[name=string(\"xn\")];\n", DIM, SEQ];
+
+    // Slice W1
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> s1 = const()[name=string(\"s1\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W1 = slice_by_size(x=xh,begin=b1,size=s1)[name=string(\"W1\")];\n", DIM, HIDDEN];
+
+    // Slice W3
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ+HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W3 = slice_by_size(x=xh,begin=b3,size=s1)[name=string(\"W3\")];\n", DIM, HIDDEN];
+
+    // Reshape for matmul
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> rd = const()[name=string(\"rd\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> xn2 = reshape(shape=rd,x=xn)[name=string(\"xn2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> xnt = transpose(perm=pm,x=xn2)[name=string(\"xnt\")];\n", SEQ, DIM];
+
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W12 = reshape(shape=rw,x=W1)[name=string(\"W12\")];\n", DIM, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W32 = reshape(shape=rw,x=W3)[name=string(\"W32\")];\n", DIM, HIDDEN];
+
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> h1m = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=W12)[name=string(\"h1m\")];\n", SEQ, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> h3m = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=W32)[name=string(\"h3m\")];\n", SEQ, HIDDEN];
+
+    // Transpose back
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> h1t = transpose(perm=pm,x=h1m)[name=string(\"h1t\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> h3t = transpose(perm=pm,x=h3m)[name=string(\"h3t\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> rh = const()[name=string(\"rh\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> h1 = reshape(shape=rh,x=h1t)[name=string(\"h1\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> h3 = reshape(shape=rh,x=h3t)[name=string(\"h3\")];\n", HIDDEN, SEQ];
+
+    // SiLU + gate
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> sig = sigmoid(x=h1)[name=string(\"sg\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> silu = mul(x=h1,y=sig)[name=string(\"si\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> gate = mul(x=silu,y=h3)[name=string(\"gt\")];\n", HIDDEN, SEQ];
+
+    // Concat output: (h1, h3, gate)
+    [m appendString:@"        int32 cax = const()[name=string(\"cax\"), val=int32(1)];\n"];
+    [m appendString:@"        bool cid = const()[name=string(\"cid\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = concat(axis=cax,interleave=cid,values=(h1,h3,gate))[name=string(\"cat\")];\n", 2*HIDDEN+HIDDEN, SEQ];
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> out32 = cast(dtype=to32,x=out)[name=string(\"cout\")];\n", 3*HIDDEN, SEQ];
+    [m appendString:@"    } -> (out32);\n}\n"];
+    return m;
+}
+
+// FFN part 2: gate @ W2 (HIDDEN→DIM)
+// Input: [1, HIDDEN, 1, SEQ + DIM] fp32
+//   sp[0:SEQ]       = gate [HIDDEN,SEQ]
+//   sp[SEQ:SEQ+DIM] = W2[HIDDEN,DIM]
+// Output: [1, DIM, 1, SEQ] fp32
+static NSString *gen_ffn_w2_dynamic(void) {
+    int sp_in = SEQ + DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", HIDDEN, sp_in];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", HIDDEN, sp_in];
+
+    [m appendString:@"        tensor<int32, [4]> ba = const()[name=string(\"ba\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sa = const()[name=string(\"sa\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> act = slice_by_size(x=xh,begin=ba,size=sa)[name=string(\"act\")];\n", HIDDEN, SEQ];
+
+    [m appendFormat:@"        tensor<int32, [4]> bw = const()[name=string(\"bw\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name=string(\"sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W2 = slice_by_size(x=xh,begin=bw,size=sw)[name=string(\"W2\")];\n", HIDDEN, DIM];
+
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> ra = const()[name=string(\"ra\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a2 = reshape(shape=ra,x=act)[name=string(\"a2\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> at = transpose(perm=pm,x=a2)[name=string(\"at\")];\n", SEQ, HIDDEN];
+
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W22 = reshape(shape=rw,x=W2)[name=string(\"W22\")];\n", HIDDEN, DIM];
+
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> ym = matmul(transpose_x=bF,transpose_y=bF,x=at,y=W22)[name=string(\"ym\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> yt = transpose(perm=pm,x=ym)[name=string(\"yt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> ro = const()[name=string(\"ro\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> yr = reshape(shape=ro,x=yt)[name=string(\"yr\")];\n", DIM, SEQ];
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=yr)[name=string(\"cout\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// ===== FFN backward (dynamic weights) =====
+// Input: [1, DIM+2*HIDDEN, 1, SEQ + HIDDEN + DIM + DIM] fp32
+// Actually simpler to split into separate backward kernels like forward.
+
+// FFN backward part 1: dffn @ W2^T → dsilu (HIDDEN), then SiLU derivative
+// Input: [1, DIM, 1, SEQ + HIDDEN] fp32
+//   sp[0:SEQ]        = dffn [DIM, SEQ]
+//   sp[SEQ:SEQ+HIDDEN]= W2^T [DIM, HIDDEN]
+// Output: [1, HIDDEN, 1, SEQ] fp32 = dsilu_raw
+static NSString *gen_ffn_bwd_w2t_dynamic(void) {
+    return gen_dyn_matmul_mil(DIM, HIDDEN, SEQ);
+}
+
+// FFN backward part 2: dh1 @ W1^T + dh3 @ W3^T → dx
+// We need h1,h3 for SiLU derivative, but those are on CPU.
+// Actually the SiLU derivative + gating is element-wise, do on CPU.
+// Then: dh1 @ W1^T and dh3 @ W3^T are two separate matmuls (HIDDEN→DIM).
+// Combine into one kernel:
+// Input: [1, HIDDEN, 1, SEQ + SEQ + DIM + DIM] fp32
+//   sp[0:SEQ]              = dh1 [HIDDEN,SEQ]
+//   sp[SEQ:2*SEQ]          = dh3 [HIDDEN,SEQ]
+//   sp[2*SEQ:2*SEQ+DIM]    = W1^T [HIDDEN,DIM]
+//   sp[2*SEQ+DIM:2*SEQ+2D] = W3^T [HIDDEN,DIM]
+// Output: [1, DIM, 1, SEQ] fp32 = dx1 + dx3
+static NSString *gen_ffn_bwd_w13t_dynamic(void) {
+    int sp_in = 2*SEQ + 2*DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", HIDDEN, sp_in];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", HIDDEN, sp_in];
+
+    // Slice dh1 [HIDDEN, SEQ]
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sh = const()[name=string(\"sh\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dh1 = slice_by_size(x=xh,begin=b0,size=sh)[name=string(\"dh1\")];\n", HIDDEN, SEQ];
+
+    // Slice dh3
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dh3 = slice_by_size(x=xh,begin=b1,size=sh)[name=string(\"dh3\")];\n", HIDDEN, SEQ];
+
+    // Slice W1^T [HIDDEN, DIM]
+    [m appendFormat:@"        tensor<int32, [4]> b2 = const()[name=string(\"b2\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 2*SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name=string(\"sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W1t = slice_by_size(x=xh,begin=b2,size=sw)[name=string(\"W1t\")];\n", HIDDEN, DIM];
+
+    // Slice W3^T
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 2*SEQ+DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W3t = slice_by_size(x=xh,begin=b3,size=sw)[name=string(\"W3t\")];\n", HIDDEN, DIM];
+
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+
+    // dh1 matmul: [S,H] @ [H,D] → [S,D]
+    [m appendFormat:@"        tensor<int32, [4]> ra = const()[name=string(\"ra\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dh12 = reshape(shape=ra,x=dh1)[name=string(\"dh12\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dh1t = transpose(perm=pm,x=dh12)[name=string(\"dh1t\")];\n", SEQ, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dh32 = reshape(shape=ra,x=dh3)[name=string(\"dh32\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dh3t = transpose(perm=pm,x=dh32)[name=string(\"dh3t\")];\n", SEQ, HIDDEN];
+
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W1t2 = reshape(shape=rw,x=W1t)[name=string(\"W1t2\")];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W3t2 = reshape(shape=rw,x=W3t)[name=string(\"W3t2\")];\n", HIDDEN, DIM];
+
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dx1m = matmul(transpose_x=bF,transpose_y=bF,x=dh1t,y=W1t2)[name=string(\"dx1m\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dx3m = matmul(transpose_x=bF,transpose_y=bF,x=dh3t,y=W3t2)[name=string(\"dx3m\")];\n", SEQ, DIM];
+
+    // Add
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxm = add(x=dx1m,y=dx3m)[name=string(\"dxm\")];\n", SEQ, DIM];
+
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxt = transpose(perm=pm,x=dxm)[name=string(\"dxt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> ro = const()[name=string(\"ro\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dx = reshape(shape=ro,x=dxt)[name=string(\"dx\")];\n", DIM, SEQ];
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=dx)[name=string(\"cout\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// ===== SDPA backward part 1 (dynamic Wo^T) =====
+// Same as original gen_sdpa_bwd1 but Wo^T comes from input instead of const
+// Input: [1, 4*DIM, 1, SEQ + DIM] fp32 — Q,K,V,dx2 in channels, Wo^T in spatial
+// Wait — channels must match for all data. Q,K,V are [DIM,SEQ], dx2 is [DIM,SEQ].
+// Total input channels = 4*DIM. But Wo^T is [DIM,DIM] = DIM channels of DIM spatial.
+// Problem: can't mix 4*DIM channels for data with DIM channels for Wo^T.
+// Solution: Wo^T matmul as separate kernel, then SDPA part purely element-wise on ANE.
+
+// Wo^T matmul: dx2 @ Wo^T → da (DIM→DIM)
+static NSString *gen_wot_dynamic(void) {
+    return gen_dyn_matmul_mil(DIM, DIM, SEQ);
+}
+
+// SDPA backward part 1 (no weights, all data): Q,K,V,da → dV,probs,dp
+// Same as original but without Wo^T conv (already done)
+// Input: [1, 4*DIM, 1, SEQ] fp16
+static NSString *gen_sdpa_bwd1_noweight(void) {
+    float sc = 1.0f/sqrtf((float)HD);
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", 4*DIM, SEQ];
+
+    // Slice Q,K,V,da
+    [m appendFormat:@"        tensor<int32, [4]> sz = const()[name=string(\"sz\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> qf = slice_by_size(x=x,begin=b0,size=sz)[name=string(\"s0\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> kf = slice_by_size(x=x,begin=b1,size=sz)[name=string(\"s1\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b2 = const()[name=string(\"b2\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", 2*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> vf = slice_by_size(x=x,begin=b2,size=sz)[name=string(\"s2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", 3*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> da = slice_by_size(x=x,begin=b3,size=sz)[name=string(\"s3\")];\n", DIM, SEQ];
+
+    // Reshape to heads
+    [m appendFormat:@"        tensor<int32, [4]> rsh = const()[name=string(\"rsh\"), val=tensor<int32, [4]>([1,%d,%d,%d])];\n", HEADS, HD, SEQ];
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> qr = reshape(shape=rsh,x=qf)[name=string(\"rq\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> q = transpose(perm=pm,x=qr)[name=string(\"tq\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> kr = reshape(shape=rsh,x=kf)[name=string(\"rk\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> k = transpose(perm=pm,x=kr)[name=string(\"tk\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> vr = reshape(shape=rsh,x=vf)[name=string(\"rv\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> v = transpose(perm=pm,x=vr)[name=string(\"tv\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dr = reshape(shape=rsh,x=da)[name=string(\"rd\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dat = transpose(perm=pm,x=dr)[name=string(\"td\")];\n", HEADS, SEQ, HD];
+
+    // Forward attention scores (recompute)
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendString:@"        bool bT = const()[name=string(\"bT\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> sc1 = matmul(transpose_x=bF,transpose_y=bT,x=q,y=k)[name=string(\"mm1\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        fp16 scv = const()[name=string(\"scv\"), val=fp16(%f)];\n", sc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> sc2 = mul(x=sc1,y=scv)[name=string(\"scl\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> cm = const()[name=string(\"cm\"), val=tensor<fp16, [1,1,%d,%d]>(BLOBFILE(path=string(\"@model_path/weights/mask.bin\"), offset=uint64(64)))];\n", SEQ, SEQ, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> ms = add(x=sc2,y=cm)[name=string(\"msk\")];\n", HEADS, SEQ, SEQ];
+    [m appendString:@"        int32 sax = const()[name=string(\"sax\"), val=int32(-1)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> probs = softmax(axis=sax,x=ms)[name=string(\"sm\")];\n", HEADS, SEQ, SEQ];
+
+    // dV = probs^T @ da, dp = da @ V^T
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dv4 = matmul(transpose_x=bT,transpose_y=bF,x=probs,y=dat)[name=string(\"dv\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dp4 = matmul(transpose_x=bF,transpose_y=bT,x=dat,y=v)[name=string(\"dp\")];\n", HEADS, SEQ, SEQ];
+
+    // Reshape dV back
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dvt = transpose(perm=pm,x=dv4)[name=string(\"dvt\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> dvs = const()[name=string(\"dvs\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dvf = reshape(shape=dvs,x=dvt)[name=string(\"dvf\")];\n", DIM, SEQ];
+
+    // Flatten probs and dp for output
+    [m appendFormat:@"        tensor<int32, [4]> scs = const()[name=string(\"scs\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", SCORE_CH, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> pf = reshape(shape=scs,x=probs)[name=string(\"pf\")];\n", SCORE_CH, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dpf = reshape(shape=scs,x=dp4)[name=string(\"dpf\")];\n", SCORE_CH, SEQ];
+
+    [m appendString:@"        int32 cax = const()[name=string(\"cax\"), val=int32(1)];\n"];
+    [m appendString:@"        bool cid = const()[name=string(\"cid\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = concat(axis=cax,interleave=cid,values=(dvf,pf,dpf))[name=string(\"cat\")];\n", DIM+2*SCORE_CH, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// SDPA backward part 2: same as original (no weights, pure computation)
+static NSString *gen_sdpa_bwd2(void) {
+    float sc = 1.0f/sqrtf((float)HD);
+    int bwd2_in = 2*SCORE_CH + 2*DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", bwd2_in, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sz_sc = const()[name=string(\"szsc\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", SCORE_CH, SEQ];
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> pf = slice_by_size(x=x,begin=b0,size=sz_sc)[name=string(\"s0\")];\n", SCORE_CH, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", SCORE_CH];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dpf = slice_by_size(x=x,begin=b1,size=sz_sc)[name=string(\"s1\")];\n", SCORE_CH, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sz_d = const()[name=string(\"szd\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b2 = const()[name=string(\"b2\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", 2*SCORE_CH];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> qf = slice_by_size(x=x,begin=b2,size=sz_d)[name=string(\"s2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", 2*SCORE_CH+DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> kf = slice_by_size(x=x,begin=b3,size=sz_d)[name=string(\"s3\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> ssh = const()[name=string(\"ssh\"), val=tensor<int32, [4]>([1,%d,%d,%d])];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> probs = reshape(shape=ssh,x=pf)[name=string(\"rp\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dp = reshape(shape=ssh,x=dpf)[name=string(\"rdp\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> rsh = const()[name=string(\"rsh\"), val=tensor<int32, [4]>([1,%d,%d,%d])];\n", HEADS, HD, SEQ];
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> qr = reshape(shape=rsh,x=qf)[name=string(\"rq\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> q = transpose(perm=pm,x=qr)[name=string(\"tq\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> kr = reshape(shape=rsh,x=kf)[name=string(\"rk\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> k = transpose(perm=pm,x=kr)[name=string(\"tk\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> pdp = mul(x=probs,y=dp)[name=string(\"pdp\")];\n", HEADS, SEQ, SEQ];
+    [m appendString:@"        tensor<int32, [1]> rax = const()[name=string(\"rax\"), val=tensor<int32, [1]>([-1])];\n"];
+    [m appendString:@"        bool kd = const()[name=string(\"kd\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,1]> spdp = reduce_sum(x=pdp,axes=rax,keep_dims=kd)[name=string(\"rs\")];\n", HEADS, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dps = sub(x=dp,y=spdp)[name=string(\"dps\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> ds0 = mul(x=probs,y=dps)[name=string(\"ds0\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        fp16 scv = const()[name=string(\"scv\"), val=fp16(%f)];\n", sc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> ds = mul(x=ds0,y=scv)[name=string(\"ds\")];\n", HEADS, SEQ, SEQ];
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendString:@"        bool bT = const()[name=string(\"bT\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dq4 = matmul(transpose_x=bF,transpose_y=bF,x=ds,y=k)[name=string(\"dq\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dk4 = matmul(transpose_x=bT,transpose_y=bF,x=ds,y=q)[name=string(\"dk\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dqt = transpose(perm=pm,x=dq4)[name=string(\"dqt\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dkt = transpose(perm=pm,x=dk4)[name=string(\"dkt\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> fs = const()[name=string(\"fs\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dqf = reshape(shape=fs,x=dqt)[name=string(\"dqf\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dkf = reshape(shape=fs,x=dkt)[name=string(\"dkf\")];\n", DIM, SEQ];
+    [m appendString:@"        int32 cax = const()[name=string(\"cax\"), val=int32(1)];\n"];
+    [m appendString:@"        bool cid = const()[name=string(\"cid\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = concat(axis=cax,interleave=cid,values=(dqf,dkf))[name=string(\"cat\")];\n", 2*DIM, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// QKV backward (dynamic): dq @ Wq^T + dk @ Wk^T + dv @ Wv^T → dx
+// Input: [1, DIM, 1, 3*SEQ + 3*DIM] fp32
+//   sp[0:SEQ]              = dq  [DIM,SEQ]
+//   sp[SEQ:2*SEQ]          = dk  [DIM,SEQ]
+//   sp[2*SEQ:3*SEQ]        = dv  [DIM,SEQ]
+//   sp[3*SEQ:3*SEQ+DIM]    = Wq^T [DIM,DIM]
+//   sp[3*SEQ+DIM:3*SEQ+2D] = Wk^T [DIM,DIM]
+//   sp[3*SEQ+2D:3*SEQ+3D]  = Wv^T [DIM,DIM]
+// Output: [1, DIM, 1, SEQ] fp32 = dxq + dxk + dxv
+static NSString *gen_qkvb_dynamic(void) {
+    int sp_in = 3*SEQ + 3*DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", DIM, sp_in];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", DIM, sp_in];
+
+    // Slice dq, dk, dv
+    [m appendFormat:@"        tensor<int32, [4]> sd = const()[name=string(\"sd\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dq = slice_by_size(x=xh,begin=b0,size=sd)[name=string(\"dq\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dk = slice_by_size(x=xh,begin=b1,size=sd)[name=string(\"dk\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b2 = const()[name=string(\"b2\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 2*SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dv = slice_by_size(x=xh,begin=b2,size=sd)[name=string(\"dv\")];\n", DIM, SEQ];
+
+    // Slice Wq^T, Wk^T, Wv^T
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name=string(\"sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 3*SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wqt = slice_by_size(x=xh,begin=b3,size=sw)[name=string(\"Wqt\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<int32, [4]> b4 = const()[name=string(\"b4\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 3*SEQ+DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wkt = slice_by_size(x=xh,begin=b4,size=sw)[name=string(\"Wkt\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<int32, [4]> b5 = const()[name=string(\"b5\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 3*SEQ+2*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wvt = slice_by_size(x=xh,begin=b5,size=sw)[name=string(\"Wvt\")];\n", DIM, DIM];
+
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+
+    // Reshape and matmul for each
+    [m appendFormat:@"        tensor<int32, [4]> rd = const()[name=string(\"rd\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, DIM];
+
+    // dq @ Wq^T
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dq2 = reshape(shape=rd,x=dq)[name=string(\"dq2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dqt = transpose(perm=pm,x=dq2)[name=string(\"dqt\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wqt2 = reshape(shape=rw,x=Wqt)[name=string(\"Wqt2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxq = matmul(transpose_x=bF,transpose_y=bF,x=dqt,y=Wqt2)[name=string(\"dxq\")];\n", SEQ, DIM];
+
+    // dk @ Wk^T
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dk2 = reshape(shape=rd,x=dk)[name=string(\"dk2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dkt = transpose(perm=pm,x=dk2)[name=string(\"dkt\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wkt2 = reshape(shape=rw,x=Wkt)[name=string(\"Wkt2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxk = matmul(transpose_x=bF,transpose_y=bF,x=dkt,y=Wkt2)[name=string(\"dxk\")];\n", SEQ, DIM];
+
+    // dv @ Wv^T
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dv2 = reshape(shape=rd,x=dv)[name=string(\"dv2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dvt = transpose(perm=pm,x=dv2)[name=string(\"dvt\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wvt2 = reshape(shape=rw,x=Wvt)[name=string(\"Wvt2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxv = matmul(transpose_x=bF,transpose_y=bF,x=dvt,y=Wvt2)[name=string(\"dxv\")];\n", SEQ, DIM];
+
+    // Sum: dxq + dxk + dxv
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxqk = add(x=dxq,y=dxk)[name=string(\"aqk\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxall = add(x=dxqk,y=dxv)[name=string(\"aall\")];\n", SEQ, DIM];
+
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxt = transpose(perm=pm,x=dxall)[name=string(\"dxt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> ro = const()[name=string(\"ro\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dx = reshape(shape=ro,x=dxt)[name=string(\"dx\")];\n", DIM, SEQ];
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=dx)[name=string(\"cout\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// Causal mask blob (used by sdpa_fwd and sdpa_bwd1)
+static NSData *g_mask_blob = nil;
+static NSData *get_mask_blob(void) {
+    if (!g_mask_blob) {
+        _Float16 *mask = (_Float16*)calloc(SEQ*SEQ, sizeof(_Float16));
+        for(int t=0;t<SEQ;t++) for(int t2=0;t2<SEQ;t2++)
+            mask[t*SEQ+t2] = (t2<=t) ? (_Float16)0.0f : (_Float16)(-65504.0f);
+        g_mask_blob = build_blob_fp16(mask, SEQ*SEQ);
+        free(mask);
+    }
+    return g_mask_blob;
+}
diff --git a/training/training_dynamic/train.m b/training/training_dynamic/train.m
new file mode 100644
index 0000000..412c4d8
--- /dev/null
+++ b/training/training_dynamic/train.m
@@ -0,0 +1,876 @@
+// train.m — Dynamic weight ANE training for Stories110M
+// Compile kernels ONCE at startup, update weights via IOSurface every step.
+// No exec() restart needed — eliminates 76% compile overhead.
+#include "mil_dynamic.h"
+#include "cpu_ops.h"
+
+#define CKPT_PATH "ane_stories110M_dyn_ckpt.bin"
+#define MODEL_PATH "../../../assets/models/stories110M.bin"
+#define DATA_PATH "../tinystories_data00.bin"
+
+// Dynamic kernel set per layer
+typedef struct {
+    Kern *sdpaFwd;     // QKV matmul + SDPA + Wo matmul (dynamic weights via IOSurface)
+    Kern *ffnW13;      // W1,W3 matmul (dynamic)
+    Kern *ffnW2;       // W2 matmul (dynamic)
+    Kern *ffnBwdW2t;   // dffn @ W2^T (dynamic)
+    Kern *ffnBwdW13t;  // dh1@W1^T + dh3@W3^T (dynamic)
+    Kern *wotBwd;      // dx2 @ Wo^T (dynamic)
+    Kern *sdpaBwd1;    // Q,K,V,da → dV,probs,dp (weight-free, has mask const)
+    Kern *sdpaBwd2;    // probs,dp,Q,K → dQ,dK (weight-free)
+    Kern *qkvBwd;      // dq@Wq^T + dk@Wk^T + dv@Wv^T (dynamic)
+} DynLayerKernels;
+
+// ===== Weight loading from llama2.c format =====
+static bool load_pretrained(LayerWeights *lw, float *rms_final, float *embed, const char *path) {
+    FILE *f = fopen(path, "rb");
+    if (!f) { printf("Cannot open %s\n", path); return false; }
+    Llama2Config cfg;
+    fread(&cfg, sizeof(cfg), 1, f);
+    printf("  Model: dim=%d hidden=%d layers=%d heads=%d vocab=%d seq=%d\n",
+           cfg.dim, cfg.hidden_dim, cfg.n_layers, cfg.n_heads, abs(cfg.vocab_size), cfg.seq_len);
+    if (cfg.dim != DIM || cfg.hidden_dim != HIDDEN || cfg.n_layers != NLAYERS) {
+        printf("  ERROR: Config mismatch!\n"); fclose(f); return false;
+    }
+    int V = abs(cfg.vocab_size);
+    fread(embed, 4, V * DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].rms_att, 4, DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wq, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wk, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wv, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wo, 4, WO_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].rms_ffn, 4, DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W1, 4, W1_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W2, 4, W2_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W3, 4, W3_SZ, f);
+    fread(rms_final, 4, DIM, f);
+    fclose(f);
+    printf("  Loaded pretrained weights\n");
+    return true;
+}
+
+// Transpose W[rows,cols] → W^T[cols,rows] stored as [cols channels, rows spatial]
+static void transpose_weight(float *dst, const float *src, int rows, int cols) {
+    for (int r = 0; r < rows; r++)
+        for (int c = 0; c < cols; c++)
+            dst[c * rows + r] = src[r * cols + c];
+}
+
+// ===== Compile all dynamic kernels (ONCE) =====
+static bool compile_dynamic_kernels(DynLayerKernels *dk) {
+    NSDictionary *mask_w = @{@"@model_path/weights/mask.bin": @{@"offset":@0, @"data":get_mask_blob()}};
+
+    // SDPA forward: [1, DIM, 1, SEQ+4*DIM] fp32 → [1, 6*DIM, 1, SEQ] fp32
+    printf("  Compiling sdpaFwd...\n");
+    dk->sdpaFwd = compile_kern_mil_w(gen_sdpa_fwd_dynamic(), mask_w,
+        DIM*(SEQ+4*DIM)*4, 6*DIM*SEQ*4);
+    if (!dk->sdpaFwd) return false;
+
+    // FFN W1+W3: [1, DIM, 1, SEQ+2*HIDDEN] fp32 → [1, 3*HIDDEN, 1, SEQ] fp32
+    printf("  Compiling ffnW13...\n");
+    dk->ffnW13 = compile_kern_mil_w(gen_ffn_w13_dynamic(), @{},
+        DIM*(SEQ+2*HIDDEN)*4, 3*HIDDEN*SEQ*4);
+    if (!dk->ffnW13) return false;
+
+    // FFN W2: [1, HIDDEN, 1, SEQ+DIM] fp32 → [1, DIM, 1, SEQ] fp32
+    printf("  Compiling ffnW2...\n");
+    dk->ffnW2 = compile_kern_mil_w(gen_ffn_w2_dynamic(), @{},
+        HIDDEN*(SEQ+DIM)*4, DIM*SEQ*4);
+    if (!dk->ffnW2) return false;
+
+    // FFN backward W2^T: [1, DIM, 1, SEQ+HIDDEN] fp32 → [1, HIDDEN, 1, SEQ] fp32
+    printf("  Compiling ffnBwdW2t...\n");
+    dk->ffnBwdW2t = compile_kern_mil_w(gen_ffn_bwd_w2t_dynamic(), @{},
+        DIM*(SEQ+HIDDEN)*4, HIDDEN*SEQ*4);
+    if (!dk->ffnBwdW2t) return false;
+
+    // FFN backward W1^T+W3^T: [1, HIDDEN, 1, 2*SEQ+2*DIM] fp32 → [1, DIM, 1, SEQ] fp32
+    printf("  Compiling ffnBwdW13t...\n");
+    dk->ffnBwdW13t = compile_kern_mil_w(gen_ffn_bwd_w13t_dynamic(), @{},
+        HIDDEN*(2*SEQ+2*DIM)*4, DIM*SEQ*4);
+    if (!dk->ffnBwdW13t) return false;
+
+    // Wo^T backward: [1, DIM, 1, SEQ+DIM] fp32 → [1, DIM, 1, SEQ] fp32
+    printf("  Compiling wotBwd...\n");
+    dk->wotBwd = compile_kern_mil_w(gen_wot_dynamic(), @{},
+        DIM*(SEQ+DIM)*4, DIM*SEQ*4);
+    if (!dk->wotBwd) return false;
+
+    // SDPA bwd1 (no dynamic weights, has mask): [1, 4*DIM, 1, SEQ] fp16 → [1, DIM+2*SCORE_CH, 1, SEQ] fp16
+    printf("  Compiling sdpaBwd1...\n");
+    dk->sdpaBwd1 = compile_kern_mil_w(gen_sdpa_bwd1_noweight(), mask_w,
+        4*DIM*SEQ*2, (DIM+2*SCORE_CH)*SEQ*2);
+    if (!dk->sdpaBwd1) return false;
+
+    // SDPA bwd2 (no weights): [1, 2*SCORE_CH+2*DIM, 1, SEQ] fp16 → [1, 2*DIM, 1, SEQ] fp16
+    printf("  Compiling sdpaBwd2...\n");
+    dk->sdpaBwd2 = compile_kern_mil_w(gen_sdpa_bwd2(), @{},
+        (2*SCORE_CH+2*DIM)*SEQ*2, 2*DIM*SEQ*2);
+    if (!dk->sdpaBwd2) return false;
+
+    // QKV backward: [1, DIM, 1, 3*SEQ+3*DIM] fp32 → [1, DIM, 1, SEQ] fp32
+    printf("  Compiling qkvBwd...\n");
+    dk->qkvBwd = compile_kern_mil_w(gen_qkvb_dynamic(), @{},
+        DIM*(3*SEQ+3*DIM)*4, DIM*SEQ*4);
+    if (!dk->qkvBwd) return false;
+
+    return true;
+}
+
+// ===== Write dynamic weights into IOSurface =====
+// sdpaFwd: [1, DIM, 1, SEQ+4*DIM] — xnorm at sp[0:S], Wq/Wk/Wv/Wo at sp[S:]
+static void write_sdpa_fwd_input(DynLayerKernels *dk, const float *xnorm,
+                                  const float *Wq, const float *Wk, const float *Wv, const float *Wo) {
+    IOSurfaceLock(dk->sdpaFwd->ioIn, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(dk->sdpaFwd->ioIn);
+    int sp = SEQ + 4*DIM;
+    for (int d = 0; d < DIM; d++) {
+        memcpy(buf + d*sp, xnorm + d*SEQ, SEQ*4);
+        memcpy(buf + d*sp + SEQ,       Wq + d*DIM, DIM*4);
+        memcpy(buf + d*sp + SEQ+DIM,   Wk + d*DIM, DIM*4);
+        memcpy(buf + d*sp + SEQ+2*DIM, Wv + d*DIM, DIM*4);
+        memcpy(buf + d*sp + SEQ+3*DIM, Wo + d*DIM, DIM*4);
+    }
+    IOSurfaceUnlock(dk->sdpaFwd->ioIn, 0, NULL);
+}
+
+// ffnW13: [1, DIM, 1, SEQ+2*HIDDEN] — xnorm at sp[0:S], W1,W3 at sp[S:]
+static void write_ffn_w13_input(DynLayerKernels *dk, const float *xnorm,
+                                const float *W1, const float *W3) {
+    IOSurfaceLock(dk->ffnW13->ioIn, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(dk->ffnW13->ioIn);
+    int sp = SEQ + 2*HIDDEN;
+    for (int d = 0; d < DIM; d++) {
+        memcpy(buf + d*sp, xnorm + d*SEQ, SEQ*4);
+        memcpy(buf + d*sp + SEQ,        W1 + d*HIDDEN, HIDDEN*4);
+        memcpy(buf + d*sp + SEQ+HIDDEN,  W3 + d*HIDDEN, HIDDEN*4);
+    }
+    IOSurfaceUnlock(dk->ffnW13->ioIn, 0, NULL);
+}
+
+// ffnW2: [1, HIDDEN, 1, SEQ+DIM] — gate at sp[0:S], W2 at sp[S:]
+static void write_ffn_w2_input(DynLayerKernels *dk, const float *gate, const float *W2) {
+    IOSurfaceLock(dk->ffnW2->ioIn, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(dk->ffnW2->ioIn);
+    int sp = SEQ + DIM;
+    for (int d = 0; d < HIDDEN; d++) {
+        memcpy(buf + d*sp, gate + d*SEQ, SEQ*4);
+        memcpy(buf + d*sp + SEQ, W2 + d*DIM, DIM*4);
+    }
+    IOSurfaceUnlock(dk->ffnW2->ioIn, 0, NULL);
+}
+
+// ===== Checkpoint =====
+static void save_checkpoint(const char *path, int step, int total_steps, float lr, float loss,
+                            double ct, double cw, int cs, int adam_t,
+                            LayerWeights *lw, LayerAdam *la, float *rms_final, AdamState *arms_final,
+                            float *embed, AdamState *aembed) {
+    FILE *f = fopen(path, "wb");
+    CkptHdr h = {0};
+    h.magic = 0x424C5A54; h.version = 3;
+    h.step = step; h.total_steps = total_steps;
+    h.n_layers = NLAYERS; h.vocab_size = VOCAB; h.dim = DIM;
+    h.hidden_dim = HIDDEN; h.n_heads = HEADS; h.seq_len = SEQ;
+    h.lr = lr; h.loss = loss;
+    h.cum_train = ct; h.cum_wall = cw; h.cum_steps = cs; h.adam_t = adam_t;
+    fwrite(&h, sizeof(h), 1, f);
+    for (int L = 0; L < NLAYERS; L++) {
+        fwrite(lw[L].Wq,4,WQ_SZ,f); fwrite(lw[L].Wk,4,WQ_SZ,f);
+        fwrite(lw[L].Wv,4,WQ_SZ,f); fwrite(lw[L].Wo,4,WO_SZ,f);
+        fwrite(lw[L].W1,4,W1_SZ,f); fwrite(lw[L].W2,4,W2_SZ,f); fwrite(lw[L].W3,4,W3_SZ,f);
+        fwrite(lw[L].rms_att,4,DIM,f); fwrite(lw[L].rms_ffn,4,DIM,f);
+        fwrite(la[L].Wq.m,4,WQ_SZ,f); fwrite(la[L].Wq.v,4,WQ_SZ,f);
+        fwrite(la[L].Wk.m,4,WQ_SZ,f); fwrite(la[L].Wk.v,4,WQ_SZ,f);
+        fwrite(la[L].Wv.m,4,WQ_SZ,f); fwrite(la[L].Wv.v,4,WQ_SZ,f);
+        fwrite(la[L].Wo.m,4,WO_SZ,f); fwrite(la[L].Wo.v,4,WO_SZ,f);
+        fwrite(la[L].W1.m,4,W1_SZ,f); fwrite(la[L].W1.v,4,W1_SZ,f);
+        fwrite(la[L].W2.m,4,W2_SZ,f); fwrite(la[L].W2.v,4,W2_SZ,f);
+        fwrite(la[L].W3.m,4,W3_SZ,f); fwrite(la[L].W3.v,4,W3_SZ,f);
+        fwrite(la[L].rms_att.m,4,DIM,f); fwrite(la[L].rms_att.v,4,DIM,f);
+        fwrite(la[L].rms_ffn.m,4,DIM,f); fwrite(la[L].rms_ffn.v,4,DIM,f);
+    }
+    fwrite(rms_final,4,DIM,f);
+    fwrite(arms_final->m,4,DIM,f); fwrite(arms_final->v,4,DIM,f);
+    fwrite(embed,4,VOCAB*DIM,f);
+    fwrite(aembed->m,4,VOCAB*DIM,f); fwrite(aembed->v,4,VOCAB*DIM,f);
+    fclose(f);
+}
+
+static bool load_checkpoint(const char *path, int *step, int *total_steps, float *lr, float *loss,
+                             double *ct, double *cw, int *cs, int *adam_t,
+                             LayerWeights *lw, LayerAdam *la, float *rms_final, AdamState *arms_final,
+                             float *embed, AdamState *aembed) {
+    FILE *f = fopen(path, "rb");
+    if (!f) return false;
+    CkptHdr h;
+    fread(&h, sizeof(h), 1, f);
+    if (h.magic != 0x424C5A54 || h.version != 3) { fclose(f); return false; }
+    *step = h.step; *total_steps = h.total_steps; *lr = h.lr; *loss = h.loss;
+    *ct = h.cum_train; *cw = h.cum_wall; *cs = h.cum_steps; *adam_t = h.adam_t;
+    for (int L = 0; L < NLAYERS; L++) {
+        fread(lw[L].Wq,4,WQ_SZ,f); fread(lw[L].Wk,4,WQ_SZ,f);
+        fread(lw[L].Wv,4,WQ_SZ,f); fread(lw[L].Wo,4,WO_SZ,f);
+        fread(lw[L].W1,4,W1_SZ,f); fread(lw[L].W2,4,W2_SZ,f); fread(lw[L].W3,4,W3_SZ,f);
+        fread(lw[L].rms_att,4,DIM,f); fread(lw[L].rms_ffn,4,DIM,f);
+        fread(la[L].Wq.m,4,WQ_SZ,f); fread(la[L].Wq.v,4,WQ_SZ,f);
+        fread(la[L].Wk.m,4,WQ_SZ,f); fread(la[L].Wk.v,4,WQ_SZ,f);
+        fread(la[L].Wv.m,4,WQ_SZ,f); fread(la[L].Wv.v,4,WQ_SZ,f);
+        fread(la[L].Wo.m,4,WO_SZ,f); fread(la[L].Wo.v,4,WO_SZ,f);
+        fread(la[L].W1.m,4,W1_SZ,f); fread(la[L].W1.v,4,W1_SZ,f);
+        fread(la[L].W2.m,4,W2_SZ,f); fread(la[L].W2.v,4,W2_SZ,f);
+        fread(la[L].W3.m,4,W3_SZ,f); fread(la[L].W3.v,4,W3_SZ,f);
+        fread(la[L].rms_att.m,4,DIM,f); fread(la[L].rms_att.v,4,DIM,f);
+        fread(la[L].rms_ffn.m,4,DIM,f); fread(la[L].rms_ffn.v,4,DIM,f);
+    }
+    fread(rms_final,4,DIM,f);
+    fread(arms_final->m,4,DIM,f); fread(arms_final->v,4,DIM,f);
+    fread(embed,4,VOCAB*DIM,f);
+    fread(aembed->m,4,VOCAB*DIM,f); fread(aembed->v,4,VOCAB*DIM,f);
+    fclose(f);
+    return true;
+}
+
+int main(int argc, char *argv[]) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        ane_init();
+        mach_timebase_info(&g_tb);
+
+        int total_steps = 10000;
+        float max_lr = 3e-4f;
+        float adam_b1=0.9f, adam_b2=0.999f, adam_eps=1e-8f;
+        int adam_t = 0, start_step = 0;
+        int accum_steps = 10;
+        int warmup_steps = 100;
+        float grad_clip = 1.0f;
+        float min_lr_frac = 0.1f;  // min_lr = max_lr * 0.1
+
+        bool do_resume = false, from_scratch = false;
+        for (int i=1; i<argc; i++) {
+            if (strcmp(argv[i], "--resume") == 0) do_resume = true;
+            else if (strcmp(argv[i], "--scratch") == 0) from_scratch = true;
+            else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) max_lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--accum") == 0 && i+1<argc) accum_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--warmup") == 0 && i+1<argc) warmup_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--clip") == 0 && i+1<argc) grad_clip = atof(argv[++i]);
+        }
+        float lr = max_lr;
+
+        // Allocate per-layer state
+        LayerWeights lw[NLAYERS]; LayerAdam la[NLAYERS];
+        LayerActs acts[NLAYERS]; LayerGrads grads[NLAYERS];
+        for (int L=0; L<NLAYERS; L++) {
+            lw[L] = layer_weights_alloc(); la[L] = layer_adam_alloc();
+            acts[L] = layer_acts_alloc(); grads[L] = layer_grads_alloc();
+        }
+        float *rms_final = (float*)malloc(DIM*4);
+        float *embed = (float*)malloc(VOCAB*DIM*4);
+        float *grms_final = (float*)calloc(DIM, 4);
+        float *gembed = (float*)calloc(VOCAB*DIM, 4);
+        AdamState arms_final = adam_alloc(DIM);
+        AdamState aembed = adam_alloc((size_t)VOCAB*DIM);
+
+        double cum_train=0, cum_wall=0; int cum_steps=0;
+        float resume_loss = 0;
+        bool resuming = false;
+        if (do_resume) {
+            resuming = load_checkpoint(CKPT_PATH, &start_step, &total_steps, &lr, &resume_loss,
+                &cum_train, &cum_wall, &cum_steps, &adam_t,
+                lw, la, rms_final, &arms_final, embed, &aembed);
+            if (resuming) printf("[RESUMED step %d, loss=%.4f]\n", start_step, resume_loss);
+        }
+        if (!resuming) {
+            printf("=== ANE Dynamic Training: Stories110M (12 layers) ===\n");
+            printf("dim=%d hidden=%d heads=%d seq=%d vocab=%d layers=%d\n", DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS);
+            // Param counts for dashboard
+            double xformer_m = (double)NLAYERS*(4.0*WQ_SZ + 2.0*W1_SZ + W2_SZ + W3_SZ + 2.0*DIM) / 1e6;
+            double embed_m = (double)VOCAB*DIM / 1e6;
+            printf("Params: %.1fM (transformer %.1fM + embed %.1fM)\n", xformer_m+embed_m, xformer_m, embed_m);
+            printf("Kernels: 9 compiled, 9 weight-bearing\n");
+            printf("Accum %d steps, LR=%g\n", accum_steps, max_lr);
+            // FLOPs estimate: 6*N*B*T for transformer (forward+backward ≈ 3x forward)
+            double fwd_flops = 2.0*NLAYERS*(4.0*WQ_SZ + 2.0*W1_SZ + W2_SZ + W3_SZ) * SEQ;
+            double total_flops = 3.0 * fwd_flops;  // fwd + bwd ≈ 3x fwd
+            printf("FLOPs/step: fwd=%.1fM bwd_dx=%.1fM bwd_dW=%.1fM sdpa_bwd=0.0M total=%.1fM\n",
+                   fwd_flops/1e6, fwd_flops/1e6, fwd_flops/1e6, total_flops/1e6);
+            printf("ANE FLOPs/step: %.1fM\n", total_flops/1e6);
+            if (from_scratch || !load_pretrained(lw, rms_final, embed, MODEL_PATH)) {
+                if (from_scratch) printf("  Training from scratch (random init)\n");
+                else printf("  Pretrained load failed, using random init\n");
+                srand48(42);
+                float scale_d=1.0f/sqrtf(DIM), scale_h=1.0f/sqrtf(HIDDEN);
+                for (int L=0; L<NLAYERS; L++) {
+                    for(size_t i=0;i<WQ_SZ;i++){lw[L].Wq[i]=scale_d*(2*drand48()-1);lw[L].Wk[i]=scale_d*(2*drand48()-1);}
+                    for(size_t i=0;i<WQ_SZ;i++){lw[L].Wv[i]=scale_d*(2*drand48()-1);lw[L].Wo[i]=scale_d*(2*drand48()-1);}
+                    for(size_t i=0;i<W1_SZ;i++) lw[L].W1[i]=scale_h*(2*drand48()-1);
+                    for(size_t i=0;i<W2_SZ;i++) lw[L].W2[i]=scale_d*(2*drand48()-1);
+                    for(size_t i=0;i<W3_SZ;i++) lw[L].W3[i]=scale_h*(2*drand48()-1);
+                    for(int i=0;i<DIM;i++){lw[L].rms_att[i]=1.0f; lw[L].rms_ffn[i]=1.0f;}
+                }
+                for(int i=0;i<DIM;i++) rms_final[i]=1.0f;
+                float escale = 0.02f;
+                for(size_t i=0;i<(size_t)VOCAB*DIM;i++) embed[i]=escale*(2*drand48()-1);
+            }
+        }
+
+        // Precompute transposed weights (for backward pass kernels)
+        // These get updated after each Adam step
+        float *Wqt_buf[NLAYERS], *Wkt_buf[NLAYERS], *Wvt_buf[NLAYERS], *Wot_buf[NLAYERS];
+        float *W1t_buf[NLAYERS], *W2t_buf[NLAYERS], *W3t_buf[NLAYERS];
+        for (int L=0; L<NLAYERS; L++) {
+            Wqt_buf[L]=(float*)malloc(WQ_SZ*4); Wkt_buf[L]=(float*)malloc(WQ_SZ*4);
+            Wvt_buf[L]=(float*)malloc(WQ_SZ*4); Wot_buf[L]=(float*)malloc(WO_SZ*4);
+            W1t_buf[L]=(float*)malloc(W1_SZ*4); W2t_buf[L]=(float*)malloc(W2_SZ*4);
+            W3t_buf[L]=(float*)malloc(W3_SZ*4);
+            transpose_weight(Wqt_buf[L], lw[L].Wq, DIM, DIM);
+            transpose_weight(Wkt_buf[L], lw[L].Wk, DIM, DIM);
+            transpose_weight(Wvt_buf[L], lw[L].Wv, DIM, DIM);
+            transpose_weight(Wot_buf[L], lw[L].Wo, DIM, DIM);
+            transpose_weight(W1t_buf[L], lw[L].W1, HIDDEN, DIM);
+            transpose_weight(W2t_buf[L], lw[L].W2, DIM, HIDDEN);
+            transpose_weight(W3t_buf[L], lw[L].W3, HIDDEN, DIM);
+        }
+
+        // mmap token data
+        int data_fd = open(DATA_PATH, O_RDONLY);
+        if (data_fd < 0) { printf("Cannot open %s\n", DATA_PATH); return 1; }
+        struct stat st; fstat(data_fd, &st);
+        size_t data_len = st.st_size;
+        uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
+        if (token_data == MAP_FAILED) { printf("mmap failed\n"); return 1; }
+        size_t n_tokens = data_len / 2;
+        printf("Token data: %zu tokens (%.1f MB)\n", n_tokens, data_len/1e6);
+
+        // Vocab compaction: map 32K sparse vocab → ~9K compact
+        VocabMap vm = vocab_map_build(token_data, n_tokens, VOCAB);
+        int CV = vm.compact_vocab;
+        printf("Vocab compaction: %d → %d active tokens (%.1fx reduction)\n", VOCAB, CV, (float)VOCAB/CV);
+
+        // Create compact embedding + adam state
+        float *cembed = vocab_compact_embed(embed, &vm, DIM);
+        float *gcembed = (float*)calloc((size_t)CV*DIM, 4);
+        AdamState acembed = adam_alloc((size_t)CV*DIM);
+
+        // ===== Compile all kernels ONCE =====
+        printf("Compiling %d dynamic kernels (one-time)...\n", 9);
+        uint64_t tc = mach_absolute_time();
+        DynLayerKernels dk;
+        if (!compile_dynamic_kernels(&dk)) {
+            printf("Compilation failed!\n"); return 1;
+        }
+        double compile_ms = tb_ms(mach_absolute_time() - tc);
+        printf("Compiled 9 kernels in %.0fms (shared across all %d layers)\n\n", compile_ms, NLAYERS);
+
+        // Gradient + work buffers
+        float *dy = (float*)malloc(SEQ*DIM*4);
+        float *dffn = (float*)malloc(SEQ*DIM*4);
+        float *dx_ffn = (float*)malloc(SEQ*DIM*4);
+        float *dx2 = (float*)malloc(SEQ*DIM*4);
+        float *dx_attn = (float*)malloc(SEQ*DIM*4);
+        float *dq = (float*)malloc(SEQ*DIM*4);
+        float *dk_buf = (float*)malloc(SEQ*DIM*4);
+        float *dv = (float*)malloc(SEQ*DIM*4);
+        float *x_cur = (float*)malloc(SEQ*DIM*4);
+        float *x_final = (float*)malloc(SEQ*DIM*4);
+        float *xnorm_buf = (float*)malloc(SEQ*DIM*4);
+        float *logits = (float*)malloc(SEQ*CV*4);
+        float *dlogits = (float*)malloc(SEQ*CV*4);
+        float *gate_buf = (float*)malloc(SEQ*HIDDEN*4);
+        float *dh1 = (float*)malloc(SEQ*HIDDEN*4);
+        float *dh3 = (float*)malloc(SEQ*HIDDEN*4);
+        float *dsilu = (float*)malloc(SEQ*HIDDEN*4);
+        float *silu_tmp = (float*)malloc(SEQ*HIDDEN*4);
+        float *silu_tmp2 = (float*)malloc(SEQ*HIDDEN*4);
+
+        dispatch_queue_t dw_q = dispatch_queue_create("dw_cblas", DISPATCH_QUEUE_SERIAL);
+        dispatch_group_t dw_grp = dispatch_group_create();
+
+        float last_loss = 999.0f;
+        double total_train_ms = 0;
+        int total_steps_done = 0;
+        uint64_t t_wall_start = mach_absolute_time();
+        srand48(42 + start_step);
+
+        for (int step = start_step; step < total_steps; step++) {
+            uint64_t t0, t1, t_step = mach_absolute_time();
+
+            // Sample data
+            size_t max_pos = n_tokens - SEQ - 1;
+            size_t pos = (size_t)(drand48() * max_pos);
+            uint16_t *input_tokens = token_data + pos;
+            uint16_t *target_tokens_raw = token_data + pos + 1;
+
+            // Map targets to compact vocab IDs
+            uint16_t ctargets[SEQ];
+            for (int t = 0; t < SEQ; t++) ctargets[t] = (uint16_t)vm.full_to_compact[target_tokens_raw[t]];
+
+            // Embedding lookup (uses full embed for now — input tokens are full IDs)
+            embed_lookup(x_cur, embed, input_tokens, DIM, SEQ);
+
+            // Timing accumulators (reset each step)
+            double t_rms=0, t_ane_fwd=0, t_io_fwd=0, t_cblas_wait=0;
+            double t_ane_bwd=0, t_io_bwd=0, t_silu=0, t_rms_bwd=0, t_cls=0, t_dw_copy=0;
+
+            // ===== FORWARD (12 layers) =====
+            for (int L=0; L<NLAYERS; L++) {
+                LayerActs *ac = &acts[L];
+                memcpy(ac->layer_in, x_cur, SEQ*DIM*4);
+
+                // RMSNorm1 (CPU)
+                t0 = mach_absolute_time();
+                rmsnorm(xnorm_buf, x_cur, lw[L].rms_att, DIM, SEQ);
+                memcpy(ac->xnorm, xnorm_buf, SEQ*DIM*4);
+                t_rms += tb_ms(mach_absolute_time() - t0);
+
+                // Wait for any pending dW cblas
+                t0 = mach_absolute_time();
+                dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+                t_cblas_wait += tb_ms(mach_absolute_time() - t0);
+
+                // SDPA forward (ANE): xnorm + Wq,Wk,Wv,Wo → o_out,Q,K,V,attn_out,xnorm
+                t0 = mach_absolute_time();
+                write_sdpa_fwd_input(&dk, xnorm_buf, Wqt_buf[L], Wkt_buf[L], Wvt_buf[L], Wot_buf[L]);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.sdpaFwd);
+                t_ane_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // Read output: [1, 6*DIM, 1, SEQ] fp32
+                t0 = mach_absolute_time();
+                IOSurfaceLock(dk.sdpaFwd->ioOut, kIOSurfaceLockReadOnly, NULL);
+                float *fwd_out = (float*)IOSurfaceGetBaseAddress(dk.sdpaFwd->ioOut);
+                memcpy(ac->o_out,    fwd_out + 0*DIM*SEQ, DIM*SEQ*4);
+                memcpy(ac->Q,       fwd_out + 1*DIM*SEQ, DIM*SEQ*4);
+                memcpy(ac->K,       fwd_out + 2*DIM*SEQ, DIM*SEQ*4);
+                memcpy(ac->V,       fwd_out + 3*DIM*SEQ, DIM*SEQ*4);
+                memcpy(ac->attn_out, fwd_out + 4*DIM*SEQ, DIM*SEQ*4);
+                IOSurfaceUnlock(dk.sdpaFwd->ioOut, kIOSurfaceLockReadOnly, NULL);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // Residual: x2 = x_cur + o_out
+                vDSP_vadd(x_cur, 1, ac->o_out, 1, ac->x2, 1, (vDSP_Length)(SEQ*DIM));
+
+                // RMSNorm2 (CPU)
+                t0 = mach_absolute_time();
+                rmsnorm(xnorm_buf, ac->x2, lw[L].rms_ffn, DIM, SEQ);
+                memcpy(ac->x2norm, xnorm_buf, SEQ*DIM*4);
+                t_rms += tb_ms(mach_absolute_time() - t0);
+
+                // FFN W1+W3 (ANE): xnorm → h1, h3, gate
+                t0 = mach_absolute_time();
+                write_ffn_w13_input(&dk, xnorm_buf, W1t_buf[L], W3t_buf[L]);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.ffnW13);
+                t_ane_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // Read h1, h3, gate from output [1, 3*HIDDEN, 1, SEQ]
+                t0 = mach_absolute_time();
+                IOSurfaceLock(dk.ffnW13->ioOut, kIOSurfaceLockReadOnly, NULL);
+                float *ffn13_out = (float*)IOSurfaceGetBaseAddress(dk.ffnW13->ioOut);
+                memcpy(ac->h1,       ffn13_out,                   HIDDEN*SEQ*4);
+                memcpy(ac->h3,       ffn13_out + HIDDEN*SEQ,      HIDDEN*SEQ*4);
+                memcpy(gate_buf,     ffn13_out + 2*HIDDEN*SEQ,    HIDDEN*SEQ*4);
+                memcpy(ac->silu_out, gate_buf,                    HIDDEN*SEQ*4);
+                IOSurfaceUnlock(dk.ffnW13->ioOut, kIOSurfaceLockReadOnly, NULL);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // FFN W2 (ANE): gate @ W2 → ffn_out
+                t0 = mach_absolute_time();
+                write_ffn_w2_input(&dk, gate_buf, W2t_buf[L]);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.ffnW2);
+                t_ane_fwd += tb_ms(mach_absolute_time() - t0);
+
+                t0 = mach_absolute_time();
+                IOSurfaceLock(dk.ffnW2->ioOut, kIOSurfaceLockReadOnly, NULL);
+                memcpy(ac->ffn_out, (float*)IOSurfaceGetBaseAddress(dk.ffnW2->ioOut), DIM*SEQ*4);
+                IOSurfaceUnlock(dk.ffnW2->ioOut, kIOSurfaceLockReadOnly, NULL);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // Residual: x_cur = x2 + ffn_out
+                vDSP_vadd(ac->x2, 1, ac->ffn_out, 1, x_cur, 1, (vDSP_Length)(SEQ*DIM));
+            }
+
+            // Final RMSNorm + classifier + loss (CPU)
+            t0 = mach_absolute_time();
+            rmsnorm(x_final, x_cur, rms_final, DIM, SEQ);
+            t_rms += tb_ms(mach_absolute_time() - t0);
+            t0 = mach_absolute_time();
+            // Classifier: logits[CV, SEQ] = cembed[CV, DIM] @ x_final[DIM, SEQ]
+            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
+                        CV, SEQ, DIM, 1.0f, cembed, DIM, x_final, SEQ, 0.0f, logits, SEQ);
+            float loss = cross_entropy_loss(dlogits, logits, ctargets, CV, SEQ);
+            t_cls += tb_ms(mach_absolute_time() - t0);
+            last_loss = loss;
+
+            // ===== BACKWARD =====
+            // Classifier backward: dy[DIM, SEQ] = cembed^T[DIM, CV] @ dlogits[CV, SEQ]
+            t0 = mach_absolute_time();
+            cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
+                        DIM, SEQ, CV, 1.0f, cembed, DIM, dlogits, SEQ, 0.0f, dy, SEQ);
+            t_cls += tb_ms(mach_absolute_time() - t0);
+
+            // dEmbed async: gcembed[CV, DIM] += dlogits[CV, SEQ] @ x_final^T[SEQ, DIM]
+            dispatch_group_async(dw_grp, dw_q, ^{
+                cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
+                            CV, DIM, SEQ, 1.0f, dlogits, SEQ, x_final, SEQ, 1.0f, gcembed, DIM);
+            });
+
+            // Final RMSNorm backward
+            float *dx_rms_final = (float*)calloc(SEQ*DIM, 4);
+            rmsnorm_bwd(dx_rms_final, grms_final, dy, x_cur, rms_final, DIM, SEQ);
+            memcpy(dy, dx_rms_final, SEQ*DIM*4);
+            free(dx_rms_final);
+
+            // ===== BACKWARD (12 layers, reverse) =====
+            for (int L=NLAYERS-1; L>=0; L--) {
+                LayerActs *ac = &acts[L];
+                LayerGrads *gr = &grads[L];
+                memcpy(dffn, dy, SEQ*DIM*4);
+
+                // FFN backward: dffn @ W2^T → dsilu_raw
+                t0 = mach_absolute_time();
+                io_write_dyn(dk.ffnBwdW2t->ioIn, dffn, DIM, SEQ, lw[L].W2, HIDDEN);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.ffnBwdW2t);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                io_read_dyn(dk.ffnBwdW2t->ioOut, dsilu, HIDDEN, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // SiLU derivative (vectorized): dsilu → dh1, dh3
+                // silu(h1) = h1*sig(h1), dsilu_dh1 = sig*(1+h1*(1-sig))
+                // dh1 = dsilu * h3 * dsilu_dh1, dh3 = dsilu * silu(h1)
+                t0 = mach_absolute_time();
+                {
+                    int n = HIDDEN*SEQ;
+                    // sig = 1/(1+exp(-h1))
+                    float minus1 = -1.0f, one = 1.0f;
+                    vDSP_vsmul(ac->h1, 1, &minus1, silu_tmp, 1, (vDSP_Length)n);
+                    vvexpf(silu_tmp, silu_tmp, &n);
+                    vDSP_vsadd(silu_tmp, 1, &one, silu_tmp, 1, (vDSP_Length)n);
+                    vvrecf(silu_tmp, silu_tmp, &n);  // silu_tmp = sig
+                    // dh3 = dsilu * h1 * sig  (= dsilu * silu(h1))
+                    vDSP_vmul(ac->h1, 1, silu_tmp, 1, dh3, 1, (vDSP_Length)n);
+                    vDSP_vmul(dsilu, 1, dh3, 1, dh3, 1, (vDSP_Length)n);
+                    // dsilu_dh1 = sig*(1+h1*(1-sig)), store in silu_tmp2
+                    vDSP_vsadd(silu_tmp, 1, &minus1, silu_tmp2, 1, (vDSP_Length)n); // sig-1
+                    vDSP_vneg(silu_tmp2, 1, silu_tmp2, 1, (vDSP_Length)n);          // 1-sig
+                    vDSP_vmul(ac->h1, 1, silu_tmp2, 1, silu_tmp2, 1, (vDSP_Length)n); // h1*(1-sig)
+                    vDSP_vsadd(silu_tmp2, 1, &one, silu_tmp2, 1, (vDSP_Length)n);  // 1+h1*(1-sig)
+                    vDSP_vmul(silu_tmp, 1, silu_tmp2, 1, silu_tmp2, 1, (vDSP_Length)n); // full dsilu_dh1
+                    // dh1 = dsilu * h3 * dsilu_dh1
+                    vDSP_vmul(dsilu, 1, ac->h3, 1, dh1, 1, (vDSP_Length)n);
+                    vDSP_vmul(dh1, 1, silu_tmp2, 1, dh1, 1, (vDSP_Length)n);
+                }
+                t_silu += tb_ms(mach_absolute_time() - t0);
+
+                // dh1@W1^T + dh3@W3^T → dx_ffn (ANE)
+                t0 = mach_absolute_time();
+                {
+                    IOSurfaceLock(dk.ffnBwdW13t->ioIn, 0, NULL);
+                    float *buf = (float*)IOSurfaceGetBaseAddress(dk.ffnBwdW13t->ioIn);
+                    int sp = 2*SEQ + 2*DIM;
+                    for (int d = 0; d < HIDDEN; d++) {
+                        memcpy(buf + d*sp,            dh1 + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + SEQ,      dh3 + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + 2*SEQ,        lw[L].W1 + d*DIM, DIM*4);
+                        memcpy(buf + d*sp + 2*SEQ + DIM,  lw[L].W3 + d*DIM, DIM*4);
+                    }
+                    IOSurfaceUnlock(dk.ffnBwdW13t->ioIn, 0, NULL);
+                }
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.ffnBwdW13t);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                io_read_dyn(dk.ffnBwdW13t->ioOut, dx_ffn, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // dW FFN async (cblas)
+                t0 = mach_absolute_time();
+                float *capt_dffn = (float*)malloc(SEQ*DIM*4); memcpy(capt_dffn, dffn, SEQ*DIM*4);
+                float *capt_silu = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_silu, ac->silu_out, SEQ*HIDDEN*4);
+                float *capt_dh1 = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_dh1, dh1, SEQ*HIDDEN*4);
+                float *capt_dh3 = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_dh3, dh3, SEQ*HIDDEN*4);
+                float *capt_x2n = (float*)malloc(SEQ*DIM*4); memcpy(capt_x2n, ac->x2norm, SEQ*DIM*4);
+                t_dw_copy += tb_ms(mach_absolute_time() - t0);
+                dispatch_group_async(dw_grp, dw_q, ^{
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, HIDDEN, SEQ,
+                                1.0f, capt_dffn, SEQ, capt_silu, SEQ, 1.0f, gr->W2, HIDDEN);
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+                                1.0f, capt_dh1, SEQ, capt_x2n, SEQ, 1.0f, gr->W1, DIM);
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+                                1.0f, capt_dh3, SEQ, capt_x2n, SEQ, 1.0f, gr->W3, DIM);
+                    free(capt_dffn); free(capt_silu); free(capt_dh1); free(capt_dh3); free(capt_x2n);
+                });
+
+                // RMSNorm2 backward
+                t0 = mach_absolute_time();
+                memset(dx2, 0, SEQ*DIM*4);
+                rmsnorm_bwd(dx2, gr->rms_ffn, dx_ffn, ac->x2, lw[L].rms_ffn, DIM, SEQ);
+                for(int i=0;i<SEQ*DIM;i++) dx2[i] += dy[i];
+                t_rms_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // Wo^T backward (ANE): dx2 @ Wo^T → da
+                t0 = mach_absolute_time();
+                io_write_dyn(dk.wotBwd->ioIn, dx2, DIM, SEQ, lw[L].Wo, DIM);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.wotBwd);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                float *da_buf = (float*)malloc(SEQ*DIM*4);
+                io_read_dyn(dk.wotBwd->ioOut, da_buf, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // dWo async
+                t0 = mach_absolute_time();
+                float *capt_do = (float*)malloc(SEQ*DIM*4); memcpy(capt_do, dx2, SEQ*DIM*4);
+                float *capt_attn = (float*)malloc(SEQ*DIM*4); memcpy(capt_attn, ac->attn_out, SEQ*DIM*4);
+                t_dw_copy += tb_ms(mach_absolute_time() - t0);
+                dispatch_group_async(dw_grp, dw_q, ^{
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                1.0f, capt_do, SEQ, capt_attn, SEQ, 1.0f, gr->Wo, DIM);
+                    free(capt_do); free(capt_attn);
+                });
+
+                // SDPA backward part 1 (ANE, fp16): Q,K,V,da → dV,probs,dp
+                t0 = mach_absolute_time();
+                io_write_fp16_at(dk.sdpaBwd1->ioIn, 0,     ac->Q,  DIM, SEQ);
+                io_write_fp16_at(dk.sdpaBwd1->ioIn, DIM,   ac->K,  DIM, SEQ);
+                io_write_fp16_at(dk.sdpaBwd1->ioIn, 2*DIM, ac->V,  DIM, SEQ);
+                io_write_fp16_at(dk.sdpaBwd1->ioIn, 3*DIM, da_buf, DIM, SEQ);
+                free(da_buf);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.sdpaBwd1);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // SDPA backward part 2: probs,dp,Q,K → dQ,dK
+                t0 = mach_absolute_time();
+                io_copy(dk.sdpaBwd2->ioIn, 0, dk.sdpaBwd1->ioOut, DIM, 2*SCORE_CH, SEQ);
+                io_write_fp16_at(dk.sdpaBwd2->ioIn, 2*SCORE_CH,     ac->Q, DIM, SEQ);
+                io_write_fp16_at(dk.sdpaBwd2->ioIn, 2*SCORE_CH+DIM, ac->K, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.sdpaBwd2);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+
+                t0 = mach_absolute_time();
+                io_read_fp16(dk.sdpaBwd2->ioOut, dq, 0,   DIM, SEQ);
+                io_read_fp16(dk.sdpaBwd2->ioOut, dk_buf, DIM, DIM, SEQ);
+                io_read_fp16(dk.sdpaBwd1->ioOut, dv, 0, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // dWq/dWk/dWv async
+                t0 = mach_absolute_time();
+                float *capt_dq = (float*)malloc(SEQ*DIM*4); memcpy(capt_dq, dq, SEQ*DIM*4);
+                float *capt_dk = (float*)malloc(SEQ*DIM*4); memcpy(capt_dk, dk_buf, SEQ*DIM*4);
+                float *capt_dv = (float*)malloc(SEQ*DIM*4); memcpy(capt_dv, dv, SEQ*DIM*4);
+                float *capt_xn = (float*)malloc(SEQ*DIM*4); memcpy(capt_xn, ac->xnorm, SEQ*DIM*4);
+                t_dw_copy += tb_ms(mach_absolute_time() - t0);
+                dispatch_group_async(dw_grp, dw_q, ^{
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                1.0f, capt_dq, SEQ, capt_xn, SEQ, 1.0f, gr->Wq, DIM);
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                1.0f, capt_dk, SEQ, capt_xn, SEQ, 1.0f, gr->Wk, DIM);
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                1.0f, capt_dv, SEQ, capt_xn, SEQ, 1.0f, gr->Wv, DIM);
+                    free(capt_dq); free(capt_dk); free(capt_dv); free(capt_xn);
+                });
+
+                // QKV backward (ANE): dq,dk,dv @ Wq^T,Wk^T,Wv^T → dx_attn
+                t0 = mach_absolute_time();
+                {
+                    IOSurfaceLock(dk.qkvBwd->ioIn, 0, NULL);
+                    float *buf = (float*)IOSurfaceGetBaseAddress(dk.qkvBwd->ioIn);
+                    int sp = 3*SEQ + 3*DIM;
+                    for (int d = 0; d < DIM; d++) {
+                        memcpy(buf + d*sp,             dq     + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + SEQ,       dk_buf + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + 2*SEQ,     dv     + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + 3*SEQ,         lw[L].Wq + d*DIM, DIM*4);
+                        memcpy(buf + d*sp + 3*SEQ+DIM,     lw[L].Wk + d*DIM, DIM*4);
+                        memcpy(buf + d*sp + 3*SEQ+2*DIM,   lw[L].Wv + d*DIM, DIM*4);
+                    }
+                    IOSurfaceUnlock(dk.qkvBwd->ioIn, 0, NULL);
+                }
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.qkvBwd);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                io_read_dyn(dk.qkvBwd->ioOut, dx_attn, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // RMSNorm1 backward
+                t0 = mach_absolute_time();
+                float *dx_rms1 = (float*)calloc(SEQ*DIM, 4);
+                rmsnorm_bwd(dx_rms1, gr->rms_att, dx_attn, ac->layer_in, lw[L].rms_att, DIM, SEQ);
+                for(int i=0;i<SEQ*DIM;i++) dy[i] = dx_rms1[i] + dx2[i];
+                free(dx_rms1);
+                t_rms_bwd += tb_ms(mach_absolute_time() - t0);
+            }
+
+            // Embedding backward
+            dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+            embed_backward(gembed, dy, input_tokens, DIM, SEQ);
+
+            double step_ms = tb_ms(mach_absolute_time() - t_step);
+            total_train_ms += step_ms;
+            total_steps_done++;
+
+            if (step % 10 == 0 || step == start_step) {
+                printf("  timing: ane_fwd=%.1f io_fwd=%.1f rms=%.1f ane_bwd=%.1f io_bwd=%.1f silu=%.1f rms_bwd=%.1f cls=%.1f cblas_wait=%.1f dw_copy=%.1f\n",
+                       t_ane_fwd, t_io_fwd, t_rms, t_ane_bwd, t_io_bwd, t_silu, t_rms_bwd, t_cls, t_cblas_wait, t_dw_copy);
+                float xmx, xmn;
+                vDSP_maxv(x_cur,1,&xmx,(vDSP_Length)(SEQ*DIM));
+                vDSP_minv(x_cur,1,&xmn,(vDSP_Length)(SEQ*DIM));
+                float dmx, dmn;
+                vDSP_maxv(dy,1,&dmx,(vDSP_Length)(SEQ*DIM));
+                vDSP_minv(dy,1,&dmn,(vDSP_Length)(SEQ*DIM));
+                printf("step %-4d loss=%.4f  lr=%.2e  %.1fms/step  x[%.2f,%.2f] dy[%.3e,%.3e]\n",
+                       step, loss, lr, step_ms, xmn, xmx, dmn, dmx);
+            }
+
+            // Adam update every accum_steps
+            if ((step+1) % accum_steps == 0 || step == total_steps-1) {
+                dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+                float gsc = 1.0f / accum_steps;
+                adam_t++;
+
+                // Scale gradients by 1/accum_steps
+                for (int L=0; L<NLAYERS; L++) {
+                    LayerGrads *g = &grads[L];
+                    for(size_t i=0;i<WQ_SZ;i++){g->Wq[i]*=gsc;g->Wk[i]*=gsc;g->Wv[i]*=gsc;g->Wo[i]*=gsc;}
+                    for(size_t i=0;i<W1_SZ;i++) g->W1[i]*=gsc;
+                    for(size_t i=0;i<W2_SZ;i++) g->W2[i]*=gsc;
+                    for(size_t i=0;i<W3_SZ;i++) g->W3[i]*=gsc;
+                    for(int i=0;i<DIM;i++){g->rms_att[i]*=gsc; g->rms_ffn[i]*=gsc;}
+                }
+                for(int i=0;i<DIM;i++) grms_final[i]*=gsc;
+                // Merge compact classifier grads into full embed grads
+                vocab_scatter_grads(gembed, gcembed, &vm, DIM);
+                for(size_t i=0;i<(size_t)VOCAB*DIM;i++) gembed[i]*=gsc;
+
+                // Global gradient norm
+                float grad_norm_sq = 0;
+                for (int L=0; L<NLAYERS; L++) {
+                    LayerGrads *g = &grads[L];
+                    float s;
+                    vDSP_dotpr(g->Wq,1,g->Wq,1,&s,(vDSP_Length)WQ_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->Wk,1,g->Wk,1,&s,(vDSP_Length)WQ_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->Wv,1,g->Wv,1,&s,(vDSP_Length)WQ_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->Wo,1,g->Wo,1,&s,(vDSP_Length)WO_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->W1,1,g->W1,1,&s,(vDSP_Length)W1_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->W2,1,g->W2,1,&s,(vDSP_Length)W2_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->W3,1,g->W3,1,&s,(vDSP_Length)W3_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->rms_att,1,g->rms_att,1,&s,(vDSP_Length)DIM); grad_norm_sq+=s;
+                    vDSP_dotpr(g->rms_ffn,1,g->rms_ffn,1,&s,(vDSP_Length)DIM); grad_norm_sq+=s;
+                }
+                { float s;
+                  vDSP_dotpr(grms_final,1,grms_final,1,&s,(vDSP_Length)DIM); grad_norm_sq+=s;
+                  vDSP_dotpr(gembed,1,gembed,1,&s,(vDSP_Length)(VOCAB*DIM)); grad_norm_sq+=s;
+                }
+                float grad_norm = sqrtf(grad_norm_sq);
+                if ((step+1) % 10 == 0) printf("  grad_norm=%.4f\n", grad_norm);
+
+                // Gradient clipping
+                if (grad_clip > 0 && grad_norm > grad_clip) {
+                    float clip_scale = grad_clip / grad_norm;
+                    for (int L=0; L<NLAYERS; L++) {
+                        LayerGrads *g = &grads[L];
+                        vDSP_vsmul(g->Wq,1,&clip_scale,g->Wq,1,(vDSP_Length)WQ_SZ);
+                        vDSP_vsmul(g->Wk,1,&clip_scale,g->Wk,1,(vDSP_Length)WQ_SZ);
+                        vDSP_vsmul(g->Wv,1,&clip_scale,g->Wv,1,(vDSP_Length)WQ_SZ);
+                        vDSP_vsmul(g->Wo,1,&clip_scale,g->Wo,1,(vDSP_Length)WO_SZ);
+                        vDSP_vsmul(g->W1,1,&clip_scale,g->W1,1,(vDSP_Length)W1_SZ);
+                        vDSP_vsmul(g->W2,1,&clip_scale,g->W2,1,(vDSP_Length)W2_SZ);
+                        vDSP_vsmul(g->W3,1,&clip_scale,g->W3,1,(vDSP_Length)W3_SZ);
+                        vDSP_vsmul(g->rms_att,1,&clip_scale,g->rms_att,1,(vDSP_Length)DIM);
+                        vDSP_vsmul(g->rms_ffn,1,&clip_scale,g->rms_ffn,1,(vDSP_Length)DIM);
+                    }
+                    vDSP_vsmul(grms_final,1,&clip_scale,grms_final,1,(vDSP_Length)DIM);
+                    vDSP_vsmul(gembed,1,&clip_scale,gembed,1,(vDSP_Length)(VOCAB*DIM));
+                }
+
+                // Cosine LR schedule with warmup
+                if (step < warmup_steps) {
+                    lr = max_lr * ((float)(step + 1)) / warmup_steps;
+                } else {
+                    float decay_ratio = (float)(step - warmup_steps) / (float)(total_steps - warmup_steps);
+                    float min_lr = max_lr * min_lr_frac;
+                    lr = min_lr + 0.5f * (1.0f + cosf(M_PI * decay_ratio)) * (max_lr - min_lr);
+                }
+
+                // Adam update
+                for (int L=0; L<NLAYERS; L++) {
+                    LayerGrads *g = &grads[L];
+                    adam_update(lw[L].Wq, g->Wq, &la[L].Wq, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].Wk, g->Wk, &la[L].Wk, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].Wv, g->Wv, &la[L].Wv, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].Wo, g->Wo, &la[L].Wo, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].W1, g->W1, &la[L].W1, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].W2, g->W2, &la[L].W2, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].W3, g->W3, &la[L].W3, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].rms_att, g->rms_att, &la[L].rms_att, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].rms_ffn, g->rms_ffn, &la[L].rms_ffn, adam_t, lr, adam_b1, adam_b2, adam_eps);
+
+                    // Update transposed weight buffers
+                    transpose_weight(Wqt_buf[L], lw[L].Wq, DIM, DIM);
+                    transpose_weight(Wkt_buf[L], lw[L].Wk, DIM, DIM);
+                    transpose_weight(Wvt_buf[L], lw[L].Wv, DIM, DIM);
+                    transpose_weight(Wot_buf[L], lw[L].Wo, DIM, DIM);
+                    transpose_weight(W1t_buf[L], lw[L].W1, HIDDEN, DIM);
+                    transpose_weight(W2t_buf[L], lw[L].W2, DIM, HIDDEN);
+                    transpose_weight(W3t_buf[L], lw[L].W3, HIDDEN, DIM);
+                }
+                adam_update(rms_final, grms_final, &arms_final, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(embed, gembed, &aembed, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                // Re-extract compact embed from updated full embed
+                free(cembed);
+                cembed = vocab_compact_embed(embed, &vm, DIM);
+
+                // Zero grads
+                for (int L=0; L<NLAYERS; L++) layer_grads_zero(&grads[L]);
+                memset(grms_final, 0, DIM*4);
+                memset(gembed, 0, (size_t)VOCAB*DIM*4);
+                memset(gcembed, 0, (size_t)CV*DIM*4);
+
+                // Checkpoint
+                if ((step+1) % 100 == 0) {
+                    double wall = tb_ms(mach_absolute_time() - t_wall_start);
+                    save_checkpoint(CKPT_PATH, step+1, total_steps, lr, last_loss,
+                        total_train_ms+cum_train, wall+cum_wall, total_steps_done+cum_steps, adam_t,
+                        lw, la, rms_final, &arms_final, embed, &aembed);
+                }
+            }
+        }
+
+        // Report
+        double wall = tb_ms(mach_absolute_time() - t_wall_start);
+        printf("\n=== Efficiency Report ===\n");
+        printf("Total steps:  %d\n", total_steps_done);
+        printf("Compile:      %.0fms (one-time, %.1f%%)\n", compile_ms, 100*compile_ms/(wall+cum_wall));
+        printf("Train time:   %.0fms (%.1fms/step)\n", total_train_ms, total_train_ms/total_steps_done);
+        printf("Wall time:    %.1fs\n", (wall+cum_wall)/1000);
+
+        // Cleanup
+        for (int L=0; L<NLAYERS; L++) {
+            layer_weights_free(&lw[L]); layer_adam_free(&la[L]);
+            layer_acts_free(&acts[L]); layer_grads_free(&grads[L]);
+            free(Wqt_buf[L]); free(Wkt_buf[L]); free(Wvt_buf[L]); free(Wot_buf[L]);
+            free(W1t_buf[L]); free(W2t_buf[L]); free(W3t_buf[L]);
+        }
+        free_kern(dk.sdpaFwd); free_kern(dk.ffnW13); free_kern(dk.ffnW2);
+        free_kern(dk.ffnBwdW2t); free_kern(dk.ffnBwdW13t); free_kern(dk.wotBwd);
+        free_kern(dk.sdpaBwd1); free_kern(dk.sdpaBwd2); free_kern(dk.qkvBwd);
+        munmap(token_data, data_len); close(data_fd);
+    }
+    return 0;
+}