From 752a3be81a4b152e6ca505f119f3a881a14a34ce Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 3 Mar 2026 00:54:46 +0000
Subject: [PATCH 01/21] Add Project Scope & Intent notice to README

Weave in scope notice near the top covering project intent, what it
is/isn't, hype clarification, maintenance expectations, and fork
encouragement. Consolidate private API disclaimer with existing
disclaimer section to avoid duplication.

https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv
---
 README.md | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index d2c7bb2..ce3df1f 100644
--- a/README.md
+++ b/README.md
@@ -2,6 +2,56 @@
 
 Training neural networks directly on Apple's Neural Engine (ANE) via reverse-engineered private APIs. No CoreML training APIs, no Metal, no GPU — pure ANE compute.
 
+## Project Scope & Intent
+
+I'm genuinely grateful for all the attention this project has received — I never expected a weekend research hack to blow up like this. Thank you to everyone who starred, forked, ran benchmarks on their own hardware, and shared the work. It means a lot.
+
+That said, I want to set clear expectations about what this project is and isn't.
+
+This is a **research project**, not a production framework.
+
+The goal was to demonstrate that **training on the Apple Neural Engine — and potentially other NPUs — is possible**, and that the barrier has always been software support, not hardware capability. The ANE is a remarkably capable piece of silicon that Apple restricts to inference-only use through CoreML. This project bypasses that restriction using reverse-engineered private APIs to show what's possible when you give the hardware a chance.
+
+### What this project is
+
+- A proof of concept for ANE training via `_ANEClient` and `_ANECompiler` private APIs
+- A set of benchmarks documenting real ANE performance characteristics (throughput, power, SRAM behavior)
+- A reference for anyone exploring direct ANE access outside CoreML
+- Research code that I update when I find something interesting
+
+### What this project is not
+
+- A maintained framework or library
+- A replacement for CoreML, MLX, llama.cpp, or any production inference stack
+- A path to training large models on consumer hardware (yet)
+
+### On the hype
+
+Some coverage of this project has overstated its implications. To be clear:
+
+- Training works, but utilization is low (~2-3% of peak) with significant engineering challenges remaining
+- Many element-wise operations still fall back to CPU
+- This does **not** replace GPU training for anything beyond small research models today
+
+The honest results — including all limitations — are documented in the accompanying articles:
+- [Part 1: Reverse Engineering](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine)
+- [Part 2: Benchmarks](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615)
+
+### On maintenance
+
+I don't intend to grow this into a large community project. My focus is on original research (compiler infrastructure for edge AI optimization), and maintaining an open-source framework takes time away from that.
+
+That said:
+- I'll keep pushing updates when I discover something interesting
+- Bug fixes and benchmark contributions (especially on hardware I don't own) are welcome
+- Feature requests will likely go unaddressed — but feel free to fork
+
+### Fork it, build on it
+
+This is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it.
+
+---
+
 ## What This Is
 
 A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
@@ -104,8 +154,12 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 
 ## Disclaimer
 
-This project is independent research into Apple Neural Engine architecture. It uses undocumented APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
+This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
 
 ## License
 
 MIT — see [LICENSE](LICENSE)
+
+---
+
+*Built by a human + Claude, one weekend at a time.*

From 2b3b7ae5ccf072774b9b8f5a2036b89fed75aa39 Mon Sep 17 00:00:00 2001
From: tastyheadphones <tastyheadphones@icloud.com>
Date: Tue, 3 Mar 2026 11:42:42 +0900
Subject: [PATCH 02/21] Fix token sampling underflow on short datasets

---
 training/train_large.m | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/training/train_large.m b/training/train_large.m
index e58ce08..e33f2eb 100644
--- a/training/train_large.m
+++ b/training/train_large.m
@@ -274,11 +274,17 @@ int main(int argc, char *argv[]) {
         int data_fd = open(DATA_PATH, O_RDONLY);
         if (data_fd < 0) { printf("Cannot open %s\n", DATA_PATH); return 1; }
         struct stat st; fstat(data_fd, &st);
-        size_t data_len = st.st_size;
-        uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
-        if (token_data == MAP_FAILED) { printf("mmap failed\n"); return 1; }
-        size_t n_tokens = data_len / 2;
-        printf("Token data: %zu tokens (%.1f MB)\n", n_tokens, data_len/1e6);
+        size_t data_len = st.st_size;
+        uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
+        if (token_data == MAP_FAILED) { printf("mmap failed\n"); return 1; }
+        size_t n_tokens = data_len / 2;
+        if (n_tokens <= (size_t)(SEQ + 1)) {
+            printf("Token data too short: need at least %d tokens, got %zu\n", SEQ + 2, n_tokens);
+            munmap(token_data, data_len);
+            close(data_fd);
+            return 1;
+        }
+        printf("Token data: %zu tokens (%.1f MB)\n", n_tokens, data_len/1e6);
 
         // Gradient buffers shared across layers (reused each step)
         float *dy = (float*)malloc(SEQ*DIM*4);            // gradient flowing backward

From ebac5dd73f9f3f9f59df6dc3a735f3ba1171c095 Mon Sep 17 00:00:00 2001
From: Vipul <vipuldivyanshu6.11@gmail.com>
Date: Tue, 3 Mar 2026 02:04:36 -0500
Subject: [PATCH 03/21] Python Bridge+Memory leak fix+More functions

---
 bridge/Makefile             |  17 +
 bridge/ane_bridge.h         |  87 +++++
 bridge/ane_bridge.m         | 328 +++++++++++++++++
 bridge/libane_bridge.dylib  | Bin 0 -> 54480 bytes
 training/Makefile           |  14 +-
 training/README.md          |  64 +++-
 training/ane_classifier.h   | 102 ++++++
 training/ane_rmsnorm_bwd.h  |  78 ++++
 training/download_data.sh   |  91 +++++
 training/test_classifier.m  | 255 +++++++++++++
 training/test_rmsnorm_bwd.m | 123 +++++++
 training/train_large_ane.m  | 695 ++++++++++++++++++++++++++++++++++++
 12 files changed, 1847 insertions(+), 7 deletions(-)
 create mode 100644 bridge/Makefile
 create mode 100644 bridge/ane_bridge.h
 create mode 100644 bridge/ane_bridge.m
 create mode 100755 bridge/libane_bridge.dylib
 create mode 100644 training/ane_classifier.h
 create mode 100644 training/ane_rmsnorm_bwd.h
 create mode 100755 training/download_data.sh
 create mode 100644 training/test_classifier.m
 create mode 100644 training/test_rmsnorm_bwd.m
 create mode 100644 training/train_large_ane.m

diff --git a/bridge/Makefile b/bridge/Makefile
new file mode 100644
index 0000000..753d749
--- /dev/null
+++ b/bridge/Makefile
@@ -0,0 +1,17 @@
+CC = xcrun clang
+CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc -fPIC
+FRAMEWORKS = -framework Foundation -framework IOSurface -ldl
+TARGET = libane_bridge.dylib
+
+all: $(TARGET)
+
+$(TARGET): ane_bridge.m ane_bridge.h
+	$(CC) $(CFLAGS) -dynamiclib -o $@ ane_bridge.m $(FRAMEWORKS)
+
+test: test_bridge.m ane_bridge.h $(TARGET)
+	$(CC) $(CFLAGS) -o test_bridge test_bridge.m -L. -lane_bridge $(FRAMEWORKS)
+
+clean:
+	rm -f $(TARGET) test_bridge
+
+.PHONY: all clean test
diff --git a/bridge/ane_bridge.h b/bridge/ane_bridge.h
new file mode 100644
index 0000000..3e8ff47
--- /dev/null
+++ b/bridge/ane_bridge.h
@@ -0,0 +1,87 @@
+// ane_bridge.h — C-callable bridge to ANE private APIs for Python ctypes
+// Wraps _ANEInMemoryModel via private AppleNeuralEngine.framework
+
+#ifndef ANE_BRIDGE_H
+#define ANE_BRIDGE_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Opaque kernel handle
+typedef struct ANEKernelHandle ANEKernelHandle;
+
+// Initialize ANE runtime (load private framework, resolve classes)
+// Returns 0 on success, -1 on failure
+int ane_bridge_init(void);
+
+// Compile a MIL program with weight blobs into an ANE kernel
+// mil_text: UTF-8 MIL program text
+// mil_len: length of MIL text
+// weight_data: raw weight blob (can be NULL)
+// weight_len: length of weight blob
+// n_inputs: number of input tensors
+// input_sizes: array of byte sizes for each input
+// n_outputs: number of output tensors
+// output_sizes: array of byte sizes for each output
+// Returns kernel handle or NULL on failure
+ANEKernelHandle *ane_bridge_compile(const char *mil_text, size_t mil_len,
+                                     const uint8_t *weight_data, size_t weight_len,
+                                     int n_inputs, const size_t *input_sizes,
+                                     int n_outputs, const size_t *output_sizes);
+
+// Compile with multiple named weight files (for transformer kernels)
+// weight_names: array of weight file paths (e.g. "@model_path/weights/wq.bin")
+// weight_datas: array of weight data pointers
+// weight_lens: array of weight data lengths
+// n_weights: number of weight files
+ANEKernelHandle *ane_bridge_compile_multi_weights(
+    const char *mil_text, size_t mil_len,
+    const char **weight_names, const uint8_t **weight_datas,
+    const size_t *weight_lens, int n_weights,
+    int n_inputs, const size_t *input_sizes,
+    int n_outputs, const size_t *output_sizes);
+
+// Evaluate (run) a compiled kernel on ANE
+// Returns true on success
+bool ane_bridge_eval(ANEKernelHandle *kernel);
+
+// Write data to kernel input tensor
+void ane_bridge_write_input(ANEKernelHandle *kernel, int idx,
+                             const void *data, size_t bytes);
+
+// Read data from kernel output tensor
+void ane_bridge_read_output(ANEKernelHandle *kernel, int idx,
+                              void *data, size_t bytes);
+
+// Free a compiled kernel and all associated resources
+void ane_bridge_free(ANEKernelHandle *kernel);
+
+// Get compile count (for exec() restart budgeting)
+int ane_bridge_get_compile_count(void);
+
+// Reset compile count
+void ane_bridge_reset_compile_count(void);
+
+// Build a weight blob in ANE format (128-byte header + fp16 data)
+// src: float32 weights [rows x cols]
+// Returns allocated buffer and sets out_len. Caller must free().
+uint8_t *ane_bridge_build_weight_blob(const float *src, int rows, int cols,
+                                       size_t *out_len);
+
+// Build a transposed weight blob in ANE format
+uint8_t *ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols,
+                                                   size_t *out_len);
+
+// Free a blob allocated by ane_bridge_build_weight_blob*
+void ane_bridge_free_blob(void *ptr);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif // ANE_BRIDGE_H
diff --git a/bridge/ane_bridge.m b/bridge/ane_bridge.m
new file mode 100644
index 0000000..2b27ddc
--- /dev/null
+++ b/bridge/ane_bridge.m
@@ -0,0 +1,328 @@
+// ane_bridge.m — Objective-C implementation of ANE bridge for Python ctypes
+// Wraps _ANEInMemoryModel private APIs into C-callable functions
+
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#include "ane_bridge.h"
+
+// --- Private class references ---
+static Class g_ANEDesc = nil;
+static Class g_ANEInMem = nil;
+static Class g_ANEReq = nil;
+static Class g_ANEIO = nil;
+static bool g_initialized = false;
+static int g_compile_count = 0;
+
+// --- Kernel handle struct ---
+struct ANEKernelHandle {
+    id model;               // _ANEInMemoryModel
+    IOSurfaceRef *ioInputs;
+    IOSurfaceRef *ioOutputs;
+    id request;             // _ANERequest
+    NSString *tmpDir;
+    int nInputs, nOutputs;
+    size_t *inputBytes;
+    size_t *outputBytes;
+};
+
+// --- Public API ---
+
+int ane_bridge_init(void) {
+    if (g_initialized) return 0;
+
+    void *handle = dlopen(
+        "/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine",
+        RTLD_NOW);
+    if (!handle) {
+        fprintf(stderr, "ane_bridge: Failed to load AppleNeuralEngine.framework\n");
+        return -1;
+    }
+
+    g_ANEDesc  = NSClassFromString(@"_ANEInMemoryModelDescriptor");
+    g_ANEInMem = NSClassFromString(@"_ANEInMemoryModel");
+    g_ANEReq   = NSClassFromString(@"_ANERequest");
+    g_ANEIO    = NSClassFromString(@"_ANEIOSurfaceObject");
+
+    if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) {
+        fprintf(stderr, "ane_bridge: Failed to resolve ANE private classes\n");
+        return -1;
+    }
+
+    g_initialized = true;
+    g_compile_count = 0;
+    return 0;
+}
+
+static IOSurfaceRef create_surface(size_t bytes) {
+    return IOSurfaceCreate((__bridge CFDictionaryRef)@{
+        (id)kIOSurfaceWidth: @(bytes),
+        (id)kIOSurfaceHeight: @1,
+        (id)kIOSurfaceBytesPerElement: @1,
+        (id)kIOSurfaceBytesPerRow: @(bytes),
+        (id)kIOSurfaceAllocSize: @(bytes),
+        (id)kIOSurfacePixelFormat: @0
+    });
+}
+
+ANEKernelHandle *ane_bridge_compile_multi_weights(
+    const char *mil_text, size_t mil_len,
+    const char **weight_names, const uint8_t **weight_datas,
+    const size_t *weight_lens, int n_weights,
+    int n_inputs, const size_t *input_sizes,
+    int n_outputs, const size_t *output_sizes)
+{
+    @autoreleasepool {
+        if (!g_initialized) {
+            fprintf(stderr, "ane_bridge: Not initialized\n");
+            return NULL;
+        }
+
+        NSData *milData = [NSData dataWithBytes:mil_text length:mil_len];
+        NSError *e = nil;
+
+        // Build weight dictionary
+        NSMutableDictionary *wdict = [NSMutableDictionary dictionary];
+        for (int i = 0; i < n_weights; i++) {
+            NSString *name = [NSString stringWithUTF8String:weight_names[i]];
+            NSData *data = [NSData dataWithBytes:weight_datas[i] length:weight_lens[i]];
+            wdict[name] = @{@"offset": @0, @"data": data};
+        }
+
+        id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
+            g_ANEDesc, @selector(modelWithMILText:weights:optionsPlist:),
+            milData, wdict.count > 0 ? wdict : nil, nil);
+        if (!desc) {
+            fprintf(stderr, "ane_bridge: modelWithMILText failed\n");
+            return NULL;
+        }
+
+        id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(
+            g_ANEInMem, @selector(inMemoryModelWithDescriptor:), desc);
+        if (!mdl) {
+            fprintf(stderr, "ane_bridge: inMemoryModelWithDescriptor failed\n");
+            return NULL;
+        }
+
+        // Pre-populate temp dir
+        id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
+        NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
+        NSFileManager *fm = [NSFileManager defaultManager];
+        [fm createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"]
+            withIntermediateDirectories:YES attributes:nil error:nil];
+        [milData writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+
+        for (int i = 0; i < n_weights; i++) {
+            NSString *name = [NSString stringWithUTF8String:weight_names[i]];
+            // Extract filename from path like "@model_path/weights/wq.bin" -> "weights/wq.bin"
+            NSString *relPath = name;
+            if ([name hasPrefix:@"@model_path/"]) {
+                relPath = [name substringFromIndex:12];
+            }
+            NSString *fullPath = [td stringByAppendingPathComponent:relPath];
+            NSString *dir = [fullPath stringByDeletingLastPathComponent];
+            [fm createDirectoryAtPath:dir withIntermediateDirectories:YES attributes:nil error:nil];
+            NSData *data = [NSData dataWithBytes:weight_datas[i] length:weight_lens[i]];
+            [data writeToFile:fullPath atomically:YES];
+        }
+
+        // Compile
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e)) {
+            fprintf(stderr, "ane_bridge: ANE compile failed: %s\n",
+                    e ? [[e description] UTF8String] : "unknown");
+            [fm removeItemAtPath:td error:nil];
+            return NULL;
+        }
+
+        // Load (with one retry after a brief pause for ANE slot reclamation)
+        BOOL loaded = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
+        if (!loaded) {
+            fprintf(stderr, "ane_bridge: ANE load failed (retrying in 100ms): %s\n",
+                    e ? [[e description] UTF8String] : "unknown");
+            usleep(100000); // 100ms
+            e = nil;
+            loaded = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                    mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
+        }
+        if (!loaded) {
+            fprintf(stderr, "ane_bridge: ANE load failed after retry: %s\n",
+                    e ? [[e description] UTF8String] : "unknown");
+            [fm removeItemAtPath:td error:nil];
+            return NULL;
+        }
+
+        g_compile_count++;
+
+        // Create kernel handle
+        ANEKernelHandle *k = (ANEKernelHandle *)calloc(1, sizeof(ANEKernelHandle));
+        k->model = mdl;
+        k->tmpDir = td;
+        k->nInputs = n_inputs;
+        k->nOutputs = n_outputs;
+        k->inputBytes = (size_t *)malloc(n_inputs * sizeof(size_t));
+        k->outputBytes = (size_t *)malloc(n_outputs * sizeof(size_t));
+        memcpy(k->inputBytes, input_sizes, n_inputs * sizeof(size_t));
+        memcpy(k->outputBytes, output_sizes, n_outputs * sizeof(size_t));
+
+        // Create IOSurfaces
+        k->ioInputs = (IOSurfaceRef *)malloc(n_inputs * sizeof(IOSurfaceRef));
+        k->ioOutputs = (IOSurfaceRef *)malloc(n_outputs * sizeof(IOSurfaceRef));
+        for (int i = 0; i < n_inputs; i++)
+            k->ioInputs[i] = create_surface(input_sizes[i]);
+        for (int i = 0; i < n_outputs; i++)
+            k->ioOutputs[i] = create_surface(output_sizes[i]);
+
+        // Build request
+        NSMutableArray *wIns = [NSMutableArray arrayWithCapacity:n_inputs];
+        NSMutableArray *iIdx = [NSMutableArray arrayWithCapacity:n_inputs];
+        for (int i = 0; i < n_inputs; i++) {
+            [wIns addObject:((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(
+                g_ANEIO, @selector(objectWithIOSurface:), k->ioInputs[i])];
+            [iIdx addObject:@(i)];
+        }
+        NSMutableArray *wOuts = [NSMutableArray arrayWithCapacity:n_outputs];
+        NSMutableArray *oIdx = [NSMutableArray arrayWithCapacity:n_outputs];
+        for (int i = 0; i < n_outputs; i++) {
+            [wOuts addObject:((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(
+                g_ANEIO, @selector(objectWithIOSurface:), k->ioOutputs[i])];
+            [oIdx addObject:@(i)];
+        }
+        k->request = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(
+            g_ANEReq,
+            @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+            wIns, iIdx, wOuts, oIdx, nil, nil, @0);
+
+        return k;
+    }
+}
+
+ANEKernelHandle *ane_bridge_compile(const char *mil_text, size_t mil_len,
+                                     const uint8_t *weight_data, size_t weight_len,
+                                     int n_inputs, const size_t *input_sizes,
+                                     int n_outputs, const size_t *output_sizes) {
+    if (weight_data && weight_len > 0) {
+        const char *name = "@model_path/weights/weight.bin";
+        return ane_bridge_compile_multi_weights(
+            mil_text, mil_len,
+            &name, &weight_data, &weight_len, 1,
+            n_inputs, input_sizes,
+            n_outputs, output_sizes);
+    } else {
+        return ane_bridge_compile_multi_weights(
+            mil_text, mil_len,
+            NULL, NULL, NULL, 0,
+            n_inputs, input_sizes,
+            n_outputs, output_sizes);
+    }
+}
+
+bool ane_bridge_eval(ANEKernelHandle *kernel) {
+    @autoreleasepool {
+        if (!kernel || !kernel->model) return false;
+        NSError *e = nil;
+        return ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+            kernel->model, @selector(evaluateWithQoS:options:request:error:),
+            21, @{}, kernel->request, &e);
+    }
+}
+
+void ane_bridge_write_input(ANEKernelHandle *kernel, int idx,
+                             const void *data, size_t bytes) {
+    if (!kernel || idx < 0 || idx >= kernel->nInputs) return;
+    IOSurfaceLock(kernel->ioInputs[idx], 0, NULL);
+    memcpy(IOSurfaceGetBaseAddress(kernel->ioInputs[idx]), data, bytes);
+    IOSurfaceUnlock(kernel->ioInputs[idx], 0, NULL);
+}
+
+void ane_bridge_read_output(ANEKernelHandle *kernel, int idx,
+                              void *data, size_t bytes) {
+    if (!kernel || idx < 0 || idx >= kernel->nOutputs) return;
+    IOSurfaceLock(kernel->ioOutputs[idx], kIOSurfaceLockReadOnly, NULL);
+    memcpy(data, IOSurfaceGetBaseAddress(kernel->ioOutputs[idx]), bytes);
+    IOSurfaceUnlock(kernel->ioOutputs[idx], kIOSurfaceLockReadOnly, NULL);
+}
+
+void ane_bridge_free(ANEKernelHandle *kernel) {
+    @autoreleasepool {
+        if (!kernel) return;
+        NSError *e = nil;
+        if (kernel->model) {
+            ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(
+                kernel->model, @selector(unloadWithQoS:error:), 21, &e);
+        }
+        for (int i = 0; i < kernel->nInputs; i++)
+            if (kernel->ioInputs[i]) CFRelease(kernel->ioInputs[i]);
+        for (int i = 0; i < kernel->nOutputs; i++)
+            if (kernel->ioOutputs[i]) CFRelease(kernel->ioOutputs[i]);
+        if (kernel->tmpDir) {
+            [[NSFileManager defaultManager] removeItemAtPath:kernel->tmpDir error:nil];
+        }
+        free(kernel->ioInputs);
+        free(kernel->ioOutputs);
+        free(kernel->inputBytes);
+        free(kernel->outputBytes);
+        
+        // Explicitly nil Objective-C objects to trigger ARC release before freeing struct
+        kernel->model = nil;
+        kernel->request = nil;
+        kernel->tmpDir = nil;
+        
+        free(kernel);
+    }
+}
+
+int ane_bridge_get_compile_count(void) {
+    return g_compile_count;
+}
+
+void ane_bridge_reset_compile_count(void) {
+    g_compile_count = 0;
+}
+
+uint8_t *ane_bridge_build_weight_blob(const float *src, int rows, int cols,
+                                       size_t *out_len) {
+    int wsize = rows * cols * 2; // fp16
+    int total = 128 + wsize;
+    uint8_t *buf = (uint8_t *)calloc(total, 1);
+
+    // ANE blob header
+    buf[0] = 0x01; buf[4] = 0x02;
+    buf[64] = 0xEF; buf[65] = 0xBE; buf[66] = 0xAD; buf[67] = 0xDE;
+    buf[68] = 0x01;
+    *(uint32_t*)(buf + 72) = wsize;
+    *(uint32_t*)(buf + 80) = 128;
+
+    // Convert float32 -> float16
+    _Float16 *fp16 = (_Float16 *)(buf + 128);
+    for (int i = 0; i < rows * cols; i++) {
+        fp16[i] = (_Float16)src[i];
+    }
+
+    *out_len = total;
+    return buf;
+}
+
+uint8_t *ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols,
+                                                   size_t *out_len) {
+    int wsize = rows * cols * 2;
+    int total = 128 + wsize;
+    uint8_t *buf = (uint8_t *)calloc(total, 1);
+
+    buf[0] = 0x01; buf[4] = 0x02;
+    buf[64] = 0xEF; buf[65] = 0xBE; buf[66] = 0xAD; buf[67] = 0xDE;
+    buf[68] = 0x01;
+    *(uint32_t*)(buf + 72) = wsize;
+    *(uint32_t*)(buf + 80) = 128;
+
+    _Float16 *fp16 = (_Float16 *)(buf + 128);
+    for (int i = 0; i < rows; i++)
+        for (int j = 0; j < cols; j++)
+            fp16[j * rows + i] = (_Float16)src[i * cols + j];
+
+    *out_len = total;
+    return buf;
+}
diff --git a/bridge/libane_bridge.dylib b/bridge/libane_bridge.dylib
new file mode 100755
index 0000000000000000000000000000000000000000..72acc32e285cd688d74b74f5735e8b6cfcae02e6
GIT binary patch
literal 54480
zcmeHw4RjR8m2S<Cq>+rVKtc$F&0rt~OaTAH2E{QwlK2t7iX_1CIv!dxEve0DM)dRm
z0t~~CW5+mN#`Y$Dc1|E*zeT;PEKW!^tnD?ioh%zCYvW{{-F>kUHft-vFToMUGyGWZ
zyWKUTRtwO}%Xw#a&sHDq+`4sd-MaPN>guX#&EfTP|NGNnjJX)Dd{8MUUBK7_teC3U
z7|;ecW5HnKyt^9Bs(~NUK#_4d57k6)WnRIcu58e`0Os_!&}Zq|ODs+@y6wNbEZHwm
z`?#KHis!1F{(e|tsY=a}F-Ef8pO5>)*v%^#j5XaC>UZk&*FDa%OL8Y9)A(}Np!h<u
zXuN;GPJgS%TP^vp8OffUv%!U6FcjA{HQK^ePJgZ3-wGaIR?d2|kDcoHc-s`cH7d6$
zgYxm>1gqtzykS<pyuU<ry&4S%)o63f^gnREyfDdXxtlv6*%?#TH*uPG4<UEKU`?Pg
z5UgIltihUn5@~Ts=I0<z7!r9wG=qb|mRNrlF7Hp`^Mw4`^Fa5+Xi)Y-9-8fmaK<ME
zEK5K4N3z{t)=!r6Jg9AQi$c&j{jILDEOYY&vOC-O5<KfKs$)Jw@ATKo{mtTSNxo8i
zl!u<2AKr*ql9wgU9WwbO%Ep?#zm$lkG~>3l;^J+IC9N{`WnRgFap7LG79_3?2Aksj
zcmH62J!ntYbc>mv_Hky@omJB~7lMMpCAG^I&#S4mW*cKixP`<ym)|J0Wb$M8bv0zd
zw%k!4QJdtb5^U1caEme{+yNODjNOWnn*%$7c9^ucaCz3NjCDg#_>r$EKW1z>m0!bp
zj79Y-l<Por(Naf5b)NxE1zqkc|0!d0FQKIN)DN5Tr8}0rJNvn(&Tslb>5FSt-Ha+f
z$Qt7{2xzQXHnDX5lXn_)_QcS7x*9~jtlX<mPj}61=!ok|+sq|ulO}5&Gv{kqb=JqU
zHSwAAV~J>3*40>aMsv1y=1N72qb5GH+NOuk!TzlbT)}_s@`i-gEQgd!xev_Dso~x7
zKO4M1{w0`tE;cg}*JdJVMAlGdRN2yK#x=0rNq8C9j-D_x0`^u`f2{-hBfP%Ot}veJ
zn5JAAv%t)OHt<jLg7Q?1c|>EQIYKhcE4r=)k-Ym&#-!{`ON#%?>+Gr*JI^e4bza=$
z?!35zr5fO>bzk)yjO{i|>H6EPt;IpM_Y0{2<WpYugydz1{IJ85H1)qAuSEICMwT(_
zl6k4OP>0)acXA7_WBI9Ls9S-$z1y>P`Kh0w4)^YEY+5o$H%ggSH#_vu7AZ3Y{J?f8
z)8k@?j({kRDWKMUo;hSceuV4r2*fo7W#2q0<8etD4EG@GzB6?qKUt64%&B{WrCBNf
z-SzPE3i^f5kry{rVeEm<v#VY?aKsaQtM3)>u?w-QJI}QY?@R};L47gwBY@3FVB<|?
zJO#!lo_yoHC(k(N@fy7zk8#H1Ha_;ajI$nQnDGRpj1Tb~@Qh7=AN{0JPci$x4NS+e
z+j>tGbY_Cy&h$){sYJc>=ThcfPnq$aXOuzMUi4gR`~bZB7M3A@?0-m^i@g3I_yzbo
z-@p!UEnsUG=d;G12I=s}9^bT2Jf%kZ_X|4BHY4EkD^mLQJj>rcw4?Z_FZY8|Is|%y
z`+OQa!(%UW6{O5DY%550PMmG}@_oB7b-**n@|%(}`(AM!+wXB7s{xM3xsQjzdp%=;
z2{XDq<BWGveh*{nLw&bdUt(-9%VUgnCNDE|@bBldkG7rbYb1QiI;0HoF~E*|JPAFv
zMjyTB`Q}sI@O8p7#^}9;uqrXIWqVVGp0HX2ZjNm<J4Ef-{gxkBo`vfwZb$Ys&egE{
z1TmfMb$61zt5C{h#xwLe##q8_SAu^G+Z(uC4^Dl2T*A=*D1+uQ^-cC<NAcmFcw!3m
z4}aUZz6PA~e?97}m@n0hwl8zt7r@D9`gZEO9GI0E6M-p>Ef2IBdh$CB+-#rpr9S>V
z*^V-v<36TyAKkzt+y1A#-8gWX8{_lX+Vo5*^I9QmynuOv=W1#9=Lwq%V-NQ;7Myao
zpUY+7gjM(LWH)9&T$RvuKZ<*EF6Pn$Qo3g@%k)&^9;9+L<}~+_2Ts_~T%d71L+ubd
z#tc3N_YLu}#ve>xVVpL3nejVt3^?`80``&L%Nk8gF`u`)rOXcQn`}M+5q=i7g(<@N
zJn-^o`ferMw!4z`Te<zausu)XLH%1OAL;8GO@3`_#~bWbS8AKbS|6T=O$Bn$<YJ|t
z09$jOTrFi@+A%sk5$!_XlrqmubRWx%m&|#y&z-D)5p8<W$KD;Q(nnAxJRi?vw^7>?
z#D8uv%VgW0taf8AjKQ2PF{tf#U~kT&#Zu-ulz|Om9c%m*)x*ab;ClWBDT96C;B~Ft
zo!aKL;{F=!DQ?R5XA%3LF?v0z6M2^HcG!-EZPxx;n|%rH!Ly^K%u`tBDUN@JEzS4c
z;1t^~@Fzjt9+tsV{EZ2c7vm{0TFkmq<9=`)vr6&o*!{_T_(JX|ezb+ZQsYO^lfTX2
zH174_J3%>rr^#QL(O}k%GE{Kv<x73&%XiG1>T&d23;l%(_<wzVC;6u}!h^B>3iUL;
z<Ckj7`>5n?*)Mn3``KRYZph?g33%TETh1m9%vk_>%6InyDN}<o<+~T_`#I$5Z?BOu
zFO*1!DffjV0*9yOu@QxMX8Ca+xO~a_(Xh92JjVDO%1JlwYuuAJQ+XuIJgwzTZQUw8
z-g6bp#JAr6IAJ>mI>IVBmSvta>&lH%)X_bVyv|w|G83f?t&=qVV(19N6Q0qK%Z*{C
zTw)Yt+vHQ5a>E0e#&HJsak7YI&Uq@P(X-_$7i&C&+<y!oempOJ&e#)szrxZ_(3;)X
z*Z2^{cMZ$z$=QyBZHdv1=g!Lx+umMpC-Qf?`|HTjRkWs*8s+Gh=D>M3#}PWJAIa-K
zMm;_2=Ad0Eh+=xp!x|}{nAfMvM@bpPm3evNrt}Gn<89cJzYo9(Cu>c_dIOo(oqqvO
z4yQgU49XqZ{yJw%&n!HDx$V!vHx;l{&{J{h#D%`b-ig2p@%19Fgzayjr}jSsC;UlA
z_3!cepMcYP^U3Qqom=y*xXXFmV|m=~Zrp&l#~Odb+Z+bB;x02|ISQHLHo&d8%Z(Bq
z_jhu(6{hWr+;$ImIpW^xE;r*QoY*$(r@&ofurC3p=Q-U!y_hRz%y`zGM2`O*xa1?o
z0?@<oL*sh_oYr}EHO-X@?0bC1SIoZ3jGZPQXKXiliGex%MEdx=&a=H$owP2}y?74K
z1-h3yxXpUvB~m5<-ivujYoKQ8#v1J=FEgShA7?~NUSiy5@^T{xe!z2gx)5WS8jvzm
zT~cPQ%XjKWct%ic7dO>oul7ak+3K)&TaG>4vd#;!rJdN*rD$*0i#3SWv`XwPX-(S;
ze>BEF1E+g&C3pzL*Pw$FUd=Ptw42SkQez?bC)j^{@M+&@;CRrs4&`ib;5P5TrUK=@
z9qwaiR}~;GANG)iofon0Q(tBtFa~p<xZ~QiDbKOwo&uIB!x)d4bG6j?7W`10H-XcA
zYTDjp*_v3mic;Nw>FF#&n?+Nl%p%~@bEB*CH#aaV-=F1vv3JS({3`c34&0n`H*(v_
z+;()o?G|o30-WZ$+5hLbT?KE4In}*Cuak5vg?aTJOT9OdrC!B)N4WmPm0j=3jI$`y
zygp;{65|ix2X=gUP+m_k%4;d|S_}LZBCiWNfv1_vSJ9@=<2rV6lc)3ID=hUJ?8kes
zAK!n~+|KPdFN8qA<KUu>{M2u;MtuNG-}kV??}k`v|3r4|nQC@yGGe21LRRO-yr*Y|
z7li#+{ri~fA9xCl^B4NQadA_1C;Bt{?t6tDv!9nJC(k3E2F$Bg%rQE%;EZ;9!nIN+
zj2!l%Z=AVLhp=b(cVJ?z`#$rze-Q1cJ&n_iSW3|Ev#6uBD|<FUe`def3mtNK)3^Jk
zU{B|At?9yA=sdgmyvWO!<g~Sp_^{V<87}zj#TkXhcLwJb75)$8<DUM~x-@Wlg7zS^
z?(VeJQya3+#z5zzd$B*D`$$3@6DW@9RvbUSR2;s+aZGT;v4Hw3vf>Dsab(x*QbV@s
zsGT#G&+}Ml4xC-(!u-J5mCpsja~0bEqL3Y;`MMf8S&f`bK@MCKtbEYiq<vQ%?79o2
zOd~kug>q1eakoyh_RYKS3>*a+U+<LStde@^<te9ri+QO6KYvFey}GaO&@i+m`!x2H
ztuuybhw*0HeTru!*_-V~N*R1-ddR$g3Z=|i_@sOOR&dH?6ZAAU_d}+8XEOL|#4wyV
zd{PXn$4Z$Z;@8o=3661Z#lB)Jl`#*u4riIsREDpu*kj=Pb!(m>4w`?(rd(#^fzy0H
zv%CuTTp%~^2|tFt49x*5(;1w~>EH{nbLPN<y#UtK&8~`HZ5-dw<93fIy<d8KHgZXQ
zQBL1NzxX}_=lNv4ub9oj*mtw%7F=W!d$GLOHD~lfcGy=8T^&1o9M8}j3s~dvk!)=f
z@b>`=I>#K}-Y?^Ou+pQ=n;t7X;!YWESLzh*-+*gNS9Rgmu6ZM*t_5QPT@zX-40p9o
zShgBm0^bL|9sE7;SHLH1yR+;1ug&e6>T;zPx!kEW+t^1ZF7`E!ca@#;W1b%`+VTf{
zAJIs@k7`>U>nyrzSCRLsU8b$e_nJTNM|Uvt{Q$eLW(0eV_DR%MLYqRbZ`V}#G>Uw?
zj=FuXN%N!^*pZ3X`6si@zC3GBr6XT<?9;(l&F>}7wLFh!nhQ9)kr%9ao4lPDu^wf=
zfA|FVIpIBJv%8|aV|;_p;~r54yyuvBmzj8P1m5_r^pbdw0p8^{y!(LnDV!Gx=R)RA
z{nk@DZO?GFw)^40##5vBj9sE^dTeRomdBQ@-SXHq!22NZPHoxpixZyVuf4lv>o4B(
zjGl&kmSS9~-xsn}v3ta>?}MhIJl^Fwg_lc>d(g*Kz~`^<Rq7t(aUb$Z`wx7dRZ43x
zzV|9U=_$s(<Qn6B&oE;f+tRhex3%kmVyWxF@}YD2?c|+Z`yQFw^(=Fxc47=aNm{ww
z<0?8Oo$qTTpC>_I!+rNmQU0#UEPvN$^73~jy;cs7>|__7Ew*xa(lgvRj&{jRU*o*Z
z-lz9&3|qNVCvV>AWvOSGdsooqJr%%sUcoxB&dg&Ca#f9-RUvl)>_>iV<`Uo4oi2VD
z>z;f3TV}b;R>nP?tH*tuD_?Ib<G#+-(>k8@bEmDm$W~rvE7#k~D{bYwZRLAy<tAJC
zK3h3zD{HngzQfDKv&B~4ZYzJqR&KV+`0j4dH32wHm?ULPjMiCpe&+vWmQ4MWA@tLS
z(9aq|f9nwX*+b~>7(y=%p|2T2zi<fsk|FeUL+Bfa(BCzL{+=Q9tB25shtRhUp^prq
zZ@;|$_kZ8@4oynu_xW|2x=z*=+iQ?OdwWD#rX(~uGB4VqM(wXF2I*M;%cM&Gd|8bs
zVZR>pM`Ci=KghWMy+;uX2A0jMjV@K%Vp_-2SXhbFDDjY{w(Bu%u+FThSJoyJyva6O
zW?!W&Z@Nzj>6eK@Q{u76I>nE!{O#5_{h^2)k1O$_0Zo?0biW!^bybe28<lVoi#0dL
z6`iRJ?}Ow4rfn4KDphY?TDznXe@WnPHU~Y>pK2c%8TSt^XTM%iTUzxvGab%oQzJ|=
zi@|nTZ=E@yPl`VjYikF(ym_U6GKP?du8GFhM+e!O=v#65r)Y|<b>MwH2I#+O)~vSp
zRI9N)anymmk=(2+n%`_`8QVkSfzPz~j3zb8tQXN$9mu#64WqL<jG-D4$D&G9uVgD4
z=g)4i{%SxDhpn7dGFj8)4nn6|ZkI!<-ciZ;D5>&`v4+Z6JH3sIS1Oto(@2|$>MW!w
zn6NdfhQwp?1a#`Gay^>VMiF+K5>{<`Rf$*1I{I!(=qPeelve7bs)H&cO0)$oSXgP6
z6A^u>9K}m;4Wcz85dsUVAqoX=3~fd7U=1c-*&NdrD;@DlGsa4$tdk=Ngy9^CW=-O3
z9IbM^4)3AW4VA1_*<e9b8wO@-vx*3=Fpnx3CCurb56?MCD|ybcfYwFSIMNbLv^6Oj
z`2ccuaYZz)wnUZil328*lHqkBpf=62w_PREaObX5YVj(SCxCa&2WUBMPw4SVH40XX
zX%Pa{v4n0`T0C3J!>dX(H!E6YyP`EW=rWqNYq5|LPG|}Y6<~yWl20BBZYA7+zr8V+
z4JgeCR@G6XL=+u!aETl@r$%lbatjKre?p=iT$oMgYq2)oQ6g$bmk({drs_&#Y(8#7
zOsQC#8j>RsOj?HTj2Zn|4skz-E<eZ(!dcn75Szyul!>x`4@71A1v@s372=(JzFCbk
zkn0}2NdMt5=ALym^Blkn{VB!FD;2Z6XGXAmyi39{j(O21aW~HXUc42jvbY2S0s(=5
zKtLcM5D*9m1Ox&C0fB%(Kp-Fx5C{ka1Ofs9fq+0jARrJB2nYlO0{=-6C{EHlH2eY(
ze|~LodM`$@^<t|yi623lvh|j%IEnXOri@o?Cdc0pn;d^HYjXN6Dz&HIYZ9m5#S$OR
zIo_F>GX5^d<oF$)$wzWt!a4qCA6Id56zBMxW>d!R08CE5Wh6cQewX-I&gu88BwIh*
z@8|S8N2<S$bG*+pWk2T=IKQ6riJVX3d@|?sdrI<igUS8LqZT0egGQ?Go4?1Ul2bp-
zPqlt#jd!sAI<B|y8?2wk^%g#Z^^bDBm8Ze_Z*YC~g{3pTkGP%>XHfj~<4lU*%J*RZ
z9uvS>c&FNbNhnS+#p2{&_^DuiP{C1ey~{Nf;zJ-H5D*9m1Ox&C0fB%(Kp-Fx5C{ka
z1Ofs9fq+0jARrJB2nYlO0s;YnfIvVXAP^7;2m}NI0s(=5KtLcM5D*9m1Ox&C0fB%(
zKp-Fx5C{ka1Ofs9fq+0jARrJB2nYlO0s;YnfIvVXAP^7;{5L~jsrdu<Vm69%7w7rr
z58#X0&0Kz(^P`-<&h?L)Ka4MCZ=1jSE@nNP_i{do+fCy<%K1*tALo1z=ky<O;&=7z
zEay7!ZwKcGIiJk?wf<lFVphfFFy|XN-*58&W=R#p6$l6f1Ofs9fq+0jARrJB2nYlO
z0s;YnfIvVXAP^7;2m}NI0s(=5KtLcM5D*9m1Ox&C0fB%(Kp-Fx5C{ka1Ofs9fq+0j
zARrJB2nYlO0s;YnfIvVXAP^7;2m}NI0s(=5KtLcM5D*9m1Ox&Cf&Y>S_z`t0Xqbz!
zso-F_%gamU;IneIjFHSF{4euTe<iic)h7qmyWH@L{{-IE2zmtccZ|7vKt(R*z5{eO
zXfx=4f_@Duax?b~&^>PE*#g=JItqFR^gGa}pkfd6i~~&v-2qzSVcvT{kAhwXy#qQ2
zD)lmNjhE#;0@?@q8OWW-^6mg7LEp_|uCsZJ1(!8cN91^Xz7}h1&^0yMLRF1QTYF5C
zwT>E9Q$l)7>tMm<Rg0>F)k^{m4Z%q;nXg8arE*kmQM5t2Wr?<?OH^n5*Dh~JXw7m+
zsn!%(S6DDtu|UzQ<hT+DhczW0w<(syLThZ|ifF{FSd(iRh@hrH-3V{}l~o<O60cLV
zc@d>eiRzb9)yLM`s~0M2OX~o`I(35*nIF^I<N?*I)Ue*lg27;1mqTlUq1H9QW?794
ztZ7NeTG*-zE1K5Mf+3mWWWjJG)~-ZZu(=(O>&+-=3PNvQuc-t`8#ilH+CuHXJ`s;7
z3Yy27?h6Iw1QMb|6aZBhi$&^Ums2F-h|p}(7H?@#q6o#5c;lR;%PK<QDiQ+yJzTCn
zT(9T}ExJ;UBoJ@){CX>(K~ted*5$#KYDiaOQOvJFs-+2CZi*-YO_MLHtg$r@Vk-8_
zvnU^0B7{^05ma?7iu6QvYvP&nyk2QmG$k5B(2fb6HLi{73MMt5!*ol#f^t*|Hfd_O
zMG3Np=$`km4{yQ8LpQmCx+X{C?XkELW~CnX%aN!~PICp@5)oYut~cjXoZaYQ>1ojH
zoPm$iv+!}W7$52B__%;Y#w~|~v4q~9(Ak&biXIHb+S+l0fhRC;??sy%8S^zI)JT~7
z3^qk#O>7J|WV4iYa)iw@r_rL8K^~T?QB`M+n!-Zn6bq_R>T3s$<!JFE?l($*?cT8A
zA=e|W58cCtoy5w8WeyABa2PyOLK0(jkgxMF);-RWRW7q@EqOba{aoJ9Wr@pw%jMNv
zF2wRf{*qk&ESL9kc@dY7a5>E7ZZ6-?Wp<s_|3NPMx%`h@mbm;rmsfLn7?yIge=gt5
z<-J_~GMA5Vc^8+vx%@nr*?6n}<6QQ0*;Qcqm$-Zb<l$E{)|0gCZ{zZ5E(f@r=JHZ5
zvu&3BDlYrDyqe3!T&AR5gM2+?K9CsmbNRzDR{L5mvr(4(WiB@sTJlydZ{YgfT=tH(
z^xx)kJNN$*mlygh{Tp1~P-e+^)S2;Y=Jr0sM)95I{{1N9AF^UyH(Ihe_oxPsNYdxz
z6%IM<kQ0`SPwS!^)%s`2-*U*WT5_)ce|5-b9QMVC1XVfv&pPDgmJAc?YIex?JM8y3
z<R4lx?g;DpsY8C-A@?}s(+)Z9kX`VM;Bx*89P%}mj8@h)$sx~m$SWOkt3%%Ekat@$
zKCSCXhrHL4u}WCiVTb$+hkU{z|K1^AaLCva(fD)zXInC?tm`viH-NBZGxvnkz_IGG
z>7d&|m7vdq?f`uObSG#YXg+8GXd#I1u|=T8pe3NCpk<(}AKFje0=gCC>}xLSBv1fU
z1*!(sfT-`;pgEu`#X$Yo<G55TG{zaAnV_3MH-l~i(dB3M=iK0D%@wuw$DrLX`+T}Y
zo#PQ{w=f@O_6lozWP1z`e#F_$&8MBc0vpZT19o8j;HRx!Z+}?YcR9Iz=byZV{eRtm
zUwXUZ+|>NB+pjC_>i^iSnPY!7#Gzp5{n{0_djDTG;r-``|DeN#^Azx(b@!h=orKKe
zTrh4O?)b@pA3Y|mXq-PgdjPbbNhZnR@baenaLTN7w2<ixuu9citL1h%r0N|`18Z|U
z;Dj^DiXnGw4d`{U-dedHUT9yhwJBj0N6xHXRpOPhj$?0A0w={voY7*MvzIVVFcgJ(
zn5>K_(H1y#gc??wW$gEJ=ivbZ3@h=Frnb|G$JxwwkQvw{r=XrKc;c1KF>SHZ5wA4I
z?F^w+j@M~Qv%0~lh?<9S@&g1euUrw0t1VF_yd)NFaaw6gTWp<Dt1E3B&r1S@m6Rva
z3RDR<B$}+DJ00SXhHo0GIs)xD1&2|jELG#o6N}=UemUzJoH2D6FOlPBT5}B?uv*?}
z#Ql?QkWYkW20>qZ)JDU~Al%HQzcEHDBIZb}O{KFt&A}ENo>X0xBlxxfN9vYfVA;GH
z<d_$0qf3=GUanWxW^0#Ifj<rTvCewzvUBeYJA!Sl#LVppJ{c=T9rj`F>CBBy3ceCT
zT?BL&DRB*F?9JVNHydV&Lw%OCXDuRXQ~yJBMz{kqOPqXgy8P6iYiDjb`;BK$jotI~
z8_)gImpvoj|E(fV|4jMf>2sGS-g>UDefYC)&oA2Y{6F5jxcKPXPp{kcy}56eY<{e+
z@yTlsfA7GZ$?uiEw(h3J*n{b{Cr|Fuo?P_<<F<<n|Dvn#{-?UO{`Gh7iv8p_%}2by
zI(BT0d-+{?eVY$$-uS@3y|`=PSEv8wqONxzu6X^Uz15zV(o;6SQ+Ll_|Lz~Y`}1)R
z7F2B+^Tj!x+lOlZr|Rc*|7+dXZ*6+{(YyZe#l3Sz-MRC^hu@k1X~mB5D=Ti(zxw?j
z|Ed0mYTq_f-M`}86W@Aa`iG?}8%`FL_l}RZeakmuef6WGI_4g@w`=9*7ss!C<Z~aq
e^SkzW_Z@62JUZ*myYHR4CcXRmRL_N%9{4}oSt#-V

literal 0
HcmV?d00001

diff --git a/training/Makefile b/training/Makefile
index 9cc9e34..7f16c1a 100644
--- a/training/Makefile
+++ b/training/Makefile
@@ -5,14 +5,25 @@ LDFLAGS = $(FRAMEWORKS) -ldl
 
 HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h
 
+HEADERS_ANE = $(HEADERS_LARGE) ane_rmsnorm_bwd.h ane_classifier.h
+
 train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
 	$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS)
 
 train_large: train_large.m $(HEADERS_LARGE)
 	$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
 
+train_large_ane: train_large_ane.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ train_large_ane.m $(LDFLAGS) -framework Accelerate
+
 PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced
 
+test_rmsnorm_bwd: test_rmsnorm_bwd.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
+
+test_classifier: test_classifier.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
+
 test_weight_reload: test_weight_reload.m
 	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
 
@@ -31,6 +42,7 @@ tokenize:
 	python3 tokenize.py
 
 clean:
-	rm -f train train_large $(PROBES)
+	rm -f train train_large train_large_ane $(PROBES) test_rmsnorm_bwd test_classifier
 
 .PHONY: clean tokenize probes
+
diff --git a/training/README.md b/training/README.md
index 53edbb9..9c4fb00 100644
--- a/training/README.md
+++ b/training/README.md
@@ -47,18 +47,70 @@ Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly
 
 ## Usage
 
+### 1. Download Training Data
+
 ```bash
-# Extract tokenized data
-python3 tokenize.py
+bash download_data.sh
+```
+
+Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from [enio/TinyStories](https://huggingface.co/datasets/enio/TinyStories) on HuggingFace. Produces `tinystories_data00.bin` (~41 MB, ~20M tokens).
 
-# Build and train
+### 2. Build & Train
+
+```bash
+# Baseline: classifier + softmax on CPU
 make train_large
-./train_large                    # fresh start
+./train_large --steps 100        # quick test
+./train_large                    # full 10k steps
 ./train_large --resume           # resume from checkpoint
 
-# Monitor with dashboard
+# ANE-offloaded: classifier + softmax on ANE (faster)
+make train_large_ane
+./train_large_ane --steps 100
+```
+
+**CLI flags:** `--steps N` (default 10000), `--lr F` (default 3e-4), `--resume`.
+
+### 3. Monitor with Dashboard
+
+```bash
 pip install blessed psutil numpy
-python3 dashboard.py --resume    # needs sudo for powermetrics
+sudo python3 dashboard.py          # live mode (needs powermetrics)
+sudo python3 dashboard.py --resume  # attach to resumed training
+```
+
+### 4. Benchmarking
+
+Both programs print an **Efficiency Report** at completion:
+
+```
+=== Efficiency Report ===
+Total steps:     100
+Avg train:       107.0 ms/step
+ANE TFLOPS:      2.45 sustained
+ANE utilization: 15.5% of 15.8 TFLOPS
+```
+
+Per-batch timing breakdown during training:
+
+```
+ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step
+```
+
+| Metric | What it measures |
+|--------|-----------------|
+| `ane` | ANE kernel evaluation |
+| `io` | fp16↔fp32 IOSurface transfer |
+| `cls` | Classifier matmul (CPU cblas) |
+| `elem` | Embedding, residual adds, cross-entropy |
+| `rms` | RMSNorm forward/backward |
+| `cblas_wait` | Waiting for async dW gradient sgemms |
+
+Compare baseline vs ANE-offloaded:
+
+```bash
+make train_large && ./train_large --steps 100
+make train_large_ane && ./train_large_ane --steps 100
 ```
 
 ## Key techniques
diff --git a/training/ane_classifier.h b/training/ane_classifier.h
new file mode 100644
index 0000000..1b1b0e8
--- /dev/null
+++ b/training/ane_classifier.h
@@ -0,0 +1,102 @@
+// ane_classifier.h — MIL generators for classifier matmul and softmax on ANE
+// Replaces classifier cblas_sgemm and cross-entropy softmax from CPU
+#pragma once
+#include "stories_mil.h"
+
+// ============================================================
+// Classifier forward: logits = embed @ x_final
+// embed: [VOCAB, DIM] baked as conv weight [VOCAB, DIM, 1, 1]
+// x:     [1, DIM, 1, SEQ] input
+// out:   [1, VOCAB, 1, SEQ] logits
+//
+// VOCAB=32000 output channels — this is the largest conv we've attempted.
+// If it fails, we'll need to tile into smaller chunks.
+// ============================================================
+static NSString *gen_classifier_fwd(void) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", DIM, SEQ];
+    [m appendString:@CONV_CONST];
+    [m appendFormat:@"        tensor<fp16, [%d,%d,1,1]> We = const()[name=string(\"We\"), "
+        "val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/embed.bin\"), offset=uint64(64)))];\n",
+        VOCAB, DIM, VOCAB, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=We,x=x)[name=string(\"cls\")];\n", VOCAB, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// ============================================================
+// Classifier backward: dx = embed^T @ dlogits
+// ANE rejects conv with 32000 input channels.
+// Use matmul instead: reshape dlogits to [1, VOCAB, SEQ],
+// bake embed^T as [1, DIM, VOCAB], matmul → [1, DIM, SEQ],
+// reshape back to [1, DIM, 1, SEQ].
+// ============================================================
+static NSString *gen_classifier_bwd(void) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> dl) {\n", VOCAB, SEQ];
+    // Reshape dlogits from [1, VOCAB, 1, SEQ] to [1, VOCAB, SEQ]
+    [m appendFormat:@"        tensor<int32, [3]> sh3 = const()[name=string(\"sh3\"), val=tensor<int32, [3]>([1,%d,%d])];\n", VOCAB, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d]> dl3 = reshape(shape=sh3,x=dl)[name=string(\"rdl\")];\n", VOCAB, SEQ];
+    // embed_t as baked constant [1, DIM, VOCAB]
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d]> Wet = const()[name=string(\"Wet\"), "
+        "val=tensor<fp16, [1,%d,%d]>(BLOBFILE(path=string(\"@model_path/weights/embed_t.bin\"), offset=uint64(64)))];\n",
+        DIM, VOCAB, DIM, VOCAB];
+    // matmul: [1, DIM, VOCAB] @ [1, VOCAB, SEQ] -> [1, DIM, SEQ]
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d]> dx3 = matmul(transpose_x=bF,transpose_y=bF,x=Wet,y=dl3)[name=string(\"mm\")];\n", DIM, SEQ];
+    // Reshape back to [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<int32, [4]> sh4 = const()[name=string(\"sh4\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = reshape(shape=sh4,x=dx3)[name=string(\"out\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// ============================================================
+// Softmax over VOCAB dimension (channel axis) for cross-entropy
+// Input:  logits [1, VOCAB, 1, SEQ]
+// Output: probs  [1, VOCAB, 1, SEQ]
+//
+// softmax(x, axis=1) = exp(x - max(x)) / sum(exp(x - max(x)))
+//
+// Note: After getting probs from ANE, the NLL loss + gradient
+// (prob[target] -= 1.0) are done on CPU since they need target indexing.
+// ============================================================
+static NSString *gen_softmax_vocab(void) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", VOCAB, SEQ];
+    [m appendString:@"        int32 ax = const()[name=string(\"ax\"), val=int32(1)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = softmax(axis=ax,x=x)[name=string(\"sm\")];\n", VOCAB, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// ============================================================
+// Final RMSNorm on ANE (replaces CPU rmsnorm for final layer)
+// Input:  x [1, DIM, 1, SEQ]
+// Baked:  rms_final weights [DIM]
+// Output: xn [1, DIM, 1, SEQ]
+// ============================================================
+static NSString *gen_final_rmsnorm(void) {
+    float invd = 1.0f/(float)DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> sq = mul(x=x,y=x)[name=string(\"sq\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [1]> rax = const()[name=string(\"rax\"), val=tensor<int32, [1]>([1])];\n"];
+    [m appendFormat:@"        bool kd = const()[name=string(\"kd\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss = reduce_sum(x=sq,axes=rax,keep_dims=kd)[name=string(\"ss\")];\n", SEQ];
+    [m appendFormat:@"        fp16 invd = const()[name=string(\"invd\"), val=fp16(%f)];\n", invd];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss2 = mul(x=ss,y=invd)[name=string(\"ss2\")];\n", SEQ];
+    [m appendFormat:@"        fp16 eps = const()[name=string(\"eps\"), val=fp16(0.00001)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss3 = add(x=ss2,y=eps)[name=string(\"ss3\")];\n", SEQ];
+    [m appendFormat:@"        fp16 nhalf = const()[name=string(\"nhalf\"), val=fp16(-0.5)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> rrms = pow(x=ss3,y=nhalf)[name=string(\"rrms\")];\n", SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xr = mul(x=x,y=rrms)[name=string(\"xr\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,1]> rw = const()[name=string(\"rw\"), val=tensor<fp16, [1,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/rms_w.bin\"), offset=uint64(64)))];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = mul(x=xr,y=rw)[name=string(\"out\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
diff --git a/training/ane_rmsnorm_bwd.h b/training/ane_rmsnorm_bwd.h
new file mode 100644
index 0000000..eb51896
--- /dev/null
+++ b/training/ane_rmsnorm_bwd.h
@@ -0,0 +1,78 @@
+// ane_rmsnorm_bwd.h — MIL generator for RMSNorm backward on ANE
+// Replaces CPU rmsnorm_bwd() from stories_cpu_ops.h
+//
+// RMSNorm forward:  xn = x * rrms * w,  where rrms = 1/sqrt(mean(x²) + eps)
+// RMSNorm backward: dx = w * rrms * (dy - x * sum(dy*w*x) * invd * rrms²)
+//
+// Input:  concat(dy, x) as [1, 2*DIM, 1, SEQ]
+// Baked:  RMSNorm weights w [1, DIM, 1, 1] as BLOBFILE
+// Output: dx [1, DIM, 1, SEQ]
+//
+// Note: dw (weight gradient) stays on CPU — it requires reduce_sum over SEQ
+// and accumulation across steps, which is cheap and better done on CPU.
+#pragma once
+#include "stories_mil.h"
+
+// Generate MIL for RMSNorm backward
+// Input: concat(dy, x) [1, 2*DIM, 1, SEQ]
+// Baked weights: rms_w [DIM] — the RMSNorm scale weights
+// Output: dx [1, DIM, 1, SEQ]
+static NSString *gen_rmsnorm_bwd(void) {
+    float invd = 1.0f / (float)DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    
+    // Input: concat of dy and x along channel dimension
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> inp) {\n", 2*DIM, SEQ];
+    
+    // Slice out dy [1, DIM, 1, SEQ] and x [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<int32, [4]> sz = const()[name=string(\"sz\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dy = slice_by_size(x=inp,begin=b0,size=sz)[name=string(\"sdy\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> x = slice_by_size(x=inp,begin=b1,size=sz)[name=string(\"sx\")];\n", DIM, SEQ];
+    
+    // Step 1: Compute rrms = 1/sqrt(mean(x²) + eps)
+    // sq = x * x
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> sq = mul(x=x,y=x)[name=string(\"sq\")];\n", DIM, SEQ];
+    // ss = sum(sq, axis=1, keepdims=true)  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<int32, [1]> rax = const()[name=string(\"rax\"), val=tensor<int32, [1]>([1])];\n"];
+    [m appendFormat:@"        bool kd = const()[name=string(\"kd\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss = reduce_sum(x=sq,axes=rax,keep_dims=kd)[name=string(\"ss\")];\n", SEQ];
+    // ss2 = ss * invd + eps
+    [m appendFormat:@"        fp16 invd = const()[name=string(\"invd\"), val=fp16(%f)];\n", invd];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss2 = mul(x=ss,y=invd)[name=string(\"ss2\")];\n", SEQ];
+    [m appendFormat:@"        fp16 eps = const()[name=string(\"eps\"), val=fp16(0.00001)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> ss3 = add(x=ss2,y=eps)[name=string(\"ss3\")];\n", SEQ];
+    // rrms = pow(ss3, -0.5) → [1,1,1,SEQ]
+    [m appendFormat:@"        fp16 nhalf = const()[name=string(\"nhalf\"), val=fp16(-0.5)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> rrms = pow(x=ss3,y=nhalf)[name=string(\"rrms\")];\n", SEQ];
+    
+    // Step 2: Load RMSNorm weights w [1, DIM, 1, 1]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,1]> w = const()[name=string(\"w\"), val=tensor<fp16, [1,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/rms_w.bin\"), offset=uint64(64)))];\n", DIM, DIM];
+    
+    // Step 3: Compute dot = sum(dy * w * x, axis=1) * invd * rrms²
+    // dyw = dy * w  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dyw = mul(x=dy,y=w)[name=string(\"dyw\")];\n", DIM, SEQ];
+    // dywx = dyw * x  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dywx = mul(x=dyw,y=x)[name=string(\"dywx\")];\n", DIM, SEQ];
+    // dot_sum = sum(dywx, axis=1, keepdims=true)  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> dot_sum = reduce_sum(x=dywx,axes=rax,keep_dims=kd)[name=string(\"ds\")];\n", SEQ];
+    // dot_scaled = dot_sum * invd  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> dot_sc = mul(x=dot_sum,y=invd)[name=string(\"dsc\")];\n", SEQ];
+    // rrms_sq = rrms * rrms  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> rrms2 = mul(x=rrms,y=rrms)[name=string(\"rr2\")];\n", SEQ];
+    // coeff = dot_scaled * rrms_sq  → [1,1,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,1,%d]> coeff = mul(x=dot_sc,y=rrms2)[name=string(\"cof\")];\n", SEQ];
+    
+    // Step 4: dx = (dy * w - x * coeff) * rrms
+    // x_coeff = x * coeff  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xc = mul(x=x,y=coeff)[name=string(\"xc\")];\n", DIM, SEQ];
+    // diff = dyw - xc  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> diff = sub(x=dyw,y=xc)[name=string(\"dif\")];\n", DIM, SEQ];
+    // dx = diff * rrms  → [1, DIM, 1, SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = mul(x=diff,y=rrms)[name=string(\"out\")];\n", DIM, SEQ];
+    
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
diff --git a/training/download_data.sh b/training/download_data.sh
new file mode 100755
index 0000000..2d27d96
--- /dev/null
+++ b/training/download_data.sh
@@ -0,0 +1,91 @@
+#!/bin/bash
+# Download pretokenized TinyStories data for ANE training
+# Format: flat uint16 token IDs (Llama2 BPE, 32K vocab)
+# Source: enio/TinyStories on HuggingFace (pretokenized with karpathy/llama2.c)
+#
+# The tar.gz contains data00.bin..data49.bin (50 shards).
+# We extract only data00.bin and rename it to tinystories_data00.bin.
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+OUTPUT="$SCRIPT_DIR/tinystories_data00.bin"
+
+if [ -f "$OUTPUT" ]; then
+    SIZE=$(stat -f%z "$OUTPUT" 2>/dev/null || stat -c%s "$OUTPUT" 2>/dev/null)
+    TOKENS=$((SIZE / 2))
+    echo "$OUTPUT already exists ($TOKENS tokens, $(echo "scale=1; $SIZE/1000000" | bc) MB)"
+    exit 0
+fi
+
+TAR_URL="https://huggingface.co/datasets/enio/TinyStories/resolve/main/tok32000/TinyStories_tok32000.tar.gz?download=true"
+TAR_FILE="$SCRIPT_DIR/TinyStories_tok32000.tar.gz"
+
+echo "=== TinyStories Data Download ==="
+echo "Downloading pretokenized TinyStories (32K vocab, ~993 MB)..."
+echo "  Source: enio/TinyStories on HuggingFace"
+echo "  This will take a few minutes depending on your connection."
+echo ""
+
+# Download the tar.gz
+if [ ! -f "$TAR_FILE" ]; then
+    if command -v curl &>/dev/null; then
+        curl -L --progress-bar -o "$TAR_FILE" "$TAR_URL"
+    elif command -v wget &>/dev/null; then
+        wget --show-progress -O "$TAR_FILE" "$TAR_URL"
+    else
+        echo "Error: need curl or wget"
+        exit 1
+    fi
+else
+    echo "Tar file already downloaded, skipping..."
+fi
+
+# Verify it's actually a gzip file (not an error page)
+if ! file "$TAR_FILE" | grep -q "gzip"; then
+    echo "Error: Downloaded file is not a valid gzip archive."
+    echo "Content: $(head -c 100 "$TAR_FILE")"
+    rm -f "$TAR_FILE"
+    exit 1
+fi
+
+echo ""
+echo "Extracting data00.bin from archive..."
+
+# List what's in the archive to find the right path
+DATA_FILE=$(tar tzf "$TAR_FILE" 2>/dev/null | grep 'data00\.bin' | head -1)
+if [ -z "$DATA_FILE" ]; then
+    echo "Error: data00.bin not found in archive. Contents:"
+    tar tzf "$TAR_FILE" | head -20
+    exit 1
+fi
+echo "  Found: $DATA_FILE"
+
+# Extract just data00.bin
+tar xzf "$TAR_FILE" -C "$SCRIPT_DIR" "$DATA_FILE"
+
+# Move to expected location (might be in a subdirectory)
+EXTRACTED="$SCRIPT_DIR/$DATA_FILE"
+if [ "$EXTRACTED" != "$OUTPUT" ]; then
+    mv "$EXTRACTED" "$OUTPUT"
+    # Clean up any extracted subdirectories
+    rmdir "$(dirname "$EXTRACTED")" 2>/dev/null || true
+fi
+
+# Clean up tar.gz to save disk space
+echo "Cleaning up archive..."
+rm -f "$TAR_FILE"
+
+SIZE=$(stat -f%z "$OUTPUT" 2>/dev/null || stat -c%s "$OUTPUT" 2>/dev/null)
+TOKENS=$((SIZE / 2))
+echo ""
+echo "Done: $OUTPUT"
+echo "  $TOKENS tokens ($(echo "scale=1; $SIZE/1000000" | bc) MB)"
+
+# Sanity check
+python3 -c "
+import struct
+with open('$OUTPUT', 'rb') as f:
+    tokens = struct.unpack('<10H', f.read(20))
+    print(f'First 10 tokens: {tokens}')
+" 2>/dev/null || true
diff --git a/training/test_classifier.m b/training/test_classifier.m
new file mode 100644
index 0000000..363e46e
--- /dev/null
+++ b/training/test_classifier.m
@@ -0,0 +1,255 @@
+// test_classifier.m — Test classifier matmul (32000 channels) and softmax on ANE
+// This tests the riskiest operations: VOCAB-sized conv and softmax
+// Build: xcrun clang -O2 -framework Foundation -framework IOSurface \
+//        -framework CoreML -framework Accelerate -ldl -lobjc \
+//        -o test_classifier test_classifier.m
+#include "ane_classifier.h"
+#include "stories_cpu_ops.h"
+
+int main(void) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        ane_init();
+        mach_timebase_info(&g_tb);
+        
+        printf("=== Test: Classifier + Softmax on ANE ===\n");
+        printf("DIM=%d SEQ=%d VOCAB=%d\n\n", DIM, SEQ, VOCAB);
+        
+        // ======== Test 1: Final RMSNorm ========
+        printf("--- Test 1: Final RMSNorm on ANE ---\n");
+        {
+            float *x = (float*)malloc(DIM * SEQ * 4);
+            float *w = (float*)malloc(DIM * 4);
+            float *out_cpu = (float*)malloc(DIM * SEQ * 4);
+            float *out_ane = (float*)malloc(DIM * SEQ * 4);
+            srand48(42);
+            for (int i = 0; i < DIM * SEQ; i++) x[i] = (float)(drand48() * 2 - 1);
+            for (int i = 0; i < DIM; i++) w[i] = (float)(drand48() * 0.5 + 0.75);
+            
+            rmsnorm(out_cpu, x, w, DIM, SEQ);
+            
+            Kern *kern = compile_kern_mil_w(gen_final_rmsnorm(), (@{
+                @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":build_blob(w, 1, DIM)},
+            }), DIM*SEQ*2, DIM*SEQ*2);
+            
+            if (!kern) { printf("FAIL: Final RMSNorm compile failed\n"); return 1; }
+            printf("Compile OK\n");
+            
+            io_write_fp16(kern->ioIn, x, DIM, SEQ);
+            ane_eval(kern);
+            io_read_fp16(kern->ioOut, out_ane, 0, DIM, SEQ);
+            
+            float max_err = 0;
+            for (int i = 0; i < DIM*SEQ; i++) {
+                float e = fabsf(out_cpu[i] - out_ane[i]);
+                if (e > max_err) max_err = e;
+            }
+            printf("Max error: %.6f %s\n\n", max_err, max_err < 0.05 ? "PASS ✅" : "FAIL ❌");
+            free_kern(kern);
+            free(x); free(w); free(out_cpu); free(out_ane);
+        }
+        
+        // ======== Test 2: Classifier forward (32000-channel conv) ========
+        printf("--- Test 2: Classifier Forward (VOCAB=%d channel conv) ---\n", VOCAB);
+        {
+            float *x_final = (float*)malloc(DIM * SEQ * 4);
+            float *embed = (float*)malloc((size_t)VOCAB * DIM * 4);
+            float *logits_cpu = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            float *logits_ane = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            
+            srand48(123);
+            for (int i = 0; i < DIM * SEQ; i++) x_final[i] = (float)(drand48() * 2 - 1) * 0.1f;
+            for (size_t i = 0; i < (size_t)VOCAB * DIM; i++) embed[i] = (float)(drand48() * 2 - 1) * 0.02f;
+            
+            // CPU reference: logits = embed @ x_final
+            // logits[v, t] = sum_d embed[v,d] * x_final[d,t]
+            // embed is [VOCAB, DIM] row-major, x_final is [DIM, SEQ] channel-first
+            uint64_t t0 = mach_absolute_time();
+            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
+                        VOCAB, SEQ, DIM, 1.0f,
+                        embed, DIM, x_final, SEQ, 0.0f, logits_cpu, SEQ);
+            uint64_t t1 = mach_absolute_time();
+            printf("CPU cblas_sgemm: %.2f ms\n", tb_ms(t1-t0));
+            
+            // ANE: build weight blob for embed [VOCAB, DIM]
+            printf("Building embed blob (%.1f MB fp16)...\n", (float)VOCAB*DIM*2/1e6);
+            NSData *embed_blob = build_blob(embed, VOCAB, DIM);
+            
+            printf("Compiling classifier kernel...\n");
+            t0 = mach_absolute_time();
+            Kern *cls = compile_kern_mil_w(gen_classifier_fwd(), (@{
+                @"@model_path/weights/embed.bin": @{@"offset":@0, @"data":embed_blob},
+            }), DIM*SEQ*2, VOCAB*SEQ*2);
+            t1 = mach_absolute_time();
+            
+            if (!cls) {
+                printf("FAIL: Classifier compile failed (32000 channels too large for ANE)\n");
+                printf("This confirms tiling is needed.\n\n");
+            } else {
+                printf("Compile OK in %.0f ms (compiles=%d)\n", tb_ms(t1-t0), g_compile_count);
+                
+                io_write_fp16(cls->ioIn, x_final, DIM, SEQ);
+                t0 = mach_absolute_time();
+                ane_eval(cls);
+                t1 = mach_absolute_time();
+                printf("ANE eval: %.2f ms\n", tb_ms(t1-t0));
+                
+                // Read back and compare (sample — full read would be 32000*256*4 = 32MB)
+                io_read_fp16(cls->ioOut, logits_ane, 0, VOCAB, SEQ);
+                
+                float max_err = 0, sum_err = 0;
+                int cnt = 0;
+                for (int v = 0; v < VOCAB; v++) {
+                    for (int t = 0; t < SEQ; t++) {
+                        int idx = v*SEQ + t;
+                        float e = fabsf(logits_cpu[idx] - logits_ane[idx]);
+                        sum_err += e;
+                        cnt++;
+                        if (e > max_err) max_err = e;
+                    }
+                }
+                printf("Max error: %.6f  Mean error: %.6f  %s\n",
+                       max_err, sum_err/cnt, max_err < 1.0 ? "PASS ✅" : "FAIL ❌");
+                
+                // Benchmark
+                int N = 10;
+                t0 = mach_absolute_time();
+                for (int i = 0; i < N; i++) ane_eval(cls);
+                t1 = mach_absolute_time();
+                printf("Benchmark: %d evals in %.2f ms (%.2f ms/eval)\n\n", N, tb_ms(t1-t0), tb_ms(t1-t0)/N);
+                free_kern(cls);
+            }
+            free(x_final); free(embed); free(logits_cpu); free(logits_ane);
+        }
+        
+        // ======== Test 3: Softmax over VOCAB dimension ========
+        printf("--- Test 3: Softmax over VOCAB=%d ---\n", VOCAB);
+        {
+            float *logits = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            float *probs_cpu = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            float *probs_ane = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            
+            srand48(999);
+            for (size_t i = 0; i < (size_t)VOCAB * SEQ; i++) 
+                logits[i] = (float)(drand48() * 10 - 5);
+            
+            // CPU reference softmax (per position, over vocab)
+            // logits is [VOCAB, SEQ] channel-first
+            uint64_t t0 = mach_absolute_time();
+            for (int t = 0; t < SEQ; t++) {
+                float maxv = -1e30f;
+                for (int v = 0; v < VOCAB; v++) {
+                    float val = logits[v*SEQ+t];
+                    if (val > maxv) maxv = val;
+                }
+                float sum = 0;
+                for (int v = 0; v < VOCAB; v++) {
+                    probs_cpu[v*SEQ+t] = expf(logits[v*SEQ+t] - maxv);
+                    sum += probs_cpu[v*SEQ+t];
+                }
+                for (int v = 0; v < VOCAB; v++) probs_cpu[v*SEQ+t] /= sum;
+            }
+            uint64_t t1 = mach_absolute_time();
+            printf("CPU softmax: %.2f ms\n", tb_ms(t1-t0));
+            
+            printf("Compiling softmax kernel...\n");
+            int sm_bytes = VOCAB * SEQ * 2;
+            Kern *sm = compile_kern_mil_w(gen_softmax_vocab(), @{}, sm_bytes, sm_bytes);
+            
+            if (!sm) {
+                printf("FAIL: Softmax compile failed\n\n");
+            } else {
+                printf("Compile OK\n");
+                
+                io_write_fp16(sm->ioIn, logits, VOCAB, SEQ);
+                t0 = mach_absolute_time();
+                ane_eval(sm);
+                t1 = mach_absolute_time();
+                printf("ANE eval: %.2f ms\n", tb_ms(t1-t0));
+                
+                io_read_fp16(sm->ioOut, probs_ane, 0, VOCAB, SEQ);
+                
+                // Check: probs should sum to ~1.0 per position
+                float max_err = 0;
+                for (int t = 0; t < 4; t++) {
+                    float sum_cpu = 0, sum_ane = 0;
+                    for (int v = 0; v < VOCAB; v++) {
+                        sum_cpu += probs_cpu[v*SEQ+t];
+                        sum_ane += probs_ane[v*SEQ+t];
+                        float e = fabsf(probs_cpu[v*SEQ+t] - probs_ane[v*SEQ+t]);
+                        if (e > max_err) max_err = e;
+                    }
+                    printf("  pos %d: CPU sum=%.4f  ANE sum=%.4f\n", t, sum_cpu, sum_ane);
+                }
+                printf("Max error (first 4 positions): %.6f  %s\n",
+                       max_err, max_err < 0.01 ? "PASS ✅" : "FAIL ❌");
+                
+                int N = 10;
+                t0 = mach_absolute_time();
+                for (int i = 0; i < N; i++) ane_eval(sm);
+                t1 = mach_absolute_time();
+                printf("Benchmark: %d evals in %.2f ms (%.2f ms/eval)\n\n", N, tb_ms(t1-t0), tb_ms(t1-t0)/N);
+                free_kern(sm);
+            }
+            free(logits); free(probs_cpu); free(probs_ane);
+        }
+        
+        // ======== Test 4: Classifier backward ========
+        printf("--- Test 4: Classifier Backward (DIM=%d from VOCAB=%d) ---\n", DIM, VOCAB);
+        {
+            float *dlogits = (float*)malloc((size_t)VOCAB * SEQ * 4);
+            float *embed = (float*)malloc((size_t)VOCAB * DIM * 4);
+            float *dx_cpu = (float*)malloc(DIM * SEQ * 4);
+            float *dx_ane = (float*)malloc(DIM * SEQ * 4);
+            
+            srand48(456);
+            for (size_t i = 0; i < (size_t)VOCAB * SEQ; i++) dlogits[i] = (float)(drand48() * 2 - 1) * 0.01f;
+            for (size_t i = 0; i < (size_t)VOCAB * DIM; i++) embed[i] = (float)(drand48() * 2 - 1) * 0.02f;
+            
+            // CPU: dx = embed^T @ dlogits
+            uint64_t t0 = mach_absolute_time();
+            cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
+                        DIM, SEQ, VOCAB, 1.0f,
+                        embed, DIM, dlogits, SEQ, 0.0f, dx_cpu, SEQ);
+            uint64_t t1 = mach_absolute_time();
+            printf("CPU cblas_sgemm: %.2f ms\n", tb_ms(t1-t0));
+            
+            // Build transposed embed blob
+            NSData *embed_t_blob = build_blob_t(embed, VOCAB, DIM);
+            
+            printf("Compiling classifier backward...\n");
+            Kern *clsb = compile_kern_mil_w(gen_classifier_bwd(), (@{
+                @"@model_path/weights/embed_t.bin": @{@"offset":@0, @"data":embed_t_blob},
+            }), VOCAB*SEQ*2, DIM*SEQ*2);
+            
+            if (!clsb) {
+                printf("FAIL: Classifier backward compile failed\n\n");
+            } else {
+                printf("Compile OK\n");
+                
+                io_write_fp16(clsb->ioIn, dlogits, VOCAB, SEQ);
+                t0 = mach_absolute_time();
+                ane_eval(clsb);
+                t1 = mach_absolute_time();
+                printf("ANE eval: %.2f ms\n", tb_ms(t1-t0));
+                
+                io_read_fp16(clsb->ioOut, dx_ane, 0, DIM, SEQ);
+                
+                float max_err = 0, sum_err = 0;
+                for (int i = 0; i < DIM*SEQ; i++) {
+                    float e = fabsf(dx_cpu[i] - dx_ane[i]);
+                    sum_err += e;
+                    if (e > max_err) max_err = e;
+                }
+                printf("Max error: %.6f  Mean error: %.6f  %s\n\n",
+                       max_err, sum_err/(DIM*SEQ), max_err < 1.0 ? "PASS ✅" : "FAIL ❌");
+                free_kern(clsb);
+            }
+            free(dlogits); free(embed); free(dx_cpu); free(dx_ane);
+        }
+        
+        printf("=== All tests complete ===\n");
+        printf("Total ANE compiles used: %d\n", g_compile_count);
+        return 0;
+    }
+}
diff --git a/training/test_rmsnorm_bwd.m b/training/test_rmsnorm_bwd.m
new file mode 100644
index 0000000..9014e53
--- /dev/null
+++ b/training/test_rmsnorm_bwd.m
@@ -0,0 +1,123 @@
+// test_rmsnorm_bwd.m — Test RMSNorm backward ANE kernel vs CPU reference
+// Build: xcrun clang -O2 -framework Foundation -framework IOSurface \
+//        -framework CoreML -framework Accelerate -ldl -lobjc \
+//        -o test_rmsnorm_bwd test_rmsnorm_bwd.m
+#include "ane_rmsnorm_bwd.h"
+#include "stories_cpu_ops.h"
+
+int main(void) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        ane_init();
+        mach_timebase_info(&g_tb);
+        
+        printf("=== Test: RMSNorm Backward on ANE ===\n");
+        printf("DIM=%d SEQ=%d\n\n", DIM, SEQ);
+        
+        // Allocate test data
+        float *x = (float*)malloc(DIM * SEQ * 4);
+        float *dy = (float*)malloc(DIM * SEQ * 4);
+        float *w = (float*)malloc(DIM * 4);
+        float *dx_cpu = (float*)calloc(DIM * SEQ, 4);
+        float *dw_cpu = (float*)calloc(DIM, 4);
+        float *dx_ane = (float*)malloc(DIM * SEQ * 4);
+        
+        // Random init (channel-first [DIM, SEQ])
+        srand48(42);
+        for (int i = 0; i < DIM * SEQ; i++) {
+            x[i] = (float)(drand48() * 2 - 1) * 0.5f;
+            dy[i] = (float)(drand48() * 2 - 1) * 0.1f;
+        }
+        for (int i = 0; i < DIM; i++) {
+            w[i] = (float)(drand48() * 0.5 + 0.75);  // close to 1.0
+        }
+        
+        // === CPU Reference ===
+        uint64_t t0 = mach_absolute_time();
+        rmsnorm_bwd(dx_cpu, dw_cpu, dy, x, w, DIM, SEQ);
+        uint64_t t1 = mach_absolute_time();
+        printf("CPU rmsnorm_bwd: %.2f ms\n", tb_ms(t1 - t0));
+        
+        // === ANE Kernel ===
+        printf("Compiling ANE rmsnorm_bwd kernel...\n");
+        NSString *mil = gen_rmsnorm_bwd();
+        
+        // Build weight blob for RMSNorm weights
+        NSData *rms_blob = build_blob(w, 1, DIM);
+        
+        int in_bytes = 2 * DIM * SEQ * 2;  // concat(dy, x) in fp16
+        int out_bytes = DIM * SEQ * 2;      // dx in fp16
+        
+        Kern *kern = compile_kern_mil_w(mil, (@{
+            @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":rms_blob},
+        }), in_bytes, out_bytes);
+        
+        if (!kern) {
+            printf("FAIL: ANE kernel compilation failed!\n");
+            return 1;
+        }
+        printf("Compile OK (compiles=%d)\n", g_compile_count);
+        
+        // Write input: concat(dy, x) into ioIn
+        // dy goes at channel offset 0, x goes at channel offset DIM
+        io_write_fp16_at(kern->ioIn, 0, dy, DIM, SEQ);
+        io_write_fp16_at(kern->ioIn, DIM, x, DIM, SEQ);
+        
+        // Evaluate
+        t0 = mach_absolute_time();
+        ane_eval(kern);
+        t1 = mach_absolute_time();
+        printf("ANE eval: %.3f ms\n", tb_ms(t1 - t0));
+        
+        // Read output
+        io_read_fp16(kern->ioOut, dx_ane, 0, DIM, SEQ);
+        
+        // === Compare ===
+        float max_err = 0, sum_err = 0;
+        int max_i = 0, max_j = 0;
+        for (int i = 0; i < DIM; i++) {
+            for (int j = 0; j < SEQ; j++) {
+                int idx = i * SEQ + j;
+                float err = fabsf(dx_cpu[idx] - dx_ane[idx]);
+                sum_err += err;
+                if (err > max_err) {
+                    max_err = err;
+                    max_i = i; max_j = j;
+                }
+            }
+        }
+        float mean_err = sum_err / (DIM * SEQ);
+        
+        printf("\n=== Results ===\n");
+        printf("Max absolute error: %.6f at [%d,%d] (CPU=%.6f ANE=%.6f)\n",
+               max_err, max_i, max_j, dx_cpu[max_i*SEQ+max_j], dx_ane[max_i*SEQ+max_j]);
+        printf("Mean absolute error: %.6f\n", mean_err);
+        
+        // Sample outputs
+        printf("\nSample dx values (first 4 channels, first 4 positions):\n");
+        printf("%-6s %-12s %-12s %-10s\n", "Idx", "CPU", "ANE", "Error");
+        for (int i = 0; i < 4 && i < DIM; i++) {
+            for (int j = 0; j < 4 && j < SEQ; j++) {
+                int idx = i * SEQ + j;
+                printf("[%d,%d] %-12.6f %-12.6f %-10.6f\n",
+                       i, j, dx_cpu[idx], dx_ane[idx], fabsf(dx_cpu[idx] - dx_ane[idx]));
+            }
+        }
+        
+        // Benchmark: multiple evals
+        int N = 100;
+        t0 = mach_absolute_time();
+        for (int i = 0; i < N; i++) ane_eval(kern);
+        t1 = mach_absolute_time();
+        printf("\nBenchmark: %d evals in %.2f ms (%.3f ms/eval)\n",
+               N, tb_ms(t1-t0), tb_ms(t1-t0)/N);
+        
+        // Pass/fail
+        bool pass = max_err < 0.05f && mean_err < 0.01f;
+        printf("\n%s (threshold: max<0.05, mean<0.01)\n", pass ? "PASS ✅" : "FAIL ❌");
+        
+        free_kern(kern);
+        free(x); free(dy); free(w); free(dx_cpu); free(dw_cpu); free(dx_ane);
+        return pass ? 0 : 1;
+    }
+}
diff --git a/training/train_large_ane.m b/training/train_large_ane.m
new file mode 100644
index 0000000..d7a99ef
--- /dev/null
+++ b/training/train_large_ane.m
@@ -0,0 +1,695 @@
+// train_large_ane.m — Stories110M training with CPU ops offloaded to ANE
+// Based on train_large.m but moves these operations from CPU to ANE:
+//   1. Final RMSNorm (was CPU vDSP) → ANE kernel
+//   2. Classifier forward embed@x (was CPU cblas) → ANE 32000-ch conv
+//   3. Cross-entropy softmax (was CPU vDSP) → ANE softmax kernel
+//   4. RMSNorm backward (was CPU vDSP) → ANE kernel
+// Still on CPU: dW gradients (parallel via GCD), Adam optimizer (needs weight mutation),
+//               classifier backward (ANE matmul slower than cblas for this shape),
+//               NLL loss + gradient (needs target indexing)
+//
+// Build: make train_large_ane
+// Run:   ./train_large_ane [--resume] [--steps N] [--lr F]
+#include "stories_io.h"
+#include "stories_mil.h"
+#include "stories_cpu_ops.h"
+#include "ane_rmsnorm_bwd.h"
+#include "ane_classifier.h"
+
+#define CKPT_PATH "ane_stories110M_ckpt.bin"
+#define MODEL_PATH "../../assets/models/stories110M.bin"
+#define DATA_PATH "tinystories_data00.bin"
+
+// ===== Weight loading from llama2.c format =====
+static bool load_pretrained(LayerWeights *lw, float *rms_final, float *embed, const char *path) {
+    FILE *f = fopen(path, "rb");
+    if (!f) { printf("Cannot open %s\n", path); return false; }
+    Llama2Config cfg;
+    fread(&cfg, sizeof(cfg), 1, f);
+    printf("  Model config: dim=%d hidden=%d layers=%d heads=%d vocab=%d seq=%d\n",
+           cfg.dim, cfg.hidden_dim, cfg.n_layers, cfg.n_heads, abs(cfg.vocab_size), cfg.seq_len);
+    if (cfg.dim != DIM || cfg.hidden_dim != HIDDEN || cfg.n_layers != NLAYERS) {
+        printf("  ERROR: Config mismatch!\n"); fclose(f); return false;
+    }
+    int V = abs(cfg.vocab_size);
+    bool shared = cfg.vocab_size > 0;
+    fread(embed, 4, V * DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].rms_att, 4, DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wq, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wk, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wv, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wo, 4, WO_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].rms_ffn, 4, DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W1, 4, W1_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W2, 4, W2_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W3, 4, W3_SZ, f);
+    fread(rms_final, 4, DIM, f);
+    fclose(f);
+    printf("  Loaded pretrained weights (%s)\n", shared ? "shared embed/cls" : "separate cls");
+    return true;
+}
+
+// ===== Compile one layer's kernels =====
+static bool compile_layer_kernels(LayerKernels *lk, LayerWeights *w) {
+    lk->fwdAttn = compile_kern_mil_w(gen_sdpa_fwd_taps(), (@{
+        @"@model_path/weights/rms1.bin": @{@"offset":@0, @"data":build_blob(w->rms_att,1,DIM)},
+        @"@model_path/weights/wq.bin": @{@"offset":@0, @"data":build_blob(w->Wq,DIM,DIM)},
+        @"@model_path/weights/wk.bin": @{@"offset":@0, @"data":build_blob(w->Wk,DIM,DIM)},
+        @"@model_path/weights/wv.bin": @{@"offset":@0, @"data":build_blob(w->Wv,DIM,DIM)},
+        @"@model_path/weights/wo.bin": @{@"offset":@0, @"data":build_blob(w->Wo,DIM,DIM)},
+        @"@model_path/weights/mask.bin": @{@"offset":@0, @"data":get_mask_blob()},
+    }), DIM*SEQ*2, 6*DIM*SEQ*2);
+    lk->fwdFFN = compile_kern_mil_w(gen_ffn_fwd_taps(), (@{
+        @"@model_path/weights/rms2.bin": @{@"offset":@0, @"data":build_blob(w->rms_ffn,1,DIM)},
+        @"@model_path/weights/w1.bin": @{@"offset":@0, @"data":build_blob(w->W1,HIDDEN,DIM)},
+        @"@model_path/weights/w3.bin": @{@"offset":@0, @"data":build_blob(w->W3,HIDDEN,DIM)},
+        @"@model_path/weights/w2.bin": @{@"offset":@0, @"data":build_blob(w->W2,DIM,HIDDEN)},
+    }), DIM*SEQ*2, (2*DIM+3*HIDDEN)*SEQ*2);
+    lk->ffnBwd = compile_kern_mil_w(gen_ffn_bwd(), (@{
+        @"@model_path/weights/w2t.bin": @{@"offset":@0, @"data":build_blob_t(w->W2,DIM,HIDDEN)},
+        @"@model_path/weights/w1t.bin": @{@"offset":@0, @"data":build_blob_t(w->W1,HIDDEN,DIM)},
+        @"@model_path/weights/w3t.bin": @{@"offset":@0, @"data":build_blob_t(w->W3,HIDDEN,DIM)},
+    }), (DIM+2*HIDDEN)*SEQ*2, (DIM+2*HIDDEN)*SEQ*2);
+    lk->sdpaBwd1 = compile_kern_mil_w(gen_sdpa_bwd1(), (@{
+        @"@model_path/weights/mask.bin": @{@"offset":@0, @"data":get_mask_blob()},
+        @"@model_path/weights/wot.bin": @{@"offset":@0, @"data":build_blob_t(w->Wo,DIM,DIM)},
+    }), 4*DIM*SEQ*2, (DIM+2*SCORE_CH)*SEQ*2);
+    lk->qkvBwd = compile_kern_mil_w(gen_qkvb(), (@{
+        @"@model_path/weights/wqt.bin": @{@"offset":@0, @"data":build_blob_t(w->Wq,DIM,DIM)},
+        @"@model_path/weights/wkt.bin": @{@"offset":@0, @"data":build_blob_t(w->Wk,DIM,DIM)},
+        @"@model_path/weights/wvt.bin": @{@"offset":@0, @"data":build_blob_t(w->Wv,DIM,DIM)},
+    }), 3*DIM*SEQ*2, DIM*SEQ*2);
+    return lk->fwdAttn && lk->fwdFFN && lk->ffnBwd && lk->sdpaBwd1 && lk->qkvBwd;
+}
+
+static Kern *compile_sdpa_bwd2(void) {
+    return compile_kern_mil_w(gen_sdpa_bwd2(), @{},
+        (2*SCORE_CH+2*DIM)*SEQ*2, 2*DIM*SEQ*2);
+}
+
+// NEW: Compile RMSNorm backward kernels (one per layer pair: attn + ffn)
+static Kern *compile_rmsnorm_bwd_kern(const float *rms_w) {
+    return compile_kern_mil_w(gen_rmsnorm_bwd(), (@{
+        @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":build_blob(rms_w, 1, DIM)},
+    }), 2*DIM*SEQ*2, DIM*SEQ*2);
+}
+
+// NEW: Compile classifier forward kernel
+static Kern *compile_classifier_fwd(const float *embed) {
+    return compile_kern_mil_w(gen_classifier_fwd(), (@{
+        @"@model_path/weights/embed.bin": @{@"offset":@0, @"data":build_blob(embed, VOCAB, DIM)},
+    }), DIM*SEQ*2, VOCAB*SEQ*2);
+}
+
+// NEW: Compile final RMSNorm kernel
+static Kern *compile_final_rmsnorm_kern(const float *rms_w) {
+    return compile_kern_mil_w(gen_final_rmsnorm(), (@{
+        @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":build_blob(rms_w, 1, DIM)},
+    }), DIM*SEQ*2, DIM*SEQ*2);
+}
+
+// NEW: Compile softmax kernel (no weights)
+static Kern *compile_softmax_kern(void) {
+    return compile_kern_mil_w(gen_softmax_vocab(), @{}, VOCAB*SEQ*2, VOCAB*SEQ*2);
+}
+
+static void free_layer_kernels(LayerKernels *lk) {
+    free_kern(lk->fwdAttn); free_kern(lk->fwdFFN); free_kern(lk->ffnBwd);
+    free_kern(lk->sdpaBwd1); free_kern(lk->qkvBwd);
+    lk->fwdAttn = lk->fwdFFN = lk->ffnBwd = lk->sdpaBwd1 = lk->qkvBwd = NULL;
+}
+
+// ===== Checkpoint save/load (same as train_large.m) =====
+static void save_checkpoint(const char *path, int step, int total_steps, float lr, float loss,
+                            double cc, double ct, double cw, int cs, int cb, int adam_t,
+                            LayerWeights *lw, LayerAdam *la, float *rms_final, AdamState *arms_final,
+                            float *embed, AdamState *aembed) {
+    FILE *f = fopen(path, "wb");
+    CkptHdr h = {0};
+    h.magic = 0x424C5A54; h.version = 2;
+    h.step = step; h.total_steps = total_steps;
+    h.n_layers = NLAYERS; h.vocab_size = VOCAB; h.dim = DIM;
+    h.hidden_dim = HIDDEN; h.n_heads = HEADS; h.seq_len = SEQ;
+    h.lr = lr; h.loss = loss;
+    h.cum_compile = cc; h.cum_train = ct; h.cum_wall = cw;
+    h.cum_steps = cs; h.cum_batches = cb; h.adam_t = adam_t;
+    fwrite(&h, sizeof(h), 1, f);
+    for (int L = 0; L < NLAYERS; L++) {
+        fwrite(lw[L].Wq,4,WQ_SZ,f); fwrite(lw[L].Wk,4,WQ_SZ,f);
+        fwrite(lw[L].Wv,4,WQ_SZ,f); fwrite(lw[L].Wo,4,WO_SZ,f);
+        fwrite(lw[L].W1,4,W1_SZ,f); fwrite(lw[L].W2,4,W2_SZ,f); fwrite(lw[L].W3,4,W3_SZ,f);
+        fwrite(lw[L].rms_att,4,DIM,f); fwrite(lw[L].rms_ffn,4,DIM,f);
+        fwrite(la[L].Wq.m,4,WQ_SZ,f); fwrite(la[L].Wq.v,4,WQ_SZ,f);
+        fwrite(la[L].Wk.m,4,WQ_SZ,f); fwrite(la[L].Wk.v,4,WQ_SZ,f);
+        fwrite(la[L].Wv.m,4,WQ_SZ,f); fwrite(la[L].Wv.v,4,WQ_SZ,f);
+        fwrite(la[L].Wo.m,4,WO_SZ,f); fwrite(la[L].Wo.v,4,WO_SZ,f);
+        fwrite(la[L].W1.m,4,W1_SZ,f); fwrite(la[L].W1.v,4,W1_SZ,f);
+        fwrite(la[L].W2.m,4,W2_SZ,f); fwrite(la[L].W2.v,4,W2_SZ,f);
+        fwrite(la[L].W3.m,4,W3_SZ,f); fwrite(la[L].W3.v,4,W3_SZ,f);
+        fwrite(la[L].rms_att.m,4,DIM,f); fwrite(la[L].rms_att.v,4,DIM,f);
+        fwrite(la[L].rms_ffn.m,4,DIM,f); fwrite(la[L].rms_ffn.v,4,DIM,f);
+    }
+    fwrite(rms_final,4,DIM,f);
+    fwrite(arms_final->m,4,DIM,f); fwrite(arms_final->v,4,DIM,f);
+    fwrite(embed,4,VOCAB*DIM,f);
+    fwrite(aembed->m,4,VOCAB*DIM,f); fwrite(aembed->v,4,VOCAB*DIM,f);
+    fclose(f);
+}
+
+static bool load_checkpoint(const char *path, int *step, int *total_steps, float *lr, float *loss,
+                             double *cc, double *ct, double *cw, int *cs, int *cb, int *adam_t,
+                             LayerWeights *lw, LayerAdam *la, float *rms_final, AdamState *arms_final,
+                             float *embed, AdamState *aembed) {
+    FILE *f = fopen(path, "rb");
+    if (!f) return false;
+    CkptHdr h;
+    fread(&h, sizeof(h), 1, f);
+    if (h.magic != 0x424C5A54 || h.version != 2) { fclose(f); return false; }
+    *step = h.step; *total_steps = h.total_steps; *lr = h.lr; *loss = h.loss;
+    *cc = h.cum_compile; *ct = h.cum_train; *cw = h.cum_wall;
+    *cs = h.cum_steps; *cb = h.cum_batches; *adam_t = h.adam_t;
+    for (int L = 0; L < NLAYERS; L++) {
+        fread(lw[L].Wq,4,WQ_SZ,f); fread(lw[L].Wk,4,WQ_SZ,f);
+        fread(lw[L].Wv,4,WQ_SZ,f); fread(lw[L].Wo,4,WO_SZ,f);
+        fread(lw[L].W1,4,W1_SZ,f); fread(lw[L].W2,4,W2_SZ,f); fread(lw[L].W3,4,W3_SZ,f);
+        fread(lw[L].rms_att,4,DIM,f); fread(lw[L].rms_ffn,4,DIM,f);
+        fread(la[L].Wq.m,4,WQ_SZ,f); fread(la[L].Wq.v,4,WQ_SZ,f);
+        fread(la[L].Wk.m,4,WQ_SZ,f); fread(la[L].Wk.v,4,WQ_SZ,f);
+        fread(la[L].Wv.m,4,WQ_SZ,f); fread(la[L].Wv.v,4,WQ_SZ,f);
+        fread(la[L].Wo.m,4,WO_SZ,f); fread(la[L].Wo.v,4,WO_SZ,f);
+        fread(la[L].W1.m,4,W1_SZ,f); fread(la[L].W1.v,4,W1_SZ,f);
+        fread(la[L].W2.m,4,W2_SZ,f); fread(la[L].W2.v,4,W2_SZ,f);
+        fread(la[L].W3.m,4,W3_SZ,f); fread(la[L].W3.v,4,W3_SZ,f);
+        fread(la[L].rms_att.m,4,DIM,f); fread(la[L].rms_att.v,4,DIM,f);
+        fread(la[L].rms_ffn.m,4,DIM,f); fread(la[L].rms_ffn.v,4,DIM,f);
+    }
+    fread(rms_final,4,DIM,f);
+    fread(arms_final->m,4,DIM,f); fread(arms_final->v,4,DIM,f);
+    fread(embed,4,VOCAB*DIM,f);
+    fread(aembed->m,4,VOCAB*DIM,f); fread(aembed->v,4,VOCAB*DIM,f);
+    fclose(f);
+    return true;
+}
+
+// ===== Main =====
+int main(int argc, char *argv[]) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        ane_init();
+        mach_timebase_info(&g_tb);
+
+        int total_steps = 10000;
+        float lr = 3e-4f;
+        float adam_b1=0.9f, adam_b2=0.999f, adam_eps=1e-8f;
+        int adam_t = 0, start_step = 0;
+        bool do_resume = false;
+        for (int i=1; i<argc; i++) {
+            if (strcmp(argv[i], "--resume") == 0) do_resume = true;
+            else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
+        }
+
+        LayerWeights lw[NLAYERS]; LayerAdam la[NLAYERS];
+        LayerActs acts[NLAYERS]; LayerGrads grads[NLAYERS]; LayerKernels kern[NLAYERS];
+        for (int L=0; L<NLAYERS; L++) {
+            lw[L] = layer_weights_alloc(); la[L] = layer_adam_alloc();
+            acts[L] = layer_acts_alloc(); grads[L] = layer_grads_alloc();
+            memset(&kern[L], 0, sizeof(LayerKernels));
+        }
+        float *rms_final = (float*)malloc(DIM*4);
+        float *embed = (float*)malloc(VOCAB*DIM*4);
+        float *grms_final = (float*)calloc(DIM, 4);
+        float *gembed = (float*)calloc(VOCAB*DIM, 4);
+        AdamState arms_final = adam_alloc(DIM);
+        AdamState aembed = adam_alloc((size_t)VOCAB*DIM);
+        double cum_compile=0, cum_train=0, cum_wall=0;
+        int cum_steps=0, cum_batches=0;
+
+        float resume_loss = 0;
+        bool resuming = false;
+        if (do_resume) {
+            resuming = load_checkpoint(CKPT_PATH, &start_step, &total_steps, &lr, &resume_loss,
+                &cum_compile, &cum_train, &cum_wall, &cum_steps, &cum_batches, &adam_t,
+                lw, la, rms_final, &arms_final, embed, &aembed);
+            if (resuming) printf("[RESUMED step %d, loss=%.4f]\n", start_step, resume_loss);
+        }
+        if (!resuming) {
+            printf("=== ANE Training: Stories110M (ANE-offloaded) ===\n");
+            printf("dim=%d hidden=%d heads=%d seq=%d vocab=%d layers=%d\n", DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS);
+            printf("NEW: final_rmsnorm, classifier_fwd, softmax, rmsnorm_bwd on ANE\n");
+            if (!load_pretrained(lw, rms_final, embed, MODEL_PATH)) {
+                printf("Pretrained load failed, using random init\n");
+                srand48(42);
+                float scale_d=1.0f/sqrtf(DIM), scale_h=1.0f/sqrtf(HIDDEN);
+                for (int L=0; L<NLAYERS; L++) {
+                    for(size_t i=0;i<WQ_SZ;i++){lw[L].Wq[i]=scale_d*(2*drand48()-1);lw[L].Wk[i]=scale_d*(2*drand48()-1);}
+                    for(size_t i=0;i<WQ_SZ;i++){lw[L].Wv[i]=scale_d*(2*drand48()-1);lw[L].Wo[i]=scale_d*(2*drand48()-1);}
+                    for(size_t i=0;i<W1_SZ;i++) lw[L].W1[i]=scale_h*(2*drand48()-1);
+                    for(size_t i=0;i<W2_SZ;i++) lw[L].W2[i]=scale_d*(2*drand48()-1);
+                    for(size_t i=0;i<W3_SZ;i++) lw[L].W3[i]=scale_h*(2*drand48()-1);
+                    for(int i=0;i<DIM;i++){lw[L].rms_att[i]=1.0f; lw[L].rms_ffn[i]=1.0f;}
+                }
+                for(int i=0;i<DIM;i++) rms_final[i]=1.0f;
+                float escale = 0.02f;
+                for(size_t i=0;i<(size_t)VOCAB*DIM;i++) embed[i]=escale*(2*drand48()-1);
+            }
+        }
+
+        // mmap token data
+        int data_fd = open(DATA_PATH, O_RDONLY);
+        if (data_fd < 0) { printf("Cannot open %s\n", DATA_PATH); return 1; }
+        struct stat st; fstat(data_fd, &st);
+        size_t data_len = st.st_size;
+        uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
+        if (token_data == MAP_FAILED) { printf("mmap failed\n"); return 1; }
+        size_t n_tokens = data_len / 2;
+        printf("Token data: %zu tokens (%.1f MB)\n", n_tokens, data_len/1e6);
+
+        // Gradient buffers
+        float *dy = (float*)malloc(SEQ*DIM*4);
+        float *dffn = (float*)malloc(SEQ*DIM*4);
+        float *dh1 = (float*)malloc(SEQ*HIDDEN*4);
+        float *dh3 = (float*)malloc(SEQ*HIDDEN*4);
+        float *dx_ffn = (float*)malloc(SEQ*DIM*4);
+        float *dx2 = (float*)malloc(SEQ*DIM*4);
+        float *do_out_buf = (float*)malloc(SEQ*DIM*4);
+        float *dq = (float*)malloc(SEQ*DIM*4);
+        float *dk = (float*)malloc(SEQ*DIM*4);
+        float *dv = (float*)malloc(SEQ*DIM*4);
+        float *dx_attn = (float*)malloc(SEQ*DIM*4);
+        float *x_cur = (float*)malloc(SEQ*DIM*4);
+        float *x_final = (float*)malloc(SEQ*DIM*4);
+        float *logits = (float*)malloc(SEQ*VOCAB*4);
+        float *dlogits = (float*)malloc(SEQ*VOCAB*4);
+        float *probs = (float*)malloc(SEQ*VOCAB*4);   // NEW: for ANE softmax output
+
+        // Compile static sdpaBwd2 kernels
+        Kern *sdpaBwd2[NLAYERS];
+        for (int L=0; L<NLAYERS; L++) {
+            sdpaBwd2[L] = compile_sdpa_bwd2();
+            if (!sdpaBwd2[L]) { printf("sdpaBwd2 compile failed\n"); return 1; }
+        }
+
+        // NEW: Compile ANE-offloaded kernels (static — no per-batch recompile needed)
+        // These have no weight-bearing or static weights that don't change per batch
+
+        // RMSNorm backward kernels — one per layer for attn and ffn
+        // These DO have baked weights (rms_att, rms_ffn) so they need recompile per batch
+        // But they're small weights, and we compile them alongside the layer kernels
+        Kern *rmsAttBwd[NLAYERS], *rmsFFNBwd[NLAYERS];
+        memset(rmsAttBwd, 0, sizeof(rmsAttBwd));
+        memset(rmsFFNBwd, 0, sizeof(rmsFFNBwd));
+
+        // Softmax kernel (no weights — compile once)
+        Kern *softmaxKern = compile_softmax_kern();
+        if (!softmaxKern) { printf("softmax compile failed\n"); return 1; }
+        printf("Softmax kernel compiled (no weights)\n");
+
+        // Final RMSNorm and classifier are recompiled per batch since they have baked weights
+        Kern *finalRmsKern = NULL, *classifierKern = NULL;
+
+        dispatch_queue_t dw_q = dispatch_queue_create("dw_cblas", DISPATCH_QUEUE_SERIAL);
+        dispatch_group_t dw_grp = dispatch_group_create();
+
+        float last_loss = 999.0f;
+        double total_compile_ms=0, total_train_ms=0;
+        int total_steps_done=0, total_batches=0;
+        uint64_t t_wall_start = mach_absolute_time();
+        srand48(42 + start_step);
+
+        int step = start_step;
+        while (step < total_steps) {
+            // Check compile budget — account for new kernels
+            // Per batch: 60 layer kernels + 24 rmsnorm_bwd + 1 classifier + 1 final_rms = 86
+            int kernels_needed = TOTAL_WEIGHT_KERNELS + 2*NLAYERS + 2;
+            if (g_compile_count + kernels_needed > MAX_COMPILES) {
+                for (int L=0; L<NLAYERS; L++) {
+                    free_layer_kernels(&kern[L]); free_kern(sdpaBwd2[L]);
+                    free_kern(rmsAttBwd[L]); free_kern(rmsFFNBwd[L]);
+                }
+                free_kern(softmaxKern); free_kern(finalRmsKern); free_kern(classifierKern);
+                double wall = tb_ms(mach_absolute_time() - t_wall_start);
+                save_checkpoint(CKPT_PATH, step, total_steps, lr, last_loss,
+                    total_compile_ms+cum_compile, total_train_ms+cum_train, wall+cum_wall,
+                    total_steps_done+cum_steps, total_batches+cum_batches, adam_t,
+                    lw, la, rms_final, &arms_final, embed, &aembed);
+                printf("[exec() restart step %d, %d compiles, loss=%.4f]\n", step, g_compile_count, last_loss);
+                fflush(stdout);
+                execl(argv[0], argv[0], "--resume", NULL);
+                perror("execl"); return 1;
+            }
+
+            // Compile all layer kernels
+            uint64_t tc = mach_absolute_time();
+            for (int L=0; L<NLAYERS; L++) free_layer_kernels(&kern[L]);
+            bool compile_ok = true;
+            for (int L=0; L<NLAYERS; L++) {
+                printf("  Compiling layer %d/%d... (%d compiles)\r", L+1, NLAYERS, g_compile_count);
+                fflush(stdout);
+                if (!compile_layer_kernels(&kern[L], &lw[L])) {
+                    printf("\nCompile failed at layer %d\n", L);
+                    compile_ok = false; break;
+                }
+                // NEW: Compile RMSNorm backward kernels for this layer
+                free_kern(rmsAttBwd[L]); free_kern(rmsFFNBwd[L]);
+                rmsAttBwd[L] = compile_rmsnorm_bwd_kern(lw[L].rms_att);
+                rmsFFNBwd[L] = compile_rmsnorm_bwd_kern(lw[L].rms_ffn);
+                if (!rmsAttBwd[L] || !rmsFFNBwd[L]) {
+                    printf("\nrmsnorm_bwd compile failed at layer %d\n", L);
+                    compile_ok = false; break;
+                }
+            }
+            if (!compile_ok) { g_compile_count = MAX_COMPILES; continue; }
+
+            // Re-compile sdpaBwd2 if needed
+            for (int L=0; L<NLAYERS; L++) {
+                if (!sdpaBwd2[L]) {
+                    sdpaBwd2[L] = compile_sdpa_bwd2();
+                    if (!sdpaBwd2[L]) { printf("sdpaBwd2 recompile failed\n"); return 1; }
+                }
+            }
+
+            // NEW: Compile final RMSNorm and classifier with current weights
+            free_kern(finalRmsKern); free_kern(classifierKern);
+            finalRmsKern = compile_final_rmsnorm_kern(rms_final);
+            classifierKern = compile_classifier_fwd(embed);
+            if (!finalRmsKern || !classifierKern) {
+                printf("finalRms or classifier compile failed\n");
+                g_compile_count = MAX_COMPILES; continue;
+            }
+            // Re-compile softmax if needed
+            if (!softmaxKern) {
+                softmaxKern = compile_softmax_kern();
+                if (!softmaxKern) { printf("softmax recompile failed\n"); return 1; }
+            }
+
+            double cms = tb_ms(mach_absolute_time() - tc);
+            total_compile_ms += cms;
+            printf("  Compiled %d kernels in %.0fms                    \n", kernels_needed, cms);
+
+            // Zero gradient accumulators
+            for (int L=0; L<NLAYERS; L++) layer_grads_zero(&grads[L]);
+            memset(grms_final, 0, DIM*4);
+            memset(gembed, 0, (size_t)VOCAB*DIM*4);
+
+            int steps_batch = 0;
+            uint64_t tt = mach_absolute_time();
+            double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
+
+            for (int a=0; a<ACCUM_STEPS && step<total_steps; a++, step++) {
+                uint64_t t0,t1;
+                size_t max_pos = n_tokens - SEQ - 1;
+                size_t pos = (size_t)(drand48() * max_pos);
+                uint16_t *input_tokens = token_data + pos;
+                uint16_t *target_tokens = token_data + pos + 1;
+
+                // Embedding lookup
+                t0=mach_absolute_time();
+                embed_lookup(x_cur, embed, input_tokens, DIM, SEQ);
+                t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0);
+
+                // ===== FORWARD (12 layers) =====
+                for (int L=0; L<NLAYERS; L++) {
+                    LayerActs *ac = &acts[L];
+                    memcpy(ac->layer_in, x_cur, SEQ*DIM*4);
+
+                    t0=mach_absolute_time();
+                    dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+                    t1=mach_absolute_time(); t_cblas_wait+=tb_ms(t1-t0); t0=t1;
+
+                    io_write_fp16(kern[L].fwdAttn->ioIn, x_cur, DIM, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+                    ane_eval(kern[L].fwdAttn);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+                    io_read_fp16(kern[L].fwdAttn->ioOut, ac->o_out, 0, DIM, SEQ);
+                    io_read_fp16(kern[L].fwdAttn->ioOut, ac->attn_out, 4*DIM, DIM, SEQ);
+                    io_read_fp16(kern[L].fwdAttn->ioOut, ac->xnorm, 5*DIM, DIM, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+
+                    vDSP_vadd(x_cur, 1, ac->o_out, 1, ac->x2, 1, (vDSP_Length)(SEQ*DIM));
+                    t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0); t0=t1;
+
+                    io_write_fp16(kern[L].fwdFFN->ioIn, ac->x2, DIM, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+                    ane_eval(kern[L].fwdFFN);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->ffn_out, 0, DIM, SEQ);
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->h1, DIM, HIDDEN, SEQ);
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->h3, DIM+HIDDEN, HIDDEN, SEQ);
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->silu_out, DIM+2*HIDDEN, HIDDEN, SEQ);
+                    io_read_fp16(kern[L].fwdFFN->ioOut, ac->x2norm, DIM+3*HIDDEN, DIM, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+
+                    vDSP_vadd(ac->x2, 1, ac->ffn_out, 1, x_cur, 1, (vDSP_Length)(SEQ*DIM));
+                    t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0);
+                }
+
+                // CHANGED: Final RMSNorm on ANE (was CPU)
+                t0=mach_absolute_time();
+                io_write_fp16(finalRmsKern->ioIn, x_cur, DIM, SEQ);
+                ane_eval(finalRmsKern);
+                io_read_fp16(finalRmsKern->ioOut, x_final, 0, DIM, SEQ);
+                t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+
+                // CHANGED: Classifier on ANE (was CPU cblas)
+                io_write_fp16(classifierKern->ioIn, x_final, DIM, SEQ);
+                ane_eval(classifierKern);
+                t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+
+                // CHANGED: Softmax on ANE, then read probs back for NLL on CPU
+                io_copy(softmaxKern->ioIn, 0, classifierKern->ioOut, 0, VOCAB, SEQ);
+                ane_eval(softmaxKern);
+                t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+
+                // Read probs back for NLL loss + gradient (needs target indexing — CPU)
+                io_read_fp16(softmaxKern->ioOut, probs, 0, VOCAB, SEQ);
+                t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+
+                // NLL loss + gradient on CPU: dlogits = probs - one_hot(targets)
+                float total_loss = 0;
+                float invS = 1.0f / SEQ;
+                memcpy(dlogits, probs, (size_t)VOCAB*SEQ*4);
+                for (int t = 0; t < SEQ; t++) {
+                    int tgt = target_tokens[t];
+                    total_loss -= logf(probs[tgt*SEQ+t] + 1e-10f);
+                    dlogits[tgt*SEQ+t] -= 1.0f;  // subtract one_hot
+                }
+                // Scale by 1/S
+                vDSP_vsmul(dlogits, 1, &invS, dlogits, 1, (vDSP_Length)((size_t)VOCAB*SEQ));
+                float loss = total_loss / SEQ;
+                last_loss = loss;
+                t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0); t0=t1;
+
+                // ===== BACKWARD =====
+                // Classifier backward: dx_final = embed^T @ dlogits (CPU — ANE is slower)
+                cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
+                            DIM, SEQ, VOCAB, 1.0f,
+                            embed, DIM, dlogits, SEQ, 0.0f, dy, SEQ);
+                // dembed async on CPU
+                dispatch_group_async(dw_grp, dw_q, ^{
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
+                                VOCAB, DIM, SEQ, 1.0f,
+                                dlogits, SEQ, x_final, SEQ, 1.0f, gembed, DIM);
+                });
+
+                // Final RMSNorm backward (CPU — just one call, not worth ANE overhead)
+                {
+                    float *dx_rms_final = (float*)calloc(SEQ*DIM, 4);
+                    rmsnorm_bwd(dx_rms_final, grms_final, dy, x_cur, rms_final, DIM, SEQ);
+                    memcpy(dy, dx_rms_final, SEQ*DIM*4);
+                    free(dx_rms_final);
+                }
+                t1=mach_absolute_time(); t_rms+=tb_ms(t1-t0);
+
+                // ===== BACKWARD (12 layers, reverse) =====
+                for (int L=NLAYERS-1; L>=0; L--) {
+                    LayerActs *ac = &acts[L];
+                    LayerGrads *gr = &grads[L];
+                    memcpy(dffn, dy, SEQ*DIM*4);
+
+                    // FFN backward (ANE) — same as original
+                    io_write_fp16_at(kern[L].ffnBwd->ioIn, 0, dffn, DIM, SEQ);
+                    io_copy(kern[L].ffnBwd->ioIn, DIM, kern[L].fwdFFN->ioOut, DIM, 2*HIDDEN, SEQ);
+                    ane_eval(kern[L].ffnBwd);
+                    io_read_fp16(kern[L].ffnBwd->ioOut, dx_ffn, 0, DIM, SEQ);
+                    io_read_fp16(kern[L].ffnBwd->ioOut, dh1, DIM, HIDDEN, SEQ);
+                    io_read_fp16(kern[L].ffnBwd->ioOut, dh3, DIM+HIDDEN, HIDDEN, SEQ);
+
+                    // dW FFN async (CPU — parallel with ANE)
+                    float *capt_dffn = (float*)malloc(SEQ*DIM*4); memcpy(capt_dffn, dffn, SEQ*DIM*4);
+                    float *capt_silu = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_silu, ac->silu_out, SEQ*HIDDEN*4);
+                    float *capt_dh1 = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_dh1, dh1, SEQ*HIDDEN*4);
+                    float *capt_dh3 = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_dh3, dh3, SEQ*HIDDEN*4);
+                    float *capt_x2n = (float*)malloc(SEQ*DIM*4); memcpy(capt_x2n, ac->x2norm, SEQ*DIM*4);
+                    dispatch_group_async(dw_grp, dw_q, ^{
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, HIDDEN, SEQ,
+                                    1.0f, capt_dffn, SEQ, capt_silu, SEQ, 1.0f, gr->W2, HIDDEN);
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+                                    1.0f, capt_dh1, SEQ, capt_x2n, SEQ, 1.0f, gr->W1, DIM);
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+                                    1.0f, capt_dh3, SEQ, capt_x2n, SEQ, 1.0f, gr->W3, DIM);
+                        free(capt_dffn); free(capt_silu); free(capt_dh1); free(capt_dh3); free(capt_x2n);
+                    });
+
+                    // CHANGED: RMSNorm2 backward on ANE
+                    // Write concat(dx_ffn, x2) into rmsnorm_bwd kernel
+                    io_write_fp16_at(rmsFFNBwd[L]->ioIn, 0, dx_ffn, DIM, SEQ);
+                    io_write_fp16_at(rmsFFNBwd[L]->ioIn, DIM, ac->x2, DIM, SEQ);
+                    ane_eval(rmsFFNBwd[L]);
+                    io_read_fp16(rmsFFNBwd[L]->ioOut, dx2, 0, DIM, SEQ);
+                    // dw for rmsnorm_ffn still on CPU (accumulate per step)
+                    {
+                        float *dw_tmp = (float*)calloc(DIM, 4);
+                        float *dx_scratch = (float*)malloc(SEQ*DIM*4);
+                        rmsnorm_bwd(dx_scratch, dw_tmp, dx_ffn, ac->x2, lw[L].rms_ffn, DIM, SEQ);
+                        for(int i=0;i<DIM;i++) gr->rms_ffn[i] += dw_tmp[i];
+                        free(dx_scratch); free(dw_tmp);
+                    }
+                    // Add residual: dx2 += dy
+                    for(int i=0;i<SEQ*DIM;i++) dx2[i] += dy[i];
+
+                    // dWo async (CPU)
+                    memcpy(do_out_buf, dx2, SEQ*DIM*4);
+                    float *capt_do = (float*)malloc(SEQ*DIM*4); memcpy(capt_do, do_out_buf, SEQ*DIM*4);
+                    float *capt_attn = (float*)malloc(SEQ*DIM*4); memcpy(capt_attn, ac->attn_out, SEQ*DIM*4);
+                    dispatch_group_async(dw_grp, dw_q, ^{
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                    1.0f, capt_do, SEQ, capt_attn, SEQ, 1.0f, gr->Wo, DIM);
+                        free(capt_do); free(capt_attn);
+                    });
+
+                    // SDPA backward (ANE) — same as original
+                    io_copy(kern[L].sdpaBwd1->ioIn, 0, kern[L].fwdAttn->ioOut, DIM, 3*DIM, SEQ);
+                    io_write_fp16_at(kern[L].sdpaBwd1->ioIn, 3*DIM, dx2, DIM, SEQ);
+                    ane_eval(kern[L].sdpaBwd1);
+                    io_copy(sdpaBwd2[L]->ioIn, 0, kern[L].sdpaBwd1->ioOut, DIM, 2*SCORE_CH, SEQ);
+                    io_copy(sdpaBwd2[L]->ioIn, 2*SCORE_CH, kern[L].fwdAttn->ioOut, DIM, 2*DIM, SEQ);
+                    ane_eval(sdpaBwd2[L]);
+
+                    io_read_fp16(sdpaBwd2[L]->ioOut, dq, 0, DIM, SEQ);
+                    io_read_fp16(sdpaBwd2[L]->ioOut, dk, DIM, DIM, SEQ);
+                    io_read_fp16(kern[L].sdpaBwd1->ioOut, dv, 0, DIM, SEQ);
+
+                    // dWq/dWk/dWv async (CPU)
+                    float *capt_dq = (float*)malloc(SEQ*DIM*4); memcpy(capt_dq, dq, SEQ*DIM*4);
+                    float *capt_dk = (float*)malloc(SEQ*DIM*4); memcpy(capt_dk, dk, SEQ*DIM*4);
+                    float *capt_dv = (float*)malloc(SEQ*DIM*4); memcpy(capt_dv, dv, SEQ*DIM*4);
+                    float *capt_xn = (float*)malloc(SEQ*DIM*4); memcpy(capt_xn, ac->xnorm, SEQ*DIM*4);
+                    dispatch_group_async(dw_grp, dw_q, ^{
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                    1.0f, capt_dq, SEQ, capt_xn, SEQ, 1.0f, gr->Wq, DIM);
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                    1.0f, capt_dk, SEQ, capt_xn, SEQ, 1.0f, gr->Wk, DIM);
+                        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                    1.0f, capt_dv, SEQ, capt_xn, SEQ, 1.0f, gr->Wv, DIM);
+                        free(capt_dq); free(capt_dk); free(capt_dv); free(capt_xn);
+                    });
+
+                    // QKV backward (ANE) — same as original
+                    io_copy(kern[L].qkvBwd->ioIn, 0, sdpaBwd2[L]->ioOut, 0, 2*DIM, SEQ);
+                    io_copy(kern[L].qkvBwd->ioIn, 2*DIM, kern[L].sdpaBwd1->ioOut, 0, DIM, SEQ);
+                    ane_eval(kern[L].qkvBwd);
+                    io_read_fp16(kern[L].qkvBwd->ioOut, dx_attn, 0, DIM, SEQ);
+
+                    // CHANGED: RMSNorm1 backward on ANE
+                    io_write_fp16_at(rmsAttBwd[L]->ioIn, 0, dx_attn, DIM, SEQ);
+                    io_write_fp16_at(rmsAttBwd[L]->ioIn, DIM, ac->layer_in, DIM, SEQ);
+                    ane_eval(rmsAttBwd[L]);
+                    float *dx_rms1 = (float*)malloc(SEQ*DIM*4);
+                    io_read_fp16(rmsAttBwd[L]->ioOut, dx_rms1, 0, DIM, SEQ);
+                    // dw for rmsnorm_att still on CPU
+                    {
+                        float *dw_tmp = (float*)calloc(DIM, 4);
+                        float *dx_scratch = (float*)malloc(SEQ*DIM*4);
+                        rmsnorm_bwd(dx_scratch, dw_tmp, dx_attn, ac->layer_in, lw[L].rms_att, DIM, SEQ);
+                        for(int i=0;i<DIM;i++) gr->rms_att[i] += dw_tmp[i];
+                        free(dx_scratch); free(dw_tmp);
+                    }
+
+                    for(int i=0;i<SEQ*DIM;i++) dy[i] = dx_rms1[i] + dx2[i];
+                    free(dx_rms1);
+                }
+
+                // Embedding backward
+                dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+                embed_backward(gembed, dy, input_tokens, DIM, SEQ);
+
+                steps_batch++;
+                if (step % 10 == 0 || step == start_step)
+                    printf("step %-4d loss=%.4f\n", step, loss);
+            }
+            double tms = tb_ms(mach_absolute_time() - tt);
+            total_train_ms += tms;
+            total_steps_done += steps_batch;
+            total_batches++;
+
+            dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+
+            // Adam update
+            float gsc = 1.0f / steps_batch;
+            adam_t++;
+            for (int L=0; L<NLAYERS; L++) {
+                LayerGrads *g = &grads[L];
+                for(size_t i=0;i<WQ_SZ;i++){g->Wq[i]*=gsc;g->Wk[i]*=gsc;g->Wv[i]*=gsc;g->Wo[i]*=gsc;}
+                for(size_t i=0;i<W1_SZ;i++) g->W1[i]*=gsc;
+                for(size_t i=0;i<W2_SZ;i++) g->W2[i]*=gsc;
+                for(size_t i=0;i<W3_SZ;i++) g->W3[i]*=gsc;
+                for(int i=0;i<DIM;i++){g->rms_att[i]*=gsc; g->rms_ffn[i]*=gsc;}
+                adam_update(lw[L].Wq, g->Wq, &la[L].Wq, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].Wk, g->Wk, &la[L].Wk, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].Wv, g->Wv, &la[L].Wv, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].Wo, g->Wo, &la[L].Wo, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].W1, g->W1, &la[L].W1, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].W2, g->W2, &la[L].W2, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].W3, g->W3, &la[L].W3, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].rms_att, g->rms_att, &la[L].rms_att, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(lw[L].rms_ffn, g->rms_ffn, &la[L].rms_ffn, adam_t, lr, adam_b1, adam_b2, adam_eps);
+            }
+            for(int i=0;i<DIM;i++) grms_final[i]*=gsc;
+            adam_update(rms_final, grms_final, &arms_final, adam_t, lr, adam_b1, adam_b2, adam_eps);
+            for(size_t i=0;i<(size_t)VOCAB*DIM;i++) gembed[i]*=gsc;
+            adam_update(embed, gembed, &aembed, adam_t, lr, adam_b1, adam_b2, adam_eps);
+
+            printf("  [batch %d: compile=%.0fms train=%.1fms (%.1fms/step) compiles=%d]\n",
+                   steps_batch, cms, tms, tms/steps_batch, g_compile_count);
+        }
+
+        // Efficiency report
+        double wall = tb_ms(mach_absolute_time() - t_wall_start);
+        total_compile_ms += cum_compile; total_train_ms += cum_train;
+        wall += cum_wall; total_steps_done += cum_steps; total_batches += cum_batches;
+
+        // FLOP accounting — same as train_large.m but classifier+softmax now on ANE
+        double fwd_flops = NLAYERS * (4.0*2*DIM*DIM*SEQ + 2.0*2*DIM*HIDDEN*SEQ + 2.0*HIDDEN*DIM*SEQ);
+        double sdpa_flops = NLAYERS * 2.0*HEADS*5*SEQ*SEQ*HD;
+        double cls_flops = 2.0*VOCAB*DIM*SEQ;
+        double total_flops = (fwd_flops*3 + sdpa_flops + cls_flops*3) * total_steps_done;
+        // In train_large_ane: classifier fwd + softmax run on ANE (not CPU)
+        double ane_flops = (fwd_flops*2 + sdpa_flops + cls_flops) * total_steps_done;
+
+        printf("\n=== NEW Efficiency Report ===\n");
+        printf("Total steps:     %d\n", total_steps_done);
+        printf("Wall time:       %.0f ms (%.1f s)\n", wall, wall/1000);
+        printf("Compile time:    %.0f ms (%.1f%%)\n", total_compile_ms, 100*total_compile_ms/wall);
+        printf("Train time:      %.0f ms (%.1f%%)\n", total_train_ms, 100*total_train_ms/wall);
+        printf("Avg train:       %.1f ms/step\n", total_train_ms/total_steps_done);
+        printf("ANE TFLOPS:      %.2f sustained\n", ane_flops / (total_train_ms * 1e9));
+        printf("Total TFLOPS:    %.2f (ANE+CPU)\n", total_flops / (total_train_ms * 1e9));
+        printf("ANE utilization: %.1f%% of 15.8 TFLOPS\n", 100*ane_flops/(total_train_ms*1e9)/15.8);
+        // Cleanup
+        for (int L=0; L<NLAYERS; L++) {
+            free_layer_kernels(&kern[L]); free_kern(sdpaBwd2[L]);
+            free_kern(rmsAttBwd[L]); free_kern(rmsFFNBwd[L]);
+            layer_weights_free(&lw[L]); layer_adam_free(&la[L]);
+            layer_acts_free(&acts[L]); layer_grads_free(&grads[L]);
+        }
+        free_kern(softmaxKern); free_kern(finalRmsKern); free_kern(classifierKern);
+        munmap(token_data, data_len); close(data_fd);
+        free(rms_final); free(embed); free(grms_final); free(gembed);
+        adam_free(&arms_final); adam_free(&aembed);
+        free(dy); free(dffn); free(dh1); free(dh3); free(dx_ffn); free(dx2);
+        free(do_out_buf); free(dq); free(dk); free(dv); free(dx_attn);
+        free(x_cur); free(x_final); free(logits); free(dlogits); free(probs);
+    }
+    return 0;
+}

From 65cfc3255f03b5abb528990ae595d0a69b19a2df Mon Sep 17 00:00:00 2001
From: Guitared <guitared@users.noreply.github.com>
Date: Tue, 3 Mar 2026 14:11:42 +0700
Subject: [PATCH 04/21] optimize singleton token params in generate_text

---
 training/dashboard.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/training/dashboard.py b/training/dashboard.py
index a3a1503..ed019ef 100644
--- a/training/dashboard.py
+++ b/training/dashboard.py
@@ -142,7 +142,7 @@ def softmax(x):
     e = np.exp(x)
     return e / np.sum(e)
 
-def generate_text(W, tok, max_tokens=64, temperature=0.8):
+def generate_text(W, max_tokens=64, temperature=0.8):
     tokenizer = get_tokenizer()
     if tokenizer is None:
         return '[no tokenizer]'
@@ -244,7 +244,7 @@ def generation_thread():
                 with S.gen_lock:
                     S.gen_status = 'idle'
                 continue
-            text = generate_text(W, get_tokenizer(), max_tokens=64, temperature=0.8)
+            text = generate_text(W, max_tokens=64, temperature=0.8)
             with S.gen_lock:
                 S.gen_text = text
                 S.gen_step = S.step
@@ -851,7 +851,7 @@ def force_gen():
                             try:
                                 W = load_weights_from_ckpt(CKPT_PATH)
                                 if W:
-                                    text = generate_text(W, get_tokenizer(), max_tokens=64, temperature=0.8)
+                                    text = generate_text(W, max_tokens=64, temperature=0.8)
                                     with S.gen_lock:
                                         S.gen_text = text
                                         S.gen_step = S.step

From b8f09a685354b4ec3fbd381ae49844f0bf655770 Mon Sep 17 00:00:00 2001
From: Guitared <guitared@users.noreply.github.com>
Date: Tue, 3 Mar 2026 14:14:30 +0700
Subject: [PATCH 05/21] fix non-interactive session error and sudo password
 input for powermetrics

---
 training/dashboard.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/training/dashboard.py b/training/dashboard.py
index ed019ef..b55c12e 100644
--- a/training/dashboard.py
+++ b/training/dashboard.py
@@ -672,6 +672,8 @@ def spawn_training(resume=False, steps=10000):
     return proc
 
 def spawn_powermetrics():
+    if not sys.stdin.isatty():
+        return None
     try:
         proc = subprocess.Popen(
             ['sudo', 'powermetrics', '--samplers', 'cpu_power,gpu_power,ane_power', '-i', '1000'],

From a14ce098fb901eee48448bc1439346d5f0728051 Mon Sep 17 00:00:00 2001
From: Guitared <guitared@users.noreply.github.com>
Date: Tue, 3 Mar 2026 14:18:35 +0700
Subject: [PATCH 06/21] Capitalize doc header

---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index ce3df1f..403c668 100644
--- a/README.md
+++ b/README.md
@@ -12,20 +12,20 @@ This is a **research project**, not a production framework.
 
 The goal was to demonstrate that **training on the Apple Neural Engine — and potentially other NPUs — is possible**, and that the barrier has always been software support, not hardware capability. The ANE is a remarkably capable piece of silicon that Apple restricts to inference-only use through CoreML. This project bypasses that restriction using reverse-engineered private APIs to show what's possible when you give the hardware a chance.
 
-### What this project is
+### What This Project Is
 
 - A proof of concept for ANE training via `_ANEClient` and `_ANECompiler` private APIs
 - A set of benchmarks documenting real ANE performance characteristics (throughput, power, SRAM behavior)
 - A reference for anyone exploring direct ANE access outside CoreML
 - Research code that I update when I find something interesting
 
-### What this project is not
+### What This Project Is Not
 
 - A maintained framework or library
 - A replacement for CoreML, MLX, llama.cpp, or any production inference stack
 - A path to training large models on consumer hardware (yet)
 
-### On the hype
+### On The Hype
 
 Some coverage of this project has overstated its implications. To be clear:
 
@@ -37,7 +37,7 @@ The honest results — including all limitations — are documented in the accom
 - [Part 1: Reverse Engineering](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine)
 - [Part 2: Benchmarks](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615)
 
-### On maintenance
+### On Maintenance
 
 I don't intend to grow this into a large community project. My focus is on original research (compiler infrastructure for edge AI optimization), and maintaining an open-source framework takes time away from that.
 

From cb474e15378f75de37b7fc99e97af3c7c00a2afa Mon Sep 17 00:00:00 2001
From: maderix <maderix@gmail.com>
Date: Mon, 2 Mar 2026 23:49:55 -0800
Subject: [PATCH 07/21] =?UTF-8?q?Add=20dynamic=20weight=20training=20pipel?=
 =?UTF-8?q?ine=20=E2=80=94=20110ms/step=20without=20recompilation?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.

Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
  - 9 dynamic kernels shared across all 12 layers
  - Vocab compaction 32K→9.2K for faster classifier
  - Vectorized cross-entropy with vDSP/NEON
  - Adam optimizer with gradient clipping + cosine LR schedule
  - Checkpoint save/resume

- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface

- dashboard.py — updated with --dynamic flag for v2 pipeline support,
  improved step regex parsing, --scratch/--lr/--accum CLI args

Performance: 110ms/step steady-state (no recompile overhead)
  ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
---
 training/dashboard.py                   |  29 +-
 training/test_dynamic_matmul.m          | 333 +++++++++
 training/test_weight_patch.m            | 450 ++++++++++++
 training/training_dynamic/Makefile      |   9 +
 training/training_dynamic/config.h      | 156 +++++
 training/training_dynamic/cpu_ops.h     | 164 +++++
 training/training_dynamic/io.h          | 147 ++++
 training/training_dynamic/mil_dynamic.h | 590 ++++++++++++++++
 training/training_dynamic/train.m       | 876 ++++++++++++++++++++++++
 9 files changed, 2749 insertions(+), 5 deletions(-)
 create mode 100644 training/test_dynamic_matmul.m
 create mode 100644 training/test_weight_patch.m
 create mode 100644 training/training_dynamic/Makefile
 create mode 100644 training/training_dynamic/config.h
 create mode 100644 training/training_dynamic/cpu_ops.h
 create mode 100644 training/training_dynamic/io.h
 create mode 100644 training/training_dynamic/mil_dynamic.h
 create mode 100644 training/training_dynamic/train.m

diff --git a/training/dashboard.py b/training/dashboard.py
index a3a1503..06d46a2 100644
--- a/training/dashboard.py
+++ b/training/dashboard.py
@@ -279,7 +279,7 @@ def sysmetrics_thread():
 RE_PARAMS = re.compile(r'Params: ([\d.]+)M \(transformer ([\d.]+)M \+ embed ([\d.]+)M\)')
 RE_KERNELS = re.compile(r'Kernels: (\d+).*?(\d+) weight-bearing')
 RE_ACCUM = re.compile(r'Accum (\d+).*LR=([\d.e+-]+)')
-RE_STEP = re.compile(r'step\s+(\d+)\s+loss=([\d.]+)')
+RE_STEP = re.compile(r'step\s+(\d+)\s+loss=([\d.]+)(?:\s+lr=([\d.e+-]+))?(?:\s+([\d.]+)ms/step)?')
 RE_BATCH = re.compile(r'\[batch (\d+): compile=([\d.]+)ms train=([\d.]+)ms \(([\d.]+)ms/step\) compiles=(\d+)\]')
 RE_TIMING = re.compile(r'ane=([\d.]+) io=([\d.]+) cls=([\d.]+) elem=([\d.]+) rms=([\d.]+) cblas_wait=([\d.]+)')
 RE_RESTART = re.compile(r'\[exec\(\) restart step (\d+)')
@@ -323,6 +323,10 @@ def parse_line(line):
     m = RE_STEP.search(line)
     if m:
         S.step, S.loss = int(m[1]), float(m[2])
+        if m[3]:
+            S.training['lr'] = m[3]
+        if m[4]:
+            S.ms_per_step = float(m[4])
         S.loss_history.append((S.step, S.loss))
         S.best_loss = min(S.best_loss, S.loss)
         return
@@ -659,10 +663,19 @@ def set_nonblock(fd):
     fl = fcntl.fcntl(fd, fcntl.F_GETFL)
     fcntl.fcntl(fd, fcntl.F_SETFL, fl | os.O_NONBLOCK)
 
-def spawn_training(resume=False, steps=10000):
-    cmd = 'make train_large 2>&1 && ./train_large'
+def spawn_training(resume=False, steps=10000, dynamic=False, scratch=False, lr=None, accum=None):
+    if dynamic:
+        cmd = 'cd training_dynamic && make 2>&1 && ./train'
+    else:
+        cmd = 'make train_large 2>&1 && ./train_large'
     if resume:
         cmd += ' --resume'
+    if scratch and dynamic:
+        cmd += ' --scratch'
+    if lr is not None:
+        cmd += f' --lr {lr}'
+    if accum is not None:
+        cmd += f' --accum {accum}'
     cmd += f' --steps {steps}'
     proc = subprocess.Popen(
         ['bash', '-c', cmd],
@@ -684,6 +697,10 @@ def spawn_powermetrics():
 def main():
     parser = argparse.ArgumentParser(description='ANE Training Dashboard (stories110M)')
     parser.add_argument('--resume', action='store_true', help='Resume from checkpoint')
+    parser.add_argument('--dynamic', action='store_true', help='Use v2 dynamic weight pipeline (training_dynamic/)')
+    parser.add_argument('--scratch', action='store_true', help='Train from scratch (random init)')
+    parser.add_argument('--lr', type=float, default=None, help='Learning rate')
+    parser.add_argument('--accum', type=int, default=None, help='Gradient accumulation steps')
     parser.add_argument('--infinite', action='store_true', help='Train indefinitely')
     parser.add_argument('--no-powermetrics', action='store_true')
     parser.add_argument('--no-generate', action='store_true', help='Disable text generation')
@@ -697,7 +714,8 @@ def main():
     term = Terminal()
     procs = []
 
-    train_proc = spawn_training(resume=args.resume, steps=args.steps)
+    train_proc = spawn_training(resume=args.resume, steps=args.steps, dynamic=args.dynamic,
+                                scratch=args.scratch, lr=args.lr, accum=args.accum)
     S.train_pid = train_proc.pid
     procs.append(train_proc)
 
@@ -837,7 +855,8 @@ def on_resize(*a):
                         if train_proc:
                             train_proc.terminate()
                             train_proc.wait()
-                        train_proc = spawn_training(resume=True, steps=args.steps)
+                        train_proc = spawn_training(resume=True, steps=args.steps, dynamic=args.dynamic,
+                                                        lr=args.lr, accum=args.accum)
                         S.train_pid = train_proc.pid
                         procs = [p for p in procs if p.poll() is None]
                         procs.append(train_proc)
diff --git a/training/test_dynamic_matmul.m b/training/test_dynamic_matmul.m
new file mode 100644
index 0000000..72addbd
--- /dev/null
+++ b/training/test_dynamic_matmul.m
@@ -0,0 +1,333 @@
+// test_dynamic_matmul.m — Benchmark dynamic matmul on ANE (no recompile)
+// Layout: input [1, D, 1, S+D] — activations in sp[0:S], weight rows in sp[S:S+D]
+// MIL: slice → reshape → matmul → reshape → output
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#import <mach/mach_time.h>
+#include <arm_neon.h>
+#include <Accelerate/Accelerate.h>
+
+#include "stories_io.h"
+
+// Generate MIL for y = x @ W where both come from input IOSurface
+// Input: [1, IC, 1, SEQ+OC] fp32
+//   sp[0:SEQ]    = activations x[IC, SEQ]
+//   sp[SEQ:SEQ+OC] = weight W[IC, OC] (each channel d holds W[d, :])
+// Output: [1, OC, 1, SEQ] fp32
+static NSString *gen_dynamic_matmul_mil(int ic, int oc, int seq) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n"
+        "[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
+        "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
+        "{\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    int sp_total = seq + oc;
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ic, sp_total];
+    // Cast to fp16
+    [m appendString:@"        string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", ic, sp_total];
+    // Slice activations [1, IC, 1, SEQ]
+    [m appendString:@"        tensor<int32, [4]> ba = const()[name = string(\"ba\"), val = tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sa = const()[name = string(\"sa\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", ic, seq];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> act = slice_by_size(x=xh,begin=ba,size=sa)[name=string(\"act\")];\n", ic, seq];
+    // Slice weight [1, IC, 1, OC]
+    [m appendFormat:@"        tensor<int32, [4]> bw = const()[name = string(\"bw\"), val = tensor<int32, [4]>([0,0,0,%d])];\n", seq];
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name = string(\"sw\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", ic, oc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> wt = slice_by_size(x=xh,begin=bw,size=sw)[name=string(\"wt\")];\n", ic, oc];
+    // Reshape act: [1,IC,1,SEQ] → [1,1,IC,SEQ] → transpose → [1,1,SEQ,IC]
+    [m appendFormat:@"        tensor<int32, [4]> ra = const()[name = string(\"ra\"), val = tensor<int32, [4]>([1,1,%d,%d])];\n", ic, seq];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a2 = reshape(shape=ra,x=act)[name=string(\"a2\")];\n", ic, seq];
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name = string(\"pm\"), val = tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a3 = transpose(perm=pm,x=a2)[name=string(\"a3\")];\n", seq, ic];
+    // Reshape weight: [1,IC,1,OC] → [1,1,IC,OC]
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name = string(\"rw\"), val = tensor<int32, [4]>([1,1,%d,%d])];\n", ic, oc];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W = reshape(shape=rw,x=wt)[name=string(\"W\")];\n", ic, oc];
+    // matmul: [1,1,SEQ,IC] @ [1,1,IC,OC] → [1,1,SEQ,OC]
+    [m appendString:@"        bool bF = const()[name = string(\"bF\"), val = bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> yh = matmul(transpose_x=bF,transpose_y=bF,x=a3,y=W)[name=string(\"mm\")];\n", seq, oc];
+    // Reshape+transpose back: [1,1,SEQ,OC] → transpose → [1,1,OC,SEQ] → reshape → [1,OC,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> yt = transpose(perm=pm,x=yh)[name=string(\"yt\")];\n", oc, seq];
+    [m appendFormat:@"        tensor<int32, [4]> ro = const()[name = string(\"ro\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", oc, seq];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> yr = reshape(shape=ro,x=yt)[name=string(\"yr\")];\n", oc, seq];
+    // Cast back to fp32
+    [m appendString:@"        string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype = to32, x = yr)[name = string(\"cout\")];\n", oc, seq];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// Tiled version: splits OC into tiles, each tile is a separate kernel
+// For W[IC, OC], tile along OC: each tile handles W[:, t*T:(t+1)*T]
+// Input per tile: [1, IC, 1, SEQ+T]
+// Output per tile: [1, T, 1, SEQ]
+typedef struct {
+    Kern **tiles;
+    int n_tiles, tile_oc, ic, oc, seq;
+} TiledMatmul;
+
+static TiledMatmul *compile_tiled_matmul(int ic, int oc, int tile_oc, int seq) {
+    TiledMatmul *tm = (TiledMatmul*)calloc(1, sizeof(TiledMatmul));
+    tm->ic = ic; tm->oc = oc; tm->seq = seq; tm->tile_oc = tile_oc;
+    tm->n_tiles = (oc + tile_oc - 1) / tile_oc;
+    tm->tiles = (Kern**)calloc(tm->n_tiles, sizeof(Kern*));
+    for (int t = 0; t < tm->n_tiles; t++) {
+        int this_oc = (t == tm->n_tiles-1 && oc % tile_oc) ? (oc % tile_oc) : tile_oc;
+        NSString *mil = gen_dynamic_matmul_mil(ic, this_oc, seq);
+        int in_bytes = ic * (seq + this_oc) * 4;
+        int out_bytes = this_oc * seq * 4;
+        tm->tiles[t] = compile_kern_mil_w(mil, @{}, in_bytes, out_bytes);
+        if (!tm->tiles[t]) { printf("Tile %d compile FAIL\n", t); return NULL; }
+    }
+    return tm;
+}
+
+// Write activations + weight tile into IOSurface
+// act: [IC, SEQ] column-major (channel-first)
+// W: [IC, OC] — full weight matrix, we extract the tile
+static void write_tile_input(TiledMatmul *tm, int tile_idx, const float *act, const float *W) {
+    Kern *k = tm->tiles[tile_idx];
+    int ic = tm->ic, seq = tm->seq, toc = tm->tile_oc;
+    int oc_off = tile_idx * toc;
+    int this_oc = (tile_idx == tm->n_tiles-1 && tm->oc % toc) ? (tm->oc % toc) : toc;
+
+    IOSurfaceLock(k->ioIn, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+    // Activations: buf[d * (seq+this_oc) + t] = act[d * seq + t]
+    for (int d = 0; d < ic; d++) {
+        memcpy(buf + d*(seq+this_oc), act + d*seq, seq*sizeof(float));
+        // Weight: buf[d * (seq+this_oc) + seq + c] = W[d * oc + oc_off + c]
+        for (int c = 0; c < this_oc; c++)
+            buf[d*(seq+this_oc) + seq + c] = W[d*tm->oc + oc_off + c];
+    }
+    IOSurfaceUnlock(k->ioIn, 0, NULL);
+}
+
+// Read tile output into full output buffer
+static void read_tile_output(TiledMatmul *tm, int tile_idx, float *out) {
+    Kern *k = tm->tiles[tile_idx];
+    int seq = tm->seq, toc = tm->tile_oc;
+    int oc_off = tile_idx * toc;
+    int this_oc = (tile_idx == tm->n_tiles-1 && tm->oc % toc) ? (tm->oc % toc) : toc;
+
+    IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+    float *obuf = (float*)IOSurfaceGetBaseAddress(k->ioOut);
+    for (int c = 0; c < this_oc; c++)
+        memcpy(out + (oc_off+c)*seq, obuf + c*seq, seq*sizeof(float));
+    IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+}
+
+int main(int argc, char **argv) {
+    @autoreleasepool {
+        mach_timebase_info(&g_tb);
+        ane_init();
+
+        // === Test 1: Single 64×64 dynamic matmul (correctness) ===
+        printf("=== Test 1: 64×64 dynamic matmul correctness ===\n");
+        {
+        int D = 64, S = 64;
+        NSString *mil = gen_dynamic_matmul_mil(D, D, S);
+        int in_b = D * (S+D) * 4, out_b = D * S * 4;
+        Kern *k = compile_kern_mil_w(mil, @{}, in_b, out_b);
+        if (!k) { printf("FAIL\n"); return 1; }
+
+        // Identity test
+        IOSurfaceLock(k->ioIn, 0, NULL);
+        float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+        memset(inp, 0, in_b);
+        for (int d = 0; d < D; d++)
+            for (int s = 0; s < S; s++)
+                inp[d*(S+D) + s] = (float)(d*S + s) * 0.001f;
+        for (int d = 0; d < D; d++)
+            for (int c = 0; c < D; c++)
+                inp[d*(S+D) + S + c] = (d == c) ? 1.0f : 0.0f;
+        IOSurfaceUnlock(k->ioIn, 0, NULL);
+
+        ane_eval(k);
+        IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        float *out = (float*)IOSurfaceGetBaseAddress(k->ioOut);
+        float me = 0;
+        for (int d = 0; d < D; d++)
+            for (int s = 0; s < S; s++) {
+                float e = fabsf(out[d*S+s] - inp[d*(S+D)+s]);
+                if (e > me) me = e;
+            }
+        IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        printf("Identity: max_err=%.6f %s\n", me, me < 0.01 ? "PASS" : "FAIL");
+
+        // 2× test
+        IOSurfaceLock(k->ioIn, 0, NULL);
+        for (int d = 0; d < D; d++)
+            for (int c = 0; c < D; c++)
+                inp[d*(S+D) + S + c] = (d == c) ? 2.0f : 0.0f;
+        IOSurfaceUnlock(k->ioIn, 0, NULL);
+        ane_eval(k);
+        IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        float sr = 0; int cnt = 0;
+        for (int i = 0; i < D*S; i++)
+            if (fabsf(inp[i/(S)*((S)+D) + i%S]) > 0.001f) { sr += out[i]/inp[i/S*(S+D)+i%S]; cnt++; }
+        IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        printf("2× W: ratio=%.3f %s\n\n", cnt?sr/cnt:0, fabsf(sr/cnt-2.0f)<0.1?"PASS":"FAIL");
+        free_kern(k);
+        }
+
+        // === Test 2: 768×768 single kernel (if it compiles) ===
+        printf("=== Test 2: 768×768 single dynamic matmul ===\n");
+        {
+        int D = 768, S = 256;
+        int sp_total = S + D;  // 256 + 768 = 1024
+        int in_b = D * sp_total * 4;  // 768 * 1024 * 4 = 3.1MB
+        int out_b = D * S * 4;        // 768 * 256 * 4 = 786KB
+        printf("IOSurface: in=%.1fMB out=%.1fKB\n", in_b/1e6, out_b/1e3);
+
+        NSString *mil = gen_dynamic_matmul_mil(D, D, S);
+        uint64_t t0 = mach_absolute_time();
+        Kern *k = compile_kern_mil_w(mil, @{}, in_b, out_b);
+        double compile_ms = tb_ms(mach_absolute_time() - t0);
+        if (!k) { printf("768×768 compile FAIL\n"); }
+        else {
+            printf("Compile: %.1fms\n", compile_ms);
+            // Random weights
+            float *act = (float*)calloc(D*S, sizeof(float));
+            float *W = (float*)calloc(D*D, sizeof(float));
+            for (int i = 0; i < D*S; i++) act[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.1f;
+            for (int i = 0; i < D*D; i++) W[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.01f;
+
+            // Write to IOSurface
+            IOSurfaceLock(k->ioIn, 0, NULL);
+            float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+            for (int d = 0; d < D; d++) {
+                memcpy(inp + d*(S+D), act + d*S, S*4);
+                memcpy(inp + d*(S+D) + S, W + d*D, D*4);
+            }
+            IOSurfaceUnlock(k->ioIn, 0, NULL);
+
+            // Warmup
+            for (int i = 0; i < 3; i++) ane_eval(k);
+
+            // Benchmark
+            int iters = 50;
+            t0 = mach_absolute_time();
+            for (int i = 0; i < iters; i++) ane_eval(k);
+            double total_ms = tb_ms(mach_absolute_time() - t0);
+            double per_eval = total_ms / iters;
+            double flops = 2.0 * D * D * S;  // matmul FLOPs
+            double gflops = flops / (per_eval * 1e6);
+            printf("768×768×256 matmul: %.3fms/eval  %.1f GFLOP/s\n", per_eval, gflops);
+
+            // Benchmark with IO write (simulating weight update)
+            t0 = mach_absolute_time();
+            for (int i = 0; i < iters; i++) {
+                IOSurfaceLock(k->ioIn, 0, NULL);
+                float *p = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+                for (int d = 0; d < D; d++)
+                    memcpy(p + d*(S+D) + S, W + d*D, D*4);
+                IOSurfaceUnlock(k->ioIn, 0, NULL);
+                ane_eval(k);
+            }
+            total_ms = tb_ms(mach_absolute_time() - t0);
+            per_eval = total_ms / iters;
+            gflops = flops / (per_eval * 1e6);
+            printf("With weight IO: %.3fms/eval  %.1f GFLOP/s\n", per_eval, gflops);
+
+            free(act); free(W); free_kern(k);
+        }
+        }
+
+        // === Test 3: Tiled matmul benchmark ===
+        int tile_sizes[] = {64, 128, 256, 384, 768};
+        int n_tiles_test = sizeof(tile_sizes)/sizeof(tile_sizes[0]);
+        printf("\n=== Test 3: Tiled 768×768 matmul (varying tile_oc) ===\n");
+        printf("%-10s %-8s %-10s %-12s %-10s\n", "tile_oc", "tiles", "compile", "eval/ms", "GFLOP/s");
+        {
+        int D = 768, S = 256;
+        float *act = (float*)calloc(D*S, sizeof(float));
+        float *W = (float*)calloc(D*D, sizeof(float));
+        float *out_full = (float*)calloc(D*S, sizeof(float));
+        for (int i = 0; i < D*S; i++) act[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.1f;
+        for (int i = 0; i < D*D; i++) W[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.01f;
+
+        for (int ti = 0; ti < n_tiles_test; ti++) {
+            int T = tile_sizes[ti];
+            if (T > D) continue;
+            uint64_t t0 = mach_absolute_time();
+            TiledMatmul *tm = compile_tiled_matmul(D, D, T, S);
+            double compile_ms = tb_ms(mach_absolute_time() - t0);
+            if (!tm) { printf("%-10d FAIL\n", T); continue; }
+
+            // Warmup
+            for (int w = 0; w < 2; w++) {
+                for (int t = 0; t < tm->n_tiles; t++) {
+                    write_tile_input(tm, t, act, W);
+                    ane_eval(tm->tiles[t]);
+                }
+            }
+
+            // Benchmark (with IO)
+            int iters = 20;
+            t0 = mach_absolute_time();
+            for (int i = 0; i < iters; i++) {
+                for (int t = 0; t < tm->n_tiles; t++) {
+                    write_tile_input(tm, t, act, W);
+                    ane_eval(tm->tiles[t]);
+                    read_tile_output(tm, t, out_full);
+                }
+            }
+            double total_ms = tb_ms(mach_absolute_time() - t0);
+            double per_matmul = total_ms / iters;
+            double flops = 2.0 * D * D * S;
+            double gflops = flops / (per_matmul * 1e6);
+            printf("%-10d %-8d %-10.0fms %-12.3fms %-10.1f\n",
+                T, tm->n_tiles, compile_ms, per_matmul, gflops);
+
+            for (int t = 0; t < tm->n_tiles; t++) free_kern(tm->tiles[t]);
+            free(tm->tiles); free(tm);
+        }
+
+        // === Correctness check: compare with cblas ===
+        printf("\n=== Correctness: dynamic matmul vs cblas_sgemm ===\n");
+        {
+        int T = 768;  // full, no tiling
+        TiledMatmul *tm = compile_tiled_matmul(D, D, T, S);
+        if (tm) {
+            write_tile_input(tm, 0, act, W);
+            ane_eval(tm->tiles[0]);
+            read_tile_output(tm, 0, out_full);
+
+            // Reference: cblas  y = act^T @ W → y[s,oc] = sum_d act[d,s]*W[d,oc]
+            // act is [D,S] col-major, W is [D,D] row-major
+            // We want out[oc,s] = sum_d act[d,s] * W[d,oc]
+            // = W^T @ act where W^T is [D,D] and act is [D,S] → out is [D,S]
+            float *ref = (float*)calloc(D*S, sizeof(float));
+            // out[oc*S+s] = sum_d W[d*D+oc] * act[d*S+s]
+            // This is: (W^T) @ act in column-major: M=D,N=S,K=D
+            // cblas: C = alpha*A*B + beta*C
+            // A=W^T [D×D], B=act [D×S], C=ref [D×S]
+            cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans,
+                D, S, D, 1.0f, W, D, act, D, 0.0f, ref, D);
+            float me = 0;
+            for (int i = 0; i < D*S; i++) {
+                float e = fabsf(out_full[i] - ref[i]);
+                if (e > me) me = e;
+            }
+            printf("vs cblas: max_err=%.6f %s\n", me, me < 1.0 ? "PASS" : "FAIL");
+            free(ref);
+            for (int t = 0; t < tm->n_tiles; t++) free_kern(tm->tiles[t]);
+            free(tm->tiles); free(tm);
+        }
+        }
+
+        free(act); free(W); free(out_full);
+        }
+
+        // === Summary for training ===
+        printf("\n=== Summary ===\n");
+        printf("Stories110M: 12 layers × 10 matmuls/layer = 120 matmuls/step\n");
+        printf("Sizes: Wq/Wk/Wv/Wo [768,768], W1/W3 [2048,768], W2 [768,2048]\n");
+        printf("With dynamic weights: compile once, update IOSurface every step\n");
+
+        printf("\nDone.\n");
+    }
+    return 0;
+}
diff --git a/training/test_weight_patch.m b/training/test_weight_patch.m
new file mode 100644
index 0000000..13473b7
--- /dev/null
+++ b/training/test_weight_patch.m
@@ -0,0 +1,450 @@
+// test_weight_patch.m — Test whether ANE weights can be patched after compile
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#import <mach/mach.h>
+#import <mach/mach_time.h>
+#import <mach/vm_map.h>
+#include <arm_neon.h>
+#include <Accelerate/Accelerate.h>
+
+#include "stories_io.h"
+
+// MIL: fp32 in → cast fp16 → conv → cast fp32 out (matches inmem_peak.m pattern)
+static NSString *gen_conv_mil(int ic, int oc, int sp) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n"
+        "[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
+        "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
+        "{\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ic, sp];
+    [m appendString:
+        @"        string pt = const()[name = string(\"pt\"), val = string(\"valid\")];\n"
+        "        tensor<int32, [2]> st = const()[name = string(\"st\"), val = tensor<int32, [2]>([1, 1])];\n"
+        "        tensor<int32, [4]> pd = const()[name = string(\"pd\"), val = tensor<int32, [4]>([0, 0, 0, 0])];\n"
+        "        tensor<int32, [2]> dl = const()[name = string(\"dl\"), val = tensor<int32, [2]>([1, 1])];\n"
+        "        int32 gr = const()[name = string(\"gr\"), val = int32(1)];\n"
+        "        string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> xh = cast(dtype = to16, x = x)[name = string(\"cast_in\")];\n", ic, sp];
+    [m appendFormat:@"        tensor<fp16, [%d, %d, 1, 1]> W = const()[name = string(\"W\"), "
+        "val = tensor<fp16, [%d, %d, 1, 1]>(BLOBFILE(path = string(\"@model_path/weights/w.bin\"), offset = uint64(64)))];\n",
+        oc, ic, oc, ic];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> yh = conv(dilations = dl, groups = gr, pad = pd, pad_type = pt, strides = st, weight = W, x = xh)"
+        "[name = string(\"conv\")];\n", oc, sp];
+    [m appendString:@"        string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1, %d, 1, %d]> y = cast(dtype = to32, x = yh)[name = string(\"cast_out\")];\n", oc, sp];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+int main(int argc, char **argv) {
+    @autoreleasepool {
+        mach_timebase_info(&g_tb);
+        ane_init();
+
+        int IC = 256, OC = 256, SP = 64;
+        int io_bytes = IC * SP * 4;  // fp32
+
+        // Identity weight
+        float *W_id = (float*)calloc(OC*IC, sizeof(float));
+        for (int i = 0; i < IC; i++) W_id[i*IC+i] = 1.0f;
+
+        NSString *mil = gen_conv_mil(IC, OC, SP);
+        NSDictionary *wd = @{@"@model_path/weights/w.bin": @{@"offset":@0, @"data":build_blob(W_id, OC, IC)}};
+
+        printf("=== Compiling conv %dx%d sp=%d ===\n", OC, IC, SP);
+        Kern *k = compile_kern_mil_w(mil, wd, io_bytes, io_bytes);
+        if (!k) { printf("COMPILE FAILED\n"); free(W_id); return 1; }
+        printf("Compile OK!\n");
+
+        // Write fp32 input
+        IOSurfaceLock(k->ioIn, 0, NULL);
+        float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn);
+        for (int i = 0; i < IC*SP; i++) inp[i] = (i % 100) * 0.01f;
+        IOSurfaceUnlock(k->ioIn, 0, NULL);
+
+        // Eval with identity
+        ane_eval(k);
+        IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        float *out = (float*)IOSurfaceGetBaseAddress(k->ioOut);
+        printf("In:  [%.3f, %.3f, %.3f, %.3f]\n", inp[0], inp[1], inp[2], inp[3]);
+        printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+        float max_err = 0;
+        for (int i = 0; i < OC*SP; i++) {
+            float err = fabsf(out[i] - inp[i]);
+            if (err > max_err) max_err = err;
+        }
+        IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+        printf("Identity max_err=%.6f %s\n\n", max_err, max_err < 0.1 ? "PASS" : "FAIL");
+
+        // === Approach 1: Patch weight on disk, unload+reload ===
+        printf("=== Approach 1: Disk patch + unload/reload ===\n");
+        float *W_2x = (float*)calloc(OC*IC, sizeof(float));
+        for (int i = 0; i < IC; i++) W_2x[i*IC+i] = 2.0f;
+        [build_blob(W_2x, OC, IC) writeToFile:
+            [(__bridge NSString*)k->tmpDir stringByAppendingPathComponent:@"weights/w.bin"] atomically:YES];
+
+        id mdl = (__bridge id)k->model;
+        NSError *e = nil;
+        ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
+        e = nil;
+        BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
+        printf("Reload: %s\n", ok?"OK":"FAIL");
+        if (ok) {
+            // Re-create request after reload
+            id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioIn);
+            id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioOut);
+            CFRelease(k->request);
+            k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
+                @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+                @[wI], @[@0], @[wO], @[@0], nil, nil, @0));
+            ane_eval(k);
+            IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+            float sr = 0; int cnt = 0;
+            for (int i = 0; i < OC*SP; i++)
+                if (fabsf(inp[i]) > 0.01f) { sr += out[i]/inp[i]; cnt++; }
+            IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("Ratio: %.3f (2.0=patched, 1.0=cached)\n\n", cnt>0?sr/cnt:0);
+        }
+
+        // === Approach 2: Memory scan ===
+        printf("=== Approach 2: Memory scan ===\n");
+        uint16_t pat1[8] = {0x3C00, 0, 0, 0, 0, 0, 0, 0};
+        uint16_t pat2[8] = {0x4000, 0, 0, 0, 0, 0, 0, 0};
+        mach_port_t task = mach_task_self();
+        vm_address_t addr = 0; vm_size_t sz; natural_t depth = 1;
+        int f1 = 0, f2 = 0;
+        while (1) {
+            struct vm_region_submap_info_64 info;
+            mach_msg_type_number_t count = VM_REGION_SUBMAP_INFO_COUNT_64;
+            if (vm_region_recurse_64(task, &addr, &sz, &depth, (vm_region_recurse_info_t)&info, &count) != KERN_SUCCESS) break;
+            if (info.is_submap) { depth++; continue; }
+            if (!(info.protection & VM_PROT_READ) || sz < (size_t)(OC*IC*2)) { addr += sz; continue; }
+            uint8_t *base = (uint8_t*)addr;
+            for (size_t off = 0; off + OC*IC*2 <= sz; off += 2) {
+                int w = 0;
+                if (memcmp(base+off, pat1, 16) == 0) w = 1;
+                else if (memcmp(base+off, pat2, 16) == 0) w = 2;
+                if (!w) continue;
+                uint16_t *p = (uint16_t*)(base+off), diag = (w==1)?0x3C00:0x4000;
+                int ok2 = 1;
+                for (int r = 0; r < OC && ok2; r++)
+                    for (int c = 0; c < IC && ok2; c++)
+                        if (p[r*IC+c] != ((r==c)?diag:0)) ok2 = 0;
+                if (!ok2) continue;
+                if (w==1) f1++; else f2++;
+                printf("  FOUND %dx @%p prot=%d/%d %s\n", w, (void*)(addr+off),
+                    info.protection, info.max_protection, (info.protection&VM_PROT_WRITE)?"WR":"RO");
+            }
+            addr += sz;
+        }
+        printf("Found: 1x=%d 2x=%d\n", f1, f2);
+
+        // Now patch ALL found weight patterns to 3× and re-eval
+        if (f1 > 0 || f2 > 0) {
+            printf("Patching all found patterns to 3x identity...\n");
+            addr = 0; depth = 1;
+            while (1) {
+                struct vm_region_submap_info_64 info2;
+                mach_msg_type_number_t count2 = VM_REGION_SUBMAP_INFO_COUNT_64;
+                if (vm_region_recurse_64(task, &addr, &sz, &depth, (vm_region_recurse_info_t)&info2, &count2) != KERN_SUCCESS) break;
+                if (info2.is_submap) { depth++; continue; }
+                if (!(info2.protection & VM_PROT_READ) || sz < (size_t)(OC*IC*2)) { addr += sz; continue; }
+                uint8_t *base2 = (uint8_t*)addr;
+                for (size_t off = 0; off + OC*IC*2 <= sz; off += 2) {
+                    int w2 = 0;
+                    if (memcmp(base2+off, pat1, 16) == 0) w2 = 1;
+                    else if (memcmp(base2+off, pat2, 16) == 0) w2 = 2;
+                    if (!w2) continue;
+                    uint16_t *p2 = (uint16_t*)(base2+off), diag2 = (w2==1)?0x3C00:0x4000;
+                    int ok3 = 1;
+                    for (int r = 0; r < OC && ok3; r++)
+                        for (int c = 0; c < IC && ok3; c++)
+                            if (p2[r*IC+c] != ((r==c)?diag2:0)) ok3 = 0;
+                    if (!ok3) continue;
+                    if (info2.protection & VM_PROT_WRITE) {
+                        printf("  Patching %dx @%p to 3x\n", w2, (void*)(addr+off));
+                        for (int r = 0; r < OC; r++)
+                            for (int c = 0; c < IC; c++)
+                                p2[r*IC+c] = (r==c) ? 0x4200 : 0; // fp16(3.0)
+                    }
+                }
+                addr += sz;
+            }
+
+            printf("\n=== Eval after memory patch (expect 3x) ===\n");
+            ane_eval(k);
+            IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+            float sr2 = 0; int cnt2 = 0;
+            for (int i = 0; i < OC*SP; i++)
+                if (fabsf(inp[i]) > 0.01f) { sr2 += out[i]/inp[i]; cnt2++; }
+            IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("Ratio: %.3f (3.0=mem patch works!, 1.0=ANE uses SRAM copy)\n", cnt2>0?sr2/cnt2:0);
+        }
+        printf("\n");
+
+        // === Approach 3: Explore classes ===
+        printf("=== ANE classes ===\n");
+        const char *cn[] = {"_ANEWeight", "_ANEProgramForEvaluation", "_ANEChainingRequest", NULL};
+        for (int i = 0; cn[i]; i++) {
+            Class cls = NSClassFromString([NSString stringWithUTF8String:cn[i]]);
+            if (!cls) { printf("%s: NOT FOUND\n", cn[i]); continue; }
+            printf("%s:\n", cn[i]);
+            unsigned int mc = 0; Method *ms = class_copyMethodList(cls, &mc);
+            for (unsigned j = 0; j < mc; j++) printf("  - %s\n", sel_getName(method_getName(ms[j])));
+            free(ms);
+            mc = 0; ms = class_copyMethodList(object_getClass(cls), &mc);
+            for (unsigned j = 0; j < mc; j++) printf("  + %s\n", sel_getName(method_getName(ms[j])));
+            free(ms); printf("\n");
+        }
+        @try { printf("programHandle: %s\n", [[[mdl valueForKey:@"programHandle"] description] UTF8String]); } @catch(id x) {}
+        @try { printf("intermediateBufferHandle: %s\n", [[[mdl valueForKey:@"intermediateBufferHandle"] description] UTF8String]); } @catch(id x) {}
+
+        // === Approach 4: _ANEWeight + updateWeightURL ===
+        printf("\n=== Approach 4: _ANEWeight API ===\n");
+        Class AW = NSClassFromString(@"_ANEWeight");
+        if (AW) {
+            // Write 5× identity weights to a new file
+            float *W_5x = (float*)calloc(OC*IC, sizeof(float));
+            for (int i = 0; i < IC; i++) W_5x[i*IC+i] = 5.0f;
+            NSString *wpath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"patched_w.bin"];
+            [build_blob(W_5x, OC, IC) writeToFile:wpath atomically:YES];
+            free(W_5x);
+
+            NSURL *wurl = [NSURL fileURLWithPath:wpath];
+            id wobj = ((id(*)(Class,SEL,id,id))objc_msgSend)(AW,
+                @selector(weightWithSymbolAndURL:weightURL:), @"W", wurl);
+            printf("  _ANEWeight: %s\n", wobj ? [[wobj description] UTF8String] : "nil");
+            if (wobj) {
+                printf("  weightSymbol: %s\n", [((id(*)(id,SEL))objc_msgSend)(wobj, @selector(weightSymbol)) UTF8String]);
+                printf("  weightURL: %s\n", [[((id(*)(id,SEL))objc_msgSend)(wobj, @selector(weightURL)) description] UTF8String]);
+            }
+
+            // Try to pass as weightsBuffer in request
+            printf("\n  Trying weightsBuffer in request...\n");
+            id wI2 = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioIn);
+            id wO2 = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioOut);
+
+            // Try passing weight array as weightsBuffer
+            if (wobj) {
+                CFRelease(k->request);
+                k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
+                    @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+                    @[wI2], @[@0], @[wO2], @[@0], @[wobj], nil, @0));
+                printf("  Request with weightsBuffer created\n");
+                @try {
+                    ane_eval(k);
+                    IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+                    printf("  Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+                    float sr3 = 0; int cnt3 = 0;
+                    for (int i2 = 0; i2 < OC*SP; i2++)
+                        if (fabsf(inp[i2]) > 0.01f) { sr3 += out[i2]/inp[i2]; cnt3++; }
+                    IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+                    printf("  Ratio: %.3f (5.0=weightsBuffer works!)\n", cnt3>0?sr3/cnt3:0);
+                } @catch(NSException *ex) {
+                    printf("  Eval exception: %s\n", [[ex description] UTF8String]);
+                }
+            }
+
+            // Also try IOSurface as weightsBuffer
+            printf("\n  Trying IOSurface as weightsBuffer...\n");
+            IOSurfaceRef wSurf = make_surface(OC*IC*2);  // fp16 weights
+            IOSurfaceLock(wSurf, 0, NULL);
+            _Float16 *wfp16 = (_Float16*)IOSurfaceGetBaseAddress(wSurf);
+            for (int r = 0; r < OC; r++)
+                for (int c2 = 0; c2 < IC; c2++)
+                    wfp16[r*IC+c2] = (r==c2) ? (_Float16)7.0f : (_Float16)0.0f;  // 7× identity
+            IOSurfaceUnlock(wSurf, 0, NULL);
+            id wSurfObj = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), wSurf);
+            CFRelease(k->request);
+            k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
+                @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+                @[wI2], @[@0], @[wO2], @[@0], wSurfObj, nil, @0));
+            @try {
+                ane_eval(k);
+                IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+                printf("  Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]);
+                float sr4 = 0; int cnt4 = 0;
+                for (int i3 = 0; i3 < OC*SP; i3++)
+                    if (fabsf(inp[i3]) > 0.01f) { sr4 += out[i3]/inp[i3]; cnt4++; }
+                IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
+                printf("  Ratio: %.3f (7.0=IOSurface weights work!)\n", cnt4>0?sr4/cnt4:0);
+            } @catch(NSException *ex) {
+                printf("  Eval exception: %s\n", [[ex description] UTF8String]);
+            }
+            CFRelease(wSurf);
+        }
+
+        // === Approach 5: Weights packed into input IOSurface (fp16 with cast) ===
+        printf("\n=== Approach 5: Dynamic weights via input IOSurface ===\n");
+        // Element-wise mul: x * w where both come from input
+        // Input [1, IC*2, 1, SP] fp32 → cast fp16 → slice → mul → cast fp32
+        {
+        int C5 = IC;
+        NSMutableString *m5 = [NSMutableString string];
+        [m5 appendString:@"program(1.3)\n"
+            "[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
+            "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
+            "{\"coremltools-version\", \"9.0\"}})]\n{\n"];
+        [m5 appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", C5*2, SP];
+        [m5 appendString:@"        string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"];
+        [m5 appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", C5*2, SP];
+        [m5 appendFormat:@"        tensor<int32, [4]> b0 = const()[name = string(\"b0\"), val = tensor<int32, [4]>([0,0,0,0])];\n"];
+        [m5 appendFormat:@"        tensor<int32, [4]> s0 = const()[name = string(\"s0\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", C5, SP];
+        [m5 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> data = slice_by_size(x=xh,begin=b0,size=s0)[name=string(\"data\")];\n", C5, SP];
+        [m5 appendFormat:@"        tensor<int32, [4]> b1 = const()[name = string(\"b1\"), val = tensor<int32, [4]>([0,%d,0,0])];\n", C5];
+        [m5 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> wt = slice_by_size(x=xh,begin=b1,size=s0)[name=string(\"wt\")];\n", C5, SP];
+        [m5 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> yh = mul(x=data,y=wt)[name=string(\"mul\")];\n", C5, SP];
+        [m5 appendString:@"        string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"];
+        [m5 appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype = to32, x = yh)[name = string(\"cout\")];\n", C5, SP];
+        [m5 appendString:@"    } -> (y);\n}\n"];
+
+        int io5_in = C5*2*SP*4;
+        int io5_out = C5*SP*4;
+        Kern *k5 = compile_kern_mil_w(m5, @{}, io5_in, io5_out);
+        if (k5) {
+            printf("Compile OK!\n");
+            IOSurfaceLock(k5->ioIn, 0, NULL);
+            float *in5 = (float*)IOSurfaceGetBaseAddress(k5->ioIn);
+            for (int i = 0; i < C5*SP; i++) in5[i] = (i%100)*0.01f;
+            for (int i = 0; i < C5*SP; i++) in5[C5*SP+i] = 2.0f;
+            IOSurfaceUnlock(k5->ioIn, 0, NULL);
+            ane_eval(k5);
+            IOSurfaceLock(k5->ioOut, kIOSurfaceLockReadOnly, NULL);
+            float *out5 = (float*)IOSurfaceGetBaseAddress(k5->ioOut);
+            printf("data=[%.3f,%.3f,%.3f], w=2.0 → out=[%.3f,%.3f,%.3f]\n",
+                in5[0],in5[1],in5[2], out5[0],out5[1],out5[2]);
+            IOSurfaceUnlock(k5->ioOut, kIOSurfaceLockReadOnly, NULL);
+
+            // Change weight dynamically — NO recompile!
+            IOSurfaceLock(k5->ioIn, 0, NULL);
+            for (int i = 0; i < C5*SP; i++) in5[C5*SP+i] = 5.0f;
+            IOSurfaceUnlock(k5->ioIn, 0, NULL);
+            ane_eval(k5);
+            IOSurfaceLock(k5->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("w=5.0 → out=[%.3f,%.3f,%.3f] (expect 5×)\n", out5[0],out5[1],out5[2]);
+            IOSurfaceUnlock(k5->ioOut, kIOSurfaceLockReadOnly, NULL);
+            free_kern(k5);
+        } else printf("Compile FAILED\n");
+        }
+
+        // === Approach 6: matmul with dynamic weights from input ===
+        printf("\n=== Approach 6: matmul with dynamic W from input ===\n");
+        // Pack x[1,D,S,1] and W[1,D,1,D] into input, then reshape+matmul
+        // Input shape: [1, D+D*D, 1, S] — first D channels=activations, rest=weight matrix flattened
+        // Actually, matmul needs [1,H,S,D] shapes. Let's try:
+        // Input: [1, D*(S+D), 1, 1] reshaped as needed
+        // Simpler: just test matmul with two sliced inputs
+        {
+        int D6 = 64, S6 = 64;  // small for test
+        // Input: [1, D6+D6, S6, D6] — but that's 4D...
+        // Actually ANE matmul works on [1,H,M,K] @ [1,H,K,N] → [1,H,M,N]
+        // Let's pack x[1,1,S6,D6] and W[1,1,D6,D6] into [1,2,S6,D6]
+        // Then slice → matmul
+        NSMutableString *m6 = [NSMutableString string];
+        [m6 appendString:@"program(1.3)\n"
+            "[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
+            "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
+            "{\"coremltools-version\", \"9.0\"}})]\n{\n"];
+        // Input: [1, D6+D6, 1, S6*D6] — flatten everything, then reshape
+        // Actually simplest: two separate regions in channel dim
+        // x_data: [1, D6, 1, S6] and W: [1, D6*D6, 1, 1]
+        // Total input channels: D6 + D6*D6
+        int total_ch = D6 + D6*D6;
+        [m6 appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", total_ch, S6];
+        [m6 appendString:@"        string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"];
+        [m6 appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", total_ch, S6];
+        // Slice activations: [1, D6, 1, S6]
+        [m6 appendFormat:@"        tensor<int32, [4]> b0 = const()[name = string(\"b0\"), val = tensor<int32, [4]>([0,0,0,0])];\n"];
+        [m6 appendFormat:@"        tensor<int32, [4]> sa = const()[name = string(\"sa\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> act = slice_by_size(x=xh,begin=b0,size=sa)[name=string(\"act\")];\n", D6, S6];
+        // Slice weight: [1, D6*D6, 1, S6] but we only need [D6, D6] → reshape
+        [m6 appendFormat:@"        tensor<int32, [4]> bw = const()[name = string(\"bw\"), val = tensor<int32, [4]>([0,%d,0,0])];\n", D6];
+        [m6 appendFormat:@"        tensor<int32, [4]> sw = const()[name = string(\"sw\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", D6*D6, S6];
+        [m6 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> wf = slice_by_size(x=xh,begin=bw,size=sw)[name=string(\"wf\")];\n", D6*D6, S6];
+        // Reshape weight to [1, D6, D6, S6] for matmul-like operation
+        // Actually for conv: weight needs to be [OC, IC, 1, 1] const. Can't use dynamic weight with conv.
+        // For matmul: need [1, 1, D6, D6] or similar
+        // Let's try: reshape wf to [1, D6, D6, S6], take first slice [:,:,:,0] → no, that's hard
+        // Simpler: reshape to [D6, D6] and use matmul
+        // But matmul expects specific ranks... let me try:
+        [m6 appendFormat:@"        tensor<int32, [4]> ws = const()[name = string(\"ws\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n", D6, D6];
+        // Only take first column of wf to get [1, D6*D6, 1, 1]
+        [m6 appendFormat:@"        tensor<int32, [4]> sw1 = const()[name = string(\"sw1\"), val = tensor<int32, [4]>([1,%d,1,1])];\n", D6*D6];
+        [m6 appendFormat:@"        tensor<fp16, [1,%d,1,1]> wf1 = slice_by_size(x=wf,begin=b0,size=sw1)[name=string(\"wf1\")];\n", D6*D6];
+        [m6 appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W = reshape(shape=ws,x=wf1)[name=string(\"W\")];\n", D6, D6];
+        // Reshape act to [1, 1, S6, D6] for matmul
+        [m6 appendFormat:@"        tensor<int32, [4]> as2 = const()[name = string(\"as2\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<int32, [4]> pm = const()[name = string(\"pm\"), val = tensor<int32, [4]>([0, 1, 3, 2])];\n"];
+        [m6 appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a2 = reshape(shape=as2,x=act)[name=string(\"a2\")];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a3 = transpose(perm=pm,x=a2)[name=string(\"a3\")];\n", S6, D6];
+        // matmul: [1,1,S6,D6] @ [1,1,D6,D6] → [1,1,S6,D6]
+        [m6 appendString:@"        bool bF = const()[name = string(\"bF\"), val = bool(false)];\n"];
+        [m6 appendFormat:@"        tensor<fp16, [1, 1, %d, %d]> yh = matmul(transpose_x = bF, transpose_y = bF, x = a3, y = W)[name = string(\"mm\")];\n", S6, D6];
+        // Reshape back to [1, D6, 1, S6]
+        [m6 appendFormat:@"        tensor<fp16, [1,1,%d,%d]> yt = transpose(perm=pm,x=yh)[name=string(\"yt\")];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<int32, [4]> os = const()[name = string(\"os\"), val = tensor<int32, [4]>([1,%d,1,%d])];\n", D6, S6];
+        [m6 appendFormat:@"        tensor<fp16, [1,%d,1,%d]> yr = reshape(shape=os,x=yt)[name=string(\"yr\")];\n", D6, S6];
+        [m6 appendString:@"        string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"];
+        [m6 appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype = to32, x = yr)[name = string(\"cout\")];\n", D6, S6];
+        [m6 appendString:@"    } -> (y);\n}\n"];
+
+        int io6_in = total_ch * S6 * 4;
+        int io6_out = D6 * S6 * 4;
+        Kern *k6 = compile_kern_mil_w(m6, @{}, io6_in, io6_out);
+        if (k6) {
+            printf("Dynamic matmul compile OK!\n");
+            // Set up: identity W, ramp input
+            IOSurfaceLock(k6->ioIn, 0, NULL);
+            float *in6 = (float*)IOSurfaceGetBaseAddress(k6->ioIn);
+            memset(in6, 0, io6_in);
+            // Activations: [D6, S6] in channel-first layout
+            for (int d = 0; d < D6; d++)
+                for (int s = 0; s < S6; s++)
+                    in6[d*S6+s] = (d*S6+s) * 0.001f;
+            // Weight: identity matrix [D6, D6] packed in channels D6..D6+D6*D6, only col 0
+            float *wbase = in6 + D6*S6;
+            for (int r = 0; r < D6; r++)
+                for (int c = 0; c < D6; c++)
+                    wbase[(r*D6+c)*S6] = (r==c) ? 1.0f : 0.0f;  // only sp=0 matters
+            IOSurfaceUnlock(k6->ioIn, 0, NULL);
+
+            ane_eval(k6);
+            IOSurfaceLock(k6->ioOut, kIOSurfaceLockReadOnly, NULL);
+            float *out6 = (float*)IOSurfaceGetBaseAddress(k6->ioOut);
+            printf("Identity W: in=[%.4f,%.4f,%.4f] out=[%.4f,%.4f,%.4f]\n",
+                in6[0],in6[1],in6[2], out6[0],out6[1],out6[2]);
+
+            // Check
+            float me6 = 0;
+            for (int i = 0; i < D6*S6; i++) {
+                float e6 = fabsf(out6[i] - in6[i]);
+                if (e6 > me6) me6 = e6;
+            }
+            IOSurfaceUnlock(k6->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("max_err=%.6f %s\n", me6, me6 < 0.1 ? "PASS" : "FAIL");
+
+            // Now: 2× identity — just change the IOSurface weight, no recompile!
+            IOSurfaceLock(k6->ioIn, 0, NULL);
+            for (int r = 0; r < D6; r++)
+                for (int c = 0; c < D6; c++)
+                    wbase[(r*D6+c)*S6] = (r==c) ? 2.0f : 0.0f;
+            IOSurfaceUnlock(k6->ioIn, 0, NULL);
+            ane_eval(k6);
+            IOSurfaceLock(k6->ioOut, kIOSurfaceLockReadOnly, NULL);
+            printf("2× W: in=[%.4f,%.4f] out=[%.4f,%.4f] (expect 2×)\n",
+                in6[0],in6[1], out6[0],out6[1]);
+            IOSurfaceUnlock(k6->ioOut, kIOSurfaceLockReadOnly, NULL);
+            free_kern(k6);
+        } else printf("Dynamic matmul compile FAILED\n");
+        }
+
+        free_kern(k); free(W_id); free(W_2x);
+        printf("\nDone.\n");
+    }
+    return 0;
+}
diff --git a/training/training_dynamic/Makefile b/training/training_dynamic/Makefile
new file mode 100644
index 0000000..8c02c11
--- /dev/null
+++ b/training/training_dynamic/Makefile
@@ -0,0 +1,9 @@
+CC = xcrun clang
+CFLAGS = -O2 -framework Foundation -framework IOSurface -framework Accelerate \
+         -isysroot $(shell xcrun --show-sdk-path) -fobjc-arc
+
+train: train.m config.h io.h cpu_ops.h mil_dynamic.h
+	$(CC) $(CFLAGS) -o train train.m
+
+clean:
+	rm -f train
diff --git a/training/training_dynamic/config.h b/training/training_dynamic/config.h
new file mode 100644
index 0000000..d66d045
--- /dev/null
+++ b/training/training_dynamic/config.h
@@ -0,0 +1,156 @@
+// config.h — Stories110M model config, structs, ANE init
+#pragma once
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#import <mach/mach_time.h>
+#import <Accelerate/Accelerate.h>
+#include <math.h>
+#include <unistd.h>
+#include <dispatch/dispatch.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <arm_neon.h>
+
+// Stories110M config
+#define DIM 768
+#define HIDDEN 2048
+#define HEADS 12
+#define HD (DIM/HEADS)
+#define SEQ 256
+#define NLAYERS 12
+#define VOCAB 32000
+
+// Weight sizes per layer
+#define WQ_SZ (DIM*DIM)
+#define WO_SZ (DIM*DIM)
+#define W1_SZ (HIDDEN*DIM)
+#define W2_SZ (DIM*HIDDEN)
+#define W3_SZ (HIDDEN*DIM)
+#define LAYER_PARAMS (4*WQ_SZ + W1_SZ + W2_SZ + W3_SZ + 2*DIM)
+
+// Attention score channels for SDPA backward
+#define SCORE_CH (HEADS*SEQ)
+
+// Per-layer weights
+typedef struct {
+    float *Wq, *Wk, *Wv, *Wo;
+    float *W1, *W2, *W3;
+    float *rms_att, *rms_ffn;
+} LayerWeights;
+
+// Adam optimizer state
+typedef struct { float *m, *v; size_t n; } AdamState;
+typedef struct {
+    AdamState Wq, Wk, Wv, Wo, W1, W2, W3, rms_att, rms_ffn;
+} LayerAdam;
+
+// Per-layer activations (saved for backward)
+typedef struct {
+    float *layer_in, *xnorm, *Q, *K, *V, *attn_out, *o_out;
+    float *x2, *x2norm, *h1, *h3, *silu_out, *ffn_out;
+} LayerActs;
+
+// Per-layer gradients
+typedef struct {
+    float *Wq, *Wk, *Wv, *Wo, *W1, *W2, *W3, *rms_att, *rms_ffn;
+} LayerGrads;
+
+// ANE kernel handle
+typedef struct { void *model; IOSurfaceRef ioIn, ioOut; void *request; void *tmpDir; } Kern;
+
+// Checkpoint header
+typedef struct {
+    int magic, version, step, total_steps;
+    int n_layers, vocab_size, dim, hidden_dim, n_heads, seq_len;
+    float lr, loss;
+    double cum_compile, cum_train, cum_wall;
+    int cum_steps, cum_batches, adam_t;
+    int pad[3];
+} CkptHdr;
+
+// llama2.c model file header
+typedef struct {
+    int dim, hidden_dim, n_layers, n_heads, n_kv_heads, vocab_size, seq_len;
+} Llama2Config;
+
+// Globals
+static Class g_D, g_I, g_AR, g_AIO;
+static mach_timebase_info_data_t g_tb;
+static int g_compile_count = 0;
+
+static void ane_init(void) {
+    dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
+    g_D  = NSClassFromString(@"_ANEInMemoryModelDescriptor");
+    g_I  = NSClassFromString(@"_ANEInMemoryModel");
+    g_AR = NSClassFromString(@"_ANERequest");
+    g_AIO= NSClassFromString(@"_ANEIOSurfaceObject");
+}
+static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
+
+// Alloc helpers
+static AdamState adam_alloc(size_t n) { AdamState s; s.m=(float*)calloc(n,4); s.v=(float*)calloc(n,4); s.n=n; return s; }
+static void adam_free(AdamState *s) { free(s->m); free(s->v); }
+
+static LayerWeights layer_weights_alloc(void) {
+    LayerWeights w;
+    w.Wq=(float*)malloc(WQ_SZ*4); w.Wk=(float*)malloc(WQ_SZ*4);
+    w.Wv=(float*)malloc(WQ_SZ*4); w.Wo=(float*)malloc(WO_SZ*4);
+    w.W1=(float*)malloc(W1_SZ*4); w.W2=(float*)malloc(W2_SZ*4); w.W3=(float*)malloc(W3_SZ*4);
+    w.rms_att=(float*)malloc(DIM*4); w.rms_ffn=(float*)malloc(DIM*4);
+    return w;
+}
+static void layer_weights_free(LayerWeights *w) {
+    free(w->Wq);free(w->Wk);free(w->Wv);free(w->Wo);
+    free(w->W1);free(w->W2);free(w->W3);free(w->rms_att);free(w->rms_ffn);
+}
+static LayerAdam layer_adam_alloc(void) {
+    LayerAdam a;
+    a.Wq=adam_alloc(WQ_SZ); a.Wk=adam_alloc(WQ_SZ); a.Wv=adam_alloc(WQ_SZ); a.Wo=adam_alloc(WO_SZ);
+    a.W1=adam_alloc(W1_SZ); a.W2=adam_alloc(W2_SZ); a.W3=adam_alloc(W3_SZ);
+    a.rms_att=adam_alloc(DIM); a.rms_ffn=adam_alloc(DIM);
+    return a;
+}
+static void layer_adam_free(LayerAdam *a) {
+    adam_free(&a->Wq);adam_free(&a->Wk);adam_free(&a->Wv);adam_free(&a->Wo);
+    adam_free(&a->W1);adam_free(&a->W2);adam_free(&a->W3);
+    adam_free(&a->rms_att);adam_free(&a->rms_ffn);
+}
+static LayerActs layer_acts_alloc(void) {
+    LayerActs a;
+    a.layer_in=(float*)malloc(SEQ*DIM*4);
+    a.xnorm=(float*)malloc(SEQ*DIM*4);
+    a.Q=(float*)malloc(SEQ*DIM*4); a.K=(float*)malloc(SEQ*DIM*4); a.V=(float*)malloc(SEQ*DIM*4);
+    a.attn_out=(float*)malloc(SEQ*DIM*4); a.o_out=(float*)malloc(SEQ*DIM*4);
+    a.x2=(float*)malloc(SEQ*DIM*4); a.x2norm=(float*)malloc(SEQ*DIM*4);
+    a.h1=(float*)malloc(SEQ*HIDDEN*4); a.h3=(float*)malloc(SEQ*HIDDEN*4);
+    a.silu_out=(float*)malloc(SEQ*HIDDEN*4); a.ffn_out=(float*)malloc(SEQ*DIM*4);
+    return a;
+}
+static void layer_acts_free(LayerActs *a) {
+    free(a->layer_in);free(a->xnorm);
+    free(a->Q);free(a->K);free(a->V);
+    free(a->attn_out);free(a->o_out);free(a->x2);free(a->x2norm);
+    free(a->h1);free(a->h3);free(a->silu_out);free(a->ffn_out);
+}
+static LayerGrads layer_grads_alloc(void) {
+    LayerGrads g;
+    g.Wq=(float*)calloc(WQ_SZ,4); g.Wk=(float*)calloc(WQ_SZ,4);
+    g.Wv=(float*)calloc(WQ_SZ,4); g.Wo=(float*)calloc(WO_SZ,4);
+    g.W1=(float*)calloc(W1_SZ,4); g.W2=(float*)calloc(W2_SZ,4); g.W3=(float*)calloc(W3_SZ,4);
+    g.rms_att=(float*)calloc(DIM,4); g.rms_ffn=(float*)calloc(DIM,4);
+    return g;
+}
+static void layer_grads_zero(LayerGrads *g) {
+    memset(g->Wq,0,WQ_SZ*4);memset(g->Wk,0,WQ_SZ*4);
+    memset(g->Wv,0,WQ_SZ*4);memset(g->Wo,0,WO_SZ*4);
+    memset(g->W1,0,W1_SZ*4);memset(g->W2,0,W2_SZ*4);memset(g->W3,0,W3_SZ*4);
+    memset(g->rms_att,0,DIM*4);memset(g->rms_ffn,0,DIM*4);
+}
+static void layer_grads_free(LayerGrads *g) {
+    free(g->Wq);free(g->Wk);free(g->Wv);free(g->Wo);
+    free(g->W1);free(g->W2);free(g->W3);free(g->rms_att);free(g->rms_ffn);
+}
diff --git a/training/training_dynamic/cpu_ops.h b/training/training_dynamic/cpu_ops.h
new file mode 100644
index 0000000..aed7e6f
--- /dev/null
+++ b/training/training_dynamic/cpu_ops.h
@@ -0,0 +1,164 @@
+// cpu_ops.h — CPU operations: RMSNorm, cross-entropy, Adam, embedding
+#pragma once
+#include "config.h"
+
+static float *g_rms_tmp = NULL;
+
+static void rmsnorm(float *out, const float *x, const float *w, int d, int S) {
+    if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4);
+    float *ss = (float*)calloc(S, sizeof(float));
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(x+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vadd(g_rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
+    }
+    float invd = 1.0f/d, eps=1e-5f;
+    vDSP_vsmsa(ss, 1, &invd, &eps, ss, 1, (vDSP_Length)S);
+    int n = S; vvrsqrtf(ss, ss, &n);
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(x+i*S, 1, ss, 1, out+i*S, 1, (vDSP_Length)S);
+        vDSP_vsmul(out+i*S, 1, &w[i], out+i*S, 1, (vDSP_Length)S);
+    }
+    free(ss);
+}
+
+static void rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, const float *w, int d, int S) {
+    if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4);
+    float *ss = (float*)calloc(S, sizeof(float));
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(x+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vadd(g_rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
+    }
+    float invd = 1.0f/d, eps=1e-5f;
+    vDSP_vsmsa(ss, 1, &invd, &eps, ss, 1, (vDSP_Length)S);
+    float *rrms = (float*)malloc(S*4);
+    int n = S; vvrsqrtf(rrms, ss, &n);
+    float *dot = (float*)calloc(S, sizeof(float));
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(dy+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsma(g_rms_tmp, 1, &w[i], dot, 1, dot, 1, (vDSP_Length)S);
+    }
+    vDSP_vmul(rrms, 1, rrms, 1, ss, 1, (vDSP_Length)S);
+    vDSP_vsmul(ss, 1, &invd, ss, 1, (vDSP_Length)S);
+    vDSP_vmul(dot, 1, ss, 1, dot, 1, (vDSP_Length)S);
+    for (int i=0; i<d; i++) {
+        vDSP_vmul(x+i*S, 1, dot, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsub(g_rms_tmp, 1, dy+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vmul(g_rms_tmp, 1, rrms, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsmul(g_rms_tmp, 1, &w[i], dx+i*S, 1, (vDSP_Length)S);
+        vDSP_vmul(dy+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vmul(g_rms_tmp, 1, rrms, 1, g_rms_tmp, 1, (vDSP_Length)S);
+        float s; vDSP_sve(g_rms_tmp, 1, &s, (vDSP_Length)S);
+        dw[i] += s;
+    }
+    free(ss); free(rrms); free(dot);
+}
+
+static void adam_update(float *w, const float *g, AdamState *s, int t, float lr, float b1, float b2, float eps) {
+    float bc1 = 1.0f - powf(b1, t), bc2 = 1.0f - powf(b2, t);
+    for (size_t i=0; i<s->n; i++) {
+        s->m[i] = b1*s->m[i] + (1-b1)*g[i];
+        s->v[i] = b2*s->v[i] + (1-b2)*g[i]*g[i];
+        float mh = s->m[i]/bc1, vh = s->v[i]/bc2;
+        w[i] -= lr * mh / (sqrtf(vh) + eps);
+    }
+}
+
+// Cross-entropy loss: operates on logits[V, S] column-major (each column = one token)
+// Avoids transposing by using a per-token temp buffer
+static float cross_entropy_loss(float *dlogits, const float *logits, const uint16_t *targets, int V, int S) {
+    float *col = (float*)malloc(V * 4);  // single column buffer
+    float total_loss = 0;
+    float invS = 1.0f / S;
+    for (int t = 0; t < S; t++) {
+        // Gather column t: logits[v, t] = logits[v*S + t], stride=S
+        cblas_scopy(V, logits + t, S, col, 1);
+        // Softmax
+        float maxv; vDSP_maxv(col, 1, &maxv, (vDSP_Length)V);
+        float neg_max = -maxv;
+        vDSP_vsadd(col, 1, &neg_max, col, 1, (vDSP_Length)V);
+        int n = V; vvexpf(col, col, &n);
+        float sum; vDSP_sve(col, 1, &sum, (vDSP_Length)V);
+        float inv_sum = 1.0f / sum;
+        vDSP_vsmul(col, 1, &inv_sum, col, 1, (vDSP_Length)V);
+        // Loss + gradient
+        int tgt = targets[t];
+        total_loss -= logf(col[tgt] + 1e-10f);
+        col[tgt] -= 1.0f;
+        vDSP_vsmul(col, 1, &invS, col, 1, (vDSP_Length)V);
+        // Scatter back: dlogits[v*S + t] = col[v]
+        cblas_scopy(V, col, 1, dlogits + t, S);
+    }
+    free(col);
+    return total_loss / S;
+}
+
+// Vocab compaction: build mapping from full 32K vocab to compact vocab
+typedef struct {
+    int compact_vocab;          // number of active tokens
+    int *full_to_compact;       // [VOCAB] → compact id (-1 if unused)
+    int *compact_to_full;       // [compact_vocab] → full vocab id
+} VocabMap;
+
+static VocabMap vocab_map_build(const uint16_t *data, size_t n_tokens, int full_vocab) {
+    VocabMap vm;
+    vm.full_to_compact = (int*)malloc(full_vocab * sizeof(int));
+    memset(vm.full_to_compact, -1, full_vocab * sizeof(int));
+    // Scan for used tokens
+    for (size_t i = 0; i < n_tokens; i++) {
+        vm.full_to_compact[data[i]] = 0;  // mark as used
+    }
+    // Assign compact IDs
+    int cid = 0;
+    for (int v = 0; v < full_vocab; v++) {
+        if (vm.full_to_compact[v] == 0)
+            vm.full_to_compact[v] = cid++;
+        else
+            vm.full_to_compact[v] = -1;
+    }
+    vm.compact_vocab = cid;
+    vm.compact_to_full = (int*)malloc(cid * sizeof(int));
+    for (int v = 0; v < full_vocab; v++) {
+        if (vm.full_to_compact[v] >= 0)
+            vm.compact_to_full[vm.full_to_compact[v]] = v;
+    }
+    return vm;
+}
+
+// Create compact embedding from full embedding
+static float *vocab_compact_embed(const float *full_embed, const VocabMap *vm, int dim) {
+    float *ce = (float*)malloc((size_t)vm->compact_vocab * dim * 4);
+    for (int c = 0; c < vm->compact_vocab; c++)
+        memcpy(ce + c*dim, full_embed + vm->compact_to_full[c]*dim, dim*4);
+    return ce;
+}
+
+// Scatter compact embed gradients back to full embed
+static void vocab_scatter_grads(float *full_gembed, const float *compact_gembed, const VocabMap *vm, int dim) {
+    for (int c = 0; c < vm->compact_vocab; c++) {
+        int fv = vm->compact_to_full[c];
+        for (int d = 0; d < dim; d++)
+            full_gembed[fv*dim + d] += compact_gembed[c*dim + d];
+    }
+}
+
+// Update full embed from compact embed (after adam)
+static void vocab_update_full(float *full_embed, const float *compact_embed, const VocabMap *vm, int dim) {
+    for (int c = 0; c < vm->compact_vocab; c++)
+        memcpy(full_embed + vm->compact_to_full[c]*dim, compact_embed + c*dim, dim*4);
+}
+
+static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq) {
+    for (int t = 0; t < seq; t++) {
+        int tok = tokens[t];
+        for (int d = 0; d < dim; d++)
+            x[d*seq + t] = embed[tok*dim + d];
+    }
+}
+
+static void embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq) {
+    for (int t = 0; t < seq; t++) {
+        int tok = tokens[t];
+        for (int d = 0; d < dim; d++)
+            d_embed[tok*dim + d] += dx[d*seq + t];
+    }
+}
diff --git a/training/training_dynamic/io.h b/training/training_dynamic/io.h
new file mode 100644
index 0000000..0a6969e
--- /dev/null
+++ b/training/training_dynamic/io.h
@@ -0,0 +1,147 @@
+// io.h — IOSurface helpers, NEON conversion, kernel compile/eval
+#pragma once
+#include "config.h"
+
+static IOSurfaceRef make_surface(size_t bytes) {
+    return IOSurfaceCreate((__bridge CFDictionaryRef)@{
+        (id)kIOSurfaceWidth:@(bytes), (id)kIOSurfaceHeight:@1,
+        (id)kIOSurfaceBytesPerElement:@1, (id)kIOSurfaceBytesPerRow:@(bytes),
+        (id)kIOSurfaceAllocSize:@(bytes), (id)kIOSurfacePixelFormat:@0});
+}
+
+// Blob builders for const weights (mask, rms)
+static NSData *build_blob(const float *w, int rows, int cols) {
+    int ws=rows*cols*2, tot=128+ws;
+    uint8_t *b=(uint8_t*)calloc(tot,1);
+    b[0]=1;b[4]=2;b[64]=0xEF;b[65]=0xBE;b[66]=0xAD;b[67]=0xDE;b[68]=1;
+    *(uint32_t*)(b+72)=ws;*(uint32_t*)(b+80)=128;
+    _Float16 *fp16=(_Float16*)(b+128);
+    for(int i=0;i<rows*cols;i++) fp16[i]=(_Float16)w[i];
+    return [NSData dataWithBytesNoCopy:b length:tot freeWhenDone:YES];
+}
+static NSData *build_blob_fp16(_Float16 *d, int cnt) {
+    int ws=cnt*2, tot=128+ws;
+    uint8_t *b=(uint8_t*)calloc(tot,1);
+    b[0]=1;b[4]=2;b[64]=0xEF;b[65]=0xBE;b[66]=0xAD;b[67]=0xDE;b[68]=1;
+    *(uint32_t*)(b+72)=ws;*(uint32_t*)(b+80)=128;
+    memcpy(b+128,d,ws);
+    return [NSData dataWithBytesNoCopy:b length:tot freeWhenDone:YES];
+}
+
+// NEON vectorized conversion
+static void cvt_f16_f32(float *dst, const _Float16 *src, int n) {
+    int i = 0;
+    for (; i+7 < n; i += 8) {
+        float16x8_t h = vld1q_f16((const __fp16*)(src+i));
+        vst1q_f32(dst+i,   vcvt_f32_f16(vget_low_f16(h)));
+        vst1q_f32(dst+i+4, vcvt_f32_f16(vget_high_f16(h)));
+    }
+    for (; i < n; i++) dst[i] = (float)src[i];
+}
+static void cvt_f32_f16(_Float16 *dst, const float *src, int n) {
+    int i = 0;
+    for (; i+7 < n; i += 8) {
+        float16x8_t h = vcombine_f16(vcvt_f16_f32(vld1q_f32(src+i)),
+                                      vcvt_f16_f32(vld1q_f32(src+i+4)));
+        vst1q_f16((__fp16*)(dst+i), h);
+    }
+    for (; i < n; i++) dst[i] = (_Float16)src[i];
+}
+
+// IOSurface I/O (channel-first [C,S] layout, fp16 on surface)
+static void io_write_fp16(IOSurfaceRef s, const float *data, int channels, int sp) {
+    IOSurfaceLock(s, 0, NULL);
+    cvt_f32_f16((_Float16*)IOSurfaceGetBaseAddress(s), data, channels * sp);
+    IOSurfaceUnlock(s, 0, NULL);
+}
+static void io_read_fp16(IOSurfaceRef s, float *data, int ch_off, int channels, int sp) {
+    IOSurfaceLock(s, kIOSurfaceLockReadOnly, NULL);
+    cvt_f16_f32(data, (_Float16*)IOSurfaceGetBaseAddress(s) + ch_off * sp, channels * sp);
+    IOSurfaceUnlock(s, kIOSurfaceLockReadOnly, NULL);
+}
+static void io_copy(IOSurfaceRef dst, int dst_ch, IOSurfaceRef src, int src_ch, int channels, int sp) {
+    IOSurfaceLock(dst, 0, NULL);
+    IOSurfaceLock(src, kIOSurfaceLockReadOnly, NULL);
+    memcpy((_Float16*)IOSurfaceGetBaseAddress(dst) + dst_ch*sp,
+           (_Float16*)IOSurfaceGetBaseAddress(src) + src_ch*sp,
+           channels * sp * sizeof(_Float16));
+    IOSurfaceUnlock(src, kIOSurfaceLockReadOnly, NULL);
+    IOSurfaceUnlock(dst, 0, NULL);
+}
+static void io_write_fp16_at(IOSurfaceRef s, int ch_off, const float *data, int channels, int sp) {
+    IOSurfaceLock(s, 0, NULL);
+    cvt_f32_f16((_Float16*)IOSurfaceGetBaseAddress(s) + ch_off * sp, data, channels * sp);
+    IOSurfaceUnlock(s, 0, NULL);
+}
+
+// fp32 IOSurface I/O (for dynamic matmul kernels that use fp32 input/output)
+// Layout: [1, IC, 1, SP] where SP = SEQ + OC
+// Write activations at sp[0:SEQ] and weights at sp[SEQ:SEQ+OC]
+static void io_write_dyn(IOSurfaceRef s, const float *act, int ic, int seq,
+                         const float *W, int oc) {
+    int sp = seq + oc;
+    IOSurfaceLock(s, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(s);
+    for (int d = 0; d < ic; d++) {
+        memcpy(buf + d*sp, act + d*seq, seq*4);
+        memcpy(buf + d*sp + seq, W + d*oc, oc*4);
+    }
+    IOSurfaceUnlock(s, 0, NULL);
+}
+
+// Read output from dynamic matmul kernel: [1, OC, 1, SEQ]
+static void io_read_dyn(IOSurfaceRef s, float *out, int oc, int seq) {
+    IOSurfaceLock(s, kIOSurfaceLockReadOnly, NULL);
+    memcpy(out, (float*)IOSurfaceGetBaseAddress(s), oc * seq * 4);
+    IOSurfaceUnlock(s, kIOSurfaceLockReadOnly, NULL);
+}
+
+// Compile MIL to ANE kernel
+static Kern *compile_kern_mil_w(NSString *mil, NSDictionary *weights, int ic_bytes, int oc_bytes) {
+    @autoreleasepool {
+    NSData *md = [mil dataUsingEncoding:NSUTF8StringEncoding];
+    id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(g_D, @selector(modelWithMILText:weights:optionsPlist:), md, weights, nil);
+    if (!desc) { printf("  [compile] desc=NULL\n"); return NULL; }
+    id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(g_I, @selector(inMemoryModelWithDescriptor:), desc);
+    id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
+    NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
+    [[NSFileManager defaultManager] createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"] withIntermediateDirectories:YES attributes:nil error:nil];
+    [md writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+    for (NSString *path in weights) {
+        NSString *rel = [path stringByReplacingOccurrencesOfString:@"@model_path/" withString:@""];
+        [weights[path][@"data"] writeToFile:[td stringByAppendingPathComponent:rel] atomically:YES];
+    }
+    NSError *e = nil;
+    if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e)) {
+        printf("  [compile] FAIL: %s\n", e ? [[e description] UTF8String] : "no error"); return NULL;
+    }
+    if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e)) {
+        printf("  [compile] load FAIL\n"); return NULL;
+    }
+    __sync_fetch_and_add(&g_compile_count, 1);
+    Kern *k = (Kern*)calloc(1, sizeof(Kern));
+    k->model = (void*)CFBridgingRetain(mdl);
+    k->ioIn = make_surface(ic_bytes);
+    k->ioOut = make_surface(oc_bytes);
+    id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioIn);
+    id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioOut);
+    k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
+        @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
+        @[wI], @[@0], @[wO], @[@0], nil, nil, @0));
+    k->tmpDir = (void*)CFBridgingRetain(td);
+    return k;
+    }
+}
+static void free_kern(Kern *k) {
+    if (!k) return;
+    id mdl = (__bridge id)k->model; NSError *e = nil;
+    ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
+    CFRelease(k->ioIn); CFRelease(k->ioOut);
+    [[NSFileManager defaultManager] removeItemAtPath:(__bridge id)k->tmpDir error:nil];
+    CFRelease(k->model); CFRelease(k->request); CFRelease(k->tmpDir);
+    free(k);
+}
+static void ane_eval(Kern *k) {
+    id mdl = (__bridge id)k->model; id req = (__bridge id)k->request; NSError *e = nil;
+    ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
+}
diff --git a/training/training_dynamic/mil_dynamic.h b/training/training_dynamic/mil_dynamic.h
new file mode 100644
index 0000000..e6c5798
--- /dev/null
+++ b/training/training_dynamic/mil_dynamic.h
@@ -0,0 +1,590 @@
+// mil_dynamic.h — MIL generators using dynamic matmul (weights via IOSurface)
+// Instead of conv(const_weight, x), we use matmul(x, W) where both come from input.
+// Input layout: [1, IC, 1, SP] fp32, SP = SEQ + total_weight_cols
+// Activations in sp[0:SEQ], weight matrices packed sequentially in sp[SEQ:]
+#pragma once
+#include "io.h"
+
+#define MIL_HDR \
+    @"program(1.3)\n[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, " \
+    "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, " \
+    "{\"coremltools-version\", \"9.0\"}})]\n{\n"
+
+// Helper: generate a dynamic matmul within a MIL function
+// Slices activation [1,ic,1,seq] and weight [1,ic,1,oc] from input, does matmul
+// act_sp_off: spatial offset for activations (usually 0)
+// w_sp_off: spatial offset for weight block
+// Returns variable name of result [1,oc,1,seq] in fp16
+static void gen_dyn_matmul(NSMutableString *m, const char *prefix,
+                           int ic, int oc, int seq,
+                           int act_sp_off, int w_sp_off,
+                           const char *input_var) {
+    // Slice activations
+    [m appendFormat:@"        tensor<int32, [4]> %s_ba = const()[name=string(\"%s_ba\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", prefix, prefix, act_sp_off];
+    [m appendFormat:@"        tensor<int32, [4]> %s_sa = const()[name=string(\"%s_sa\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", prefix, prefix, ic, seq];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> %s_act = slice_by_size(x=%s,begin=%s_ba,size=%s_sa)[name=string(\"%s_act\")];\n", ic, seq, prefix, input_var, prefix, prefix, prefix];
+    // Slice weight
+    [m appendFormat:@"        tensor<int32, [4]> %s_bw = const()[name=string(\"%s_bw\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", prefix, prefix, w_sp_off];
+    [m appendFormat:@"        tensor<int32, [4]> %s_sw = const()[name=string(\"%s_sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", prefix, prefix, ic, oc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> %s_wt = slice_by_size(x=%s,begin=%s_bw,size=%s_sw)[name=string(\"%s_wt\")];\n", ic, oc, prefix, input_var, prefix, prefix, prefix];
+    // Reshape act: [1,ic,1,seq] → [1,1,ic,seq] → transpose → [1,1,seq,ic]
+    [m appendFormat:@"        tensor<int32, [4]> %s_ra = const()[name=string(\"%s_ra\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", prefix, prefix, ic, seq];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_a2 = reshape(shape=%s_ra,x=%s_act)[name=string(\"%s_a2\")];\n", ic, seq, prefix, prefix, prefix, prefix];
+    [m appendFormat:@"        tensor<int32, [4]> %s_pm = const()[name=string(\"%s_pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n", prefix, prefix];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_a3 = transpose(perm=%s_pm,x=%s_a2)[name=string(\"%s_a3\")];\n", seq, ic, prefix, prefix, prefix, prefix];
+    // Reshape weight: [1,ic,1,oc] → [1,1,ic,oc]
+    [m appendFormat:@"        tensor<int32, [4]> %s_rw = const()[name=string(\"%s_rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", prefix, prefix, ic, oc];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_W = reshape(shape=%s_rw,x=%s_wt)[name=string(\"%s_W\")];\n", ic, oc, prefix, prefix, prefix, prefix];
+    // matmul: [1,1,seq,ic] @ [1,1,ic,oc] → [1,1,seq,oc]
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_yh = matmul(transpose_x=bF,transpose_y=bF,x=%s_a3,y=%s_W)[name=string(\"%s_yh\")];\n", seq, oc, prefix, prefix, prefix, prefix];
+    // Transpose back + reshape: [1,1,seq,oc] → [1,1,oc,seq] → [1,oc,1,seq]
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> %s_yt = transpose(perm=%s_pm,x=%s_yh)[name=string(\"%s_yt\")];\n", oc, seq, prefix, prefix, prefix, prefix];
+    [m appendFormat:@"        tensor<int32, [4]> %s_ro = const()[name=string(\"%s_ro\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", prefix, prefix, oc, seq];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> %s_y = reshape(shape=%s_ro,x=%s_yt)[name=string(\"%s_y\")];\n", oc, seq, prefix, prefix, prefix, prefix];
+}
+
+// ===== Dynamic matmul kernel: y = x @ W =====
+// Input: [1, IC, 1, SEQ+OC] fp32 — act[0:SEQ] + W[SEQ:SEQ+OC]
+// Output: [1, OC, 1, SEQ] fp32
+static NSString *gen_dyn_matmul_mil(int ic, int oc, int seq) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    int sp = seq + oc;
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ic, sp];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", ic, sp];
+    gen_dyn_matmul(m, "mm", ic, oc, seq, 0, seq, "xh");
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=mm_y)[name=string(\"cout\")];\n", oc, seq];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// ===== SDPA forward (dynamic weights) =====
+// Replaces gen_sdpa_fwd_taps: RMSNorm done on CPU, this kernel does QKV matmul + SDPA + Wo matmul
+// Input: [1, DIM, 1, SEQ + 4*DIM] fp32
+//   sp[0:SEQ]           = xnorm (rmsnorm output, DIM channels)
+//   sp[SEQ:SEQ+DIM]     = Wq[DIM,DIM]
+//   sp[SEQ+DIM:SEQ+2D]  = Wk[DIM,DIM]
+//   sp[SEQ+2D:SEQ+3D]   = Wv[DIM,DIM]
+//   sp[SEQ+3D:SEQ+4D]   = Wo[DIM,DIM]
+// Output: [1, 6*DIM, 1, SEQ] fp16 = concat(o_out, Q, K, V, attn_out, xnorm_pass)
+// NOTE: mask is still a const weight (it doesn't change)
+static NSString *gen_sdpa_fwd_dynamic(void) {
+    float sc = 1.0f/sqrtf((float)HD);
+    int w_total = 4*DIM;  // Wq+Wk+Wv+Wo
+    int sp_in = SEQ + w_total;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", DIM, sp_in];
+    // Cast to fp16
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", DIM, sp_in];
+
+    // Slice xnorm [1,DIM,1,SEQ]
+    [m appendString:@"        tensor<int32, [4]> bx = const()[name=string(\"bx\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sx = const()[name=string(\"sx\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xn = slice_by_size(x=xh,begin=bx,size=sx)[name=string(\"xn\")];\n", DIM, SEQ];
+
+    // Slice Wq [1,DIM,1,DIM]
+    [m appendFormat:@"        tensor<int32, [4]> bq = const()[name=string(\"bq\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name=string(\"sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wq = slice_by_size(x=xh,begin=bq,size=sw)[name=string(\"Wq\")];\n", DIM, DIM];
+
+    // Slice Wk
+    [m appendFormat:@"        tensor<int32, [4]> bk = const()[name=string(\"bk\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ+DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wk = slice_by_size(x=xh,begin=bk,size=sw)[name=string(\"Wk\")];\n", DIM, DIM];
+
+    // Slice Wv
+    [m appendFormat:@"        tensor<int32, [4]> bv = const()[name=string(\"bv\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ+2*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wv = slice_by_size(x=xh,begin=bv,size=sw)[name=string(\"Wv\")];\n", DIM, DIM];
+
+    // Slice Wo
+    [m appendFormat:@"        tensor<int32, [4]> bo = const()[name=string(\"bo\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ+3*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wo = slice_by_size(x=xh,begin=bo,size=sw)[name=string(\"Wo\")];\n", DIM, DIM];
+
+    // Reshape for matmul: [1,D,1,S] → [1,1,D,S] → [1,1,S,D]
+    [m appendFormat:@"        tensor<int32, [4]> r2 = const()[name=string(\"r2\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, SEQ];
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> xn2 = reshape(shape=r2,x=xn)[name=string(\"xn2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> xnt = transpose(perm=pm,x=xn2)[name=string(\"xnt\")];\n", SEQ, DIM];
+
+    // Reshape weights: [1,D,1,D] → [1,1,D,D]
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wq2 = reshape(shape=rw,x=Wq)[name=string(\"Wq2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wk2 = reshape(shape=rw,x=Wk)[name=string(\"Wk2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wv2 = reshape(shape=rw,x=Wv)[name=string(\"Wv2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wo2 = reshape(shape=rw,x=Wo)[name=string(\"Wo2\")];\n", DIM, DIM];
+
+    // QKV matmul: [1,1,S,D] @ [1,1,D,D] → [1,1,S,D]
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendString:@"        bool bT = const()[name=string(\"bT\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> qm = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=Wq2)[name=string(\"qm\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> km = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=Wk2)[name=string(\"km\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> vm = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=Wv2)[name=string(\"vm\")];\n", SEQ, DIM];
+
+    // Transpose back: [1,1,S,D] → [1,1,D,S] → reshape [1,D,1,S]
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> qt = transpose(perm=pm,x=qm)[name=string(\"qt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> kt = transpose(perm=pm,x=km)[name=string(\"kt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> vt = transpose(perm=pm,x=vm)[name=string(\"vt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> os = const()[name=string(\"os\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> qf = reshape(shape=os,x=qt)[name=string(\"qf\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> kf = reshape(shape=os,x=kt)[name=string(\"kf\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> vf = reshape(shape=os,x=vt)[name=string(\"vf\")];\n", DIM, SEQ];
+
+    // SDPA: reshape to heads, matmul, mask, softmax, matmul
+    [m appendFormat:@"        tensor<int32, [4]> qsh = const()[name=string(\"qsh\"), val=tensor<int32, [4]>([1,%d,%d,%d])];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> q4 = reshape(shape=qsh,x=qf)[name=string(\"rq\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> q = transpose(perm=pm,x=q4)[name=string(\"tq\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> k4 = reshape(shape=qsh,x=kf)[name=string(\"rk\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> k = transpose(perm=pm,x=k4)[name=string(\"tk\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> v4 = reshape(shape=qsh,x=vf)[name=string(\"rv\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> v = transpose(perm=pm,x=v4)[name=string(\"tv\")];\n", HEADS, SEQ, HD];
+
+    // Q @ K^T
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> sc1 = matmul(transpose_x=bF,transpose_y=bT,x=q,y=k)[name=string(\"mm1\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        fp16 scv = const()[name=string(\"scv\"), val=fp16(%f)];\n", sc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> sc2 = mul(x=sc1,y=scv)[name=string(\"scl\")];\n", HEADS, SEQ, SEQ];
+
+    // Causal mask (still const — doesn't change)
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> cm = const()[name=string(\"cm\"), val=tensor<fp16, [1,1,%d,%d]>(BLOBFILE(path=string(\"@model_path/weights/mask.bin\"), offset=uint64(64)))];\n", SEQ, SEQ, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> ms = add(x=sc2,y=cm)[name=string(\"msk\")];\n", HEADS, SEQ, SEQ];
+
+    // Softmax
+    [m appendString:@"        int32 sax = const()[name=string(\"sax\"), val=int32(-1)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> aw = softmax(axis=sax,x=ms)[name=string(\"sm\")];\n", HEADS, SEQ, SEQ];
+
+    // scores @ V
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> a4 = matmul(transpose_x=bF,transpose_y=bF,x=aw,y=v)[name=string(\"mm2\")];\n", HEADS, SEQ, HD];
+
+    // Reshape back to [1,DIM,1,SEQ]
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> at = transpose(perm=pm,x=a4)[name=string(\"ta\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> af = reshape(shape=os,x=at)[name=string(\"ra\")];\n", DIM, SEQ];
+
+    // Wo matmul: af → [1,1,S,D] @ Wo[1,1,D,D] → [1,1,S,D] → [1,D,1,S]
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> af2 = reshape(shape=r2,x=af)[name=string(\"af2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> aft = transpose(perm=pm,x=af2)[name=string(\"aft\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> om = matmul(transpose_x=bF,transpose_y=bF,x=aft,y=Wo2)[name=string(\"om\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> ot = transpose(perm=pm,x=om)[name=string(\"ot\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> oo = reshape(shape=os,x=ot)[name=string(\"oo\")];\n", DIM, SEQ];
+
+    // Output: concat(o_out, qf, kf, vf, af, xn) — same as original for backward compatibility
+    [m appendString:@"        int32 cax = const()[name=string(\"cax\"), val=int32(1)];\n"];
+    [m appendString:@"        bool cid = const()[name=string(\"cid\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = concat(axis=cax,interleave=cid,values=(oo,qf,kf,vf,af,xn))[name=string(\"cat\")];\n", 6*DIM, SEQ];
+    // Cast to fp32
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> out32 = cast(dtype=to32,x=out)[name=string(\"cout\")];\n", 6*DIM, SEQ];
+    [m appendString:@"    } -> (out32);\n}\n"];
+    return m;
+}
+
+// ===== FFN forward (dynamic weights) =====
+// RMSNorm on CPU. This kernel: xnorm @ W1 → SiLU, xnorm @ W3 → gate, gate*silu @ W2 → out
+// Input: [1, DIM, 1, SEQ + HIDDEN + HIDDEN + DIM] fp32
+//   sp[0:SEQ]                        = xnorm [DIM,SEQ]
+//   sp[SEQ:SEQ+HIDDEN]               = W1[DIM,HIDDEN]
+//   sp[SEQ+HIDDEN:SEQ+2*HIDDEN]      = W3[DIM,HIDDEN]
+//   sp[SEQ+2*HIDDEN:SEQ+2*HIDDEN+DIM]= W2[HIDDEN→DIM] — but W2 is [DIM,HIDDEN], we need HIDDEN input channels
+// PROBLEM: W2 has shape [DIM,HIDDEN] = HIDDEN input channels, but our kernel has DIM input channels.
+// Solution: separate kernels for W1/W3 (DIM→HIDDEN) and W2 (HIDDEN→DIM)
+// OR: do W1,W3 in one kernel, SiLU on CPU/ANE, W2 in another kernel.
+// Simpler: 3 separate matmul kernels per FFN direction. But that's too many dispatches.
+// Better: one kernel for W1+W3 (same input dim), CPU SiLU, one kernel for W2.
+
+// FFN part 1: xnorm @ W1, xnorm @ W3 (both DIM→HIDDEN)
+// Input: [1, DIM, 1, SEQ + 2*HIDDEN] fp32
+//   sp[0:SEQ]                  = xnorm
+//   sp[SEQ:SEQ+HIDDEN]         = W1[DIM,HIDDEN]
+//   sp[SEQ+HIDDEN:SEQ+2*HIDDEN]= W3[DIM,HIDDEN]
+// Output: [1, 2*HIDDEN, 1, SEQ] fp32 = concat(h1, h3)
+static NSString *gen_ffn_w13_dynamic(void) {
+    int sp_in = SEQ + 2*HIDDEN;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", DIM, sp_in];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", DIM, sp_in];
+
+    // Slice xnorm
+    [m appendString:@"        tensor<int32, [4]> bx = const()[name=string(\"bx\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sx = const()[name=string(\"sx\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xn = slice_by_size(x=xh,begin=bx,size=sx)[name=string(\"xn\")];\n", DIM, SEQ];
+
+    // Slice W1
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> s1 = const()[name=string(\"s1\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W1 = slice_by_size(x=xh,begin=b1,size=s1)[name=string(\"W1\")];\n", DIM, HIDDEN];
+
+    // Slice W3
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ+HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W3 = slice_by_size(x=xh,begin=b3,size=s1)[name=string(\"W3\")];\n", DIM, HIDDEN];
+
+    // Reshape for matmul
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> rd = const()[name=string(\"rd\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> xn2 = reshape(shape=rd,x=xn)[name=string(\"xn2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> xnt = transpose(perm=pm,x=xn2)[name=string(\"xnt\")];\n", SEQ, DIM];
+
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W12 = reshape(shape=rw,x=W1)[name=string(\"W12\")];\n", DIM, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W32 = reshape(shape=rw,x=W3)[name=string(\"W32\")];\n", DIM, HIDDEN];
+
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> h1m = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=W12)[name=string(\"h1m\")];\n", SEQ, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> h3m = matmul(transpose_x=bF,transpose_y=bF,x=xnt,y=W32)[name=string(\"h3m\")];\n", SEQ, HIDDEN];
+
+    // Transpose back
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> h1t = transpose(perm=pm,x=h1m)[name=string(\"h1t\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> h3t = transpose(perm=pm,x=h3m)[name=string(\"h3t\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> rh = const()[name=string(\"rh\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> h1 = reshape(shape=rh,x=h1t)[name=string(\"h1\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> h3 = reshape(shape=rh,x=h3t)[name=string(\"h3\")];\n", HIDDEN, SEQ];
+
+    // SiLU + gate
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> sig = sigmoid(x=h1)[name=string(\"sg\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> silu = mul(x=h1,y=sig)[name=string(\"si\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> gate = mul(x=silu,y=h3)[name=string(\"gt\")];\n", HIDDEN, SEQ];
+
+    // Concat output: (h1, h3, gate)
+    [m appendString:@"        int32 cax = const()[name=string(\"cax\"), val=int32(1)];\n"];
+    [m appendString:@"        bool cid = const()[name=string(\"cid\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = concat(axis=cax,interleave=cid,values=(h1,h3,gate))[name=string(\"cat\")];\n", 2*HIDDEN+HIDDEN, SEQ];
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> out32 = cast(dtype=to32,x=out)[name=string(\"cout\")];\n", 3*HIDDEN, SEQ];
+    [m appendString:@"    } -> (out32);\n}\n"];
+    return m;
+}
+
+// FFN part 2: gate @ W2 (HIDDEN→DIM)
+// Input: [1, HIDDEN, 1, SEQ + DIM] fp32
+//   sp[0:SEQ]       = gate [HIDDEN,SEQ]
+//   sp[SEQ:SEQ+DIM] = W2[HIDDEN,DIM]
+// Output: [1, DIM, 1, SEQ] fp32
+static NSString *gen_ffn_w2_dynamic(void) {
+    int sp_in = SEQ + DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", HIDDEN, sp_in];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", HIDDEN, sp_in];
+
+    [m appendString:@"        tensor<int32, [4]> ba = const()[name=string(\"ba\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sa = const()[name=string(\"sa\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> act = slice_by_size(x=xh,begin=ba,size=sa)[name=string(\"act\")];\n", HIDDEN, SEQ];
+
+    [m appendFormat:@"        tensor<int32, [4]> bw = const()[name=string(\"bw\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name=string(\"sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W2 = slice_by_size(x=xh,begin=bw,size=sw)[name=string(\"W2\")];\n", HIDDEN, DIM];
+
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> ra = const()[name=string(\"ra\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> a2 = reshape(shape=ra,x=act)[name=string(\"a2\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> at = transpose(perm=pm,x=a2)[name=string(\"at\")];\n", SEQ, HIDDEN];
+
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W22 = reshape(shape=rw,x=W2)[name=string(\"W22\")];\n", HIDDEN, DIM];
+
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> ym = matmul(transpose_x=bF,transpose_y=bF,x=at,y=W22)[name=string(\"ym\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> yt = transpose(perm=pm,x=ym)[name=string(\"yt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> ro = const()[name=string(\"ro\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> yr = reshape(shape=ro,x=yt)[name=string(\"yr\")];\n", DIM, SEQ];
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=yr)[name=string(\"cout\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// ===== FFN backward (dynamic weights) =====
+// Input: [1, DIM+2*HIDDEN, 1, SEQ + HIDDEN + DIM + DIM] fp32
+// Actually simpler to split into separate backward kernels like forward.
+
+// FFN backward part 1: dffn @ W2^T → dsilu (HIDDEN), then SiLU derivative
+// Input: [1, DIM, 1, SEQ + HIDDEN] fp32
+//   sp[0:SEQ]        = dffn [DIM, SEQ]
+//   sp[SEQ:SEQ+HIDDEN]= W2^T [DIM, HIDDEN]
+// Output: [1, HIDDEN, 1, SEQ] fp32 = dsilu_raw
+static NSString *gen_ffn_bwd_w2t_dynamic(void) {
+    return gen_dyn_matmul_mil(DIM, HIDDEN, SEQ);
+}
+
+// FFN backward part 2: dh1 @ W1^T + dh3 @ W3^T → dx
+// We need h1,h3 for SiLU derivative, but those are on CPU.
+// Actually the SiLU derivative + gating is element-wise, do on CPU.
+// Then: dh1 @ W1^T and dh3 @ W3^T are two separate matmuls (HIDDEN→DIM).
+// Combine into one kernel:
+// Input: [1, HIDDEN, 1, SEQ + SEQ + DIM + DIM] fp32
+//   sp[0:SEQ]              = dh1 [HIDDEN,SEQ]
+//   sp[SEQ:2*SEQ]          = dh3 [HIDDEN,SEQ]
+//   sp[2*SEQ:2*SEQ+DIM]    = W1^T [HIDDEN,DIM]
+//   sp[2*SEQ+DIM:2*SEQ+2D] = W3^T [HIDDEN,DIM]
+// Output: [1, DIM, 1, SEQ] fp32 = dx1 + dx3
+static NSString *gen_ffn_bwd_w13t_dynamic(void) {
+    int sp_in = 2*SEQ + 2*DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", HIDDEN, sp_in];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", HIDDEN, sp_in];
+
+    // Slice dh1 [HIDDEN, SEQ]
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<int32, [4]> sh = const()[name=string(\"sh\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dh1 = slice_by_size(x=xh,begin=b0,size=sh)[name=string(\"dh1\")];\n", HIDDEN, SEQ];
+
+    // Slice dh3
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dh3 = slice_by_size(x=xh,begin=b1,size=sh)[name=string(\"dh3\")];\n", HIDDEN, SEQ];
+
+    // Slice W1^T [HIDDEN, DIM]
+    [m appendFormat:@"        tensor<int32, [4]> b2 = const()[name=string(\"b2\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 2*SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name=string(\"sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W1t = slice_by_size(x=xh,begin=b2,size=sw)[name=string(\"W1t\")];\n", HIDDEN, DIM];
+
+    // Slice W3^T
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 2*SEQ+DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> W3t = slice_by_size(x=xh,begin=b3,size=sw)[name=string(\"W3t\")];\n", HIDDEN, DIM];
+
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+
+    // dh1 matmul: [S,H] @ [H,D] → [S,D]
+    [m appendFormat:@"        tensor<int32, [4]> ra = const()[name=string(\"ra\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dh12 = reshape(shape=ra,x=dh1)[name=string(\"dh12\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dh1t = transpose(perm=pm,x=dh12)[name=string(\"dh1t\")];\n", SEQ, HIDDEN];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dh32 = reshape(shape=ra,x=dh3)[name=string(\"dh32\")];\n", HIDDEN, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dh3t = transpose(perm=pm,x=dh32)[name=string(\"dh3t\")];\n", SEQ, HIDDEN];
+
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W1t2 = reshape(shape=rw,x=W1t)[name=string(\"W1t2\")];\n", HIDDEN, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> W3t2 = reshape(shape=rw,x=W3t)[name=string(\"W3t2\")];\n", HIDDEN, DIM];
+
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dx1m = matmul(transpose_x=bF,transpose_y=bF,x=dh1t,y=W1t2)[name=string(\"dx1m\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dx3m = matmul(transpose_x=bF,transpose_y=bF,x=dh3t,y=W3t2)[name=string(\"dx3m\")];\n", SEQ, DIM];
+
+    // Add
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxm = add(x=dx1m,y=dx3m)[name=string(\"dxm\")];\n", SEQ, DIM];
+
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxt = transpose(perm=pm,x=dxm)[name=string(\"dxt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> ro = const()[name=string(\"ro\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dx = reshape(shape=ro,x=dxt)[name=string(\"dx\")];\n", DIM, SEQ];
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=dx)[name=string(\"cout\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// ===== SDPA backward part 1 (dynamic Wo^T) =====
+// Same as original gen_sdpa_bwd1 but Wo^T comes from input instead of const
+// Input: [1, 4*DIM, 1, SEQ + DIM] fp32 — Q,K,V,dx2 in channels, Wo^T in spatial
+// Wait — channels must match for all data. Q,K,V are [DIM,SEQ], dx2 is [DIM,SEQ].
+// Total input channels = 4*DIM. But Wo^T is [DIM,DIM] = DIM channels of DIM spatial.
+// Problem: can't mix 4*DIM channels for data with DIM channels for Wo^T.
+// Solution: Wo^T matmul as separate kernel, then SDPA part purely element-wise on ANE.
+
+// Wo^T matmul: dx2 @ Wo^T → da (DIM→DIM)
+static NSString *gen_wot_dynamic(void) {
+    return gen_dyn_matmul_mil(DIM, DIM, SEQ);
+}
+
+// SDPA backward part 1 (no weights, all data): Q,K,V,da → dV,probs,dp
+// Same as original but without Wo^T conv (already done)
+// Input: [1, 4*DIM, 1, SEQ] fp16
+static NSString *gen_sdpa_bwd1_noweight(void) {
+    float sc = 1.0f/sqrtf((float)HD);
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", 4*DIM, SEQ];
+
+    // Slice Q,K,V,da
+    [m appendFormat:@"        tensor<int32, [4]> sz = const()[name=string(\"sz\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> qf = slice_by_size(x=x,begin=b0,size=sz)[name=string(\"s0\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> kf = slice_by_size(x=x,begin=b1,size=sz)[name=string(\"s1\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b2 = const()[name=string(\"b2\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", 2*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> vf = slice_by_size(x=x,begin=b2,size=sz)[name=string(\"s2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", 3*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> da = slice_by_size(x=x,begin=b3,size=sz)[name=string(\"s3\")];\n", DIM, SEQ];
+
+    // Reshape to heads
+    [m appendFormat:@"        tensor<int32, [4]> rsh = const()[name=string(\"rsh\"), val=tensor<int32, [4]>([1,%d,%d,%d])];\n", HEADS, HD, SEQ];
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> qr = reshape(shape=rsh,x=qf)[name=string(\"rq\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> q = transpose(perm=pm,x=qr)[name=string(\"tq\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> kr = reshape(shape=rsh,x=kf)[name=string(\"rk\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> k = transpose(perm=pm,x=kr)[name=string(\"tk\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> vr = reshape(shape=rsh,x=vf)[name=string(\"rv\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> v = transpose(perm=pm,x=vr)[name=string(\"tv\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dr = reshape(shape=rsh,x=da)[name=string(\"rd\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dat = transpose(perm=pm,x=dr)[name=string(\"td\")];\n", HEADS, SEQ, HD];
+
+    // Forward attention scores (recompute)
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendString:@"        bool bT = const()[name=string(\"bT\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> sc1 = matmul(transpose_x=bF,transpose_y=bT,x=q,y=k)[name=string(\"mm1\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        fp16 scv = const()[name=string(\"scv\"), val=fp16(%f)];\n", sc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> sc2 = mul(x=sc1,y=scv)[name=string(\"scl\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> cm = const()[name=string(\"cm\"), val=tensor<fp16, [1,1,%d,%d]>(BLOBFILE(path=string(\"@model_path/weights/mask.bin\"), offset=uint64(64)))];\n", SEQ, SEQ, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> ms = add(x=sc2,y=cm)[name=string(\"msk\")];\n", HEADS, SEQ, SEQ];
+    [m appendString:@"        int32 sax = const()[name=string(\"sax\"), val=int32(-1)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> probs = softmax(axis=sax,x=ms)[name=string(\"sm\")];\n", HEADS, SEQ, SEQ];
+
+    // dV = probs^T @ da, dp = da @ V^T
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dv4 = matmul(transpose_x=bT,transpose_y=bF,x=probs,y=dat)[name=string(\"dv\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dp4 = matmul(transpose_x=bF,transpose_y=bT,x=dat,y=v)[name=string(\"dp\")];\n", HEADS, SEQ, SEQ];
+
+    // Reshape dV back
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dvt = transpose(perm=pm,x=dv4)[name=string(\"dvt\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> dvs = const()[name=string(\"dvs\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dvf = reshape(shape=dvs,x=dvt)[name=string(\"dvf\")];\n", DIM, SEQ];
+
+    // Flatten probs and dp for output
+    [m appendFormat:@"        tensor<int32, [4]> scs = const()[name=string(\"scs\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", SCORE_CH, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> pf = reshape(shape=scs,x=probs)[name=string(\"pf\")];\n", SCORE_CH, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dpf = reshape(shape=scs,x=dp4)[name=string(\"dpf\")];\n", SCORE_CH, SEQ];
+
+    [m appendString:@"        int32 cax = const()[name=string(\"cax\"), val=int32(1)];\n"];
+    [m appendString:@"        bool cid = const()[name=string(\"cid\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = concat(axis=cax,interleave=cid,values=(dvf,pf,dpf))[name=string(\"cat\")];\n", DIM+2*SCORE_CH, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// SDPA backward part 2: same as original (no weights, pure computation)
+static NSString *gen_sdpa_bwd2(void) {
+    float sc = 1.0f/sqrtf((float)HD);
+    int bwd2_in = 2*SCORE_CH + 2*DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp16, [1, %d, 1, %d]> x) {\n", bwd2_in, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sz_sc = const()[name=string(\"szsc\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", SCORE_CH, SEQ];
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> pf = slice_by_size(x=x,begin=b0,size=sz_sc)[name=string(\"s0\")];\n", SCORE_CH, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", SCORE_CH];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dpf = slice_by_size(x=x,begin=b1,size=sz_sc)[name=string(\"s1\")];\n", SCORE_CH, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> sz_d = const()[name=string(\"szd\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b2 = const()[name=string(\"b2\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", 2*SCORE_CH];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> qf = slice_by_size(x=x,begin=b2,size=sz_d)[name=string(\"s2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,%d,0,0])];\n", 2*SCORE_CH+DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> kf = slice_by_size(x=x,begin=b3,size=sz_d)[name=string(\"s3\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> ssh = const()[name=string(\"ssh\"), val=tensor<int32, [4]>([1,%d,%d,%d])];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> probs = reshape(shape=ssh,x=pf)[name=string(\"rp\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dp = reshape(shape=ssh,x=dpf)[name=string(\"rdp\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> rsh = const()[name=string(\"rsh\"), val=tensor<int32, [4]>([1,%d,%d,%d])];\n", HEADS, HD, SEQ];
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> qr = reshape(shape=rsh,x=qf)[name=string(\"rq\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> q = transpose(perm=pm,x=qr)[name=string(\"tq\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> kr = reshape(shape=rsh,x=kf)[name=string(\"rk\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> k = transpose(perm=pm,x=kr)[name=string(\"tk\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> pdp = mul(x=probs,y=dp)[name=string(\"pdp\")];\n", HEADS, SEQ, SEQ];
+    [m appendString:@"        tensor<int32, [1]> rax = const()[name=string(\"rax\"), val=tensor<int32, [1]>([-1])];\n"];
+    [m appendString:@"        bool kd = const()[name=string(\"kd\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,1]> spdp = reduce_sum(x=pdp,axes=rax,keep_dims=kd)[name=string(\"rs\")];\n", HEADS, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dps = sub(x=dp,y=spdp)[name=string(\"dps\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> ds0 = mul(x=probs,y=dps)[name=string(\"ds0\")];\n", HEADS, SEQ, SEQ];
+    [m appendFormat:@"        fp16 scv = const()[name=string(\"scv\"), val=fp16(%f)];\n", sc];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> ds = mul(x=ds0,y=scv)[name=string(\"ds\")];\n", HEADS, SEQ, SEQ];
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+    [m appendString:@"        bool bT = const()[name=string(\"bT\"), val=bool(true)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dq4 = matmul(transpose_x=bF,transpose_y=bF,x=ds,y=k)[name=string(\"dq\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dk4 = matmul(transpose_x=bT,transpose_y=bF,x=ds,y=q)[name=string(\"dk\")];\n", HEADS, SEQ, HD];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dqt = transpose(perm=pm,x=dq4)[name=string(\"dqt\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,%d,%d]> dkt = transpose(perm=pm,x=dk4)[name=string(\"dkt\")];\n", HEADS, HD, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> fs = const()[name=string(\"fs\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dqf = reshape(shape=fs,x=dqt)[name=string(\"dqf\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dkf = reshape(shape=fs,x=dkt)[name=string(\"dkf\")];\n", DIM, SEQ];
+    [m appendString:@"        int32 cax = const()[name=string(\"cax\"), val=int32(1)];\n"];
+    [m appendString:@"        bool cid = const()[name=string(\"cid\"), val=bool(false)];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> out = concat(axis=cax,interleave=cid,values=(dqf,dkf))[name=string(\"cat\")];\n", 2*DIM, SEQ];
+    [m appendString:@"    } -> (out);\n}\n"];
+    return m;
+}
+
+// QKV backward (dynamic): dq @ Wq^T + dk @ Wk^T + dv @ Wv^T → dx
+// Input: [1, DIM, 1, 3*SEQ + 3*DIM] fp32
+//   sp[0:SEQ]              = dq  [DIM,SEQ]
+//   sp[SEQ:2*SEQ]          = dk  [DIM,SEQ]
+//   sp[2*SEQ:3*SEQ]        = dv  [DIM,SEQ]
+//   sp[3*SEQ:3*SEQ+DIM]    = Wq^T [DIM,DIM]
+//   sp[3*SEQ+DIM:3*SEQ+2D] = Wk^T [DIM,DIM]
+//   sp[3*SEQ+2D:3*SEQ+3D]  = Wv^T [DIM,DIM]
+// Output: [1, DIM, 1, SEQ] fp32 = dxq + dxk + dxv
+static NSString *gen_qkvb_dynamic(void) {
+    int sp_in = 3*SEQ + 3*DIM;
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:MIL_HDR];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", DIM, sp_in];
+    [m appendString:@"        string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> xh = cast(dtype=to16,x=x)[name=string(\"cin\")];\n", DIM, sp_in];
+
+    // Slice dq, dk, dv
+    [m appendFormat:@"        tensor<int32, [4]> sd = const()[name=string(\"sd\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendString:@"        tensor<int32, [4]> b0 = const()[name=string(\"b0\"), val=tensor<int32, [4]>([0,0,0,0])];\n"];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dq = slice_by_size(x=xh,begin=b0,size=sd)[name=string(\"dq\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b1 = const()[name=string(\"b1\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dk = slice_by_size(x=xh,begin=b1,size=sd)[name=string(\"dk\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> b2 = const()[name=string(\"b2\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 2*SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dv = slice_by_size(x=xh,begin=b2,size=sd)[name=string(\"dv\")];\n", DIM, SEQ];
+
+    // Slice Wq^T, Wk^T, Wv^T
+    [m appendFormat:@"        tensor<int32, [4]> sw = const()[name=string(\"sw\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<int32, [4]> b3 = const()[name=string(\"b3\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 3*SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wqt = slice_by_size(x=xh,begin=b3,size=sw)[name=string(\"Wqt\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<int32, [4]> b4 = const()[name=string(\"b4\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 3*SEQ+DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wkt = slice_by_size(x=xh,begin=b4,size=sw)[name=string(\"Wkt\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<int32, [4]> b5 = const()[name=string(\"b5\"), val=tensor<int32, [4]>([0,0,0,%d])];\n", 3*SEQ+2*DIM];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> Wvt = slice_by_size(x=xh,begin=b5,size=sw)[name=string(\"Wvt\")];\n", DIM, DIM];
+
+    [m appendString:@"        tensor<int32, [4]> pm = const()[name=string(\"pm\"), val=tensor<int32, [4]>([0,1,3,2])];\n"];
+    [m appendString:@"        bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"];
+
+    // Reshape and matmul for each
+    [m appendFormat:@"        tensor<int32, [4]> rd = const()[name=string(\"rd\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> rw = const()[name=string(\"rw\"), val=tensor<int32, [4]>([1,1,%d,%d])];\n", DIM, DIM];
+
+    // dq @ Wq^T
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dq2 = reshape(shape=rd,x=dq)[name=string(\"dq2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dqt = transpose(perm=pm,x=dq2)[name=string(\"dqt\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wqt2 = reshape(shape=rw,x=Wqt)[name=string(\"Wqt2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxq = matmul(transpose_x=bF,transpose_y=bF,x=dqt,y=Wqt2)[name=string(\"dxq\")];\n", SEQ, DIM];
+
+    // dk @ Wk^T
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dk2 = reshape(shape=rd,x=dk)[name=string(\"dk2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dkt = transpose(perm=pm,x=dk2)[name=string(\"dkt\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wkt2 = reshape(shape=rw,x=Wkt)[name=string(\"Wkt2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxk = matmul(transpose_x=bF,transpose_y=bF,x=dkt,y=Wkt2)[name=string(\"dxk\")];\n", SEQ, DIM];
+
+    // dv @ Wv^T
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dv2 = reshape(shape=rd,x=dv)[name=string(\"dv2\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dvt = transpose(perm=pm,x=dv2)[name=string(\"dvt\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> Wvt2 = reshape(shape=rw,x=Wvt)[name=string(\"Wvt2\")];\n", DIM, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxv = matmul(transpose_x=bF,transpose_y=bF,x=dvt,y=Wvt2)[name=string(\"dxv\")];\n", SEQ, DIM];
+
+    // Sum: dxq + dxk + dxv
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxqk = add(x=dxq,y=dxk)[name=string(\"aqk\")];\n", SEQ, DIM];
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxall = add(x=dxqk,y=dxv)[name=string(\"aall\")];\n", SEQ, DIM];
+
+    [m appendFormat:@"        tensor<fp16, [1,1,%d,%d]> dxt = transpose(perm=pm,x=dxall)[name=string(\"dxt\")];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<int32, [4]> ro = const()[name=string(\"ro\"), val=tensor<int32, [4]>([1,%d,1,%d])];\n", DIM, SEQ];
+    [m appendFormat:@"        tensor<fp16, [1,%d,1,%d]> dx = reshape(shape=ro,x=dxt)[name=string(\"dx\")];\n", DIM, SEQ];
+    [m appendString:@"        string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=dx)[name=string(\"cout\")];\n", DIM, SEQ];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+// Causal mask blob (used by sdpa_fwd and sdpa_bwd1)
+static NSData *g_mask_blob = nil;
+static NSData *get_mask_blob(void) {
+    if (!g_mask_blob) {
+        _Float16 *mask = (_Float16*)calloc(SEQ*SEQ, sizeof(_Float16));
+        for(int t=0;t<SEQ;t++) for(int t2=0;t2<SEQ;t2++)
+            mask[t*SEQ+t2] = (t2<=t) ? (_Float16)0.0f : (_Float16)(-65504.0f);
+        g_mask_blob = build_blob_fp16(mask, SEQ*SEQ);
+        free(mask);
+    }
+    return g_mask_blob;
+}
diff --git a/training/training_dynamic/train.m b/training/training_dynamic/train.m
new file mode 100644
index 0000000..412c4d8
--- /dev/null
+++ b/training/training_dynamic/train.m
@@ -0,0 +1,876 @@
+// train.m — Dynamic weight ANE training for Stories110M
+// Compile kernels ONCE at startup, update weights via IOSurface every step.
+// No exec() restart needed — eliminates 76% compile overhead.
+#include "mil_dynamic.h"
+#include "cpu_ops.h"
+
+#define CKPT_PATH "ane_stories110M_dyn_ckpt.bin"
+#define MODEL_PATH "../../../assets/models/stories110M.bin"
+#define DATA_PATH "../tinystories_data00.bin"
+
+// Dynamic kernel set per layer
+typedef struct {
+    Kern *sdpaFwd;     // QKV matmul + SDPA + Wo matmul (dynamic weights via IOSurface)
+    Kern *ffnW13;      // W1,W3 matmul (dynamic)
+    Kern *ffnW2;       // W2 matmul (dynamic)
+    Kern *ffnBwdW2t;   // dffn @ W2^T (dynamic)
+    Kern *ffnBwdW13t;  // dh1@W1^T + dh3@W3^T (dynamic)
+    Kern *wotBwd;      // dx2 @ Wo^T (dynamic)
+    Kern *sdpaBwd1;    // Q,K,V,da → dV,probs,dp (weight-free, has mask const)
+    Kern *sdpaBwd2;    // probs,dp,Q,K → dQ,dK (weight-free)
+    Kern *qkvBwd;      // dq@Wq^T + dk@Wk^T + dv@Wv^T (dynamic)
+} DynLayerKernels;
+
+// ===== Weight loading from llama2.c format =====
+static bool load_pretrained(LayerWeights *lw, float *rms_final, float *embed, const char *path) {
+    FILE *f = fopen(path, "rb");
+    if (!f) { printf("Cannot open %s\n", path); return false; }
+    Llama2Config cfg;
+    fread(&cfg, sizeof(cfg), 1, f);
+    printf("  Model: dim=%d hidden=%d layers=%d heads=%d vocab=%d seq=%d\n",
+           cfg.dim, cfg.hidden_dim, cfg.n_layers, cfg.n_heads, abs(cfg.vocab_size), cfg.seq_len);
+    if (cfg.dim != DIM || cfg.hidden_dim != HIDDEN || cfg.n_layers != NLAYERS) {
+        printf("  ERROR: Config mismatch!\n"); fclose(f); return false;
+    }
+    int V = abs(cfg.vocab_size);
+    fread(embed, 4, V * DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].rms_att, 4, DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wq, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wk, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wv, 4, WQ_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].Wo, 4, WO_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].rms_ffn, 4, DIM, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W1, 4, W1_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W2, 4, W2_SZ, f);
+    for (int L = 0; L < NLAYERS; L++) fread(lw[L].W3, 4, W3_SZ, f);
+    fread(rms_final, 4, DIM, f);
+    fclose(f);
+    printf("  Loaded pretrained weights\n");
+    return true;
+}
+
+// Transpose W[rows,cols] → W^T[cols,rows] stored as [cols channels, rows spatial]
+static void transpose_weight(float *dst, const float *src, int rows, int cols) {
+    for (int r = 0; r < rows; r++)
+        for (int c = 0; c < cols; c++)
+            dst[c * rows + r] = src[r * cols + c];
+}
+
+// ===== Compile all dynamic kernels (ONCE) =====
+static bool compile_dynamic_kernels(DynLayerKernels *dk) {
+    NSDictionary *mask_w = @{@"@model_path/weights/mask.bin": @{@"offset":@0, @"data":get_mask_blob()}};
+
+    // SDPA forward: [1, DIM, 1, SEQ+4*DIM] fp32 → [1, 6*DIM, 1, SEQ] fp32
+    printf("  Compiling sdpaFwd...\n");
+    dk->sdpaFwd = compile_kern_mil_w(gen_sdpa_fwd_dynamic(), mask_w,
+        DIM*(SEQ+4*DIM)*4, 6*DIM*SEQ*4);
+    if (!dk->sdpaFwd) return false;
+
+    // FFN W1+W3: [1, DIM, 1, SEQ+2*HIDDEN] fp32 → [1, 3*HIDDEN, 1, SEQ] fp32
+    printf("  Compiling ffnW13...\n");
+    dk->ffnW13 = compile_kern_mil_w(gen_ffn_w13_dynamic(), @{},
+        DIM*(SEQ+2*HIDDEN)*4, 3*HIDDEN*SEQ*4);
+    if (!dk->ffnW13) return false;
+
+    // FFN W2: [1, HIDDEN, 1, SEQ+DIM] fp32 → [1, DIM, 1, SEQ] fp32
+    printf("  Compiling ffnW2...\n");
+    dk->ffnW2 = compile_kern_mil_w(gen_ffn_w2_dynamic(), @{},
+        HIDDEN*(SEQ+DIM)*4, DIM*SEQ*4);
+    if (!dk->ffnW2) return false;
+
+    // FFN backward W2^T: [1, DIM, 1, SEQ+HIDDEN] fp32 → [1, HIDDEN, 1, SEQ] fp32
+    printf("  Compiling ffnBwdW2t...\n");
+    dk->ffnBwdW2t = compile_kern_mil_w(gen_ffn_bwd_w2t_dynamic(), @{},
+        DIM*(SEQ+HIDDEN)*4, HIDDEN*SEQ*4);
+    if (!dk->ffnBwdW2t) return false;
+
+    // FFN backward W1^T+W3^T: [1, HIDDEN, 1, 2*SEQ+2*DIM] fp32 → [1, DIM, 1, SEQ] fp32
+    printf("  Compiling ffnBwdW13t...\n");
+    dk->ffnBwdW13t = compile_kern_mil_w(gen_ffn_bwd_w13t_dynamic(), @{},
+        HIDDEN*(2*SEQ+2*DIM)*4, DIM*SEQ*4);
+    if (!dk->ffnBwdW13t) return false;
+
+    // Wo^T backward: [1, DIM, 1, SEQ+DIM] fp32 → [1, DIM, 1, SEQ] fp32
+    printf("  Compiling wotBwd...\n");
+    dk->wotBwd = compile_kern_mil_w(gen_wot_dynamic(), @{},
+        DIM*(SEQ+DIM)*4, DIM*SEQ*4);
+    if (!dk->wotBwd) return false;
+
+    // SDPA bwd1 (no dynamic weights, has mask): [1, 4*DIM, 1, SEQ] fp16 → [1, DIM+2*SCORE_CH, 1, SEQ] fp16
+    printf("  Compiling sdpaBwd1...\n");
+    dk->sdpaBwd1 = compile_kern_mil_w(gen_sdpa_bwd1_noweight(), mask_w,
+        4*DIM*SEQ*2, (DIM+2*SCORE_CH)*SEQ*2);
+    if (!dk->sdpaBwd1) return false;
+
+    // SDPA bwd2 (no weights): [1, 2*SCORE_CH+2*DIM, 1, SEQ] fp16 → [1, 2*DIM, 1, SEQ] fp16
+    printf("  Compiling sdpaBwd2...\n");
+    dk->sdpaBwd2 = compile_kern_mil_w(gen_sdpa_bwd2(), @{},
+        (2*SCORE_CH+2*DIM)*SEQ*2, 2*DIM*SEQ*2);
+    if (!dk->sdpaBwd2) return false;
+
+    // QKV backward: [1, DIM, 1, 3*SEQ+3*DIM] fp32 → [1, DIM, 1, SEQ] fp32
+    printf("  Compiling qkvBwd...\n");
+    dk->qkvBwd = compile_kern_mil_w(gen_qkvb_dynamic(), @{},
+        DIM*(3*SEQ+3*DIM)*4, DIM*SEQ*4);
+    if (!dk->qkvBwd) return false;
+
+    return true;
+}
+
+// ===== Write dynamic weights into IOSurface =====
+// sdpaFwd: [1, DIM, 1, SEQ+4*DIM] — xnorm at sp[0:S], Wq/Wk/Wv/Wo at sp[S:]
+static void write_sdpa_fwd_input(DynLayerKernels *dk, const float *xnorm,
+                                  const float *Wq, const float *Wk, const float *Wv, const float *Wo) {
+    IOSurfaceLock(dk->sdpaFwd->ioIn, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(dk->sdpaFwd->ioIn);
+    int sp = SEQ + 4*DIM;
+    for (int d = 0; d < DIM; d++) {
+        memcpy(buf + d*sp, xnorm + d*SEQ, SEQ*4);
+        memcpy(buf + d*sp + SEQ,       Wq + d*DIM, DIM*4);
+        memcpy(buf + d*sp + SEQ+DIM,   Wk + d*DIM, DIM*4);
+        memcpy(buf + d*sp + SEQ+2*DIM, Wv + d*DIM, DIM*4);
+        memcpy(buf + d*sp + SEQ+3*DIM, Wo + d*DIM, DIM*4);
+    }
+    IOSurfaceUnlock(dk->sdpaFwd->ioIn, 0, NULL);
+}
+
+// ffnW13: [1, DIM, 1, SEQ+2*HIDDEN] — xnorm at sp[0:S], W1,W3 at sp[S:]
+static void write_ffn_w13_input(DynLayerKernels *dk, const float *xnorm,
+                                const float *W1, const float *W3) {
+    IOSurfaceLock(dk->ffnW13->ioIn, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(dk->ffnW13->ioIn);
+    int sp = SEQ + 2*HIDDEN;
+    for (int d = 0; d < DIM; d++) {
+        memcpy(buf + d*sp, xnorm + d*SEQ, SEQ*4);
+        memcpy(buf + d*sp + SEQ,        W1 + d*HIDDEN, HIDDEN*4);
+        memcpy(buf + d*sp + SEQ+HIDDEN,  W3 + d*HIDDEN, HIDDEN*4);
+    }
+    IOSurfaceUnlock(dk->ffnW13->ioIn, 0, NULL);
+}
+
+// ffnW2: [1, HIDDEN, 1, SEQ+DIM] — gate at sp[0:S], W2 at sp[S:]
+static void write_ffn_w2_input(DynLayerKernels *dk, const float *gate, const float *W2) {
+    IOSurfaceLock(dk->ffnW2->ioIn, 0, NULL);
+    float *buf = (float*)IOSurfaceGetBaseAddress(dk->ffnW2->ioIn);
+    int sp = SEQ + DIM;
+    for (int d = 0; d < HIDDEN; d++) {
+        memcpy(buf + d*sp, gate + d*SEQ, SEQ*4);
+        memcpy(buf + d*sp + SEQ, W2 + d*DIM, DIM*4);
+    }
+    IOSurfaceUnlock(dk->ffnW2->ioIn, 0, NULL);
+}
+
+// ===== Checkpoint =====
+static void save_checkpoint(const char *path, int step, int total_steps, float lr, float loss,
+                            double ct, double cw, int cs, int adam_t,
+                            LayerWeights *lw, LayerAdam *la, float *rms_final, AdamState *arms_final,
+                            float *embed, AdamState *aembed) {
+    FILE *f = fopen(path, "wb");
+    CkptHdr h = {0};
+    h.magic = 0x424C5A54; h.version = 3;
+    h.step = step; h.total_steps = total_steps;
+    h.n_layers = NLAYERS; h.vocab_size = VOCAB; h.dim = DIM;
+    h.hidden_dim = HIDDEN; h.n_heads = HEADS; h.seq_len = SEQ;
+    h.lr = lr; h.loss = loss;
+    h.cum_train = ct; h.cum_wall = cw; h.cum_steps = cs; h.adam_t = adam_t;
+    fwrite(&h, sizeof(h), 1, f);
+    for (int L = 0; L < NLAYERS; L++) {
+        fwrite(lw[L].Wq,4,WQ_SZ,f); fwrite(lw[L].Wk,4,WQ_SZ,f);
+        fwrite(lw[L].Wv,4,WQ_SZ,f); fwrite(lw[L].Wo,4,WO_SZ,f);
+        fwrite(lw[L].W1,4,W1_SZ,f); fwrite(lw[L].W2,4,W2_SZ,f); fwrite(lw[L].W3,4,W3_SZ,f);
+        fwrite(lw[L].rms_att,4,DIM,f); fwrite(lw[L].rms_ffn,4,DIM,f);
+        fwrite(la[L].Wq.m,4,WQ_SZ,f); fwrite(la[L].Wq.v,4,WQ_SZ,f);
+        fwrite(la[L].Wk.m,4,WQ_SZ,f); fwrite(la[L].Wk.v,4,WQ_SZ,f);
+        fwrite(la[L].Wv.m,4,WQ_SZ,f); fwrite(la[L].Wv.v,4,WQ_SZ,f);
+        fwrite(la[L].Wo.m,4,WO_SZ,f); fwrite(la[L].Wo.v,4,WO_SZ,f);
+        fwrite(la[L].W1.m,4,W1_SZ,f); fwrite(la[L].W1.v,4,W1_SZ,f);
+        fwrite(la[L].W2.m,4,W2_SZ,f); fwrite(la[L].W2.v,4,W2_SZ,f);
+        fwrite(la[L].W3.m,4,W3_SZ,f); fwrite(la[L].W3.v,4,W3_SZ,f);
+        fwrite(la[L].rms_att.m,4,DIM,f); fwrite(la[L].rms_att.v,4,DIM,f);
+        fwrite(la[L].rms_ffn.m,4,DIM,f); fwrite(la[L].rms_ffn.v,4,DIM,f);
+    }
+    fwrite(rms_final,4,DIM,f);
+    fwrite(arms_final->m,4,DIM,f); fwrite(arms_final->v,4,DIM,f);
+    fwrite(embed,4,VOCAB*DIM,f);
+    fwrite(aembed->m,4,VOCAB*DIM,f); fwrite(aembed->v,4,VOCAB*DIM,f);
+    fclose(f);
+}
+
+static bool load_checkpoint(const char *path, int *step, int *total_steps, float *lr, float *loss,
+                             double *ct, double *cw, int *cs, int *adam_t,
+                             LayerWeights *lw, LayerAdam *la, float *rms_final, AdamState *arms_final,
+                             float *embed, AdamState *aembed) {
+    FILE *f = fopen(path, "rb");
+    if (!f) return false;
+    CkptHdr h;
+    fread(&h, sizeof(h), 1, f);
+    if (h.magic != 0x424C5A54 || h.version != 3) { fclose(f); return false; }
+    *step = h.step; *total_steps = h.total_steps; *lr = h.lr; *loss = h.loss;
+    *ct = h.cum_train; *cw = h.cum_wall; *cs = h.cum_steps; *adam_t = h.adam_t;
+    for (int L = 0; L < NLAYERS; L++) {
+        fread(lw[L].Wq,4,WQ_SZ,f); fread(lw[L].Wk,4,WQ_SZ,f);
+        fread(lw[L].Wv,4,WQ_SZ,f); fread(lw[L].Wo,4,WO_SZ,f);
+        fread(lw[L].W1,4,W1_SZ,f); fread(lw[L].W2,4,W2_SZ,f); fread(lw[L].W3,4,W3_SZ,f);
+        fread(lw[L].rms_att,4,DIM,f); fread(lw[L].rms_ffn,4,DIM,f);
+        fread(la[L].Wq.m,4,WQ_SZ,f); fread(la[L].Wq.v,4,WQ_SZ,f);
+        fread(la[L].Wk.m,4,WQ_SZ,f); fread(la[L].Wk.v,4,WQ_SZ,f);
+        fread(la[L].Wv.m,4,WQ_SZ,f); fread(la[L].Wv.v,4,WQ_SZ,f);
+        fread(la[L].Wo.m,4,WO_SZ,f); fread(la[L].Wo.v,4,WO_SZ,f);
+        fread(la[L].W1.m,4,W1_SZ,f); fread(la[L].W1.v,4,W1_SZ,f);
+        fread(la[L].W2.m,4,W2_SZ,f); fread(la[L].W2.v,4,W2_SZ,f);
+        fread(la[L].W3.m,4,W3_SZ,f); fread(la[L].W3.v,4,W3_SZ,f);
+        fread(la[L].rms_att.m,4,DIM,f); fread(la[L].rms_att.v,4,DIM,f);
+        fread(la[L].rms_ffn.m,4,DIM,f); fread(la[L].rms_ffn.v,4,DIM,f);
+    }
+    fread(rms_final,4,DIM,f);
+    fread(arms_final->m,4,DIM,f); fread(arms_final->v,4,DIM,f);
+    fread(embed,4,VOCAB*DIM,f);
+    fread(aembed->m,4,VOCAB*DIM,f); fread(aembed->v,4,VOCAB*DIM,f);
+    fclose(f);
+    return true;
+}
+
+int main(int argc, char *argv[]) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        ane_init();
+        mach_timebase_info(&g_tb);
+
+        int total_steps = 10000;
+        float max_lr = 3e-4f;
+        float adam_b1=0.9f, adam_b2=0.999f, adam_eps=1e-8f;
+        int adam_t = 0, start_step = 0;
+        int accum_steps = 10;
+        int warmup_steps = 100;
+        float grad_clip = 1.0f;
+        float min_lr_frac = 0.1f;  // min_lr = max_lr * 0.1
+
+        bool do_resume = false, from_scratch = false;
+        for (int i=1; i<argc; i++) {
+            if (strcmp(argv[i], "--resume") == 0) do_resume = true;
+            else if (strcmp(argv[i], "--scratch") == 0) from_scratch = true;
+            else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) max_lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--accum") == 0 && i+1<argc) accum_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--warmup") == 0 && i+1<argc) warmup_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--clip") == 0 && i+1<argc) grad_clip = atof(argv[++i]);
+        }
+        float lr = max_lr;
+
+        // Allocate per-layer state
+        LayerWeights lw[NLAYERS]; LayerAdam la[NLAYERS];
+        LayerActs acts[NLAYERS]; LayerGrads grads[NLAYERS];
+        for (int L=0; L<NLAYERS; L++) {
+            lw[L] = layer_weights_alloc(); la[L] = layer_adam_alloc();
+            acts[L] = layer_acts_alloc(); grads[L] = layer_grads_alloc();
+        }
+        float *rms_final = (float*)malloc(DIM*4);
+        float *embed = (float*)malloc(VOCAB*DIM*4);
+        float *grms_final = (float*)calloc(DIM, 4);
+        float *gembed = (float*)calloc(VOCAB*DIM, 4);
+        AdamState arms_final = adam_alloc(DIM);
+        AdamState aembed = adam_alloc((size_t)VOCAB*DIM);
+
+        double cum_train=0, cum_wall=0; int cum_steps=0;
+        float resume_loss = 0;
+        bool resuming = false;
+        if (do_resume) {
+            resuming = load_checkpoint(CKPT_PATH, &start_step, &total_steps, &lr, &resume_loss,
+                &cum_train, &cum_wall, &cum_steps, &adam_t,
+                lw, la, rms_final, &arms_final, embed, &aembed);
+            if (resuming) printf("[RESUMED step %d, loss=%.4f]\n", start_step, resume_loss);
+        }
+        if (!resuming) {
+            printf("=== ANE Dynamic Training: Stories110M (12 layers) ===\n");
+            printf("dim=%d hidden=%d heads=%d seq=%d vocab=%d layers=%d\n", DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS);
+            // Param counts for dashboard
+            double xformer_m = (double)NLAYERS*(4.0*WQ_SZ + 2.0*W1_SZ + W2_SZ + W3_SZ + 2.0*DIM) / 1e6;
+            double embed_m = (double)VOCAB*DIM / 1e6;
+            printf("Params: %.1fM (transformer %.1fM + embed %.1fM)\n", xformer_m+embed_m, xformer_m, embed_m);
+            printf("Kernels: 9 compiled, 9 weight-bearing\n");
+            printf("Accum %d steps, LR=%g\n", accum_steps, max_lr);
+            // FLOPs estimate: 6*N*B*T for transformer (forward+backward ≈ 3x forward)
+            double fwd_flops = 2.0*NLAYERS*(4.0*WQ_SZ + 2.0*W1_SZ + W2_SZ + W3_SZ) * SEQ;
+            double total_flops = 3.0 * fwd_flops;  // fwd + bwd ≈ 3x fwd
+            printf("FLOPs/step: fwd=%.1fM bwd_dx=%.1fM bwd_dW=%.1fM sdpa_bwd=0.0M total=%.1fM\n",
+                   fwd_flops/1e6, fwd_flops/1e6, fwd_flops/1e6, total_flops/1e6);
+            printf("ANE FLOPs/step: %.1fM\n", total_flops/1e6);
+            if (from_scratch || !load_pretrained(lw, rms_final, embed, MODEL_PATH)) {
+                if (from_scratch) printf("  Training from scratch (random init)\n");
+                else printf("  Pretrained load failed, using random init\n");
+                srand48(42);
+                float scale_d=1.0f/sqrtf(DIM), scale_h=1.0f/sqrtf(HIDDEN);
+                for (int L=0; L<NLAYERS; L++) {
+                    for(size_t i=0;i<WQ_SZ;i++){lw[L].Wq[i]=scale_d*(2*drand48()-1);lw[L].Wk[i]=scale_d*(2*drand48()-1);}
+                    for(size_t i=0;i<WQ_SZ;i++){lw[L].Wv[i]=scale_d*(2*drand48()-1);lw[L].Wo[i]=scale_d*(2*drand48()-1);}
+                    for(size_t i=0;i<W1_SZ;i++) lw[L].W1[i]=scale_h*(2*drand48()-1);
+                    for(size_t i=0;i<W2_SZ;i++) lw[L].W2[i]=scale_d*(2*drand48()-1);
+                    for(size_t i=0;i<W3_SZ;i++) lw[L].W3[i]=scale_h*(2*drand48()-1);
+                    for(int i=0;i<DIM;i++){lw[L].rms_att[i]=1.0f; lw[L].rms_ffn[i]=1.0f;}
+                }
+                for(int i=0;i<DIM;i++) rms_final[i]=1.0f;
+                float escale = 0.02f;
+                for(size_t i=0;i<(size_t)VOCAB*DIM;i++) embed[i]=escale*(2*drand48()-1);
+            }
+        }
+
+        // Precompute transposed weights (for backward pass kernels)
+        // These get updated after each Adam step
+        float *Wqt_buf[NLAYERS], *Wkt_buf[NLAYERS], *Wvt_buf[NLAYERS], *Wot_buf[NLAYERS];
+        float *W1t_buf[NLAYERS], *W2t_buf[NLAYERS], *W3t_buf[NLAYERS];
+        for (int L=0; L<NLAYERS; L++) {
+            Wqt_buf[L]=(float*)malloc(WQ_SZ*4); Wkt_buf[L]=(float*)malloc(WQ_SZ*4);
+            Wvt_buf[L]=(float*)malloc(WQ_SZ*4); Wot_buf[L]=(float*)malloc(WO_SZ*4);
+            W1t_buf[L]=(float*)malloc(W1_SZ*4); W2t_buf[L]=(float*)malloc(W2_SZ*4);
+            W3t_buf[L]=(float*)malloc(W3_SZ*4);
+            transpose_weight(Wqt_buf[L], lw[L].Wq, DIM, DIM);
+            transpose_weight(Wkt_buf[L], lw[L].Wk, DIM, DIM);
+            transpose_weight(Wvt_buf[L], lw[L].Wv, DIM, DIM);
+            transpose_weight(Wot_buf[L], lw[L].Wo, DIM, DIM);
+            transpose_weight(W1t_buf[L], lw[L].W1, HIDDEN, DIM);
+            transpose_weight(W2t_buf[L], lw[L].W2, DIM, HIDDEN);
+            transpose_weight(W3t_buf[L], lw[L].W3, HIDDEN, DIM);
+        }
+
+        // mmap token data
+        int data_fd = open(DATA_PATH, O_RDONLY);
+        if (data_fd < 0) { printf("Cannot open %s\n", DATA_PATH); return 1; }
+        struct stat st; fstat(data_fd, &st);
+        size_t data_len = st.st_size;
+        uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
+        if (token_data == MAP_FAILED) { printf("mmap failed\n"); return 1; }
+        size_t n_tokens = data_len / 2;
+        printf("Token data: %zu tokens (%.1f MB)\n", n_tokens, data_len/1e6);
+
+        // Vocab compaction: map 32K sparse vocab → ~9K compact
+        VocabMap vm = vocab_map_build(token_data, n_tokens, VOCAB);
+        int CV = vm.compact_vocab;
+        printf("Vocab compaction: %d → %d active tokens (%.1fx reduction)\n", VOCAB, CV, (float)VOCAB/CV);
+
+        // Create compact embedding + adam state
+        float *cembed = vocab_compact_embed(embed, &vm, DIM);
+        float *gcembed = (float*)calloc((size_t)CV*DIM, 4);
+        AdamState acembed = adam_alloc((size_t)CV*DIM);
+
+        // ===== Compile all kernels ONCE =====
+        printf("Compiling %d dynamic kernels (one-time)...\n", 9);
+        uint64_t tc = mach_absolute_time();
+        DynLayerKernels dk;
+        if (!compile_dynamic_kernels(&dk)) {
+            printf("Compilation failed!\n"); return 1;
+        }
+        double compile_ms = tb_ms(mach_absolute_time() - tc);
+        printf("Compiled 9 kernels in %.0fms (shared across all %d layers)\n\n", compile_ms, NLAYERS);
+
+        // Gradient + work buffers
+        float *dy = (float*)malloc(SEQ*DIM*4);
+        float *dffn = (float*)malloc(SEQ*DIM*4);
+        float *dx_ffn = (float*)malloc(SEQ*DIM*4);
+        float *dx2 = (float*)malloc(SEQ*DIM*4);
+        float *dx_attn = (float*)malloc(SEQ*DIM*4);
+        float *dq = (float*)malloc(SEQ*DIM*4);
+        float *dk_buf = (float*)malloc(SEQ*DIM*4);
+        float *dv = (float*)malloc(SEQ*DIM*4);
+        float *x_cur = (float*)malloc(SEQ*DIM*4);
+        float *x_final = (float*)malloc(SEQ*DIM*4);
+        float *xnorm_buf = (float*)malloc(SEQ*DIM*4);
+        float *logits = (float*)malloc(SEQ*CV*4);
+        float *dlogits = (float*)malloc(SEQ*CV*4);
+        float *gate_buf = (float*)malloc(SEQ*HIDDEN*4);
+        float *dh1 = (float*)malloc(SEQ*HIDDEN*4);
+        float *dh3 = (float*)malloc(SEQ*HIDDEN*4);
+        float *dsilu = (float*)malloc(SEQ*HIDDEN*4);
+        float *silu_tmp = (float*)malloc(SEQ*HIDDEN*4);
+        float *silu_tmp2 = (float*)malloc(SEQ*HIDDEN*4);
+
+        dispatch_queue_t dw_q = dispatch_queue_create("dw_cblas", DISPATCH_QUEUE_SERIAL);
+        dispatch_group_t dw_grp = dispatch_group_create();
+
+        float last_loss = 999.0f;
+        double total_train_ms = 0;
+        int total_steps_done = 0;
+        uint64_t t_wall_start = mach_absolute_time();
+        srand48(42 + start_step);
+
+        for (int step = start_step; step < total_steps; step++) {
+            uint64_t t0, t1, t_step = mach_absolute_time();
+
+            // Sample data
+            size_t max_pos = n_tokens - SEQ - 1;
+            size_t pos = (size_t)(drand48() * max_pos);
+            uint16_t *input_tokens = token_data + pos;
+            uint16_t *target_tokens_raw = token_data + pos + 1;
+
+            // Map targets to compact vocab IDs
+            uint16_t ctargets[SEQ];
+            for (int t = 0; t < SEQ; t++) ctargets[t] = (uint16_t)vm.full_to_compact[target_tokens_raw[t]];
+
+            // Embedding lookup (uses full embed for now — input tokens are full IDs)
+            embed_lookup(x_cur, embed, input_tokens, DIM, SEQ);
+
+            // Timing accumulators (reset each step)
+            double t_rms=0, t_ane_fwd=0, t_io_fwd=0, t_cblas_wait=0;
+            double t_ane_bwd=0, t_io_bwd=0, t_silu=0, t_rms_bwd=0, t_cls=0, t_dw_copy=0;
+
+            // ===== FORWARD (12 layers) =====
+            for (int L=0; L<NLAYERS; L++) {
+                LayerActs *ac = &acts[L];
+                memcpy(ac->layer_in, x_cur, SEQ*DIM*4);
+
+                // RMSNorm1 (CPU)
+                t0 = mach_absolute_time();
+                rmsnorm(xnorm_buf, x_cur, lw[L].rms_att, DIM, SEQ);
+                memcpy(ac->xnorm, xnorm_buf, SEQ*DIM*4);
+                t_rms += tb_ms(mach_absolute_time() - t0);
+
+                // Wait for any pending dW cblas
+                t0 = mach_absolute_time();
+                dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+                t_cblas_wait += tb_ms(mach_absolute_time() - t0);
+
+                // SDPA forward (ANE): xnorm + Wq,Wk,Wv,Wo → o_out,Q,K,V,attn_out,xnorm
+                t0 = mach_absolute_time();
+                write_sdpa_fwd_input(&dk, xnorm_buf, Wqt_buf[L], Wkt_buf[L], Wvt_buf[L], Wot_buf[L]);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.sdpaFwd);
+                t_ane_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // Read output: [1, 6*DIM, 1, SEQ] fp32
+                t0 = mach_absolute_time();
+                IOSurfaceLock(dk.sdpaFwd->ioOut, kIOSurfaceLockReadOnly, NULL);
+                float *fwd_out = (float*)IOSurfaceGetBaseAddress(dk.sdpaFwd->ioOut);
+                memcpy(ac->o_out,    fwd_out + 0*DIM*SEQ, DIM*SEQ*4);
+                memcpy(ac->Q,       fwd_out + 1*DIM*SEQ, DIM*SEQ*4);
+                memcpy(ac->K,       fwd_out + 2*DIM*SEQ, DIM*SEQ*4);
+                memcpy(ac->V,       fwd_out + 3*DIM*SEQ, DIM*SEQ*4);
+                memcpy(ac->attn_out, fwd_out + 4*DIM*SEQ, DIM*SEQ*4);
+                IOSurfaceUnlock(dk.sdpaFwd->ioOut, kIOSurfaceLockReadOnly, NULL);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // Residual: x2 = x_cur + o_out
+                vDSP_vadd(x_cur, 1, ac->o_out, 1, ac->x2, 1, (vDSP_Length)(SEQ*DIM));
+
+                // RMSNorm2 (CPU)
+                t0 = mach_absolute_time();
+                rmsnorm(xnorm_buf, ac->x2, lw[L].rms_ffn, DIM, SEQ);
+                memcpy(ac->x2norm, xnorm_buf, SEQ*DIM*4);
+                t_rms += tb_ms(mach_absolute_time() - t0);
+
+                // FFN W1+W3 (ANE): xnorm → h1, h3, gate
+                t0 = mach_absolute_time();
+                write_ffn_w13_input(&dk, xnorm_buf, W1t_buf[L], W3t_buf[L]);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.ffnW13);
+                t_ane_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // Read h1, h3, gate from output [1, 3*HIDDEN, 1, SEQ]
+                t0 = mach_absolute_time();
+                IOSurfaceLock(dk.ffnW13->ioOut, kIOSurfaceLockReadOnly, NULL);
+                float *ffn13_out = (float*)IOSurfaceGetBaseAddress(dk.ffnW13->ioOut);
+                memcpy(ac->h1,       ffn13_out,                   HIDDEN*SEQ*4);
+                memcpy(ac->h3,       ffn13_out + HIDDEN*SEQ,      HIDDEN*SEQ*4);
+                memcpy(gate_buf,     ffn13_out + 2*HIDDEN*SEQ,    HIDDEN*SEQ*4);
+                memcpy(ac->silu_out, gate_buf,                    HIDDEN*SEQ*4);
+                IOSurfaceUnlock(dk.ffnW13->ioOut, kIOSurfaceLockReadOnly, NULL);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // FFN W2 (ANE): gate @ W2 → ffn_out
+                t0 = mach_absolute_time();
+                write_ffn_w2_input(&dk, gate_buf, W2t_buf[L]);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.ffnW2);
+                t_ane_fwd += tb_ms(mach_absolute_time() - t0);
+
+                t0 = mach_absolute_time();
+                IOSurfaceLock(dk.ffnW2->ioOut, kIOSurfaceLockReadOnly, NULL);
+                memcpy(ac->ffn_out, (float*)IOSurfaceGetBaseAddress(dk.ffnW2->ioOut), DIM*SEQ*4);
+                IOSurfaceUnlock(dk.ffnW2->ioOut, kIOSurfaceLockReadOnly, NULL);
+                t_io_fwd += tb_ms(mach_absolute_time() - t0);
+
+                // Residual: x_cur = x2 + ffn_out
+                vDSP_vadd(ac->x2, 1, ac->ffn_out, 1, x_cur, 1, (vDSP_Length)(SEQ*DIM));
+            }
+
+            // Final RMSNorm + classifier + loss (CPU)
+            t0 = mach_absolute_time();
+            rmsnorm(x_final, x_cur, rms_final, DIM, SEQ);
+            t_rms += tb_ms(mach_absolute_time() - t0);
+            t0 = mach_absolute_time();
+            // Classifier: logits[CV, SEQ] = cembed[CV, DIM] @ x_final[DIM, SEQ]
+            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
+                        CV, SEQ, DIM, 1.0f, cembed, DIM, x_final, SEQ, 0.0f, logits, SEQ);
+            float loss = cross_entropy_loss(dlogits, logits, ctargets, CV, SEQ);
+            t_cls += tb_ms(mach_absolute_time() - t0);
+            last_loss = loss;
+
+            // ===== BACKWARD =====
+            // Classifier backward: dy[DIM, SEQ] = cembed^T[DIM, CV] @ dlogits[CV, SEQ]
+            t0 = mach_absolute_time();
+            cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
+                        DIM, SEQ, CV, 1.0f, cembed, DIM, dlogits, SEQ, 0.0f, dy, SEQ);
+            t_cls += tb_ms(mach_absolute_time() - t0);
+
+            // dEmbed async: gcembed[CV, DIM] += dlogits[CV, SEQ] @ x_final^T[SEQ, DIM]
+            dispatch_group_async(dw_grp, dw_q, ^{
+                cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
+                            CV, DIM, SEQ, 1.0f, dlogits, SEQ, x_final, SEQ, 1.0f, gcembed, DIM);
+            });
+
+            // Final RMSNorm backward
+            float *dx_rms_final = (float*)calloc(SEQ*DIM, 4);
+            rmsnorm_bwd(dx_rms_final, grms_final, dy, x_cur, rms_final, DIM, SEQ);
+            memcpy(dy, dx_rms_final, SEQ*DIM*4);
+            free(dx_rms_final);
+
+            // ===== BACKWARD (12 layers, reverse) =====
+            for (int L=NLAYERS-1; L>=0; L--) {
+                LayerActs *ac = &acts[L];
+                LayerGrads *gr = &grads[L];
+                memcpy(dffn, dy, SEQ*DIM*4);
+
+                // FFN backward: dffn @ W2^T → dsilu_raw
+                t0 = mach_absolute_time();
+                io_write_dyn(dk.ffnBwdW2t->ioIn, dffn, DIM, SEQ, lw[L].W2, HIDDEN);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.ffnBwdW2t);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                io_read_dyn(dk.ffnBwdW2t->ioOut, dsilu, HIDDEN, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // SiLU derivative (vectorized): dsilu → dh1, dh3
+                // silu(h1) = h1*sig(h1), dsilu_dh1 = sig*(1+h1*(1-sig))
+                // dh1 = dsilu * h3 * dsilu_dh1, dh3 = dsilu * silu(h1)
+                t0 = mach_absolute_time();
+                {
+                    int n = HIDDEN*SEQ;
+                    // sig = 1/(1+exp(-h1))
+                    float minus1 = -1.0f, one = 1.0f;
+                    vDSP_vsmul(ac->h1, 1, &minus1, silu_tmp, 1, (vDSP_Length)n);
+                    vvexpf(silu_tmp, silu_tmp, &n);
+                    vDSP_vsadd(silu_tmp, 1, &one, silu_tmp, 1, (vDSP_Length)n);
+                    vvrecf(silu_tmp, silu_tmp, &n);  // silu_tmp = sig
+                    // dh3 = dsilu * h1 * sig  (= dsilu * silu(h1))
+                    vDSP_vmul(ac->h1, 1, silu_tmp, 1, dh3, 1, (vDSP_Length)n);
+                    vDSP_vmul(dsilu, 1, dh3, 1, dh3, 1, (vDSP_Length)n);
+                    // dsilu_dh1 = sig*(1+h1*(1-sig)), store in silu_tmp2
+                    vDSP_vsadd(silu_tmp, 1, &minus1, silu_tmp2, 1, (vDSP_Length)n); // sig-1
+                    vDSP_vneg(silu_tmp2, 1, silu_tmp2, 1, (vDSP_Length)n);          // 1-sig
+                    vDSP_vmul(ac->h1, 1, silu_tmp2, 1, silu_tmp2, 1, (vDSP_Length)n); // h1*(1-sig)
+                    vDSP_vsadd(silu_tmp2, 1, &one, silu_tmp2, 1, (vDSP_Length)n);  // 1+h1*(1-sig)
+                    vDSP_vmul(silu_tmp, 1, silu_tmp2, 1, silu_tmp2, 1, (vDSP_Length)n); // full dsilu_dh1
+                    // dh1 = dsilu * h3 * dsilu_dh1
+                    vDSP_vmul(dsilu, 1, ac->h3, 1, dh1, 1, (vDSP_Length)n);
+                    vDSP_vmul(dh1, 1, silu_tmp2, 1, dh1, 1, (vDSP_Length)n);
+                }
+                t_silu += tb_ms(mach_absolute_time() - t0);
+
+                // dh1@W1^T + dh3@W3^T → dx_ffn (ANE)
+                t0 = mach_absolute_time();
+                {
+                    IOSurfaceLock(dk.ffnBwdW13t->ioIn, 0, NULL);
+                    float *buf = (float*)IOSurfaceGetBaseAddress(dk.ffnBwdW13t->ioIn);
+                    int sp = 2*SEQ + 2*DIM;
+                    for (int d = 0; d < HIDDEN; d++) {
+                        memcpy(buf + d*sp,            dh1 + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + SEQ,      dh3 + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + 2*SEQ,        lw[L].W1 + d*DIM, DIM*4);
+                        memcpy(buf + d*sp + 2*SEQ + DIM,  lw[L].W3 + d*DIM, DIM*4);
+                    }
+                    IOSurfaceUnlock(dk.ffnBwdW13t->ioIn, 0, NULL);
+                }
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.ffnBwdW13t);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                io_read_dyn(dk.ffnBwdW13t->ioOut, dx_ffn, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // dW FFN async (cblas)
+                t0 = mach_absolute_time();
+                float *capt_dffn = (float*)malloc(SEQ*DIM*4); memcpy(capt_dffn, dffn, SEQ*DIM*4);
+                float *capt_silu = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_silu, ac->silu_out, SEQ*HIDDEN*4);
+                float *capt_dh1 = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_dh1, dh1, SEQ*HIDDEN*4);
+                float *capt_dh3 = (float*)malloc(SEQ*HIDDEN*4); memcpy(capt_dh3, dh3, SEQ*HIDDEN*4);
+                float *capt_x2n = (float*)malloc(SEQ*DIM*4); memcpy(capt_x2n, ac->x2norm, SEQ*DIM*4);
+                t_dw_copy += tb_ms(mach_absolute_time() - t0);
+                dispatch_group_async(dw_grp, dw_q, ^{
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, HIDDEN, SEQ,
+                                1.0f, capt_dffn, SEQ, capt_silu, SEQ, 1.0f, gr->W2, HIDDEN);
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+                                1.0f, capt_dh1, SEQ, capt_x2n, SEQ, 1.0f, gr->W1, DIM);
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+                                1.0f, capt_dh3, SEQ, capt_x2n, SEQ, 1.0f, gr->W3, DIM);
+                    free(capt_dffn); free(capt_silu); free(capt_dh1); free(capt_dh3); free(capt_x2n);
+                });
+
+                // RMSNorm2 backward
+                t0 = mach_absolute_time();
+                memset(dx2, 0, SEQ*DIM*4);
+                rmsnorm_bwd(dx2, gr->rms_ffn, dx_ffn, ac->x2, lw[L].rms_ffn, DIM, SEQ);
+                for(int i=0;i<SEQ*DIM;i++) dx2[i] += dy[i];
+                t_rms_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // Wo^T backward (ANE): dx2 @ Wo^T → da
+                t0 = mach_absolute_time();
+                io_write_dyn(dk.wotBwd->ioIn, dx2, DIM, SEQ, lw[L].Wo, DIM);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.wotBwd);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                float *da_buf = (float*)malloc(SEQ*DIM*4);
+                io_read_dyn(dk.wotBwd->ioOut, da_buf, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // dWo async
+                t0 = mach_absolute_time();
+                float *capt_do = (float*)malloc(SEQ*DIM*4); memcpy(capt_do, dx2, SEQ*DIM*4);
+                float *capt_attn = (float*)malloc(SEQ*DIM*4); memcpy(capt_attn, ac->attn_out, SEQ*DIM*4);
+                t_dw_copy += tb_ms(mach_absolute_time() - t0);
+                dispatch_group_async(dw_grp, dw_q, ^{
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                1.0f, capt_do, SEQ, capt_attn, SEQ, 1.0f, gr->Wo, DIM);
+                    free(capt_do); free(capt_attn);
+                });
+
+                // SDPA backward part 1 (ANE, fp16): Q,K,V,da → dV,probs,dp
+                t0 = mach_absolute_time();
+                io_write_fp16_at(dk.sdpaBwd1->ioIn, 0,     ac->Q,  DIM, SEQ);
+                io_write_fp16_at(dk.sdpaBwd1->ioIn, DIM,   ac->K,  DIM, SEQ);
+                io_write_fp16_at(dk.sdpaBwd1->ioIn, 2*DIM, ac->V,  DIM, SEQ);
+                io_write_fp16_at(dk.sdpaBwd1->ioIn, 3*DIM, da_buf, DIM, SEQ);
+                free(da_buf);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.sdpaBwd1);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // SDPA backward part 2: probs,dp,Q,K → dQ,dK
+                t0 = mach_absolute_time();
+                io_copy(dk.sdpaBwd2->ioIn, 0, dk.sdpaBwd1->ioOut, DIM, 2*SCORE_CH, SEQ);
+                io_write_fp16_at(dk.sdpaBwd2->ioIn, 2*SCORE_CH,     ac->Q, DIM, SEQ);
+                io_write_fp16_at(dk.sdpaBwd2->ioIn, 2*SCORE_CH+DIM, ac->K, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.sdpaBwd2);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+
+                t0 = mach_absolute_time();
+                io_read_fp16(dk.sdpaBwd2->ioOut, dq, 0,   DIM, SEQ);
+                io_read_fp16(dk.sdpaBwd2->ioOut, dk_buf, DIM, DIM, SEQ);
+                io_read_fp16(dk.sdpaBwd1->ioOut, dv, 0, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // dWq/dWk/dWv async
+                t0 = mach_absolute_time();
+                float *capt_dq = (float*)malloc(SEQ*DIM*4); memcpy(capt_dq, dq, SEQ*DIM*4);
+                float *capt_dk = (float*)malloc(SEQ*DIM*4); memcpy(capt_dk, dk_buf, SEQ*DIM*4);
+                float *capt_dv = (float*)malloc(SEQ*DIM*4); memcpy(capt_dv, dv, SEQ*DIM*4);
+                float *capt_xn = (float*)malloc(SEQ*DIM*4); memcpy(capt_xn, ac->xnorm, SEQ*DIM*4);
+                t_dw_copy += tb_ms(mach_absolute_time() - t0);
+                dispatch_group_async(dw_grp, dw_q, ^{
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                1.0f, capt_dq, SEQ, capt_xn, SEQ, 1.0f, gr->Wq, DIM);
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                1.0f, capt_dk, SEQ, capt_xn, SEQ, 1.0f, gr->Wk, DIM);
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+                                1.0f, capt_dv, SEQ, capt_xn, SEQ, 1.0f, gr->Wv, DIM);
+                    free(capt_dq); free(capt_dk); free(capt_dv); free(capt_xn);
+                });
+
+                // QKV backward (ANE): dq,dk,dv @ Wq^T,Wk^T,Wv^T → dx_attn
+                t0 = mach_absolute_time();
+                {
+                    IOSurfaceLock(dk.qkvBwd->ioIn, 0, NULL);
+                    float *buf = (float*)IOSurfaceGetBaseAddress(dk.qkvBwd->ioIn);
+                    int sp = 3*SEQ + 3*DIM;
+                    for (int d = 0; d < DIM; d++) {
+                        memcpy(buf + d*sp,             dq     + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + SEQ,       dk_buf + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + 2*SEQ,     dv     + d*SEQ, SEQ*4);
+                        memcpy(buf + d*sp + 3*SEQ,         lw[L].Wq + d*DIM, DIM*4);
+                        memcpy(buf + d*sp + 3*SEQ+DIM,     lw[L].Wk + d*DIM, DIM*4);
+                        memcpy(buf + d*sp + 3*SEQ+2*DIM,   lw[L].Wv + d*DIM, DIM*4);
+                    }
+                    IOSurfaceUnlock(dk.qkvBwd->ioIn, 0, NULL);
+                }
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                ane_eval(dk.qkvBwd);
+                t_ane_bwd += tb_ms(mach_absolute_time() - t0);
+                t0 = mach_absolute_time();
+                io_read_dyn(dk.qkvBwd->ioOut, dx_attn, DIM, SEQ);
+                t_io_bwd += tb_ms(mach_absolute_time() - t0);
+
+                // RMSNorm1 backward
+                t0 = mach_absolute_time();
+                float *dx_rms1 = (float*)calloc(SEQ*DIM, 4);
+                rmsnorm_bwd(dx_rms1, gr->rms_att, dx_attn, ac->layer_in, lw[L].rms_att, DIM, SEQ);
+                for(int i=0;i<SEQ*DIM;i++) dy[i] = dx_rms1[i] + dx2[i];
+                free(dx_rms1);
+                t_rms_bwd += tb_ms(mach_absolute_time() - t0);
+            }
+
+            // Embedding backward
+            dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+            embed_backward(gembed, dy, input_tokens, DIM, SEQ);
+
+            double step_ms = tb_ms(mach_absolute_time() - t_step);
+            total_train_ms += step_ms;
+            total_steps_done++;
+
+            if (step % 10 == 0 || step == start_step) {
+                printf("  timing: ane_fwd=%.1f io_fwd=%.1f rms=%.1f ane_bwd=%.1f io_bwd=%.1f silu=%.1f rms_bwd=%.1f cls=%.1f cblas_wait=%.1f dw_copy=%.1f\n",
+                       t_ane_fwd, t_io_fwd, t_rms, t_ane_bwd, t_io_bwd, t_silu, t_rms_bwd, t_cls, t_cblas_wait, t_dw_copy);
+                float xmx, xmn;
+                vDSP_maxv(x_cur,1,&xmx,(vDSP_Length)(SEQ*DIM));
+                vDSP_minv(x_cur,1,&xmn,(vDSP_Length)(SEQ*DIM));
+                float dmx, dmn;
+                vDSP_maxv(dy,1,&dmx,(vDSP_Length)(SEQ*DIM));
+                vDSP_minv(dy,1,&dmn,(vDSP_Length)(SEQ*DIM));
+                printf("step %-4d loss=%.4f  lr=%.2e  %.1fms/step  x[%.2f,%.2f] dy[%.3e,%.3e]\n",
+                       step, loss, lr, step_ms, xmn, xmx, dmn, dmx);
+            }
+
+            // Adam update every accum_steps
+            if ((step+1) % accum_steps == 0 || step == total_steps-1) {
+                dispatch_group_wait(dw_grp, DISPATCH_TIME_FOREVER);
+                float gsc = 1.0f / accum_steps;
+                adam_t++;
+
+                // Scale gradients by 1/accum_steps
+                for (int L=0; L<NLAYERS; L++) {
+                    LayerGrads *g = &grads[L];
+                    for(size_t i=0;i<WQ_SZ;i++){g->Wq[i]*=gsc;g->Wk[i]*=gsc;g->Wv[i]*=gsc;g->Wo[i]*=gsc;}
+                    for(size_t i=0;i<W1_SZ;i++) g->W1[i]*=gsc;
+                    for(size_t i=0;i<W2_SZ;i++) g->W2[i]*=gsc;
+                    for(size_t i=0;i<W3_SZ;i++) g->W3[i]*=gsc;
+                    for(int i=0;i<DIM;i++){g->rms_att[i]*=gsc; g->rms_ffn[i]*=gsc;}
+                }
+                for(int i=0;i<DIM;i++) grms_final[i]*=gsc;
+                // Merge compact classifier grads into full embed grads
+                vocab_scatter_grads(gembed, gcembed, &vm, DIM);
+                for(size_t i=0;i<(size_t)VOCAB*DIM;i++) gembed[i]*=gsc;
+
+                // Global gradient norm
+                float grad_norm_sq = 0;
+                for (int L=0; L<NLAYERS; L++) {
+                    LayerGrads *g = &grads[L];
+                    float s;
+                    vDSP_dotpr(g->Wq,1,g->Wq,1,&s,(vDSP_Length)WQ_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->Wk,1,g->Wk,1,&s,(vDSP_Length)WQ_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->Wv,1,g->Wv,1,&s,(vDSP_Length)WQ_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->Wo,1,g->Wo,1,&s,(vDSP_Length)WO_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->W1,1,g->W1,1,&s,(vDSP_Length)W1_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->W2,1,g->W2,1,&s,(vDSP_Length)W2_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->W3,1,g->W3,1,&s,(vDSP_Length)W3_SZ); grad_norm_sq+=s;
+                    vDSP_dotpr(g->rms_att,1,g->rms_att,1,&s,(vDSP_Length)DIM); grad_norm_sq+=s;
+                    vDSP_dotpr(g->rms_ffn,1,g->rms_ffn,1,&s,(vDSP_Length)DIM); grad_norm_sq+=s;
+                }
+                { float s;
+                  vDSP_dotpr(grms_final,1,grms_final,1,&s,(vDSP_Length)DIM); grad_norm_sq+=s;
+                  vDSP_dotpr(gembed,1,gembed,1,&s,(vDSP_Length)(VOCAB*DIM)); grad_norm_sq+=s;
+                }
+                float grad_norm = sqrtf(grad_norm_sq);
+                if ((step+1) % 10 == 0) printf("  grad_norm=%.4f\n", grad_norm);
+
+                // Gradient clipping
+                if (grad_clip > 0 && grad_norm > grad_clip) {
+                    float clip_scale = grad_clip / grad_norm;
+                    for (int L=0; L<NLAYERS; L++) {
+                        LayerGrads *g = &grads[L];
+                        vDSP_vsmul(g->Wq,1,&clip_scale,g->Wq,1,(vDSP_Length)WQ_SZ);
+                        vDSP_vsmul(g->Wk,1,&clip_scale,g->Wk,1,(vDSP_Length)WQ_SZ);
+                        vDSP_vsmul(g->Wv,1,&clip_scale,g->Wv,1,(vDSP_Length)WQ_SZ);
+                        vDSP_vsmul(g->Wo,1,&clip_scale,g->Wo,1,(vDSP_Length)WO_SZ);
+                        vDSP_vsmul(g->W1,1,&clip_scale,g->W1,1,(vDSP_Length)W1_SZ);
+                        vDSP_vsmul(g->W2,1,&clip_scale,g->W2,1,(vDSP_Length)W2_SZ);
+                        vDSP_vsmul(g->W3,1,&clip_scale,g->W3,1,(vDSP_Length)W3_SZ);
+                        vDSP_vsmul(g->rms_att,1,&clip_scale,g->rms_att,1,(vDSP_Length)DIM);
+                        vDSP_vsmul(g->rms_ffn,1,&clip_scale,g->rms_ffn,1,(vDSP_Length)DIM);
+                    }
+                    vDSP_vsmul(grms_final,1,&clip_scale,grms_final,1,(vDSP_Length)DIM);
+                    vDSP_vsmul(gembed,1,&clip_scale,gembed,1,(vDSP_Length)(VOCAB*DIM));
+                }
+
+                // Cosine LR schedule with warmup
+                if (step < warmup_steps) {
+                    lr = max_lr * ((float)(step + 1)) / warmup_steps;
+                } else {
+                    float decay_ratio = (float)(step - warmup_steps) / (float)(total_steps - warmup_steps);
+                    float min_lr = max_lr * min_lr_frac;
+                    lr = min_lr + 0.5f * (1.0f + cosf(M_PI * decay_ratio)) * (max_lr - min_lr);
+                }
+
+                // Adam update
+                for (int L=0; L<NLAYERS; L++) {
+                    LayerGrads *g = &grads[L];
+                    adam_update(lw[L].Wq, g->Wq, &la[L].Wq, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].Wk, g->Wk, &la[L].Wk, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].Wv, g->Wv, &la[L].Wv, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].Wo, g->Wo, &la[L].Wo, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].W1, g->W1, &la[L].W1, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].W2, g->W2, &la[L].W2, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].W3, g->W3, &la[L].W3, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].rms_att, g->rms_att, &la[L].rms_att, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                    adam_update(lw[L].rms_ffn, g->rms_ffn, &la[L].rms_ffn, adam_t, lr, adam_b1, adam_b2, adam_eps);
+
+                    // Update transposed weight buffers
+                    transpose_weight(Wqt_buf[L], lw[L].Wq, DIM, DIM);
+                    transpose_weight(Wkt_buf[L], lw[L].Wk, DIM, DIM);
+                    transpose_weight(Wvt_buf[L], lw[L].Wv, DIM, DIM);
+                    transpose_weight(Wot_buf[L], lw[L].Wo, DIM, DIM);
+                    transpose_weight(W1t_buf[L], lw[L].W1, HIDDEN, DIM);
+                    transpose_weight(W2t_buf[L], lw[L].W2, DIM, HIDDEN);
+                    transpose_weight(W3t_buf[L], lw[L].W3, HIDDEN, DIM);
+                }
+                adam_update(rms_final, grms_final, &arms_final, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                adam_update(embed, gembed, &aembed, adam_t, lr, adam_b1, adam_b2, adam_eps);
+                // Re-extract compact embed from updated full embed
+                free(cembed);
+                cembed = vocab_compact_embed(embed, &vm, DIM);
+
+                // Zero grads
+                for (int L=0; L<NLAYERS; L++) layer_grads_zero(&grads[L]);
+                memset(grms_final, 0, DIM*4);
+                memset(gembed, 0, (size_t)VOCAB*DIM*4);
+                memset(gcembed, 0, (size_t)CV*DIM*4);
+
+                // Checkpoint
+                if ((step+1) % 100 == 0) {
+                    double wall = tb_ms(mach_absolute_time() - t_wall_start);
+                    save_checkpoint(CKPT_PATH, step+1, total_steps, lr, last_loss,
+                        total_train_ms+cum_train, wall+cum_wall, total_steps_done+cum_steps, adam_t,
+                        lw, la, rms_final, &arms_final, embed, &aembed);
+                }
+            }
+        }
+
+        // Report
+        double wall = tb_ms(mach_absolute_time() - t_wall_start);
+        printf("\n=== Efficiency Report ===\n");
+        printf("Total steps:  %d\n", total_steps_done);
+        printf("Compile:      %.0fms (one-time, %.1f%%)\n", compile_ms, 100*compile_ms/(wall+cum_wall));
+        printf("Train time:   %.0fms (%.1fms/step)\n", total_train_ms, total_train_ms/total_steps_done);
+        printf("Wall time:    %.1fs\n", (wall+cum_wall)/1000);
+
+        // Cleanup
+        for (int L=0; L<NLAYERS; L++) {
+            layer_weights_free(&lw[L]); layer_adam_free(&la[L]);
+            layer_acts_free(&acts[L]); layer_grads_free(&grads[L]);
+            free(Wqt_buf[L]); free(Wkt_buf[L]); free(Wvt_buf[L]); free(Wot_buf[L]);
+            free(W1t_buf[L]); free(W2t_buf[L]); free(W3t_buf[L]);
+        }
+        free_kern(dk.sdpaFwd); free_kern(dk.ffnW13); free_kern(dk.ffnW2);
+        free_kern(dk.ffnBwdW2t); free_kern(dk.ffnBwdW13t); free_kern(dk.wotBwd);
+        free_kern(dk.sdpaBwd1); free_kern(dk.sdpaBwd2); free_kern(dk.qkvBwd);
+        munmap(token_data, data_len); close(data_fd);
+    }
+    return 0;
+}

From 4c14ed0e252d6f805c6fdb40bac5a8704f7be894 Mon Sep 17 00:00:00 2001
From: maderix <maderix@gmail.com>
Date: Tue, 3 Mar 2026 04:33:30 -0800
Subject: [PATCH 08/21] CLI fixes + --no-ane-extras flag + README benchmark
 table

- Fix positional arg parsing (model_path, steps, lr were silently ignored)
- Add --model, --ckpt flags; forward ckpt_path across exec() restarts
- Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd
- CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled
- Update README with 4-way benchmark comparison table (20 steps)
---
 training/README.md         | 153 ++++++++++++++++++---------------
 training/train_large.m     |  24 ++++--
 training/train_large_ane.m | 169 ++++++++++++++++++++++++-------------
 3 files changed, 213 insertions(+), 133 deletions(-)

diff --git a/training/README.md b/training/README.md
index 9c4fb00..8ccde88 100644
--- a/training/README.md
+++ b/training/README.md
@@ -8,42 +8,67 @@ Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly
 
 - **Model**: Stories110M — dim=768, hidden=2048, heads=12, layers=12, vocab=32000, seq=256
 - **109.53M params** (84.95M transformer + 24.58M embedding)
-- **72 ANE kernels** per compile (60 weight-bearing, 12 weight-free sdpaBwd2)
-- **6 kernel types per layer**: fwdAttn, fwdFFN, ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd
+- **SDPA causal mask workaround**: ANE hardware ignores attn_mask — decompose into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv)
 
-## Performance
+## Three Training Pipelines
 
-| Component | Time (ms/step) |
-|-----------|---------------|
-| ANE eval | 9.6 |
-| IO (fp16 conversion) | 4.1 |
-| Classifier (cblas) | 9.1 |
-| Cross-entropy + residuals | 14.4 |
-| RMSNorm | 0.1 |
-| **Total** | **107 ms/step** |
+### 1. Static Baseline (`train_large`)
+Original pipeline. Weights baked as constants in MIL kernels — recompile every 10 steps via `exec()` restart.
 
-## Files
+- 60 weight-bearing + 12 weight-free kernels = 72 per compile batch
+- Classifier + softmax + RMSNorm backward on CPU
+- **106.7 ms/step**, 7.6s compile per restart
 
-| File | Description |
-|------|-------------|
-| `train_large.m` | Main training loop — 12-layer forward/backward, checkpoint, exec() restart |
-| `stories_config.h` | Model config, structs, alloc helpers |
-| `stories_io.h` | IOSurface I/O, NEON fp16 conversion, kernel compile/eval |
-| `stories_mil.h` | MIL program generators for all 6 ANE kernel types |
-| `stories_cpu_ops.h` | vDSP-vectorized RMSNorm, cross-entropy, Adam, embedding ops |
-| `dashboard.py` | TUI dashboard — loss curve, power/CPU/memory graphs, text generation |
-| `tokenize.py` | Extract pretokenized TinyStories data |
-| `Makefile` | Build targets |
+### 2. Static + ANE Extras (`train_large_ane`) — PR#19
+Offloads classifier forward (32K conv), softmax, final RMSNorm, and RMSNorm backward to ANE. Bridge API for C-callable ANE access.
+
+- 86 kernels per compile batch (+24 rmsnorm_bwd, +1 classifier, +1 finalRms)
+- **91.8 ms/step** (14% faster), 9.6s compile per restart
+- Use `--no-ane-extras` to disable and fall back to CPU (for debugging)
 
-## How it works
+### 3. Dynamic Weight Pipeline (`training_dynamic/`)
+Weights passed via IOSurface spatial dimension — compile 9 kernels once at startup, no recompilation needed.
 
-1. **Forward pass**: Each layer runs fwdAttn (QKV + SDPA + Wo) and fwdFFN (W1 + SiLU(W3) + W2) on ANE via MIL-compiled kernels. Final RMSNorm + classifier matmul on CPU (cblas).
+- 9 shared kernels across all 12 layers
+- **111 ms/step**, 0.4s one-time compile
+- No exec() restart, no compile limit issues
 
-2. **Backward pass**: Reverse layer order. ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd on ANE. Weight gradients (dW) via async cblas_sgemm on CPU. RMSNorm backward via vDSP.
+## Performance Comparison (20 Steps)
 
-3. **Compile budget**: ANE has a ~119 compile limit per process. With 72 kernels per batch, we run 10 accumulation steps then `exec()` restart with checkpoint resume.
+| | Static Baseline | PR#19 + ANE extras | PR#19 no extras | Dynamic |
+|---|---|---|---|---|
+| **Wall time** | **10.1s** | **11.7s** | **10.7s** | **~2.6s** |
+| Compile | 7.6s (75.7%) | 9.6s (81.6%) | 7.5s (69.7%) | 0.4s (15%) |
+| Train | 2.1s (21.2%) | 1.8s (15.6%) | 2.9s (27.4%) | 2.2s (85%) |
+| **ms/step** | **106.7** | **91.8** | **147.0** | **111** |
+| Kernels/restart | 72 | 86 | 60 | 9 (once) |
+| ANE TFLOPS | 0.87 | 1.15 | 0.72 | — |
+| Total TFLOPS | 1.63 | 1.90 | 1.19 | — |
 
-4. **Data**: Real TinyStories text (20M tokens), mmap'd uint16 token IDs, random position sampling per step.
+**Key insights:**
+- Dynamic wins on wall time for any practical run length (3.9x faster at 20 steps)
+- PR#19 has the best per-step throughput (92ms) but compile overhead dominates short runs
+- Static restarts every 10 steps, so dynamic's zero-recompile advantage compounds
+
+## Files
+
+| File | Description |
+|------|-------------|
+| `train_large.m` | Static baseline — 72 kernels, classifier/softmax on CPU |
+| `train_large_ane.m` | PR#19 — 86 kernels, classifier/softmax/rmsnorm_bwd on ANE |
+| `training_dynamic/train.m` | Dynamic pipeline — 9 kernels, weights via IOSurface |
+| `training_dynamic/mil_dynamic.h` | MIL generators for dynamic weight kernels |
+| `training_dynamic/config.h` | Model config (DIM=768, HIDDEN=2048, etc.) |
+| `training_dynamic/io.h` | IOSurface I/O + MIL compilation helpers |
+| `training_dynamic/cpu_ops.h` | CPU ops (SiLU backward, cross-entropy, Adam) |
+| `stories_config.h` | Static pipeline config, structs, alloc helpers |
+| `stories_io.h` | IOSurface I/O, NEON fp16 conversion, kernel compile/eval |
+| `stories_mil.h` | MIL generators for static pipeline (6 kernel types) |
+| `stories_cpu_ops.h` | vDSP-vectorized RMSNorm, cross-entropy, Adam |
+| `ane_classifier.h` | ANE classifier fwd (32K conv), softmax kernels |
+| `ane_rmsnorm_bwd.h` | ANE rmsnorm backward kernel |
+| `dashboard.py` | TUI dashboard — loss curve, power/CPU/memory graphs |
+| `Makefile` | Build targets |
 
 ## Usage
 
@@ -53,69 +78,63 @@ Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly
 bash download_data.sh
 ```
 
-Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from [enio/TinyStories](https://huggingface.co/datasets/enio/TinyStories) on HuggingFace. Produces `tinystories_data00.bin` (~41 MB, ~20M tokens).
+Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from HuggingFace. Produces `tinystories_data00.bin` (~41 MB, ~20M tokens).
 
 ### 2. Build & Train
 
 ```bash
-# Baseline: classifier + softmax on CPU
+# Static baseline (classifier + softmax on CPU)
 make train_large
-./train_large --steps 100        # quick test
-./train_large                    # full 10k steps
-./train_large --resume           # resume from checkpoint
+./train_large stories110M.bin 256 100 1e-4
+./train_large --model stories110M.bin --steps 100 --lr 1e-4
 
-# ANE-offloaded: classifier + softmax on ANE (faster)
+# PR#19: ANE-offloaded classifier + softmax + rmsnorm_bwd
 make train_large_ane
-./train_large_ane --steps 100
+./train_large_ane stories110M.bin 256 100 1e-4
+./train_large_ane --no-ane-extras --steps 100    # disable ANE extras
+
+# Dynamic pipeline (no recompilation)
+cd training_dynamic && make train
+./train --scratch              # train from random init
+./train                        # resume from checkpoint
+./train --steps 200 --lr 1e-4  # custom steps/lr
 ```
 
-**CLI flags:** `--steps N` (default 10000), `--lr F` (default 3e-4), `--resume`.
+**CLI flags (all pipelines):**
+- `--steps N` (default 10000)
+- `--lr F` (default 3e-4)
+- `--model PATH` — pretrained weights file
+- `--ckpt PATH` — checkpoint file (preserved across exec() restarts)
+- `--resume` — resume from checkpoint
+- `--no-ane-extras` — (train_large_ane only) disable ANE classifier/softmax/rmsnorm_bwd
 
 ### 3. Monitor with Dashboard
 
 ```bash
 pip install blessed psutil numpy
-sudo python3 dashboard.py          # live mode (needs powermetrics)
-sudo python3 dashboard.py --resume  # attach to resumed training
+sudo python3 dashboard.py          # static pipeline
+sudo python3 dashboard.py --dynamic # dynamic pipeline
 ```
 
 ### 4. Benchmarking
 
-Both programs print an **Efficiency Report** at completion:
+All programs print an **Efficiency Report** at completion:
 
 ```
 === Efficiency Report ===
-Total steps:     100
-Avg train:       107.0 ms/step
-ANE TFLOPS:      2.45 sustained
-ANE utilization: 15.5% of 15.8 TFLOPS
-```
-
-Per-batch timing breakdown during training:
-
-```
-ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step
-```
-
-| Metric | What it measures |
-|--------|-----------------|
-| `ane` | ANE kernel evaluation |
-| `io` | fp16↔fp32 IOSurface transfer |
-| `cls` | Classifier matmul (CPU cblas) |
-| `elem` | Embedding, residual adds, cross-entropy |
-| `rms` | RMSNorm forward/backward |
-| `cblas_wait` | Waiting for async dW gradient sgemms |
-
-Compare baseline vs ANE-offloaded:
-
-```bash
-make train_large && ./train_large --steps 100
-make train_large_ane && ./train_large_ane --steps 100
+Total steps:     20
+Wall time:       11738 ms (11.7 s)
+Compile time:    9583 ms (81.6%)
+Train time:      1835 ms (15.6%)
+Avg train:       91.8 ms/step
+ANE TFLOPS:      1.15 sustained
 ```
 
-## Key techniques
+## Key Techniques
 
-- **NEON vectorized fp16<->fp32**: ARM NEON intrinsics for fast IOSurface data transfer
+- **NEON vectorized fp16↔fp32**: ARM NEON intrinsics for fast IOSurface data transfer
 - **vDSP cross-entropy**: `vDSP_mtrans` + `vvexpf` + `vDSP_sve` — 8x faster than scalar
 - **Async weight gradients**: cblas_sgemm dispatched to background queue, overlapped with ANE
-- **SDPA causal mask workaround**: ANE hardware ignores attn_mask, so we decompose attention into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv)
+- **Vocab compaction** (dynamic): 32K → 9.2K active tokens, 3.5x reduction in classifier work
+- **Dynamic weight packing**: Activations + weights concatenated in IOSurface spatial dimension — one kernel serves all 12 layers
+- **exec() restart**: Workaround for ANE ~119 compile limit per process
diff --git a/training/train_large.m b/training/train_large.m
index e58ce08..17fb1c5 100644
--- a/training/train_large.m
+++ b/training/train_large.m
@@ -5,8 +5,8 @@
 #include "stories_mil.h"
 #include "stories_cpu_ops.h"
 
-#define CKPT_PATH "ane_stories110M_ckpt.bin"
-#define MODEL_PATH "../../assets/models/stories110M.bin"
+#define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin"
+#define MODEL_PATH_DEFAULT "stories110M.bin"
 #define DATA_PATH "tinystories_data00.bin"
 
 // ===== Weight loading from llama2.c format =====
@@ -193,11 +193,23 @@ int main(int argc, char *argv[]) {
         int adam_t = 0, start_step = 0;
 
         // Parse args
+        const char *ckpt_path = CKPT_PATH_DEFAULT;
+        const char *model_path = MODEL_PATH_DEFAULT;
         bool do_resume = false;
+        int pos = 0;
         for (int i=1; i<argc; i++) {
             if (strcmp(argv[i], "--resume") == 0) do_resume = true;
             else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
             else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--ckpt") == 0 && i+1<argc) ckpt_path = argv[++i];
+            else if (strcmp(argv[i], "--model") == 0 && i+1<argc) model_path = argv[++i];
+            else if (argv[i][0] != '-') {
+                if (pos == 0) model_path = argv[i];
+                else if (pos == 1) { /* seq - compile-time constant */ }
+                else if (pos == 2) total_steps = atoi(argv[i]);
+                else if (pos == 3) lr = atof(argv[i]);
+                pos++;
+            }
         }
 
         // Allocate per-layer state
@@ -228,7 +240,7 @@ int main(int argc, char *argv[]) {
         float resume_loss = 0;
         bool resuming = false;
         if (do_resume) {
-            resuming = load_checkpoint(CKPT_PATH, &start_step, &total_steps, &lr, &resume_loss,
+            resuming = load_checkpoint(ckpt_path, &start_step, &total_steps, &lr, &resume_loss,
                 &cum_compile, &cum_train, &cum_wall, &cum_steps, &cum_batches, &adam_t,
                 lw, la, rms_final, &arms_final, embed, &aembed);
             if (resuming) printf("[RESUMED step %d, loss=%.4f]\n", start_step, resume_loss);
@@ -236,7 +248,7 @@ int main(int argc, char *argv[]) {
         if (!resuming) {
             printf("=== ANE Training: Stories110M (12 layers) ===\n");
             printf("dim=%d hidden=%d heads=%d seq=%d vocab=%d layers=%d\n", DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS);
-            if (!load_pretrained(lw, rms_final, embed, MODEL_PATH)) {
+            if (!load_pretrained(lw, rms_final, embed, model_path)) {
                 printf("Pretrained load failed, using random init\n");
                 srand48(42);
                 float scale_d=1.0f/sqrtf(DIM), scale_h=1.0f/sqrtf(HIDDEN);
@@ -322,13 +334,13 @@ int main(int argc, char *argv[]) {
             if (g_compile_count + TOTAL_WEIGHT_KERNELS > MAX_COMPILES) {
                 for (int L=0; L<NLAYERS; L++) { free_layer_kernels(&kern[L]); free_kern(sdpaBwd2[L]); }
                 double wall = tb_ms(mach_absolute_time() - t_wall_start);
-                save_checkpoint(CKPT_PATH, step, total_steps, lr, last_loss,
+                save_checkpoint(ckpt_path, step, total_steps, lr, last_loss,
                     total_compile_ms+cum_compile, total_train_ms+cum_train, wall+cum_wall,
                     total_steps_done+cum_steps, total_batches+cum_batches, adam_t,
                     lw, la, rms_final, &arms_final, embed, &aembed);
                 printf("[exec() restart step %d, %d compiles, loss=%.4f]\n", step, g_compile_count, last_loss);
                 fflush(stdout);
-                execl(argv[0], argv[0], "--resume", NULL);
+                execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, NULL);
                 perror("execl"); return 1;
             }
 
diff --git a/training/train_large_ane.m b/training/train_large_ane.m
index d7a99ef..ba9dfe7 100644
--- a/training/train_large_ane.m
+++ b/training/train_large_ane.m
@@ -16,8 +16,8 @@
 #include "ane_rmsnorm_bwd.h"
 #include "ane_classifier.h"
 
-#define CKPT_PATH "ane_stories110M_ckpt.bin"
-#define MODEL_PATH "../../assets/models/stories110M.bin"
+#define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin"
+#define MODEL_PATH_DEFAULT "stories110M.bin"
 #define DATA_PATH "tinystories_data00.bin"
 
 // ===== Weight loading from llama2.c format =====
@@ -202,11 +202,25 @@ int main(int argc, char *argv[]) {
         float lr = 3e-4f;
         float adam_b1=0.9f, adam_b2=0.999f, adam_eps=1e-8f;
         int adam_t = 0, start_step = 0;
+        const char *ckpt_path = CKPT_PATH_DEFAULT;
+        const char *model_path = MODEL_PATH_DEFAULT;
         bool do_resume = false;
+        bool ane_extras = true;  // classifier, softmax, rmsnorm_bwd on ANE
+        int pos = 0;
         for (int i=1; i<argc; i++) {
             if (strcmp(argv[i], "--resume") == 0) do_resume = true;
+            else if (strcmp(argv[i], "--no-ane-extras") == 0) ane_extras = false;
             else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
             else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--ckpt") == 0 && i+1<argc) ckpt_path = argv[++i];
+            else if (strcmp(argv[i], "--model") == 0 && i+1<argc) model_path = argv[++i];
+            else if (argv[i][0] != '-') {
+                if (pos == 0) model_path = argv[i];
+                else if (pos == 1) { /* seq - compile-time constant */ }
+                else if (pos == 2) total_steps = atoi(argv[i]);
+                else if (pos == 3) lr = atof(argv[i]);
+                pos++;
+            }
         }
 
         LayerWeights lw[NLAYERS]; LayerAdam la[NLAYERS];
@@ -228,7 +242,7 @@ int main(int argc, char *argv[]) {
         float resume_loss = 0;
         bool resuming = false;
         if (do_resume) {
-            resuming = load_checkpoint(CKPT_PATH, &start_step, &total_steps, &lr, &resume_loss,
+            resuming = load_checkpoint(ckpt_path, &start_step, &total_steps, &lr, &resume_loss,
                 &cum_compile, &cum_train, &cum_wall, &cum_steps, &cum_batches, &adam_t,
                 lw, la, rms_final, &arms_final, embed, &aembed);
             if (resuming) printf("[RESUMED step %d, loss=%.4f]\n", start_step, resume_loss);
@@ -236,8 +250,9 @@ int main(int argc, char *argv[]) {
         if (!resuming) {
             printf("=== ANE Training: Stories110M (ANE-offloaded) ===\n");
             printf("dim=%d hidden=%d heads=%d seq=%d vocab=%d layers=%d\n", DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS);
-            printf("NEW: final_rmsnorm, classifier_fwd, softmax, rmsnorm_bwd on ANE\n");
-            if (!load_pretrained(lw, rms_final, embed, MODEL_PATH)) {
+            if (ane_extras) printf("NEW: final_rmsnorm, classifier_fwd, softmax, rmsnorm_bwd on ANE\n");
+            else printf("ANE extras DISABLED (classifier/softmax/rmsnorm_bwd on CPU)\n");
+            if (!load_pretrained(lw, rms_final, embed, model_path)) {
                 printf("Pretrained load failed, using random init\n");
                 srand48(42);
                 float scale_d=1.0f/sqrtf(DIM), scale_h=1.0f/sqrtf(HIDDEN);
@@ -301,9 +316,12 @@ int main(int argc, char *argv[]) {
         memset(rmsFFNBwd, 0, sizeof(rmsFFNBwd));
 
         // Softmax kernel (no weights — compile once)
-        Kern *softmaxKern = compile_softmax_kern();
-        if (!softmaxKern) { printf("softmax compile failed\n"); return 1; }
-        printf("Softmax kernel compiled (no weights)\n");
+        Kern *softmaxKern = NULL;
+        if (ane_extras) {
+            softmaxKern = compile_softmax_kern();
+            if (!softmaxKern) { printf("softmax compile failed\n"); return 1; }
+            printf("Softmax kernel compiled (no weights)\n");
+        }
 
         // Final RMSNorm and classifier are recompiled per batch since they have baked weights
         Kern *finalRmsKern = NULL, *classifierKern = NULL;
@@ -320,8 +338,8 @@ int main(int argc, char *argv[]) {
         int step = start_step;
         while (step < total_steps) {
             // Check compile budget — account for new kernels
-            // Per batch: 60 layer kernels + 24 rmsnorm_bwd + 1 classifier + 1 final_rms = 86
-            int kernels_needed = TOTAL_WEIGHT_KERNELS + 2*NLAYERS + 2;
+            // Per batch: 60 layer kernels [+ 24 rmsnorm_bwd + 1 classifier + 1 final_rms = 86 with extras]
+            int kernels_needed = TOTAL_WEIGHT_KERNELS + (ane_extras ? 2*NLAYERS + 2 : 0);
             if (g_compile_count + kernels_needed > MAX_COMPILES) {
                 for (int L=0; L<NLAYERS; L++) {
                     free_layer_kernels(&kern[L]); free_kern(sdpaBwd2[L]);
@@ -329,13 +347,16 @@ int main(int argc, char *argv[]) {
                 }
                 free_kern(softmaxKern); free_kern(finalRmsKern); free_kern(classifierKern);
                 double wall = tb_ms(mach_absolute_time() - t_wall_start);
-                save_checkpoint(CKPT_PATH, step, total_steps, lr, last_loss,
+                save_checkpoint(ckpt_path, step, total_steps, lr, last_loss,
                     total_compile_ms+cum_compile, total_train_ms+cum_train, wall+cum_wall,
                     total_steps_done+cum_steps, total_batches+cum_batches, adam_t,
                     lw, la, rms_final, &arms_final, embed, &aembed);
                 printf("[exec() restart step %d, %d compiles, loss=%.4f]\n", step, g_compile_count, last_loss);
                 fflush(stdout);
-                execl(argv[0], argv[0], "--resume", NULL);
+                if (ane_extras)
+                    execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, NULL);
+                else
+                    execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, "--no-ane-extras", NULL);
                 perror("execl"); return 1;
             }
 
@@ -350,13 +371,15 @@ int main(int argc, char *argv[]) {
                     printf("\nCompile failed at layer %d\n", L);
                     compile_ok = false; break;
                 }
-                // NEW: Compile RMSNorm backward kernels for this layer
-                free_kern(rmsAttBwd[L]); free_kern(rmsFFNBwd[L]);
-                rmsAttBwd[L] = compile_rmsnorm_bwd_kern(lw[L].rms_att);
-                rmsFFNBwd[L] = compile_rmsnorm_bwd_kern(lw[L].rms_ffn);
-                if (!rmsAttBwd[L] || !rmsFFNBwd[L]) {
-                    printf("\nrmsnorm_bwd compile failed at layer %d\n", L);
-                    compile_ok = false; break;
+                // Compile RMSNorm backward kernels for this layer (if ane_extras)
+                if (ane_extras) {
+                    free_kern(rmsAttBwd[L]); free_kern(rmsFFNBwd[L]);
+                    rmsAttBwd[L] = compile_rmsnorm_bwd_kern(lw[L].rms_att);
+                    rmsFFNBwd[L] = compile_rmsnorm_bwd_kern(lw[L].rms_ffn);
+                    if (!rmsAttBwd[L] || !rmsFFNBwd[L]) {
+                        printf("\nrmsnorm_bwd compile failed at layer %d\n", L);
+                        compile_ok = false; break;
+                    }
                 }
             }
             if (!compile_ok) { g_compile_count = MAX_COMPILES; continue; }
@@ -369,18 +392,19 @@ int main(int argc, char *argv[]) {
                 }
             }
 
-            // NEW: Compile final RMSNorm and classifier with current weights
-            free_kern(finalRmsKern); free_kern(classifierKern);
-            finalRmsKern = compile_final_rmsnorm_kern(rms_final);
-            classifierKern = compile_classifier_fwd(embed);
-            if (!finalRmsKern || !classifierKern) {
-                printf("finalRms or classifier compile failed\n");
-                g_compile_count = MAX_COMPILES; continue;
-            }
-            // Re-compile softmax if needed
-            if (!softmaxKern) {
-                softmaxKern = compile_softmax_kern();
-                if (!softmaxKern) { printf("softmax recompile failed\n"); return 1; }
+            // Compile final RMSNorm and classifier with current weights (if ane_extras)
+            if (ane_extras) {
+                free_kern(finalRmsKern); free_kern(classifierKern);
+                finalRmsKern = compile_final_rmsnorm_kern(rms_final);
+                classifierKern = compile_classifier_fwd(embed);
+                if (!finalRmsKern || !classifierKern) {
+                    printf("finalRms or classifier compile failed\n");
+                    g_compile_count = MAX_COMPILES; continue;
+                }
+                if (!softmaxKern) {
+                    softmaxKern = compile_softmax_kern();
+                    if (!softmaxKern) { printf("softmax recompile failed\n"); return 1; }
+                }
             }
 
             double cms = tb_ms(mach_absolute_time() - tc);
@@ -444,26 +468,46 @@ int main(int argc, char *argv[]) {
                     t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0);
                 }
 
-                // CHANGED: Final RMSNorm on ANE (was CPU)
                 t0=mach_absolute_time();
-                io_write_fp16(finalRmsKern->ioIn, x_cur, DIM, SEQ);
-                ane_eval(finalRmsKern);
-                io_read_fp16(finalRmsKern->ioOut, x_final, 0, DIM, SEQ);
-                t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+                if (ane_extras) {
+                    // Final RMSNorm on ANE
+                    io_write_fp16(finalRmsKern->ioIn, x_cur, DIM, SEQ);
+                    ane_eval(finalRmsKern);
+                    io_read_fp16(finalRmsKern->ioOut, x_final, 0, DIM, SEQ);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
 
-                // CHANGED: Classifier on ANE (was CPU cblas)
-                io_write_fp16(classifierKern->ioIn, x_final, DIM, SEQ);
-                ane_eval(classifierKern);
-                t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+                    // Classifier on ANE
+                    io_write_fp16(classifierKern->ioIn, x_final, DIM, SEQ);
+                    ane_eval(classifierKern);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
 
-                // CHANGED: Softmax on ANE, then read probs back for NLL on CPU
-                io_copy(softmaxKern->ioIn, 0, classifierKern->ioOut, 0, VOCAB, SEQ);
-                ane_eval(softmaxKern);
-                t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
+                    // Softmax on ANE
+                    io_copy(softmaxKern->ioIn, 0, classifierKern->ioOut, 0, VOCAB, SEQ);
+                    ane_eval(softmaxKern);
+                    t1=mach_absolute_time(); t_ane+=tb_ms(t1-t0); t0=t1;
 
-                // Read probs back for NLL loss + gradient (needs target indexing — CPU)
-                io_read_fp16(softmaxKern->ioOut, probs, 0, VOCAB, SEQ);
-                t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+                    io_read_fp16(softmaxKern->ioOut, probs, 0, VOCAB, SEQ);
+                    t1=mach_absolute_time(); t_io+=tb_ms(t1-t0); t0=t1;
+                } else {
+                    // CPU fallback: rmsnorm + classifier + softmax
+                    rmsnorm(x_final, x_cur, rms_final, DIM, SEQ);
+                    t1=mach_absolute_time(); t_rms+=tb_ms(t1-t0); t0=t1;
+
+                    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
+                                VOCAB, SEQ, DIM, 1.0f,
+                                embed, DIM, x_final, SEQ, 0.0f, probs, SEQ);
+                    t1=mach_absolute_time(); t_cls+=tb_ms(t1-t0); t0=t1;
+
+                    // CPU softmax
+                    for (int t = 0; t < SEQ; t++) {
+                        float maxv = -1e30f;
+                        for (int v = 0; v < VOCAB; v++) { float val = probs[v*SEQ+t]; if (val > maxv) maxv = val; }
+                        float sum = 0;
+                        for (int v = 0; v < VOCAB; v++) { probs[v*SEQ+t] = expf(probs[v*SEQ+t] - maxv); sum += probs[v*SEQ+t]; }
+                        for (int v = 0; v < VOCAB; v++) probs[v*SEQ+t] /= sum;
+                    }
+                    t1=mach_absolute_time(); t_elem+=tb_ms(t1-t0); t0=t1;
+                }
 
                 // NLL loss + gradient on CPU: dlogits = probs - one_hot(targets)
                 float total_loss = 0;
@@ -531,17 +575,19 @@ int main(int argc, char *argv[]) {
                         free(capt_dffn); free(capt_silu); free(capt_dh1); free(capt_dh3); free(capt_x2n);
                     });
 
-                    // CHANGED: RMSNorm2 backward on ANE
-                    // Write concat(dx_ffn, x2) into rmsnorm_bwd kernel
-                    io_write_fp16_at(rmsFFNBwd[L]->ioIn, 0, dx_ffn, DIM, SEQ);
-                    io_write_fp16_at(rmsFFNBwd[L]->ioIn, DIM, ac->x2, DIM, SEQ);
-                    ane_eval(rmsFFNBwd[L]);
-                    io_read_fp16(rmsFFNBwd[L]->ioOut, dx2, 0, DIM, SEQ);
-                    // dw for rmsnorm_ffn still on CPU (accumulate per step)
+                    // RMSNorm2 backward
+                    if (ane_extras) {
+                        io_write_fp16_at(rmsFFNBwd[L]->ioIn, 0, dx_ffn, DIM, SEQ);
+                        io_write_fp16_at(rmsFFNBwd[L]->ioIn, DIM, ac->x2, DIM, SEQ);
+                        ane_eval(rmsFFNBwd[L]);
+                        io_read_fp16(rmsFFNBwd[L]->ioOut, dx2, 0, DIM, SEQ);
+                    }
+                    // dw for rmsnorm_ffn on CPU (accumulate per step)
                     {
                         float *dw_tmp = (float*)calloc(DIM, 4);
                         float *dx_scratch = (float*)malloc(SEQ*DIM*4);
                         rmsnorm_bwd(dx_scratch, dw_tmp, dx_ffn, ac->x2, lw[L].rms_ffn, DIM, SEQ);
+                        if (!ane_extras) memcpy(dx2, dx_scratch, SEQ*DIM*4);
                         for(int i=0;i<DIM;i++) gr->rms_ffn[i] += dw_tmp[i];
                         free(dx_scratch); free(dw_tmp);
                     }
@@ -591,17 +637,20 @@ int main(int argc, char *argv[]) {
                     ane_eval(kern[L].qkvBwd);
                     io_read_fp16(kern[L].qkvBwd->ioOut, dx_attn, 0, DIM, SEQ);
 
-                    // CHANGED: RMSNorm1 backward on ANE
-                    io_write_fp16_at(rmsAttBwd[L]->ioIn, 0, dx_attn, DIM, SEQ);
-                    io_write_fp16_at(rmsAttBwd[L]->ioIn, DIM, ac->layer_in, DIM, SEQ);
-                    ane_eval(rmsAttBwd[L]);
+                    // RMSNorm1 backward
                     float *dx_rms1 = (float*)malloc(SEQ*DIM*4);
-                    io_read_fp16(rmsAttBwd[L]->ioOut, dx_rms1, 0, DIM, SEQ);
-                    // dw for rmsnorm_att still on CPU
+                    if (ane_extras) {
+                        io_write_fp16_at(rmsAttBwd[L]->ioIn, 0, dx_attn, DIM, SEQ);
+                        io_write_fp16_at(rmsAttBwd[L]->ioIn, DIM, ac->layer_in, DIM, SEQ);
+                        ane_eval(rmsAttBwd[L]);
+                        io_read_fp16(rmsAttBwd[L]->ioOut, dx_rms1, 0, DIM, SEQ);
+                    }
+                    // dw for rmsnorm_att on CPU
                     {
                         float *dw_tmp = (float*)calloc(DIM, 4);
                         float *dx_scratch = (float*)malloc(SEQ*DIM*4);
                         rmsnorm_bwd(dx_scratch, dw_tmp, dx_attn, ac->layer_in, lw[L].rms_att, DIM, SEQ);
+                        if (!ane_extras) memcpy(dx_rms1, dx_scratch, SEQ*DIM*4);
                         for(int i=0;i<DIM;i++) gr->rms_att[i] += dw_tmp[i];
                         free(dx_scratch); free(dw_tmp);
                     }

From 443194bca4491fae4400bae9dad2a0470692bdbf Mon Sep 17 00:00:00 2001
From: maderix <maderix@max.local>
Date: Tue, 3 Mar 2026 05:24:35 -0800
Subject: [PATCH 09/21] Dashboard v2: live stats, JSON parsing, all three
 pipelines

- Parse static pipeline JSON step/batch/perf lines for real-time updates
- Running elapsed time, ms/step from wall-clock timestamps, steps/sec
- Compute ANE + Total TFLOPS from FLOPs/step when not reported directly
- Support --ane (train_large_ane) and --no-ane-extras flags
- Dynamic pipeline timing breakdown + CKPT_PATH per mode
---
 training/dashboard.py | 148 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 130 insertions(+), 18 deletions(-)

diff --git a/training/dashboard.py b/training/dashboard.py
index 06d46a2..5926a8f 100644
--- a/training/dashboard.py
+++ b/training/dashboard.py
@@ -1,6 +1,6 @@
 """TUI dashboard for ANE training (train_large). Uses blessed for terminal UI."""
 
-import argparse, fcntl, math, os, re, select, signal, struct, subprocess, sys, time, threading
+import argparse, fcntl, json, math, os, re, select, signal, struct, subprocess, sys, time, threading
 from collections import deque
 from pathlib import Path
 
@@ -20,7 +20,9 @@
 
 DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS = 768, 2048, 12, 256, 32000, 12
 HD = DIM // HEADS
-CKPT_PATH = 'ane_stories110M_ckpt.bin'
+CKPT_PATH_STATIC = 'ane_stories110M_ckpt.bin'
+CKPT_PATH_DYNAMIC = 'training_dynamic/ane_stories110M_dyn_ckpt.bin'
+CKPT_PATH = CKPT_PATH_STATIC  # set in main() based on --dynamic
 TOKENIZER_PATH = str(Path(__file__).resolve().parent.parent.parent / 'assets' / 'models' / 'tokenizer.bin')
 
 
@@ -56,6 +58,9 @@ def __init__(self):
         self.mem_mb_history = deque(maxlen=300)
         self.proc_mem_mb_history = deque(maxlen=300)
         self.train_pid = None
+        self.step_timestamps = []  # (step, time.monotonic()) for running ms/step
+        self.train_start = None    # wall clock when first step seen
+        self.compile_ms = 0.0      # total compile time
 
 S = State()
 
@@ -278,23 +283,69 @@ def sysmetrics_thread():
 RE_CONFIG = re.compile(r'dim=(\d+) hidden=(\d+) heads=(\d+) seq=(\d+) vocab=(\d+) layers=(\d+)')
 RE_PARAMS = re.compile(r'Params: ([\d.]+)M \(transformer ([\d.]+)M \+ embed ([\d.]+)M\)')
 RE_KERNELS = re.compile(r'Kernels: (\d+).*?(\d+) weight-bearing')
+RE_KERNELS_DYN = re.compile(r'Kernels: (\d+) compiled, (\d+) weight-bearing')
 RE_ACCUM = re.compile(r'Accum (\d+).*LR=([\d.e+-]+)')
 RE_STEP = re.compile(r'step\s+(\d+)\s+loss=([\d.]+)(?:\s+lr=([\d.e+-]+))?(?:\s+([\d.]+)ms/step)?')
 RE_BATCH = re.compile(r'\[batch (\d+): compile=([\d.]+)ms train=([\d.]+)ms \(([\d.]+)ms/step\) compiles=(\d+)\]')
 RE_TIMING = re.compile(r'ane=([\d.]+) io=([\d.]+) cls=([\d.]+) elem=([\d.]+) rms=([\d.]+) cblas_wait=([\d.]+)')
+RE_TIMING_DYN = re.compile(r'ane_fwd=([\d.]+) io_fwd=([\d.]+) rms=([\d.]+) ane_bwd=([\d.]+) io_bwd=([\d.]+) silu=([\d.]+) rms_bwd=([\d.]+) cls=([\d.]+) cblas_wait=([\d.]+) dw_copy=([\d.]+)')
 RE_RESTART = re.compile(r'\[exec\(\) restart step (\d+)')
 RE_RESUME = re.compile(r'\[RESUMED step (\d+), loss=([\d.]+)\]')
 RE_FLOPS = re.compile(r'FLOPs/step: fwd=([\d.]+)M bwd_dx=([\d.]+)M bwd_dW=([\d.]+)M sdpa_bwd=([\d.]+)M total=([\d.]+)M')
 RE_ANE_FLOPS = re.compile(r'ANE FLOPs/step: ([\d.]+)M')
 RE_ANE_TFLOPS = re.compile(r'ANE TFLOPS:\s+([\d.]+)')
 RE_ANE_UTIL = re.compile(r'ANE utilization:\s+([\d.]+)%')
-RE_EFFICIENCY = re.compile(r'(Total steps|Wall time|Compile time|Train time|Avg compile|Avg train|ANE TFLOPS|Total TFLOPS|ANE utilization):?\s+(.+)')
+RE_EFFICIENCY = re.compile(r'(Total steps|Wall time|Compile time|Compile|Train time|Avg compile|Avg train|ANE TFLOPS|Total TFLOPS|ANE utilization):?\s+(.+)')
+RE_COMPILED = re.compile(r'Compiled (\d+) kernels in (\d+)ms')
 RE_ANE_POWER = re.compile(r'ANE Power:\s+([\d.]+)\s*mW')
 RE_CPU_POWER = re.compile(r'CPU Power:\s+([\d.]+)\s*mW')
 RE_GPU_POWER = re.compile(r'GPU Power:\s+([\d.]+)\s*mW')
 
 def parse_line(line):
     S.logs.append(line)
+    # Parse JSON lines from static pipeline ({"type":"step",...} or {"type":"batch",...})
+    stripped = line.strip()
+    if stripped.startswith('{'):
+        try:
+            j = json.loads(stripped)
+            jt = j.get('type')
+            if jt == 'step':
+                S.step, S.loss = j['step'], j['loss']
+                S.loss_history.append((S.step, S.loss))
+                S.best_loss = min(S.best_loss, S.loss)
+                S.compiles = j.get('compiles', S.compiles)
+                now = time.monotonic()
+                if S.train_start is None:
+                    S.train_start = now
+                S.step_timestamps.append((S.step, now))
+                if len(S.step_timestamps) >= 2:
+                    dt = S.step_timestamps[-1][1] - S.step_timestamps[-2][1]
+                    if dt > 0:
+                        S.ms_per_step = dt * 1000
+                # Extract component timing from JSON
+                ct = {}
+                for k in ('t_ane', 't_io', 't_cls', 't_elem', 't_rms', 't_cblas_wait'):
+                    if k in j:
+                        ct[k[2:]] = j[k]  # strip 't_' prefix
+                if ct:
+                    S.component_timing = ct
+                return
+            elif jt == 'batch':
+                S.batch_num = j.get('batch', S.batch_num)
+                compile_ms = j.get('compile_ms', 0)
+                train_ms = j.get('train_ms', 0)
+                S.ms_per_step = j.get('ms_per_step', S.ms_per_step)
+                S.compile_ms += compile_ms
+                S.compile_pct = 100 * S.compile_ms / (S.compile_ms + train_ms) if S.compile_ms + train_ms > 0 else 0
+                return
+            elif jt == 'perf':
+                if 'ane_tflops' in j:
+                    S.flops['ane_tflops'] = j['ane_tflops']
+                if 'ane_util_pct' in j:
+                    S.flops['ane_util'] = j['ane_util_pct']
+                return
+        except (json.JSONDecodeError, KeyError):
+            pass
     m = RE_CONFIG.search(line)
     if m:
         S.model_config = dict(zip(['dim', 'hidden', 'heads', 'seq', 'vocab', 'layers'], map(int, m.groups())))
@@ -303,7 +354,7 @@ def parse_line(line):
     if m:
         S.params = {'total': float(m[1]), 'transformer': float(m[2]), 'embed': float(m[3])}
         return
-    m = RE_KERNELS.search(line)
+    m = RE_KERNELS_DYN.search(line) or RE_KERNELS.search(line)
     if m:
         S.kernels = {'total': int(m[1]), 'weight_bearing': int(m[2])}
         return
@@ -327,6 +378,14 @@ def parse_line(line):
             S.training['lr'] = m[3]
         if m[4]:
             S.ms_per_step = float(m[4])
+        now = time.monotonic()
+        if S.train_start is None:
+            S.train_start = now
+        S.step_timestamps.append((S.step, now))
+        if not m[4] and len(S.step_timestamps) >= 2:
+            dt = S.step_timestamps[-1][1] - S.step_timestamps[-2][1]
+            if dt > 0:
+                S.ms_per_step = dt * 1000
         S.loss_history.append((S.step, S.loss))
         S.best_loss = min(S.best_loss, S.loss)
         return
@@ -338,6 +397,16 @@ def parse_line(line):
         S.compiles = int(m[5])
         S.compile_pct = 100 * compile_ms / (compile_ms + train_ms) if compile_ms + train_ms > 0 else 0
         return
+    m = RE_TIMING_DYN.search(line)
+    if m:
+        vals = list(map(float, m.groups()))
+        S.component_timing = {
+            'ane_fwd': vals[0], 'io_fwd': vals[1], 'rms': vals[2],
+            'ane_bwd': vals[3], 'io_bwd': vals[4], 'silu': vals[5],
+            'rms_bwd': vals[6], 'cls': vals[7], 'cblas_wait': vals[8], 'dw_copy': vals[9],
+            '_dynamic': True
+        }
+        return
     m = RE_TIMING.search(line)
     if m:
         S.component_timing = dict(zip(['ane', 'io', 'cls', 'elem', 'rms', 'cblas_wait'], map(float, m.groups())))
@@ -350,6 +419,11 @@ def parse_line(line):
     if m:
         S.flops['ane_util'] = float(m[1])
         return
+    m = RE_COMPILED.search(line)
+    if m:
+        S.compiles = int(m[1])
+        S.compile_ms += float(m[2])
+        return
     m = RE_EFFICIENCY.search(line)
     if m:
         S.efficiency[m[1].strip()] = m[2].strip()
@@ -518,23 +592,49 @@ def put(y, x, text, style=''):
     # Training stats (right panel)
     sr = row
     step_str = f'{S.step}' + (f'/{S.total_steps}' if S.total_steps and S.total_steps < 999999 else '')
-    put(sr, mid_x + 1, f' Step: {step_str}  Loss: {S.loss:.4f}' if S.loss else ' Step: --', term.yellow)
+    # Elapsed time
+    elapsed = 0.0
+    if S.train_start:
+        elapsed = time.monotonic() - S.train_start
+    elapsed_str = f'{elapsed:.1f}s' if elapsed < 60 else f'{elapsed/60:.1f}m'
+    put(sr, mid_x + 1, f' Step: {step_str}  Loss: {S.loss:.4f}  [{elapsed_str}]' if S.loss else ' Step: --', term.yellow)
     sr += 1
-    put(sr, mid_x + 1, f' Best: {S.best_loss:.4f}   ms/step: {S.ms_per_step:.1f}' if S.best_loss < float('inf') else ' Best: --')
+    # ms/step + steps/sec
+    sps = 1000.0 / S.ms_per_step if S.ms_per_step > 0 else 0
+    put(sr, mid_x + 1, f' Best: {S.best_loss:.4f}   {S.ms_per_step:.1f}ms/step ({sps:.1f} steps/s)' if S.best_loss < float('inf') else ' Best: --')
     sr += 1
+    # TFLOPS
     ane_tflops = S.flops.get('ane_tflops', 0)
     ane_util = S.flops.get('ane_util', 0)
+    total_tflops = 0
+    if S.ms_per_step > 0 and S.flops.get('ane', 0) > 0:
+        if not ane_tflops:
+            ane_tflops = (S.flops['ane'] * 1e6) / (S.ms_per_step * 1e-3) / 1e12
+        total_tflops = (S.flops.get('total', 0) * 1e6) / (S.ms_per_step * 1e-3) / 1e12
+    if not ane_util and ane_tflops:
+        ane_util = 100.0 * ane_tflops / 15.8
+    compile_str = f'  Compile: {S.compile_ms/1000:.1f}s' if S.compile_ms > 0 else ''
     if ane_tflops:
-        put(sr, mid_x + 1, f' ANE: {ane_tflops:.2f}T  Compile: {S.compile_pct:.0f}%  Util: {ane_util:.1f}%')
-    else:
-        put(sr, mid_x + 1, f' Compile: {S.compile_pct:.0f}%')
+        tflops_str = f' ANE: {ane_tflops:.2f}T'
+        if total_tflops:
+            tflops_str += f'  Total: {total_tflops:.2f}T'
+        tflops_str += f'  Util: {ane_util:.1f}%{compile_str}'
+        put(sr, mid_x + 1, tflops_str)
+    elif compile_str:
+        put(sr, mid_x + 1, f'{compile_str}')
     sr += 1
     ct = S.component_timing
     if ct:
-        put(sr, mid_x + 1, f' ane={ct.get("ane", 0):.1f} io={ct.get("io", 0):.1f} cls={ct.get("cls", 0):.1f} elem={ct.get("elem", 0):.1f}')
-        sr += 1
-        put(sr, mid_x + 1, f' rms={ct.get("rms", 0):.1f} cblas_wait={ct.get("cblas_wait", 0):.1f} ms/step')
-        sr += 1
+        if ct.get('_dynamic'):
+            put(sr, mid_x + 1, f' fwd={ct.get("ane_fwd",0):.1f} bwd={ct.get("ane_bwd",0):.1f} io={ct.get("io_fwd",0)+ct.get("io_bwd",0):.1f} silu={ct.get("silu",0):.1f}')
+            sr += 1
+            put(sr, mid_x + 1, f' cls={ct.get("cls",0):.1f} rms={ct.get("rms",0)+ct.get("rms_bwd",0):.1f} dw={ct.get("dw_copy",0):.1f} ms/step')
+            sr += 1
+        else:
+            put(sr, mid_x + 1, f' ane={ct.get("ane", 0):.1f} io={ct.get("io", 0):.1f} cls={ct.get("cls", 0):.1f} elem={ct.get("elem", 0):.1f}')
+            sr += 1
+            put(sr, mid_x + 1, f' rms={ct.get("rms", 0):.1f} cblas_wait={ct.get("cblas_wait", 0):.1f} ms/step')
+            sr += 1
     pw = S.power
     if any(pw.values()):
         put(sr, mid_x + 1, '\u2500 Power ' + '\u2500' * max(0, right_w - 9), term.cyan)
@@ -663,9 +763,12 @@ def set_nonblock(fd):
     fl = fcntl.fcntl(fd, fcntl.F_GETFL)
     fcntl.fcntl(fd, fcntl.F_SETFL, fl | os.O_NONBLOCK)
 
-def spawn_training(resume=False, steps=10000, dynamic=False, scratch=False, lr=None, accum=None):
+def spawn_training(resume=False, steps=10000, dynamic=False, ane=False, scratch=False,
+                   lr=None, accum=None, no_ane_extras=False):
     if dynamic:
         cmd = 'cd training_dynamic && make 2>&1 && ./train'
+    elif ane:
+        cmd = 'make train_large_ane 2>&1 && ./train_large_ane'
     else:
         cmd = 'make train_large 2>&1 && ./train_large'
     if resume:
@@ -674,8 +777,10 @@ def spawn_training(resume=False, steps=10000, dynamic=False, scratch=False, lr=N
         cmd += ' --scratch'
     if lr is not None:
         cmd += f' --lr {lr}'
-    if accum is not None:
+    if accum is not None and dynamic:
         cmd += f' --accum {accum}'
+    if no_ane_extras and ane:
+        cmd += ' --no-ane-extras'
     cmd += f' --steps {steps}'
     proc = subprocess.Popen(
         ['bash', '-c', cmd],
@@ -697,7 +802,9 @@ def spawn_powermetrics():
 def main():
     parser = argparse.ArgumentParser(description='ANE Training Dashboard (stories110M)')
     parser.add_argument('--resume', action='store_true', help='Resume from checkpoint')
-    parser.add_argument('--dynamic', action='store_true', help='Use v2 dynamic weight pipeline (training_dynamic/)')
+    parser.add_argument('--dynamic', action='store_true', help='Dynamic weight pipeline (training_dynamic/)')
+    parser.add_argument('--ane', action='store_true', help='PR#19: ANE-offloaded classifier/softmax/rmsnorm_bwd')
+    parser.add_argument('--no-ane-extras', action='store_true', help='Disable ANE extras (use with --ane)')
     parser.add_argument('--scratch', action='store_true', help='Train from scratch (random init)')
     parser.add_argument('--lr', type=float, default=None, help='Learning rate')
     parser.add_argument('--accum', type=int, default=None, help='Gradient accumulation steps')
@@ -711,11 +818,15 @@ def main():
         args.steps = 999999999
     S.total_steps = args.steps
 
+    global CKPT_PATH
+    CKPT_PATH = CKPT_PATH_DYNAMIC if args.dynamic else CKPT_PATH_STATIC
+
     term = Terminal()
     procs = []
 
     train_proc = spawn_training(resume=args.resume, steps=args.steps, dynamic=args.dynamic,
-                                scratch=args.scratch, lr=args.lr, accum=args.accum)
+                                scratch=args.scratch, lr=args.lr, accum=args.accum,
+                                ane=args.ane, no_ane_extras=args.no_ane_extras)
     S.train_pid = train_proc.pid
     procs.append(train_proc)
 
@@ -856,7 +967,8 @@ def on_resize(*a):
                             train_proc.terminate()
                             train_proc.wait()
                         train_proc = spawn_training(resume=True, steps=args.steps, dynamic=args.dynamic,
-                                                        lr=args.lr, accum=args.accum)
+                                                        lr=args.lr, accum=args.accum,
+                                                        ane=args.ane, no_ane_extras=args.no_ane_extras)
                         S.train_pid = train_proc.pid
                         procs = [p for p in procs if p.poll() is None]
                         procs.append(train_proc)

From d3d00307c092d5038828a520ad21504d6cbb5923 Mon Sep 17 00:00:00 2001
From: John Stephen Kromer <jskromer@Johns-Mac-mini.local>
Date: Tue, 3 Mar 2026 10:20:05 -0800
Subject: [PATCH 10/21] Fix benchmarks for macOS 26: replace compileModelAtURL
 with in-memory MIL pipeline
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[MLModel compileModelAtURL:] fails on macOS 26, breaking inmem_bench,
sram_bench, and sram_probe. This switches all three to generate MIL text
and weight blobs programmatically in memory (matching the working
inmem_peak.m approach), bypassing CoreML disk compilation entirely.

- inmem_bench.m: replace CoreML compile + file read with genMIL/buildWeightBlob
- sram_bench.m: switch from _ANEClient/_ANEModel to _ANEInMemoryModel API
- sram_probe.m: same _ANEClient → _ANEInMemoryModel conversion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 inmem_bench.m |  50 +++++++++++++++++------
 sram_bench.m  | 109 ++++++++++++++++++++++++++++++++++++--------------
 sram_probe.m  | 104 ++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 196 insertions(+), 67 deletions(-)

diff --git a/inmem_bench.m b/inmem_bench.m
index 8a5af33..51bd0aa 100644
--- a/inmem_bench.m
+++ b/inmem_bench.m
@@ -1,5 +1,4 @@
 #import <Foundation/Foundation.h>
-#import <CoreML/CoreML.h>
 #import <objc/runtime.h>
 #import <objc/message.h>
 #import <dlfcn.h>
@@ -9,18 +8,45 @@
 static mach_timebase_info_data_t g_tb;
 static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
 
+static NSData *buildWeightBlob(int ch) {
+    NSUInteger wsize = (NSUInteger)ch * ch * 2;
+    NSUInteger total = 64 + 64 + wsize;
+    uint8_t *buf = calloc(total, 1);
+    buf[0] = 0x01; buf[4] = 0x02;
+    uint8_t *chunk = buf + 64;
+    chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE;
+    chunk[4]=0x01; chunk[10]=0x08;
+    uint16_t *fp16 = (uint16_t*)(chunk + 64);
+    for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++)
+        fp16[j] = (arc4random() & 0x03FF) | 0x2000;
+    return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES];
+}
+
+static NSString *genMIL(int ch, int sp) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ch, sp];
+    [m appendString:
+        @"        string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n"
+        @"        tensor<int32, [2]> c_strides = const()[name = string(\"c_strides\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        tensor<int32, [4]> c_pad = const()[name = string(\"c_pad\"), val = tensor<int32, [4]>([0, 0, 0, 0])];\n"
+        @"        tensor<int32, [2]> c_dilations = const()[name = string(\"c_dilations\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n"
+        @"        string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp];
+    [m appendFormat:@"        tensor<fp16, [%d, %d, 1, 1]> W = const()[name = string(\"W\"), val = tensor<fp16, [%d, %d, 1, 1]>(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp];
+    [m appendString:@"        string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1, %d, 1, %d]> y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
 double benchInMem(int ch, int sp) {
     @autoreleasepool {
         NSError *e = nil;
-        NSString *path = [NSString stringWithFormat:@"/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp];
-        NSURL *compiled = [MLModel compileModelAtURL:[NSURL fileURLWithPath:path] error:&e];
-        if (e) return -1;
-
-        NSData *milData = [[NSString stringWithContentsOfFile:
-            [[compiled path] stringByAppendingPathComponent:@"model.mil"]
-            encoding:NSUTF8StringEncoding error:nil] dataUsingEncoding:NSUTF8StringEncoding];
-        NSData *weightBlob = [NSData dataWithContentsOfFile:
-            [[compiled path] stringByAppendingPathComponent:@"weights/weight.bin"]];
+        NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy];
+        NSData *wb = buildWeightBlob(ch);
 
         Class Desc = NSClassFromString(@"_ANEInMemoryModelDescriptor");
         Class IMM = NSClassFromString(@"_ANEInMemoryModel");
@@ -28,7 +54,7 @@
         Class AIO = NSClassFromString(@"_ANEIOSurfaceObject");
 
         NSDictionary *wdict = @{
-            @"@model_path/weights/weight.bin": @{@"offset": @64, @"data": weightBlob}
+            @"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb}
         };
         id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
             Desc, @selector(modelWithMILText:weights:optionsPlist:),
@@ -43,7 +69,7 @@
         [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"]
             withIntermediateDirectories:YES attributes:nil error:nil];
         [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES];
-        [weightBlob writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
+        [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
 
         BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
             model, @selector(compileWithQoS:options:error:), 21, @{}, &e);
diff --git a/sram_bench.m b/sram_bench.m
index 9dc3a35..85b46d5 100644
--- a/sram_bench.m
+++ b/sram_bench.m
@@ -1,5 +1,4 @@
 #import <Foundation/Foundation.h>
-#import <CoreML/CoreML.h>
 #import <objc/runtime.h>
 #import <objc/message.h>
 #import <dlfcn.h>
@@ -8,25 +7,79 @@
 
 static mach_timebase_info_data_t g_tb;
 static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
-static id g_client;
-static Class AM, AR, AIO;
 
-double bench(const char *path, int ch, int sp) {
+static NSData *buildWeightBlob(int ch) {
+    NSUInteger wsize = (NSUInteger)ch * ch * 2;
+    NSUInteger total = 64 + 64 + wsize;
+    uint8_t *buf = calloc(total, 1);
+    buf[0] = 0x01; buf[4] = 0x02;
+    uint8_t *chunk = buf + 64;
+    chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE;
+    chunk[4]=0x01; chunk[10]=0x08;
+    uint16_t *fp16 = (uint16_t*)(chunk + 64);
+    for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++)
+        fp16[j] = (arc4random() & 0x03FF) | 0x2000;
+    return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES];
+}
+
+static NSString *genMIL(int ch, int sp) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ch, sp];
+    [m appendString:
+        @"        string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n"
+        @"        tensor<int32, [2]> c_strides = const()[name = string(\"c_strides\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        tensor<int32, [4]> c_pad = const()[name = string(\"c_pad\"), val = tensor<int32, [4]>([0, 0, 0, 0])];\n"
+        @"        tensor<int32, [2]> c_dilations = const()[name = string(\"c_dilations\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n"
+        @"        string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp];
+    [m appendFormat:@"        tensor<fp16, [%d, %d, 1, 1]> W = const()[name = string(\"W\"), val = tensor<fp16, [%d, %d, 1, 1]>(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp];
+    [m appendString:@"        string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1, %d, 1, %d]> y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+double bench(int ch, int sp) {
     @autoreleasepool {
         NSError *e = nil;
-        NSURL *compiled = [MLModel compileModelAtURL:
-            [NSURL fileURLWithPath:[NSString stringWithUTF8String:path]] error:&e];
-        if (e) return -1;
-        id model = ((id(*)(Class,SEL,id,id))objc_msgSend)(AM, @selector(modelAtURL:key:), compiled, @"s");
-        BOOL ok = ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(compileModel:options:qos:error:), model,
-            @{@"kANEFModelType":@"kANEFModelMIL",@"kANEFNetPlistFilenameKey":@"model.mil"}, 21, &e);
-        if (!ok) return -2;
-        ok = ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(loadModel:options:qos:error:), model, @{}, 21, &e);
-        if (!ok) return -3;
-
-        NSUInteger bytes = ch * sp * 4; // FP32 input
+        NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy];
+        NSData *wb = buildWeightBlob(ch);
+
+        Class D = NSClassFromString(@"_ANEInMemoryModelDescriptor");
+        Class I = NSClassFromString(@"_ANEInMemoryModel");
+        Class AR = NSClassFromString(@"_ANERequest");
+        Class AIO = NSClassFromString(@"_ANEIOSurfaceObject");
+
+        id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
+            D, @selector(modelWithMILText:weights:optionsPlist:),
+            milData, @{@"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb}}, nil);
+        if (!desc) return -2;
+
+        id model = ((id(*)(Class,SEL,id))objc_msgSend)(
+            I, @selector(inMemoryModelWithDescriptor:), desc);
+        if (!model) return -3;
+
+        id hexId = ((id(*)(id,SEL))objc_msgSend)(model, @selector(hexStringIdentifier));
+        NSString *tmpDir = [NSTemporaryDirectory() stringByAppendingPathComponent:hexId];
+        NSFileManager *fm = [NSFileManager defaultManager];
+        [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"]
+            withIntermediateDirectories:YES attributes:nil error:nil];
+        [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+        [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
+
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                model, @selector(compileWithQoS:options:error:), 21, @{}, &e)) {
+            [fm removeItemAtPath:tmpDir error:nil]; return -4;
+        }
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                model, @selector(loadWithQoS:options:error:), 21, @{}, &e)) {
+            [fm removeItemAtPath:tmpDir error:nil]; return -5;
+        }
+
+        NSUInteger bytes = ch * sp * 4;
         IOSurfaceRef ioIn = IOSurfaceCreate((__bridge CFDictionaryRef)@{
             (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1,
             (id)kIOSurfaceBytesPerElement:@1,(id)kIOSurfaceBytesPerRow:@(bytes),
@@ -35,7 +88,6 @@
             (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1,
             (id)kIOSurfaceBytesPerElement:@1,(id)kIOSurfaceBytesPerRow:@(bytes),
             (id)kIOSurfaceAllocSize:@(bytes),(id)kIOSurfacePixelFormat:@0});
-
         id wIn = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(AIO, @selector(objectWithIOSurface:), ioIn);
         id wOut = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(AIO, @selector(objectWithIOSurface:), ioOut);
         id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(AR,
@@ -43,19 +95,20 @@
             @[wIn], @[@0], @[wOut], @[@0], nil, nil, @0);
 
         for (int i = 0; i < 5; i++)
-            ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)(
-                g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e);
+            ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+                model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
 
         int iters = 30;
         uint64_t t0 = mach_absolute_time();
         for (int i = 0; i < iters; i++)
-            ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)(
-                g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e);
+            ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+                model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
         double ms = ticksToMs(mach_absolute_time() - t0) / iters;
 
-        ((void(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(unloadModel:options:qos:error:), model, @{}, 21, &e);
+        ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(
+            model, @selector(unloadWithQoS:error:), 21, &e);
         CFRelease(ioIn); CFRelease(ioOut);
+        [fm removeItemAtPath:tmpDir error:nil];
         return ms;
     }
 }
@@ -63,10 +116,6 @@
 int main() {
     mach_timebase_info(&g_tb);
     dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
-    g_client = [NSClassFromString(@"_ANEClient") performSelector:@selector(sharedConnection)];
-    AM = NSClassFromString(@"_ANEModel");
-    AR = NSClassFromString(@"_ANERequest");
-    AIO = NSClassFromString(@"_ANEIOSurfaceObject");
 
     printf("=== ANE SRAM Probe: 1x1 Conv with Increasing Weight Size ===\n\n");
     printf("%-25s %8s %8s %8s %10s %8s\n", "Config", "W (MB)", "Act(MB)", "Tot(MB)", "ms/eval", "TFLOPS");
@@ -82,9 +131,7 @@ int main() {
         double tot = w_mb + 2 * a_mb;
         double gflop = 2.0 * ch * ch * sp / 1e9;
 
-        char path[256];
-        snprintf(path, sizeof(path), "/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp);
-        double ms = bench(path, ch, sp);
+        double ms = bench(ch, sp);
 
         double tflops = (ms > 0) ? gflop / ms : -1;
         char label[64];
diff --git a/sram_probe.m b/sram_probe.m
index 0766187..4ca4df6 100644
--- a/sram_probe.m
+++ b/sram_probe.m
@@ -1,5 +1,4 @@
 #import <Foundation/Foundation.h>
-#import <CoreML/CoreML.h>
 #import <objc/runtime.h>
 #import <objc/message.h>
 #import <dlfcn.h>
@@ -8,20 +7,78 @@
 
 static mach_timebase_info_data_t g_tb;
 static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
-static id g_client; static Class AM, AR, AIO;
 
-double bench(const char *path, int ch, int sp) {
+static NSData *buildWeightBlob(int ch) {
+    NSUInteger wsize = (NSUInteger)ch * ch * 2;
+    NSUInteger total = 64 + 64 + wsize;
+    uint8_t *buf = calloc(total, 1);
+    buf[0] = 0x01; buf[4] = 0x02;
+    uint8_t *chunk = buf + 64;
+    chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE;
+    chunk[4]=0x01; chunk[10]=0x08;
+    uint16_t *fp16 = (uint16_t*)(chunk + 64);
+    for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++)
+        fp16[j] = (arc4random() & 0x03FF) | 0x2000;
+    return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES];
+}
+
+static NSString *genMIL(int ch, int sp) {
+    NSMutableString *m = [NSMutableString string];
+    [m appendString:@"program(1.3)\n[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"];
+    [m appendFormat:@"    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n", ch, sp];
+    [m appendString:
+        @"        string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n"
+        @"        tensor<int32, [2]> c_strides = const()[name = string(\"c_strides\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        tensor<int32, [4]> c_pad = const()[name = string(\"c_pad\"), val = tensor<int32, [4]>([0, 0, 0, 0])];\n"
+        @"        tensor<int32, [2]> c_dilations = const()[name = string(\"c_dilations\"), val = tensor<int32, [2]>([1, 1])];\n"
+        @"        int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n"
+        @"        string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp];
+    [m appendFormat:@"        tensor<fp16, [%d, %d, 1, 1]> W = const()[name = string(\"W\"), val = tensor<fp16, [%d, %d, 1, 1]>(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch];
+    [m appendFormat:@"        tensor<fp16, [1, %d, 1, %d]> y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp];
+    [m appendString:@"        string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"];
+    [m appendFormat:@"        tensor<fp32, [1, %d, 1, %d]> y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp];
+    [m appendString:@"    } -> (y);\n}\n"];
+    return m;
+}
+
+double bench(int ch, int sp) {
     @autoreleasepool {
         NSError *e = nil;
-        NSURL *compiled = [MLModel compileModelAtURL:
-            [NSURL fileURLWithPath:[NSString stringWithUTF8String:path]] error:&e];
-        if (e) return -1;
-        id model = ((id(*)(Class,SEL,id,id))objc_msgSend)(AM, @selector(modelAtURL:key:), compiled, @"s");
-        ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(compileModel:options:qos:error:), model,
-            @{@"kANEFModelType":@"kANEFModelMIL",@"kANEFNetPlistFilenameKey":@"model.mil"}, 21, &e);
-        ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(loadModel:options:qos:error:), model, @{}, 21, &e);
+        NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy];
+        NSData *wb = buildWeightBlob(ch);
+
+        Class D = NSClassFromString(@"_ANEInMemoryModelDescriptor");
+        Class I = NSClassFromString(@"_ANEInMemoryModel");
+        Class AR = NSClassFromString(@"_ANERequest");
+        Class AIO = NSClassFromString(@"_ANEIOSurfaceObject");
+
+        id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
+            D, @selector(modelWithMILText:weights:optionsPlist:),
+            milData, @{@"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb}}, nil);
+        if (!desc) return -2;
+
+        id model = ((id(*)(Class,SEL,id))objc_msgSend)(
+            I, @selector(inMemoryModelWithDescriptor:), desc);
+        if (!model) return -3;
+
+        id hexId = ((id(*)(id,SEL))objc_msgSend)(model, @selector(hexStringIdentifier));
+        NSString *tmpDir = [NSTemporaryDirectory() stringByAppendingPathComponent:hexId];
+        NSFileManager *fm = [NSFileManager defaultManager];
+        [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"]
+            withIntermediateDirectories:YES attributes:nil error:nil];
+        [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+        [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
+
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                model, @selector(compileWithQoS:options:error:), 21, @{}, &e)) {
+            [fm removeItemAtPath:tmpDir error:nil]; return -4;
+        }
+        if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
+                model, @selector(loadWithQoS:options:error:), 21, @{}, &e)) {
+            [fm removeItemAtPath:tmpDir error:nil]; return -5;
+        }
+
         NSUInteger bytes = ch * sp * 4;
         IOSurfaceRef ioIn = IOSurfaceCreate((__bridge CFDictionaryRef)@{
             (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1,
@@ -36,18 +93,22 @@
         id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(AR,
             @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
             @[wIn], @[@0], @[wOut], @[@0], nil, nil, @0);
+
         for (int i = 0; i < 5; i++)
-            ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)(
-                g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e);
+            ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+                model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
+
         int iters = 50;
         uint64_t t0 = mach_absolute_time();
         for (int i = 0; i < iters; i++)
-            ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)(
-                g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e);
+            ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+                model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
         double ms = ticksToMs(mach_absolute_time() - t0) / iters;
-        ((void(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)(
-            g_client, @selector(unloadModel:options:qos:error:), model, @{}, 21, &e);
+
+        ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(
+            model, @selector(unloadWithQoS:error:), 21, &e);
         CFRelease(ioIn); CFRelease(ioOut);
+        [fm removeItemAtPath:tmpDir error:nil];
         return ms;
     }
 }
@@ -55,9 +116,6 @@
 int main() {
     mach_timebase_info(&g_tb);
     dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
-    g_client = [NSClassFromString(@"_ANEClient") performSelector:@selector(sharedConnection)];
-    AM = NSClassFromString(@"_ANEModel"); AR = NSClassFromString(@"_ANERequest");
-    AIO = NSClassFromString(@"_ANEIOSurfaceObject");
 
     printf("=== ANE SRAM Fine Probe (weights only vary, spatial=64) ===\n\n");
     printf("%-12s %8s %10s %8s %12s\n", "Channels", "W (MB)", "ms/eval", "TFLOPS", "GFLOPS/MB");
@@ -70,9 +128,7 @@ int main() {
         int ch = chs[i], sp = sps[i];
         double w_mb = (double)ch * ch * 2 / 1024 / 1024;
         double gf = 2.0 * ch * ch * sp / 1e9;
-        char path[256];
-        snprintf(path, sizeof(path), "/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp);
-        double ms = bench(path, ch, sp);
+        double ms = bench(ch, sp);
         double tf = (ms > 0) ? gf / ms : 0;
         double eff = (ms > 0) ? tf * 1000 / w_mb : 0;
         printf("%6d ch   %7.1f  %8.3f ms %7.2f  %10.1f %s\n",

From c04168ee17e9678e5f0878c5aed6c2b834d0c071 Mon Sep 17 00:00:00 2001
From: nabbilkhan <nabbilkhan@users.noreply.github.com>
Date: Tue, 3 Mar 2026 19:19:49 +0000
Subject: [PATCH 11/21] Add --data path support for static training pipelines

---
 training/README.md         | 33 ++++++++++++-----------
 training/train_large.m     | 54 +++++++++++++++++++++-----------------
 training/train_large_ane.m | 18 ++++++++-----
 3 files changed, 60 insertions(+), 45 deletions(-)

diff --git a/training/README.md b/training/README.md
index 8ccde88..a3f33eb 100644
--- a/training/README.md
+++ b/training/README.md
@@ -83,15 +83,17 @@ Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from HuggingFace. Pr
 ### 2. Build & Train
 
 ```bash
-# Static baseline (classifier + softmax on CPU)
-make train_large
-./train_large stories110M.bin 256 100 1e-4
-./train_large --model stories110M.bin --steps 100 --lr 1e-4
-
-# PR#19: ANE-offloaded classifier + softmax + rmsnorm_bwd
-make train_large_ane
-./train_large_ane stories110M.bin 256 100 1e-4
-./train_large_ane --no-ane-extras --steps 100    # disable ANE extras
+# Static baseline (classifier + softmax on CPU)
+make train_large
+./train_large stories110M.bin 256 100 1e-4
+./train_large --model stories110M.bin --steps 100 --lr 1e-4
+./train_large --data ./tinystories_data00.bin --steps 100 --lr 1e-4
+
+# PR#19: ANE-offloaded classifier + softmax + rmsnorm_bwd
+make train_large_ane
+./train_large_ane stories110M.bin 256 100 1e-4
+./train_large_ane --no-ane-extras --steps 100    # disable ANE extras
+./train_large_ane --data ./tinystories_data00.bin --steps 100 --lr 1e-4
 
 # Dynamic pipeline (no recompilation)
 cd training_dynamic && make train
@@ -100,13 +102,14 @@ cd training_dynamic && make train
 ./train --steps 200 --lr 1e-4  # custom steps/lr
 ```
 
-**CLI flags (all pipelines):**
+**CLI flags (`train_large` / `train_large_ane`):**
 - `--steps N` (default 10000)
-- `--lr F` (default 3e-4)
-- `--model PATH` — pretrained weights file
-- `--ckpt PATH` — checkpoint file (preserved across exec() restarts)
-- `--resume` — resume from checkpoint
-- `--no-ane-extras` — (train_large_ane only) disable ANE classifier/softmax/rmsnorm_bwd
+- `--lr F` (default 3e-4)
+- `--model PATH` — pretrained weights file
+- `--data PATH` — tokenized TinyStories `.bin` file (default: `tinystories_data00.bin`)
+- `--ckpt PATH` — checkpoint file (preserved across exec() restarts)
+- `--resume` — resume from checkpoint
+- `--no-ane-extras` — (train_large_ane only) disable ANE classifier/softmax/rmsnorm_bwd
 
 ### 3. Monitor with Dashboard
 
diff --git a/training/train_large.m b/training/train_large.m
index 17fb1c5..f461b2c 100644
--- a/training/train_large.m
+++ b/training/train_large.m
@@ -5,9 +5,9 @@
 #include "stories_mil.h"
 #include "stories_cpu_ops.h"
 
-#define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin"
-#define MODEL_PATH_DEFAULT "stories110M.bin"
-#define DATA_PATH "tinystories_data00.bin"
+#define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin"
+#define MODEL_PATH_DEFAULT "stories110M.bin"
+#define DATA_PATH_DEFAULT "tinystories_data00.bin"
 
 // ===== Weight loading from llama2.c format =====
 static bool load_pretrained(LayerWeights *lw, float *rms_final, float *embed, const char *path) {
@@ -192,22 +192,24 @@ int main(int argc, char *argv[]) {
         float adam_b1=0.9f, adam_b2=0.999f, adam_eps=1e-8f;
         int adam_t = 0, start_step = 0;
 
-        // Parse args
-        const char *ckpt_path = CKPT_PATH_DEFAULT;
-        const char *model_path = MODEL_PATH_DEFAULT;
-        bool do_resume = false;
-        int pos = 0;
-        for (int i=1; i<argc; i++) {
-            if (strcmp(argv[i], "--resume") == 0) do_resume = true;
-            else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
-            else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
-            else if (strcmp(argv[i], "--ckpt") == 0 && i+1<argc) ckpt_path = argv[++i];
-            else if (strcmp(argv[i], "--model") == 0 && i+1<argc) model_path = argv[++i];
-            else if (argv[i][0] != '-') {
-                if (pos == 0) model_path = argv[i];
-                else if (pos == 1) { /* seq - compile-time constant */ }
-                else if (pos == 2) total_steps = atoi(argv[i]);
-                else if (pos == 3) lr = atof(argv[i]);
+        // Parse args
+        const char *ckpt_path = CKPT_PATH_DEFAULT;
+        const char *model_path = MODEL_PATH_DEFAULT;
+        const char *data_path = DATA_PATH_DEFAULT;
+        bool do_resume = false;
+        int pos = 0;
+        for (int i=1; i<argc; i++) {
+            if (strcmp(argv[i], "--resume") == 0) do_resume = true;
+            else if (strcmp(argv[i], "--steps") == 0 && i+1<argc) total_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--ckpt") == 0 && i+1<argc) ckpt_path = argv[++i];
+            else if (strcmp(argv[i], "--model") == 0 && i+1<argc) model_path = argv[++i];
+            else if (strcmp(argv[i], "--data") == 0 && i+1<argc) data_path = argv[++i];
+            else if (argv[i][0] != '-') {
+                if (pos == 0) model_path = argv[i];
+                else if (pos == 1) { /* seq - compile-time constant */ }
+                else if (pos == 2) total_steps = atoi(argv[i]);
+                else if (pos == 3) lr = atof(argv[i]);
                 pos++;
             }
         }
@@ -283,8 +285,12 @@ int main(int argc, char *argv[]) {
         }
 
         // mmap token data
-        int data_fd = open(DATA_PATH, O_RDONLY);
-        if (data_fd < 0) { printf("Cannot open %s\n", DATA_PATH); return 1; }
+        int data_fd = open(data_path, O_RDONLY);
+        if (data_fd < 0) {
+            printf("Cannot open token data: %s\n", data_path);
+            printf("Hint: run `bash download_data.sh` in training/ or pass --data /path/to/tinystories_data00.bin\n");
+            return 1;
+        }
         struct stat st; fstat(data_fd, &st);
         size_t data_len = st.st_size;
         uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
@@ -340,9 +346,9 @@ int main(int argc, char *argv[]) {
                     lw, la, rms_final, &arms_final, embed, &aembed);
                 printf("[exec() restart step %d, %d compiles, loss=%.4f]\n", step, g_compile_count, last_loss);
                 fflush(stdout);
-                execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, NULL);
-                perror("execl"); return 1;
-            }
+                execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, "--data", data_path, NULL);
+                perror("execl"); return 1;
+            }
 
             // Compile all layers' weight-bearing kernels
             uint64_t tc = mach_absolute_time();
diff --git a/training/train_large_ane.m b/training/train_large_ane.m
index ba9dfe7..25e9160 100644
--- a/training/train_large_ane.m
+++ b/training/train_large_ane.m
@@ -9,7 +9,7 @@
 //               NLL loss + gradient (needs target indexing)
 //
 // Build: make train_large_ane
-// Run:   ./train_large_ane [--resume] [--steps N] [--lr F]
+// Run:   ./train_large_ane [--resume] [--steps N] [--lr F] [--data PATH]
 #include "stories_io.h"
 #include "stories_mil.h"
 #include "stories_cpu_ops.h"
@@ -18,7 +18,7 @@
 
 #define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin"
 #define MODEL_PATH_DEFAULT "stories110M.bin"
-#define DATA_PATH "tinystories_data00.bin"
+#define DATA_PATH_DEFAULT "tinystories_data00.bin"
 
 // ===== Weight loading from llama2.c format =====
 static bool load_pretrained(LayerWeights *lw, float *rms_final, float *embed, const char *path) {
@@ -204,6 +204,7 @@ int main(int argc, char *argv[]) {
         int adam_t = 0, start_step = 0;
         const char *ckpt_path = CKPT_PATH_DEFAULT;
         const char *model_path = MODEL_PATH_DEFAULT;
+        const char *data_path = DATA_PATH_DEFAULT;
         bool do_resume = false;
         bool ane_extras = true;  // classifier, softmax, rmsnorm_bwd on ANE
         int pos = 0;
@@ -214,6 +215,7 @@ int main(int argc, char *argv[]) {
             else if (strcmp(argv[i], "--lr") == 0 && i+1<argc) lr = atof(argv[++i]);
             else if (strcmp(argv[i], "--ckpt") == 0 && i+1<argc) ckpt_path = argv[++i];
             else if (strcmp(argv[i], "--model") == 0 && i+1<argc) model_path = argv[++i];
+            else if (strcmp(argv[i], "--data") == 0 && i+1<argc) data_path = argv[++i];
             else if (argv[i][0] != '-') {
                 if (pos == 0) model_path = argv[i];
                 else if (pos == 1) { /* seq - compile-time constant */ }
@@ -271,8 +273,12 @@ int main(int argc, char *argv[]) {
         }
 
         // mmap token data
-        int data_fd = open(DATA_PATH, O_RDONLY);
-        if (data_fd < 0) { printf("Cannot open %s\n", DATA_PATH); return 1; }
+        int data_fd = open(data_path, O_RDONLY);
+        if (data_fd < 0) {
+            printf("Cannot open token data: %s\n", data_path);
+            printf("Hint: run `bash download_data.sh` in training/ or pass --data /path/to/tinystories_data00.bin\n");
+            return 1;
+        }
         struct stat st; fstat(data_fd, &st);
         size_t data_len = st.st_size;
         uint16_t *token_data = (uint16_t*)mmap(NULL, data_len, PROT_READ, MAP_PRIVATE, data_fd, 0);
@@ -354,9 +360,9 @@ int main(int argc, char *argv[]) {
                 printf("[exec() restart step %d, %d compiles, loss=%.4f]\n", step, g_compile_count, last_loss);
                 fflush(stdout);
                 if (ane_extras)
-                    execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, NULL);
+                    execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, "--data", data_path, NULL);
                 else
-                    execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, "--no-ane-extras", NULL);
+                    execl(argv[0], argv[0], "--resume", "--ckpt", ckpt_path, "--data", data_path, "--no-ane-extras", NULL);
                 perror("execl"); return 1;
             }
 

From 541bf4ec90e7ecad01281f34f5aaa2a25f152adc Mon Sep 17 00:00:00 2001
From: Alvaro GPT <alvarogpt@alvaros-mac-studio.home>
Date: Mon, 2 Mar 2026 23:10:00 +0100
Subject: [PATCH 12/21] fix: correctness & safety improvements

- Validate all fread() return values in model_load_weights (model.h)
- Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m)
- Log error details on ANE eval failure (ane_runtime.h)
- Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h)
- Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward
- Atomic checkpoint writes via tmp+rename pattern (tiny_train.m)
- Non-destructive recompile: compile new kernels first, swap only on success (model.h)
- Validate fread() in load_checkpoint (tiny_train.m)
---
 training/ane_runtime.h     |  7 ++-
 training/forward.h         |  9 +++-
 training/model.h           | 93 +++++++++++++++++++++++++-------------
 training/stories_cpu_ops.h | 40 ++++++++--------
 training/tiny_train.m      | 22 ++++++---
 5 files changed, 111 insertions(+), 60 deletions(-)

diff --git a/training/ane_runtime.h b/training/ane_runtime.h
index 585d0f0..58bcb79 100644
--- a/training/ane_runtime.h
+++ b/training/ane_runtime.h
@@ -141,9 +141,14 @@ static void ane_read_output(ANEKernel *k, int idx, void *data, size_t bytes) {
 
 static bool ane_eval(ANEKernel *k) {
     NSError *e = nil;
-    return ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+    BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
         k->model, @selector(evaluateWithQoS:options:request:error:),
         21, @{}, k->request, &e);
+    if (!ok) {
+        fprintf(stderr, "ANE eval failed: %s\n",
+                e ? [[e description] UTF8String] : "unknown error");
+    }
+    return ok;
 }
 
 static void ane_free(ANEKernel *k) {
diff --git a/training/forward.h b/training/forward.h
index adcf898..1a2a31f 100644
--- a/training/forward.h
+++ b/training/forward.h
@@ -7,7 +7,7 @@
 // ANE conv eval: input [S, in_dim] row-major → transpose to [in_dim, S] channels-first
 // ANE computes conv(W, x) with baked W → output [out_dim, S]
 // Transpose back to [S, out_dim] row-major
-static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y,
+static bool ane_conv_eval(ANEKernel *kernel, const float *x, float *y,
                           int S, int in_dim, int out_dim) {
     float *x_t = (float*)malloc(S * in_dim * sizeof(float));
     for (int t = 0; t < S; t++)
@@ -15,7 +15,11 @@ static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y,
             x_t[i*S + t] = x[t*in_dim + i];
 
     ane_write_input(kernel, 0, x_t, S * in_dim * sizeof(float));
-    ane_eval(kernel);
+    bool ok = ane_eval(kernel);
+    if (!ok) {
+        free(x_t);
+        return false;
+    }
 
     float *y_t = (float*)malloc(S * out_dim * sizeof(float));
     ane_read_output(kernel, 0, y_t, S * out_dim * sizeof(float));
@@ -25,6 +29,7 @@ static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y,
             y[t*out_dim + i] = y_t[i*S + t];
 
     free(x_t); free(y_t);
+    return true;
 }
 
 // CPU matmul fallback: y = W @ x, W[out_dim, in_dim], x[S, in_dim] → y[S, out_dim]
diff --git a/training/model.h b/training/model.h
index 6cee52f..4e68ebc 100644
--- a/training/model.h
+++ b/training/model.h
@@ -78,7 +78,10 @@ typedef struct {
 static int model_load_weights(Model *m, const char *path) {
     FILE *f = fopen(path, "rb");
     if (!f) { fprintf(stderr, "Cannot open %s\n", path); return -1; }
-    fread(&m->cfg, sizeof(Config), 1, f);
+    if (fread(&m->cfg, sizeof(Config), 1, f) != 1) {
+        fprintf(stderr, "ERROR: failed to read config from %s\n", path);
+        fclose(f); return -1;
+    }
     bool shared = m->cfg.vocab_size > 0;
     if (m->cfg.vocab_size < 0) m->cfg.vocab_size = -m->cfg.vocab_size;
 
@@ -89,7 +92,10 @@ static int model_load_weights(Model *m, const char *path) {
     int d = m->cfg.dim, hd = m->cfg.hidden_dim, nl = m->cfg.n_layers, vs = m->cfg.vocab_size;
 
     m->token_embedding = (float*)malloc(vs * d * sizeof(float));
-    fread(m->token_embedding, sizeof(float), vs * d, f);
+    if (fread(m->token_embedding, sizeof(float), vs * d, f) != (size_t)(vs * d)) {
+        fprintf(stderr, "ERROR: short read on token_embedding (file truncated?)\n");
+        fclose(f); return -1;
+    }
 
     float *rms_att_all = (float*)malloc(nl * d * sizeof(float));
     float *wq_all = (float*)malloc(nl * d * d * sizeof(float));
@@ -101,15 +107,24 @@ static int model_load_weights(Model *m, const char *path) {
     float *w2_all = (float*)malloc(nl * d * hd * sizeof(float));
     float *w3_all = (float*)malloc(nl * hd * d * sizeof(float));
 
-    fread(rms_att_all, sizeof(float), nl * d, f);
-    fread(wq_all, sizeof(float), nl * d * d, f);
-    fread(wk_all, sizeof(float), nl * d * d, f);
-    fread(wv_all, sizeof(float), nl * d * d, f);
-    fread(wo_all, sizeof(float), nl * d * d, f);
-    fread(rms_ffn_all, sizeof(float), nl * d, f);
-    fread(w1_all, sizeof(float), nl * hd * d, f);
-    fread(w2_all, sizeof(float), nl * d * hd, f);
-    fread(w3_all, sizeof(float), nl * hd * d, f);
+    #define FREAD_CHECK(buf, count, file, label) do { \
+        size_t _n = fread(buf, sizeof(float), count, file); \
+        if (_n != (size_t)(count)) { \
+            fprintf(stderr, "ERROR: short read on %s: got %zu, expected %zu (file truncated?)\n", \
+                    label, _n, (size_t)(count)); \
+            fclose(file); return -1; \
+        } \
+    } while(0)
+
+    FREAD_CHECK(rms_att_all, nl * d, f, "rms_att");
+    FREAD_CHECK(wq_all, nl * d * d, f, "wq");
+    FREAD_CHECK(wk_all, nl * d * d, f, "wk");
+    FREAD_CHECK(wv_all, nl * d * d, f, "wv");
+    FREAD_CHECK(wo_all, nl * d * d, f, "wo");
+    FREAD_CHECK(rms_ffn_all, nl * d, f, "rms_ffn");
+    FREAD_CHECK(w1_all, nl * hd * d, f, "w1");
+    FREAD_CHECK(w2_all, nl * d * hd, f, "w2");
+    FREAD_CHECK(w3_all, nl * hd * d, f, "w3");
 
     for (int l = 0; l < nl; l++) {
         m->rms_att_w[l] = (float*)malloc(d * sizeof(float));
@@ -135,14 +150,15 @@ static int model_load_weights(Model *m, const char *path) {
     free(rms_ffn_all); free(w1_all); free(w2_all); free(w3_all);
 
     m->rms_final_w = (float*)malloc(d * sizeof(float));
-    fread(m->rms_final_w, sizeof(float), d, f);
+    FREAD_CHECK(m->rms_final_w, d, f, "rms_final");
 
     if (shared) {
         m->wcls = m->token_embedding;
     } else {
         m->wcls = (float*)malloc(vs * d * sizeof(float));
-        fread(m->wcls, sizeof(float), vs * d, f);
+        FREAD_CHECK(m->wcls, vs * d, f, "wcls");
     }
+    #undef FREAD_CHECK
     fclose(f);
     return 0;
 }
@@ -188,32 +204,45 @@ static int model_compile_kernels(Model *m, int seq_len) {
     return 0;
 }
 
-// Recompile all kernels after weight update — unload all first to avoid ANE model limit
+// Recompile all kernels after weight update — compile new first, then swap
 static int model_recompile_kernels(Model *m) {
     int d = m->cfg.dim, hd = m->cfg.hidden_dim, vs = m->cfg.vocab_size;
     int S = m->seq_len;
-    // Phase 1: unload+free all
+
+    // Phase 1: compile new kernels into temporaries
+    ANEKernel *new_q[N_LAYERS], *new_k[N_LAYERS], *new_v[N_LAYERS], *new_o[N_LAYERS];
+    ANEKernel *new_w1[N_LAYERS], *new_w2[N_LAYERS], *new_w3[N_LAYERS];
     for (int l = 0; l < N_LAYERS; l++) {
-        ane_free(m->kern_q[l]); ane_free(m->kern_k[l]); ane_free(m->kern_v[l]); ane_free(m->kern_o[l]);
-        ane_free(m->kern_w1[l]); ane_free(m->kern_w2[l]); ane_free(m->kern_w3[l]);
-        m->kern_q[l]=m->kern_k[l]=m->kern_v[l]=m->kern_o[l]=NULL;
-        m->kern_w1[l]=m->kern_w2[l]=m->kern_w3[l]=NULL;
+        new_q[l] = compile_conv_kernel(m->wq[l], d, d, S);
+        new_k[l] = compile_conv_kernel(m->wk[l], d, d, S);
+        new_v[l] = compile_conv_kernel(m->wv[l], d, d, S);
+        new_o[l] = compile_conv_kernel(m->wo[l], d, d, S);
+        new_w1[l] = compile_conv_kernel(m->w1[l], d, hd, S);
+        new_w2[l] = compile_conv_kernel(m->w2[l], hd, d, S);
+        new_w3[l] = compile_conv_kernel(m->w3[l], d, hd, S);
+        if (!new_q[l] || !new_k[l] || !new_v[l] || !new_o[l] ||
+            !new_w1[l] || !new_w2[l] || !new_w3[l]) {
+            // Cleanup partially compiled new kernels
+            for (int i = 0; i <= l; i++) {
+                ane_free(new_q[i]); ane_free(new_k[i]); ane_free(new_v[i]); ane_free(new_o[i]);
+                ane_free(new_w1[i]); ane_free(new_w2[i]); ane_free(new_w3[i]);
+            }
+            fprintf(stderr, "Recompile failed at layer %d, keeping old kernels\n", l);
+            return -1;
+        }
     }
-    if (m->kern_cls) { ane_free(m->kern_cls); m->kern_cls=NULL; }
-    // Phase 2: recompile all
+    ANEKernel *new_cls = compile_conv_kernel(m->wcls, d, vs, S);
+
+    // Phase 2: all compiles succeeded — swap and free old
     for (int l = 0; l < N_LAYERS; l++) {
-        m->kern_q[l] = compile_conv_kernel(m->wq[l], d, d, S);
-        m->kern_k[l] = compile_conv_kernel(m->wk[l], d, d, S);
-        m->kern_v[l] = compile_conv_kernel(m->wv[l], d, d, S);
-        m->kern_o[l] = compile_conv_kernel(m->wo[l], d, d, S);
-        m->kern_w1[l] = compile_conv_kernel(m->w1[l], d, hd, S);
-        m->kern_w2[l] = compile_conv_kernel(m->w2[l], hd, d, S);
-        m->kern_w3[l] = compile_conv_kernel(m->w3[l], d, hd, S);
-        if (!m->kern_q[l] || !m->kern_k[l] || !m->kern_v[l] || !m->kern_o[l] ||
-            !m->kern_w1[l] || !m->kern_w2[l] || !m->kern_w3[l]) return -1;
+        ane_free(m->kern_q[l]); ane_free(m->kern_k[l]); ane_free(m->kern_v[l]); ane_free(m->kern_o[l]);
+        ane_free(m->kern_w1[l]); ane_free(m->kern_w2[l]); ane_free(m->kern_w3[l]);
+        m->kern_q[l] = new_q[l]; m->kern_k[l] = new_k[l];
+        m->kern_v[l] = new_v[l]; m->kern_o[l] = new_o[l];
+        m->kern_w1[l] = new_w1[l]; m->kern_w2[l] = new_w2[l]; m->kern_w3[l] = new_w3[l];
     }
-    m->kern_cls = compile_conv_kernel(m->wcls, d, vs, S);
-    // cls may fail for large vocab — that's OK, forward uses CPU fallback
+    if (m->kern_cls) ane_free(m->kern_cls);
+    m->kern_cls = new_cls;  // may be NULL for large vocab — forward uses CPU fallback
     return 0;
 }
 
diff --git a/training/stories_cpu_ops.h b/training/stories_cpu_ops.h
index c9f2cfa..ae4dfdf 100644
--- a/training/stories_cpu_ops.h
+++ b/training/stories_cpu_ops.h
@@ -1,15 +1,14 @@
 // stories_cpu_ops.h — CPU operations: RMSNorm, cross-entropy, Adam, softmax
 #pragma once
 #include "stories_config.h"
-
-static float *g_rms_tmp = NULL;
+#include <assert.h>
 
 static void rmsnorm(float *out, const float *x, const float *w, int d, int S) {
-    if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4);
+    float *rms_tmp = (float*)malloc(S * sizeof(float));
     float *ss = (float*)calloc(S, sizeof(float));
     for (int i=0; i<d; i++) {
-        vDSP_vmul(x+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vadd(g_rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
+        vDSP_vmul(x+i*S, 1, x+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vadd(rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
     }
     float invd = 1.0f/d, eps=1e-5f;
     vDSP_vsmsa(ss, 1, &invd, &eps, ss, 1, (vDSP_Length)S);
@@ -18,15 +17,15 @@ static void rmsnorm(float *out, const float *x, const float *w, int d, int S) {
         vDSP_vmul(x+i*S, 1, ss, 1, out+i*S, 1, (vDSP_Length)S);
         vDSP_vsmul(out+i*S, 1, &w[i], out+i*S, 1, (vDSP_Length)S);
     }
-    free(ss);
+    free(ss); free(rms_tmp);
 }
 
 static void rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, const float *w, int d, int S) {
-    if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4);
+    float *rms_tmp = (float*)malloc(S * sizeof(float));
     float *ss = (float*)calloc(S, sizeof(float));
     for (int i=0; i<d; i++) {
-        vDSP_vmul(x+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vadd(g_rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
+        vDSP_vmul(x+i*S, 1, x+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vadd(rms_tmp, 1, ss, 1, ss, 1, (vDSP_Length)S);
     }
     float invd = 1.0f/d, eps=1e-5f;
     vDSP_vsmsa(ss, 1, &invd, &eps, ss, 1, (vDSP_Length)S);
@@ -34,23 +33,23 @@ static void rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, c
     int n = S; vvrsqrtf(rrms, ss, &n);
     float *dot = (float*)calloc(S, sizeof(float));
     for (int i=0; i<d; i++) {
-        vDSP_vmul(dy+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vsma(g_rms_tmp, 1, &w[i], dot, 1, dot, 1, (vDSP_Length)S);
+        vDSP_vmul(dy+i*S, 1, x+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsma(rms_tmp, 1, &w[i], dot, 1, dot, 1, (vDSP_Length)S);
     }
     vDSP_vmul(rrms, 1, rrms, 1, ss, 1, (vDSP_Length)S);
     vDSP_vsmul(ss, 1, &invd, ss, 1, (vDSP_Length)S);
     vDSP_vmul(dot, 1, ss, 1, dot, 1, (vDSP_Length)S);
     for (int i=0; i<d; i++) {
-        vDSP_vmul(x+i*S, 1, dot, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vsub(g_rms_tmp, 1, dy+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vmul(g_rms_tmp, 1, rrms, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vsmul(g_rms_tmp, 1, &w[i], dx+i*S, 1, (vDSP_Length)S);
-        vDSP_vmul(dy+i*S, 1, x+i*S, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        vDSP_vmul(g_rms_tmp, 1, rrms, 1, g_rms_tmp, 1, (vDSP_Length)S);
-        float s; vDSP_sve(g_rms_tmp, 1, &s, (vDSP_Length)S);
+        vDSP_vmul(x+i*S, 1, dot, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsub(rms_tmp, 1, dy+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vmul(rms_tmp, 1, rrms, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vsmul(rms_tmp, 1, &w[i], dx+i*S, 1, (vDSP_Length)S);
+        vDSP_vmul(dy+i*S, 1, x+i*S, 1, rms_tmp, 1, (vDSP_Length)S);
+        vDSP_vmul(rms_tmp, 1, rrms, 1, rms_tmp, 1, (vDSP_Length)S);
+        float s; vDSP_sve(rms_tmp, 1, &s, (vDSP_Length)S);
         dw[i] += s;
     }
-    free(ss); free(rrms); free(dot);
+    free(ss); free(rrms); free(dot); free(rms_tmp);
 }
 
 static void adam_update(float *w, const float *g, AdamState *s, int t, float lr, float b1, float b2, float eps) {
@@ -96,6 +95,7 @@ static float cross_entropy_loss(float *dlogits, const float *logits, const uint1
         vDSP_vsmul(row, 1, &inv_sum, row, 1, (vDSP_Length)V);
         // loss
         int tgt = targets[t];
+        assert(tgt >= 0 && tgt < V && "target token ID out of vocab range");
         total_loss -= logf(row[tgt] + 1e-10f);
         // gradient: softmax - one_hot, then /S
         row[tgt] -= 1.0f;
@@ -112,6 +112,7 @@ static float cross_entropy_loss(float *dlogits, const float *logits, const uint1
 static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq) {
     for (int t = 0; t < seq; t++) {
         int tok = tokens[t];
+        assert(tok >= 0 && tok < VOCAB && "token ID out of embedding range");
         for (int d = 0; d < dim; d++) {
             x[d*seq + t] = embed[tok*dim + d];
         }
@@ -122,6 +123,7 @@ static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, i
 static void embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq) {
     for (int t = 0; t < seq; t++) {
         int tok = tokens[t];
+        assert(tok >= 0 && tok < VOCAB && "token ID out of embedding range");
         for (int d = 0; d < dim; d++) {
             d_embed[tok*dim + d] += dx[d*seq + t];
         }
diff --git a/training/tiny_train.m b/training/tiny_train.m
index e1e9d7d..0449dba 100644
--- a/training/tiny_train.m
+++ b/training/tiny_train.m
@@ -139,7 +139,7 @@ static void free_kern(Kern *k) {
     free(k);
 }
 
-static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_ch, int sp) {
+static bool ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_ch, int sp) {
     float *tmp = (float*)malloc(in_ch * sp * sizeof(float));
     for (int t = 0; t < sp; t++)
         for (int c = 0; c < in_ch; c++)
@@ -151,8 +151,13 @@ static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_
     NSError *e = nil;
     id mdl = (__bridge id)k->model;
     id req = (__bridge id)k->request;
-    ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+    BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
         mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
+    if (!ok) {
+        fprintf(stderr, "ANE eval failed: %s\n",
+                e ? [[e description] UTF8String] : "unknown error");
+        return false;
+    }
     float *tmp2 = (float*)malloc(out_ch * sp * sizeof(float));
     IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL);
     memcpy(tmp2, IOSurfaceGetBaseAddress(k->ioOut), out_ch * sp * sizeof(float));
@@ -161,6 +166,7 @@ static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_
         for (int c = 0; c < out_ch; c++)
             out[t*out_ch + c] = tmp2[c*sp + t];
     free(tmp2);
+    return true;
 }
 
 // === Checkpoint: save/restore training state for exec() restart ===
@@ -179,21 +185,25 @@ static void save_checkpoint(const char *path, int step, float loss,
                             int D, int H, int S, int total_steps, float lr,
                             const float *W1, const float *W2,
                             double cc, double ct, double cw, int cs, int cb) {
-    FILE *f = fopen(path, "wb");
+    char tmp_path[512];
+    snprintf(tmp_path, sizeof(tmp_path), "%s.tmp", path);
+    FILE *f = fopen(tmp_path, "wb");
+    if (!f) { fprintf(stderr, "Failed to open %s for checkpoint\n", tmp_path); return; }
     CkptHeader hdr = {step, loss, D, H, S, total_steps, lr, cc, ct, cw, cs, cb};
     fwrite(&hdr, sizeof(hdr), 1, f);
     fwrite(W1, sizeof(float), H * D, f);
     fwrite(W2, sizeof(float), D * H, f);
     fclose(f);
+    rename(tmp_path, path);  // atomic on POSIX
 }
 
 static bool load_checkpoint(const char *path, CkptHeader *hdr,
                             float *W1, float *W2, int H, int D) {
     FILE *f = fopen(path, "rb");
     if (!f) return false;
-    fread(hdr, sizeof(CkptHeader), 1, f);
-    fread(W1, sizeof(float), H * D, f);
-    fread(W2, sizeof(float), D * H, f);
+    if (fread(hdr, sizeof(CkptHeader), 1, f) != 1) { fclose(f); return false; }
+    if (fread(W1, sizeof(float), H * D, f) != (size_t)(H * D)) { fclose(f); return false; }
+    if (fread(W2, sizeof(float), D * H, f) != (size_t)(D * H)) { fclose(f); return false; }
     fclose(f);
     return true;
 }

From 0d9e139567720ac05077765779b456f34432bf32 Mon Sep 17 00:00:00 2001
From: 04cb <0x04cb@gmail.com>
Date: Wed, 4 Mar 2026 08:16:20 +0800
Subject: [PATCH 13/21] Fix docs: add training data download instructions

---
 README.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/README.md b/README.md
index ce3df1f..3c07fd0 100644
--- a/README.md
+++ b/README.md
@@ -109,6 +109,14 @@ Key optimizations:
     └── Makefile
 ```
 
+## Training Data
+
+Training requires pretokenized TinyStories data. To download:
+```bash
+cd training && bash download_data.sh
+```
+See [training/README.md](training/README.md) for detailed training instructions.
+
 ## Building
 
 Requires macOS 15+ on Apple Silicon (tested on M4).

From 4a6f3e40a96e4e69d05f95561b42f3d7acc66056 Mon Sep 17 00:00:00 2001
From: Manjeet Singh <maderix@gmail.com>
Date: Wed, 4 Mar 2026 12:59:09 +0530
Subject: [PATCH 14/21] Revise README for clarity and project details

Updated README to reflect project scope, architecture, and limitations.
---
 README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index ce3df1f..c151151 100644
--- a/README.md
+++ b/README.md
@@ -45,10 +45,11 @@ That said:
 - I'll keep pushing updates when I discover something interesting
 - Bug fixes and benchmark contributions (especially on hardware I don't own) are welcome
 - Feature requests will likely go unaddressed — but feel free to fork
+- PRs will be merged at a relatively slow pace, otherwise I become the bottleneck for community growth around this tech
 
 ### Fork it, build on it
 
-This is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it.
+This is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it.If in future, community decides to maintain one source of truth repo, I'm in full support of that.
 
 ---
 
@@ -163,3 +164,4 @@ MIT — see [LICENSE](LICENSE)
 ---
 
 *Built by a human + Claude, one weekend at a time.*
+

From e986572e9095c0f646ddee4ac03c487e586de0aa Mon Sep 17 00:00:00 2001
From: maderix <maderix@max.local>
Date: Wed, 4 Mar 2026 04:41:38 -0800
Subject: [PATCH 15/21] Replace assert() with non-fatal bounds checks on token
 IDs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Follow-up to PR #31 — assert() aborts on bad tokens, which is too
harsh for training. Skip bad tokens with a warning instead.
---
 training/stories_cpu_ops.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/training/stories_cpu_ops.h b/training/stories_cpu_ops.h
index ae4dfdf..cd103c5 100644
--- a/training/stories_cpu_ops.h
+++ b/training/stories_cpu_ops.h
@@ -1,7 +1,7 @@
 // stories_cpu_ops.h — CPU operations: RMSNorm, cross-entropy, Adam, softmax
 #pragma once
 #include "stories_config.h"
-#include <assert.h>
+
 
 static void rmsnorm(float *out, const float *x, const float *w, int d, int S) {
     float *rms_tmp = (float*)malloc(S * sizeof(float));
@@ -95,7 +95,7 @@ static float cross_entropy_loss(float *dlogits, const float *logits, const uint1
         vDSP_vsmul(row, 1, &inv_sum, row, 1, (vDSP_Length)V);
         // loss
         int tgt = targets[t];
-        assert(tgt >= 0 && tgt < V && "target token ID out of vocab range");
+        if (tgt < 0 || tgt >= V) { fprintf(stderr, "WARN: target token %d out of vocab range [0,%d), skipping\n", tgt, V); continue; }
         total_loss -= logf(row[tgt] + 1e-10f);
         // gradient: softmax - one_hot, then /S
         row[tgt] -= 1.0f;
@@ -112,7 +112,7 @@ static float cross_entropy_loss(float *dlogits, const float *logits, const uint1
 static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq) {
     for (int t = 0; t < seq; t++) {
         int tok = tokens[t];
-        assert(tok >= 0 && tok < VOCAB && "token ID out of embedding range");
+        if (tok < 0 || tok >= VOCAB) { fprintf(stderr, "WARN: token %d out of range [0,%d)\n", tok, VOCAB); continue; }
         for (int d = 0; d < dim; d++) {
             x[d*seq + t] = embed[tok*dim + d];
         }
@@ -123,7 +123,7 @@ static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, i
 static void embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq) {
     for (int t = 0; t < seq; t++) {
         int tok = tokens[t];
-        assert(tok >= 0 && tok < VOCAB && "token ID out of embedding range");
+        if (tok < 0 || tok >= VOCAB) { continue; }
         for (int d = 0; d < dim; d++) {
             d_embed[tok*dim + d] += dx[d*seq + t];
         }

From 050bc4fdf0fc28095039d421f325f787486a3726 Mon Sep 17 00:00:00 2001
From: maderix <maderix@max.local>
Date: Wed, 4 Mar 2026 05:30:00 -0800
Subject: [PATCH 16/21] Add cross-generation ANE benchmark report from issue #3

Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5.
Includes training performance, peak throughput, MIL compatibility
matrix, and structured JSON data.
---
 benchmarks/ANE_BENCHMARK_REPORT.md | 135 +++++++++++++++++++++++++++++
 benchmarks/community_results.json  | 106 ++++++++++++++++++++++
 2 files changed, 241 insertions(+)
 create mode 100644 benchmarks/ANE_BENCHMARK_REPORT.md
 create mode 100644 benchmarks/community_results.json

diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md
new file mode 100644
index 0000000..75ac6c8
--- /dev/null
+++ b/benchmarks/ANE_BENCHMARK_REPORT.md
@@ -0,0 +1,135 @@
+# Apple Neural Engine — Cross-Generation Benchmark Report
+
+Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
+All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).
+
+## Training Performance (Static Pipeline)
+
+```
+Chip            ms/step   ANE ms   Compile/10   ANE TFLOPS   Util%    Contributor
+─────────────────────────────────────────────────────────────────────────────────
+M1 Pro          148-163   32-35    7.9-8.5s     0.57-0.63    3.6-4.0  @moriwang
+M1 Max          143-167   35-45    ~7.1s        0.54-0.65    3.4-4.1  @andyg5000
+M3 Ultra*       91        ~10      ~3.7s        0.88         5.6      (repo ref)
+M4 Pro          69-73     8.9      ~3.5s        1.28         8.1      @srt54558
+M4 Max          64        10.2     ~3.5s        1.45         9.2      @SethBurkart123
+M5              101-120   9.1-9.8  3.2-3.4s     0.77-0.91    4.9-5.8  @GitBubble
+```
+
+*M3 Ultra = reference platform this project was developed on.
+
+## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)
+
+```
+Chip            TFLOPS    Rated TOPS    Utilization
+───────────────────────────────────────────────────
+M1 Pro          FAIL      11            -  (MIL compat issue)
+M1 Max          FAIL      11            -  (MIL compat issue)
+M3 Pro          9.98      15.8          63%
+M4 Pro          12.57     38            33%
+M4 Max          10.93     38            29%
+M5              12.17     ~19*          64%
+M5 (other)      12.44     ~19*          65%
+```
+
+*M5 ANE TOPS not officially disclosed; ~19 TOPS estimated from measured peak.
+
+## Comparative Chart
+
+```
+ANE Training Speed (ms/step, lower is better)
+══════════════════════════════════════════════════════════════
+
+M1 Pro    ████████████████████████████████████████░░░░  148-163 ms
+M1 Max    ██████████████████████████████████████░░░░░░  143-167 ms
+M3 Ultra  ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░   91 ms
+M4 Pro    ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   69-73 ms
+M4 Max    ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   64 ms
+M5        ████████████████████████░░░░░░░░░░░░░░░░░░░░  101-120 ms
+
+          0        50       100       150       200
+
+
+Peak ANE Throughput (TFLOPS, higher is better)
+══════════════════════════════════════════════════════════════
+
+M1 Pro    FAIL (MIL compat)
+M1 Max    FAIL (MIL compat)
+M3 Pro    ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░  9.98
+M4 Pro    ████████████████████████████████░░░░░░░░░░░░░  12.57
+M4 Max    ██████████████████████░░░░░░░░░░░░░░░░░░░░░░  10.93
+M5        █████████████████████████░░░░░░░░░░░░░░░░░░░  12.17
+
+          0     3     6     9     12    15    18
+
+
+ANE Sustained Throughput (TFLOPS, 5s window)
+══════════════════════════════════════════════════════════════
+
+M3 Pro    ██████████████████████████████████████████████  15.04 (95.2%)
+
+          0     3     6     9     12    15    18
+          (Only M3 Pro submitted sustained benchmark)
+```
+
+## Key Findings
+
+### M1/M1 Pro/M1 Max
+- **Standalone benchmarks fail** — `ane_mil_gen.h` single-blob weight format rejected
+- **Training works** via `stories_mil.h` (separate per-matrix weight blobs)
+- ANE compiler handles weight blobs differently from M4+
+- Training at 148-167 ms/step, ~0.6 TFLOPS
+
+### M3 Pro
+- **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted
+- Fixed 512-wide lane structure in SRAM tiling
+- **Peak: 16.77 TFLOPS** (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
+- **Sustained: 15.04 TFLOPS** over 5 seconds (95.2% utilization)
+- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)
+
+### M4 Pro / M4 Max
+- Flexible channel support (256/384/512/768+)
+- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
+- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
+- `sram_probe` and `inmem_bench` fail on M4 Pro (same MIL compat issue)
+
+### M5
+- Training works out of the box with existing `program(1.3)` MIL
+- Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
+- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
+- ANE appears to be same H16 family as M4
+- **M5 Pro/Max not yet benchmarked** — Fusion Architecture may change ANE behavior
+
+### Cross-Generation MIL Compatibility
+
+```
+Feature                    M1       M3       M4       M5
+─────────────────────────────────────────────────────────
+program(1.3) / ios18       PARTIAL  YES      YES      YES
+Single-blob weights        FAIL     YES      YES      YES
+Per-matrix weight blobs    YES      YES      YES      YES
+Channel flexibility        ?        ch=512   FLEX     FLEX
+BLOBFILE offset refs       FAIL     YES      YES      YES
+```
+
+## macOS Compatibility Issues
+
+- **macOS 26.x** — `[MLModel compileModelAtURL:]` broken for standalone benchmarks
+  (fixed in PR #27: switched to in-memory MIL compilation)
+- **macOS 15.x** — Works for all M-series with correct MIL format
+- M1 generation requires `stories_mil.h` path, not `ane_mil_gen.h`
+
+## How to Contribute
+
+Run on your hardware and post results to [Issue #3](https://github.com/maderix/ANE/issues/3):
+
+```bash
+cd training && make train_large
+./train_large ane_stories110M_ckpt.bin 256 20 1e-4
+```
+
+Include: chip model, macOS version, full output with JSON lines.
+
+---
+*Report compiled 2026-03-04 from community submissions.*
+*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton*
diff --git a/benchmarks/community_results.json b/benchmarks/community_results.json
new file mode 100644
index 0000000..a9eaaa8
--- /dev/null
+++ b/benchmarks/community_results.json
@@ -0,0 +1,106 @@
+{
+  "report_date": "2026-03-04",
+  "source": "https://github.com/maderix/ANE/issues/3",
+  "model": "Stories110M (12-layer transformer, 109M params)",
+  "config": {"dim": 768, "hidden": 2048, "heads": 12, "seq": 256, "vocab": 32000, "layers": 12},
+  "training_results": [
+    {
+      "chip": "M1 Pro",
+      "cores": "10-core CPU",
+      "ram_gb": 32,
+      "macos": "15.0",
+      "ms_per_step": [148, 163],
+      "ane_ms": [32, 35],
+      "compile_ms": [7900, 8500],
+      "ane_tflops": [0.57, 0.63],
+      "ane_util_pct": [3.6, 4.0],
+      "benchmarks_pass": false,
+      "notes": "Standalone benchmarks fail (MIL compat). Training works via stories_mil.h.",
+      "contributor": "moriwang"
+    },
+    {
+      "chip": "M1 Max",
+      "cores": "10-core CPU",
+      "ram_gb": 64,
+      "macos": "15.6.1",
+      "ms_per_step": [143, 167],
+      "ane_ms": [35, 45],
+      "compile_ms": [7100, 7100],
+      "ane_tflops": [0.54, 0.65],
+      "ane_util_pct": [3.4, 4.1],
+      "benchmarks_pass": false,
+      "notes": "Same MIL compat issue as M1 Pro.",
+      "contributor": "andyg5000"
+    },
+    {
+      "chip": "M3 Pro",
+      "cores": "12-core CPU",
+      "ram_gb": 36,
+      "macos": "15.7.4",
+      "peak_tflops": 16.77,
+      "sustained_tflops": 15.04,
+      "sustained_util_pct": 95.2,
+      "channel_constraint": "ch=512 only",
+      "notes": "Only ch=512 compiles. 52 values tested. Peak at 128x conv 512ch sp2048.",
+      "contributor": "D-Ogi"
+    },
+    {
+      "chip": "M4 Pro",
+      "cores": "unknown",
+      "ram_gb": null,
+      "macos": null,
+      "ms_per_step": [69, 73],
+      "ane_ms": [8.9, 8.9],
+      "compile_ms": [3465, 3465],
+      "ane_tflops": [1.28, 1.28],
+      "ane_util_pct": [8.1, 8.1],
+      "peak_tflops_inmem": 12.57,
+      "notes": "sram_probe and inmem_bench fail. inmem_peak and training work.",
+      "contributor": "srt54558"
+    },
+    {
+      "chip": "M4 Max",
+      "cores": "unknown",
+      "ram_gb": null,
+      "macos": null,
+      "ms_per_step": [64, 64],
+      "ane_ms": [10.2, 10.2],
+      "compile_ms": [3531, 3531],
+      "ane_tflops": [1.45, 1.45],
+      "ane_util_pct": [9.2, 9.2],
+      "peak_tflops_inmem": 10.93,
+      "notes": "Fastest training ms/step overall.",
+      "contributor": "SethBurkart123"
+    },
+    {
+      "chip": "M5",
+      "cores": "10-core (4P+6E)",
+      "ram_gb": 16,
+      "macos": "26.3",
+      "ms_per_step": [101, 120],
+      "ane_ms": [9.1, 9.8],
+      "compile_ms": [3200, 3400],
+      "ane_tflops": [0.77, 0.91],
+      "ane_util_pct": [4.9, 5.8],
+      "peak_tflops_inmem": 12.44,
+      "notes": "H16 ANE family (same as M4). Training works with existing program(1.3) MIL.",
+      "contributor": "GitBubble"
+    },
+    {
+      "chip": "M5",
+      "cores": "unknown",
+      "ram_gb": 32,
+      "macos": "26.4",
+      "peak_tflops_inmem": 12.17,
+      "notes": "inmem_peak only, no training data submitted.",
+      "contributor": "elijah-pelton"
+    }
+  ],
+  "neural_engine_specs": {
+    "M1":  {"cores": 16, "rated_tops": 11},
+    "M2":  {"cores": 16, "rated_tops": 15.8},
+    "M3":  {"cores": 16, "rated_tops": 15.8},
+    "M4":  {"cores": 16, "rated_tops": 38},
+    "M5":  {"cores": 16, "rated_tops": null, "estimated_tops": 19}
+  }
+}

From 1a7d8846b2e1301a911266e7b344ad5c5e92cf5f Mon Sep 17 00:00:00 2001
From: maderix <maderix@max.local>
Date: Wed, 4 Mar 2026 06:11:29 -0800
Subject: [PATCH 17/21] Add NE core counts, clarify FP16 vs rated TOPS
 methodology

All chips have 16 NE cores except Ultra (32 via UltraFusion).
M4 38 TOPS is INT8/mixed-precision, not comparable to M3 FP16 spec.
---
 benchmarks/ANE_BENCHMARK_REPORT.md | 26 ++++++++++++++++----------
 benchmarks/community_results.json  | 17 ++++++++++++-----
 2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md
index 75ac6c8..6f41dc8 100644
--- a/benchmarks/ANE_BENCHMARK_REPORT.md
+++ b/benchmarks/ANE_BENCHMARK_REPORT.md
@@ -21,18 +21,24 @@ M5              101-120   9.1-9.8  3.2-3.4s     0.77-0.91    4.9-5.8  @GitBubble
 ## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)
 
 ```
-Chip            TFLOPS    Rated TOPS    Utilization
-───────────────────────────────────────────────────
-M1 Pro          FAIL      11            -  (MIL compat issue)
-M1 Max          FAIL      11            -  (MIL compat issue)
-M3 Pro          9.98      15.8          63%
-M4 Pro          12.57     38            33%
-M4 Max          10.93     38            29%
-M5              12.17     ~19*          64%
-M5 (other)      12.44     ~19*          65%
+Chip            NE Cores  FP16 TFLOPS (measured)    Rated TOPS (Apple spec*)
+────────────────────────────────────────────────────────────────────────────
+M1 Pro          16        FAIL                      11    (MIL compat issue)
+M1 Max          16        FAIL                      11    (MIL compat issue)
+M3 Pro          16        9.98                      15.8
+M3 Ultra        32        -                         31.6  (ref platform)
+M4 Pro          16        12.57                     38
+M4 Max          16        10.93                     38
+M5              16        12.17                     not disclosed
+M5 (other)      16        12.44                     not disclosed
 ```
 
-*M5 ANE TOPS not officially disclosed; ~19 TOPS estimated from measured peak.
+*Apple's "Rated TOPS" changed methodology across generations — M1/M3 report FP16,
+M4 reports INT8/mixed-precision peak. The numbers are not directly comparable across
+generations. Use the measured FP16 TFLOPS column for apples-to-apples comparison.
+All chips have 16 NE cores except Ultra variants (32 cores, two dies via UltraFusion).
+Max variants share the same 16-core NE as Pro — the M4 Max vs M4 Pro TFLOPS difference
+is run-to-run variance, not hardware.*
 
 ## Comparative Chart
 
diff --git a/benchmarks/community_results.json b/benchmarks/community_results.json
index a9eaaa8..e975925 100644
--- a/benchmarks/community_results.json
+++ b/benchmarks/community_results.json
@@ -97,10 +97,17 @@
     }
   ],
   "neural_engine_specs": {
-    "M1":  {"cores": 16, "rated_tops": 11},
-    "M2":  {"cores": 16, "rated_tops": 15.8},
-    "M3":  {"cores": 16, "rated_tops": 15.8},
-    "M4":  {"cores": 16, "rated_tops": 38},
-    "M5":  {"cores": 16, "rated_tops": null, "estimated_tops": 19}
+    "M1":       {"ne_cores": 16, "rated_tops": 11},
+    "M1_Max":   {"ne_cores": 16, "rated_tops": 11},
+    "M1_Ultra": {"ne_cores": 32, "rated_tops": 22},
+    "M2":       {"ne_cores": 16, "rated_tops": 15.8},
+    "M2_Max":   {"ne_cores": 16, "rated_tops": 15.8},
+    "M2_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
+    "M3":       {"ne_cores": 16, "rated_tops": 15.8},
+    "M3_Max":   {"ne_cores": 16, "rated_tops": 15.8},
+    "M3_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
+    "M4":       {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
+    "M4_Max":   {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
+    "M5":       {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19}
   }
 }

From efcf1930752c8e3d62dbe339976885f6551b2934 Mon Sep 17 00:00:00 2001
From: maderix <maderix@max.local>
Date: Wed, 4 Mar 2026 06:13:21 -0800
Subject: [PATCH 18/21] Add model config to benchmark report, update README
 with current results

Benchmark report now includes full Stories110M model configuration
(arch, layers, dims, kernels). README updated: 12-layer results
replace stale single-layer numbers, limitations reflect current state.
---
 README.md                          | 15 ++++++++-------
 benchmarks/ANE_BENCHMARK_REPORT.md | 24 +++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index cf5ad07..ed2362d 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ The goal was to demonstrate that **training on the Apple Neural Engine — and p
 
 Some coverage of this project has overstated its implications. To be clear:
 
-- Training works, but utilization is low (~2-3% of peak) with significant engineering challenges remaining
+- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining
 - Many element-wise operations still fall back to CPU
 - This does **not** replace GPU training for anything beyond small research models today
 
@@ -57,11 +57,12 @@ This is MIT licensed for a reason. Everyone now has access to AI-assisted develo
 
 A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
 
-**Current results (M4, single transformer layer, dim=768, seq=512):**
-- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
-- 6 ANE kernel dispatches per training step
+**Current results — Stories110M (12-layer, dim=768, seq=256, 109M params):**
+- Static pipeline: **91 ms/step** (M3 Ultra), **106 ms/step** (M4)
+- Dynamic pipeline: **110 ms/step**, no recompilation
+- 72 ANE kernels per step (static), 9 shared kernels (dynamic)
 - All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
-- Adam optimizer, gradient accumulation, checkpoint/resume
+- Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart
 
 ## Architecture
 
@@ -146,8 +147,8 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 
 - **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
 - **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
-- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
-- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
+- **Compile overhead** — Static pipeline recompiles 60+ kernels every 10 steps (~3.7s); dynamic pipeline avoids this
+- **Low utilization** — Training sustains ~1-2 TFLOPS out of 15.8+ peak due to CPU fallbacks and I/O overhead
 
 ## Performance History
 
diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md
index 6f41dc8..b7095a0 100644
--- a/benchmarks/ANE_BENCHMARK_REPORT.md
+++ b/benchmarks/ANE_BENCHMARK_REPORT.md
@@ -1,7 +1,29 @@
 # Apple Neural Engine — Cross-Generation Benchmark Report
 
 Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
-All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).
+
+## Model Configuration
+
+All training benchmarks use **Stories110M** — a Llama2-architecture transformer:
+
+```
+Parameter       Value
+────────────────────────
+Architecture    Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
+Layers          12
+Dimension       768
+Hidden (FFN)    2048
+Heads           12
+Vocab           32000 (Llama 2 BPE)
+Sequence        256
+Total Params    109.53M (84.95M transformer + 24.58M embedding)
+Training Data   TinyStories (~20M tokens, pretokenized)
+Optimizer       Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
+Precision       FP16 on ANE, FP32 on CPU
+```
+
+Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2).
+Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU.
 
 ## Training Performance (Static Pipeline)
 

From a12f0804e1aab375b60c6dd8afb547ddccd4c486 Mon Sep 17 00:00:00 2001
From: "codegen-sh[bot]" <131295404+codegen-sh[bot]@users.noreply.github.com>
Date: Mon, 2 Mar 2026 21:26:28 +0000
Subject: [PATCH 19/21] Add pipeline scaffolding for multi-group ANE training

New files:
- model_config.h: Parameterized model config with presets (Stories42M/110M, LLaMA-1B/7B),
  pipeline planning (compute_pipeline_plan), memory/FLOP estimation
- pipeline.h: Layer-group scheduler (PipelineScheduler state machine),
  compile budget tracking, mmap-based cross-exec() shared tensor state,
  exec() restart with automatic resume
- gradient_checkpoint.h: Activation checkpointing policies (ALL/BOUNDARY/SQRT/NONE),
  recompute tracking, memory savings estimation
- train_pipeline.m: Entry point with dry-run simulation mode -- prints full execution
  plan for any model config, simulates scheduler state machine
- Makefile: train_pipeline and train_pipeline_live targets

All additive -- existing train_large.m untouched.

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
---
 training/Makefile              | 104 +++----
 training/gradient_checkpoint.h | 170 +++++++++++
 training/model_config.h        | 310 ++++++++++++++++++++
 training/pipeline.h            | 497 +++++++++++++++++++++++++++++++++
 training/train_pipeline.m      | 258 +++++++++++++++++
 5 files changed, 1291 insertions(+), 48 deletions(-)
 create mode 100644 training/gradient_checkpoint.h
 create mode 100644 training/model_config.h
 create mode 100644 training/pipeline.h
 create mode 100644 training/train_pipeline.m

diff --git a/training/Makefile b/training/Makefile
index 7f16c1a..2df029b 100644
--- a/training/Makefile
+++ b/training/Makefile
@@ -1,48 +1,56 @@
-CC = xcrun clang
-CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
-FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface
-LDFLAGS = $(FRAMEWORKS) -ldl
-
-HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h
-
-HEADERS_ANE = $(HEADERS_LARGE) ane_rmsnorm_bwd.h ane_classifier.h
-
-train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
-	$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS)
-
-train_large: train_large.m $(HEADERS_LARGE)
-	$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
-
-train_large_ane: train_large_ane.m $(HEADERS_ANE)
-	$(CC) $(CFLAGS) -o $@ train_large_ane.m $(LDFLAGS) -framework Accelerate
-
-PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced
-
-test_rmsnorm_bwd: test_rmsnorm_bwd.m $(HEADERS_ANE)
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
-
-test_classifier: test_classifier.m $(HEADERS_ANE)
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
-
-test_weight_reload: test_weight_reload.m
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
-
-test_perf_stats: test_perf_stats.m
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
-
-test_qos_sweep: test_qos_sweep.m
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
-
-test_ane_advanced: test_ane_advanced.m
-	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
-
-probes: $(PROBES)
-
-tokenize:
-	python3 tokenize.py
-
-clean:
-	rm -f train train_large train_large_ane $(PROBES) test_rmsnorm_bwd test_classifier
-
-.PHONY: clean tokenize probes
-
+CC = xcrun clang
+CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
+FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface
+LDFLAGS = $(FRAMEWORKS) -ldl
+
+HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h
+
+HEADERS_ANE = $(HEADERS_LARGE) ane_rmsnorm_bwd.h ane_classifier.h
+
+HEADERS_PIPELINE = model_config.h pipeline.h gradient_checkpoint.h
+
+train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
+	$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS)
+
+train_large: train_large.m $(HEADERS_LARGE)
+	$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
+
+train_large_ane: train_large_ane.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ train_large_ane.m $(LDFLAGS) -framework Accelerate
+
+train_pipeline: train_pipeline.m $(HEADERS_PIPELINE)
+	$(CC) $(CFLAGS) -o $@ train_pipeline.m $(LDFLAGS) -framework Accelerate
+
+train_pipeline_live: train_pipeline.m $(HEADERS_PIPELINE) $(HEADERS_LARGE)
+	$(CC) $(CFLAGS) -DANE_LIVE -o train_pipeline train_pipeline.m $(LDFLAGS) -framework Accelerate
+
+PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced
+
+test_rmsnorm_bwd: test_rmsnorm_bwd.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
+
+test_classifier: test_classifier.m $(HEADERS_ANE)
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
+
+test_weight_reload: test_weight_reload.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_perf_stats: test_perf_stats.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_qos_sweep: test_qos_sweep.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_ane_advanced: test_ane_advanced.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+probes: $(PROBES)
+
+tokenize:
+	python3 tokenize.py
+
+clean:
+	rm -f train train_large train_large_ane train_pipeline $(PROBES) test_rmsnorm_bwd test_classifier
+
+.PHONY: clean tokenize probes
+
diff --git a/training/gradient_checkpoint.h b/training/gradient_checkpoint.h
new file mode 100644
index 0000000..4065c04
--- /dev/null
+++ b/training/gradient_checkpoint.h
@@ -0,0 +1,170 @@
+// gradient_checkpoint.h — Activation checkpointing for deep models
+// Trades compute for memory: recompute forward activations during backward
+// instead of storing all layers' activations simultaneously
+#pragma once
+#include "model_config.h"
+
+// ===== Checkpoint policies =====
+
+typedef enum {
+    CKPT_ALL,           // save all layers' activations (current behavior)
+    CKPT_BOUNDARY,      // save only group boundary activations, recompute within group
+    CKPT_SQRT,          // save every √N layers (optimal memory/compute tradeoff)
+    CKPT_EVERY_N,       // save every N-th layer (configurable)
+    CKPT_NONE           // save nothing, recompute everything (max memory savings)
+} CheckpointPolicy;
+
+typedef struct {
+    CheckpointPolicy policy;
+    int interval;           // for CKPT_EVERY_N: save every N layers
+    int n_layers;           // total layers in model
+    int n_groups;           // layer groups in pipeline
+    int layers_per_group;   // layers per group (from pipeline plan)
+    // Derived
+    int n_checkpointed;     // how many layers have saved activations
+    bool *is_saved;         // per-layer: true if activation is saved (not recomputed)
+} CheckpointManager;
+
+// ===== Initialization =====
+
+static CheckpointManager checkpoint_init(CheckpointPolicy policy, const ModelConfig *cfg,
+                                          const PipelinePlan *plan) {
+    CheckpointManager cm = {0};
+    cm.policy = policy;
+    cm.n_layers = cfg->dims.n_layers;
+    cm.n_groups = plan->n_groups;
+    cm.layers_per_group = (plan->n_groups > 0) ? plan->groups[0].n_layers : cfg->dims.n_layers;
+    cm.is_saved = (bool *)calloc(cfg->dims.n_layers, sizeof(bool));
+
+    switch (policy) {
+    case CKPT_ALL:
+        // Save everything — no recompute needed
+        for (int i = 0; i < cm.n_layers; i++) cm.is_saved[i] = true;
+        cm.n_checkpointed = cm.n_layers;
+        break;
+
+    case CKPT_BOUNDARY:
+        // Save only the input to each layer group
+        for (int g = 0; g < plan->n_groups; g++) {
+            cm.is_saved[plan->groups[g].start_layer] = true;
+        }
+        // Always save the last layer's output (needed for loss backward)
+        cm.is_saved[cm.n_layers - 1] = true;
+        cm.n_checkpointed = plan->n_groups + 1;
+        break;
+
+    case CKPT_SQRT: {
+        // Save every √N layers — optimal memory/compute balance
+        int interval = (int)sqrtf((float)cm.n_layers);
+        if (interval < 1) interval = 1;
+        cm.interval = interval;
+        for (int i = 0; i < cm.n_layers; i += interval) cm.is_saved[i] = true;
+        cm.is_saved[cm.n_layers - 1] = true;
+        cm.n_checkpointed = (cm.n_layers + interval - 1) / interval;
+        break;
+    }
+
+    case CKPT_EVERY_N:
+        // Caller should set cm.interval before using
+        cm.interval = 4;   // default
+        for (int i = 0; i < cm.n_layers; i += cm.interval) cm.is_saved[i] = true;
+        cm.is_saved[cm.n_layers - 1] = true;
+        cm.n_checkpointed = (cm.n_layers + cm.interval - 1) / cm.interval;
+        break;
+
+    case CKPT_NONE:
+        // Save nothing except layer 0 input (needed as recompute starting point)
+        cm.is_saved[0] = true;
+        cm.n_checkpointed = 1;
+        break;
+    }
+
+    return cm;
+}
+
+static void checkpoint_free(CheckpointManager *cm) {
+    free(cm->is_saved);
+    cm->is_saved = NULL;
+}
+
+// ===== Query functions =====
+
+// Should we save this layer's activations during forward pass?
+static bool checkpoint_should_save(const CheckpointManager *cm, int layer_idx) {
+    if (layer_idx < 0 || layer_idx >= cm->n_layers) return false;
+    return cm->is_saved[layer_idx];
+}
+
+// Does this layer need forward recompute during backward pass?
+static bool checkpoint_needs_recompute(const CheckpointManager *cm, int layer_idx) {
+    return !checkpoint_should_save(cm, layer_idx);
+}
+
+// Find the nearest saved checkpoint before this layer (for recompute starting point)
+static int checkpoint_nearest_saved_before(const CheckpointManager *cm, int layer_idx) {
+    for (int i = layer_idx; i >= 0; i--) {
+        if (cm->is_saved[i]) return i;
+    }
+    return 0;   // fallback to first layer
+}
+
+// How many layers need recompute between the nearest checkpoint and this layer?
+static int checkpoint_recompute_depth(const CheckpointManager *cm, int layer_idx) {
+    int saved = checkpoint_nearest_saved_before(cm, layer_idx);
+    return layer_idx - saved;
+}
+
+// ===== Memory estimation =====
+
+// Memory for saved activations only (bytes)
+static size_t checkpoint_saved_memory(const CheckpointManager *cm, const ModelDims *d) {
+    return (size_t)cm->n_checkpointed * layer_activation_bytes(d);
+}
+
+// Memory savings vs. saving all layers (bytes)
+static size_t checkpoint_memory_saved(const CheckpointManager *cm, const ModelDims *d) {
+    size_t all = (size_t)cm->n_layers * layer_activation_bytes(d);
+    size_t used = checkpoint_saved_memory(cm, d);
+    return all - used;
+}
+
+// Extra forward FLOPs due to recompute (fraction of 1.0)
+static double checkpoint_recompute_overhead(const CheckpointManager *cm) {
+    int recomputed = cm->n_layers - cm->n_checkpointed;
+    return (double)recomputed / (double)cm->n_layers;
+}
+
+// ===== Pretty-print =====
+
+static const char *checkpoint_policy_name(CheckpointPolicy p) {
+    switch (p) {
+        case CKPT_ALL: return "ALL";
+        case CKPT_BOUNDARY: return "BOUNDARY";
+        case CKPT_SQRT: return "SQRT";
+        case CKPT_EVERY_N: return "EVERY_N";
+        case CKPT_NONE: return "NONE";
+        default: return "UNKNOWN";
+    }
+}
+
+static void checkpoint_print(const CheckpointManager *cm, const ModelDims *d) {
+    printf("=== Checkpoint Policy: %s ===\n", checkpoint_policy_name(cm->policy));
+    printf("  %d/%d layers checkpointed", cm->n_checkpointed, cm->n_layers);
+    if (cm->policy == CKPT_SQRT || cm->policy == CKPT_EVERY_N)
+        printf(" (interval=%d)", cm->interval);
+    printf("\n");
+    printf("  Activation memory: %.1fMB (saved) / %.1fMB (all)\n",
+           checkpoint_saved_memory(cm, d) / 1e6,
+           (double)cm->n_layers * layer_activation_bytes(d) / 1e6);
+    printf("  Memory savings: %.1fMB (%.0f%%)\n",
+           checkpoint_memory_saved(cm, d) / 1e6,
+           100.0 * checkpoint_memory_saved(cm, d) / ((double)cm->n_layers * layer_activation_bytes(d)));
+    printf("  Recompute overhead: %.0f%% extra forward FLOPs\n",
+           100.0 * checkpoint_recompute_overhead(cm));
+    printf("  Saved layers: ");
+    for (int i = 0; i < cm->n_layers; i++) {
+        if (cm->is_saved[i]) printf("%d ", i);
+    }
+    printf("\n");
+}
+
diff --git a/training/model_config.h b/training/model_config.h
new file mode 100644
index 0000000..8e99090
--- /dev/null
+++ b/training/model_config.h
@@ -0,0 +1,310 @@
+// model_config.h — Parameterized model configuration for pipeline training
+// Replaces hardcoded #defines with portable structs + preset configs
+#pragma once
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <math.h>
+
+// ===== Model configuration =====
+
+typedef struct {
+    int dim;            // model dimension (embedding/residual width)
+    int hidden_dim;     // FFN hidden dimension
+    int n_heads;        // number of attention heads
+    int n_kv_heads;     // number of KV heads (for GQA; == n_heads for MHA)
+    int n_layers;       // total transformer layers
+    int vocab_size;     // vocabulary size
+    int seq_len;        // maximum sequence length
+    // Derived (computed by model_config_init)
+    int head_dim;       // dim / n_heads
+    int kv_dim;         // head_dim * n_kv_heads
+    int score_ch;       // n_heads * seq_len (attention score channels for SDPA bwd)
+} ModelDims;
+
+typedef struct {
+    int compile_budget;     // max ANE compilations per process (~119)
+    int kernels_per_layer;  // weight-bearing kernels per layer (currently 5)
+    int static_per_layer;   // weight-free kernels per layer (sdpaBwd2 = 1)
+    int accum_steps;        // gradient accumulation steps per compile batch
+} CompileConfig;
+
+typedef struct {
+    ModelDims dims;
+    CompileConfig compile;
+    const char *name;       // human-readable model name
+} ModelConfig;
+
+// ===== Layer group for pipeline scheduling =====
+
+typedef struct {
+    int start_layer;        // first layer index (inclusive)
+    int end_layer;          // last layer index (exclusive)
+    int n_layers;           // end_layer - start_layer
+    int weight_kernels;     // weight-bearing kernels in this group
+    int static_kernels;     // weight-free kernels in this group
+    int total_kernels;      // weight_kernels + static_kernels
+} LayerGroup;
+
+typedef struct {
+    LayerGroup *groups;
+    int n_groups;
+    int total_exec_restarts;    // estimated exec() restarts per training step
+} PipelinePlan;
+
+// ===== Derived dimension helpers =====
+
+static void model_dims_init(ModelDims *d) {
+    d->head_dim = d->dim / d->n_heads;
+    d->kv_dim = d->head_dim * d->n_kv_heads;
+    d->score_ch = d->n_heads * d->seq_len;
+}
+
+// ===== Per-layer memory sizes (bytes) =====
+
+// Weight sizes in floats (fp32)
+static inline size_t wq_size(const ModelDims *d) { return (size_t)d->dim * d->dim; }
+static inline size_t wo_size(const ModelDims *d) { return (size_t)d->dim * d->dim; }
+static inline size_t w1_size(const ModelDims *d) { return (size_t)d->hidden_dim * d->dim; }
+static inline size_t w2_size(const ModelDims *d) { return (size_t)d->dim * d->hidden_dim; }
+static inline size_t w3_size(const ModelDims *d) { return (size_t)d->hidden_dim * d->dim; }
+
+static inline size_t layer_weight_floats(const ModelDims *d) {
+    return 4 * wq_size(d)    // Wq, Wk, Wv, Wo
+         + w1_size(d) + w2_size(d) + w3_size(d)   // W1, W2, W3
+         + 2 * (size_t)d->dim;                     // rms_att, rms_ffn
+}
+
+static inline size_t layer_weight_bytes(const ModelDims *d) {
+    return layer_weight_floats(d) * sizeof(float);
+}
+
+// Adam state: 2x weight size (m + v vectors)
+static inline size_t layer_adam_bytes(const ModelDims *d) {
+    return 2 * layer_weight_bytes(d);
+}
+
+// Activation buffers per layer (saved for backward)
+static inline size_t layer_activation_floats(const ModelDims *d) {
+    int S = d->seq_len, D = d->dim, H = d->hidden_dim;
+    // layer_in, xnorm, Q, K, V, attn_out, o_out, x2, x2norm = 9 * D*S
+    // h1, h3, silu_out = 3 * H*S
+    // ffn_out = D*S
+    return (size_t)(10 * D * S + 3 * H * S);
+}
+
+static inline size_t layer_activation_bytes(const ModelDims *d) {
+    return layer_activation_floats(d) * sizeof(float);
+}
+
+// Gradient accumulators per layer
+static inline size_t layer_gradient_bytes(const ModelDims *d) {
+    return layer_weight_bytes(d);   // same layout as weights
+}
+
+// Total model memory (weights + adam + activations + gradients for all layers)
+static inline size_t total_model_bytes(const ModelConfig *cfg) {
+    const ModelDims *d = &cfg->dims;
+    size_t per_layer = layer_weight_bytes(d) + layer_adam_bytes(d)
+                     + layer_activation_bytes(d) + layer_gradient_bytes(d);
+    size_t global = (size_t)d->dim * sizeof(float)                  // rms_final
+                  + (size_t)d->vocab_size * d->dim * sizeof(float)  // embed
+                  + (size_t)d->dim * 2 * sizeof(float)              // rms_final adam
+                  + (size_t)d->vocab_size * d->dim * 2 * sizeof(float); // embed adam
+    return per_layer * d->n_layers + global;
+}
+
+// ===== Pipeline planning =====
+
+// Compute how many layers can fit in one compile batch
+static int max_layers_per_compile(const CompileConfig *cc) {
+    // Reserve some headroom (90% of budget) for safety
+    int usable = (int)(cc->compile_budget * 0.9);
+    int per_layer = cc->kernels_per_layer + cc->static_per_layer;
+    if (per_layer <= 0) return 1;
+    return usable / per_layer;
+}
+
+// Compute optimal layer groups for a model given compile budget
+// Returns a PipelinePlan (caller must free plan.groups)
+static PipelinePlan compute_pipeline_plan(const ModelConfig *cfg) {
+    PipelinePlan plan = {0};
+    int max_per = max_layers_per_compile(&cfg->compile);
+    if (max_per <= 0) max_per = 1;
+
+    // Clamp to actual layer count
+    int group_size = (max_per < cfg->dims.n_layers) ? max_per : cfg->dims.n_layers;
+
+    plan.n_groups = (cfg->dims.n_layers + group_size - 1) / group_size;
+    plan.groups = (LayerGroup *)calloc(plan.n_groups, sizeof(LayerGroup));
+
+    int kpl = cfg->compile.kernels_per_layer;
+    int spl = cfg->compile.static_per_layer;
+
+    for (int g = 0; g < plan.n_groups; g++) {
+        LayerGroup *lg = &plan.groups[g];
+        lg->start_layer = g * group_size;
+        lg->end_layer = lg->start_layer + group_size;
+        if (lg->end_layer > cfg->dims.n_layers)
+            lg->end_layer = cfg->dims.n_layers;
+        lg->n_layers = lg->end_layer - lg->start_layer;
+        lg->weight_kernels = lg->n_layers * kpl;
+        lg->static_kernels = lg->n_layers * spl;
+        lg->total_kernels = lg->weight_kernels + lg->static_kernels;
+    }
+
+    // Each compile batch needs one exec() restart (except possibly the last)
+    // Forward: n_groups compiles. Backward: n_groups compiles.
+    // Per training step: forward + backward = 2 * n_groups compile batches
+    // Each batch may need exec() restart. Worst case:
+    plan.total_exec_restarts = 2 * plan.n_groups;
+
+    return plan;
+}
+
+static void pipeline_plan_free(PipelinePlan *plan) {
+    free(plan->groups);
+    plan->groups = NULL;
+    plan->n_groups = 0;
+}
+
+// ===== Pretty-print plan =====
+
+static void pipeline_plan_print(const ModelConfig *cfg, const PipelinePlan *plan) {
+    printf("=== Pipeline Plan: %s ===\n", cfg->name);
+    printf("  %d layers | dim=%d hidden=%d heads=%d seq=%d vocab=%d\n",
+           cfg->dims.n_layers, cfg->dims.dim, cfg->dims.hidden_dim,
+           cfg->dims.n_heads, cfg->dims.seq_len, cfg->dims.vocab_size);
+    printf("  Compile budget: %d | %d weight-kernels/layer + %d static/layer\n",
+           cfg->compile.compile_budget, cfg->compile.kernels_per_layer,
+           cfg->compile.static_per_layer);
+    printf("  %d layer group(s):\n", plan->n_groups);
+    for (int g = 0; g < plan->n_groups; g++) {
+        const LayerGroup *lg = &plan->groups[g];
+        printf("    Group %d: layers [%d..%d) — %d layers, %d kernels (%d weight + %d static)\n",
+               g, lg->start_layer, lg->end_layer, lg->n_layers,
+               lg->total_kernels, lg->weight_kernels, lg->static_kernels);
+    }
+    printf("  Est. exec() restarts per step: %d\n", plan->total_exec_restarts);
+    printf("  Memory per layer: weights=%.1fMB adam=%.1fMB acts=%.1fMB grads=%.1fMB\n",
+           layer_weight_bytes(&cfg->dims)/1e6, layer_adam_bytes(&cfg->dims)/1e6,
+           layer_activation_bytes(&cfg->dims)/1e6, layer_gradient_bytes(&cfg->dims)/1e6);
+    printf("  Total model state: %.1fMB\n", total_model_bytes(cfg)/1e6);
+}
+
+// ===== FLOP estimation per step =====
+
+static inline double flops_per_step(const ModelConfig *cfg) {
+    const ModelDims *d = &cfg->dims;
+    int N = d->n_layers, D = d->dim, H = d->hidden_dim, S = d->seq_len;
+    int HD = d->head_dim, NH = d->n_heads;
+    // Forward: 4 linear projections (QKV+O) + 3 FFN projections per layer
+    double fwd = N * (4.0*2*D*D*S + 2.0*2*D*H*S + 2.0*H*D*S);
+    // Backward dx same flops as forward
+    double bwd_dx = fwd;
+    // Backward dW same flops as forward
+    double bwd_dw = fwd;
+    // SDPA backward (attention score computation)
+    double sdpa = N * 2.0 * NH * 5 * S * S * HD;
+    // Classifier (forward + backward)
+    double cls = 3.0 * 2.0 * d->vocab_size * D * S;
+    return fwd + bwd_dx + bwd_dw + sdpa + cls;
+}
+
+static inline double ane_flops_per_step(const ModelConfig *cfg) {
+    const ModelDims *d = &cfg->dims;
+    int N = d->n_layers, D = d->dim, H = d->hidden_dim, S = d->seq_len;
+    int HD = d->head_dim, NH = d->n_heads;
+    double fwd = N * (4.0*2*D*D*S + 2.0*2*D*H*S + 2.0*H*D*S);
+    double bwd_dx = fwd;
+    double sdpa = N * 2.0 * NH * 5 * S * S * HD;
+    return fwd + bwd_dx + sdpa;  // dW is on CPU (cblas)
+}
+
+// ===== Model presets =====
+
+static ModelConfig model_config_stories110m(void) {
+    ModelConfig cfg = {0};
+    cfg.name = "Stories110M";
+    cfg.dims = (ModelDims){
+        .dim = 768, .hidden_dim = 2048, .n_heads = 12,
+        .n_kv_heads = 12, .n_layers = 12, .vocab_size = 32000, .seq_len = 256
+    };
+    cfg.compile = (CompileConfig){
+        .compile_budget = 119, .kernels_per_layer = 5,
+        .static_per_layer = 1, .accum_steps = 10
+    };
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
+static ModelConfig model_config_stories42m(void) {
+    ModelConfig cfg = {0};
+    cfg.name = "Stories42M";
+    cfg.dims = (ModelDims){
+        .dim = 512, .hidden_dim = 1376, .n_heads = 8,
+        .n_kv_heads = 8, .n_layers = 8, .vocab_size = 32000, .seq_len = 256
+    };
+    cfg.compile = (CompileConfig){
+        .compile_budget = 119, .kernels_per_layer = 5,
+        .static_per_layer = 1, .accum_steps = 10
+    };
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
+static ModelConfig model_config_llama_1b(void) {
+    ModelConfig cfg = {0};
+    cfg.name = "LLaMA-1.1B";
+    cfg.dims = (ModelDims){
+        .dim = 2048, .hidden_dim = 5504, .n_heads = 16,
+        .n_kv_heads = 16, .n_layers = 22, .vocab_size = 32000, .seq_len = 512
+    };
+    cfg.compile = (CompileConfig){
+        .compile_budget = 119, .kernels_per_layer = 5,
+        .static_per_layer = 1, .accum_steps = 4
+    };
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
+static ModelConfig model_config_llama_7b(void) {
+    ModelConfig cfg = {0};
+    cfg.name = "LLaMA-7B";
+    cfg.dims = (ModelDims){
+        .dim = 4096, .hidden_dim = 11008, .n_heads = 32,
+        .n_kv_heads = 32, .n_layers = 32, .vocab_size = 32000, .seq_len = 512
+    };
+    cfg.compile = (CompileConfig){
+        .compile_budget = 119, .kernels_per_layer = 5,
+        .static_per_layer = 1, .accum_steps = 2
+    };
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
+// Parse config from command-line (returns preset, caller can override)
+static ModelConfig model_config_from_args(int argc, char *argv[]) {
+    ModelConfig cfg = model_config_stories110m(); // default
+    for (int i = 1; i < argc; i++) {
+        if (strcmp(argv[i], "--model") == 0 && i+1 < argc) {
+            const char *name = argv[++i];
+            if (strcmp(name, "stories42m") == 0) cfg = model_config_stories42m();
+            else if (strcmp(name, "stories110m") == 0) cfg = model_config_stories110m();
+            else if (strcmp(name, "llama1b") == 0) cfg = model_config_llama_1b();
+            else if (strcmp(name, "llama7b") == 0) cfg = model_config_llama_7b();
+            else fprintf(stderr, "Unknown model: %s (using stories110m)\n", name);
+        }
+        else if (strcmp(argv[i], "--dim") == 0 && i+1 < argc) cfg.dims.dim = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--hidden") == 0 && i+1 < argc) cfg.dims.hidden_dim = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--heads") == 0 && i+1 < argc) cfg.dims.n_heads = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--layers") == 0 && i+1 < argc) cfg.dims.n_layers = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--seq") == 0 && i+1 < argc) cfg.dims.seq_len = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--vocab") == 0 && i+1 < argc) cfg.dims.vocab_size = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--budget") == 0 && i+1 < argc) cfg.compile.compile_budget = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--accum") == 0 && i+1 < argc) cfg.compile.accum_steps = atoi(argv[++i]);
+    }
+    model_dims_init(&cfg.dims);
+    return cfg;
+}
+
diff --git a/training/pipeline.h b/training/pipeline.h
new file mode 100644
index 0000000..c04eb0c
--- /dev/null
+++ b/training/pipeline.h
@@ -0,0 +1,497 @@
+// pipeline.h — Layer-group scheduling and mmap state for multi-group ANE training
+// Manages compile budgets, exec() restarts, and cross-exec shared tensor state
+#pragma once
+#include "model_config.h"
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+
+// ===== Compile budget tracker =====
+
+typedef struct {
+    int budget;         // max compilations allowed
+    int used;           // compilations consumed so far
+    int headroom;       // safety margin (budget * 0.1)
+} CompileBudget;
+
+static CompileBudget budget_init(int max_compiles) {
+    CompileBudget b;
+    b.budget = max_compiles;
+    b.used = 0;
+    b.headroom = max_compiles / 10;
+    return b;
+}
+
+static bool budget_can_fit(const CompileBudget *b, int n_kernels) {
+    return (b->used + n_kernels) <= (b->budget - b->headroom);
+}
+
+static void budget_consume(CompileBudget *b, int n_kernels) {
+    b->used += n_kernels;
+}
+
+static bool budget_needs_restart(const CompileBudget *b) {
+    return b->used >= (b->budget - b->headroom);
+}
+
+static int budget_remaining(const CompileBudget *b) {
+    int r = b->budget - b->headroom - b->used;
+    return (r > 0) ? r : 0;
+}
+
+// ===== Pipeline execution phases =====
+
+typedef enum {
+    PHASE_INIT = 0,
+    PHASE_FORWARD,          // running forward pass through layer groups
+    PHASE_BACKWARD,         // running backward pass through layer groups (reverse)
+    PHASE_WEIGHT_UPDATE,    // Adam step on accumulated gradients
+    PHASE_DONE              // training step complete
+} PipelinePhase;
+
+typedef enum {
+    ACTION_COMPILE_GROUP,       // compile kernels for current layer group
+    ACTION_RUN_FORWARD_GROUP,   // execute forward pass for compiled group
+    ACTION_RUN_BACKWARD_GROUP,  // execute backward pass for compiled group
+    ACTION_EXEC_RESTART,        // save state and exec() to reset compile budget
+    ACTION_WEIGHT_UPDATE,       // run optimizer on all layers
+    ACTION_STEP_DONE,           // training step complete
+    ACTION_ERROR                // something went wrong
+} PipelineAction;
+
+// ===== Scheduler state =====
+
+typedef struct {
+    ModelConfig config;
+    PipelinePlan plan;
+    CompileBudget budget;
+
+    PipelinePhase phase;
+    int current_group;      // index into plan.groups
+    int current_step;       // training step number
+    int accum_step;         // gradient accumulation step within batch
+    int total_steps;        // total training steps requested
+    float learning_rate;
+    float last_loss;
+
+    // Flags
+    bool group_compiled;    // whether current group's kernels are compiled
+    bool needs_restart;     // whether we need exec() before next group
+} PipelineScheduler;
+
+static PipelineScheduler pipeline_scheduler_init(ModelConfig config, int total_steps, float lr) {
+    PipelineScheduler s = {0};
+    s.config = config;
+    s.plan = compute_pipeline_plan(&config);
+    s.budget = budget_init(config.compile.compile_budget);
+    s.phase = PHASE_FORWARD;
+    s.current_group = 0;
+    s.current_step = 0;
+    s.accum_step = 0;
+    s.total_steps = total_steps;
+    s.learning_rate = lr;
+    s.last_loss = 999.0f;
+    s.group_compiled = false;
+    s.needs_restart = false;
+    return s;
+}
+
+// Get the next action the training loop should take
+static PipelineAction pipeline_next_action(PipelineScheduler *s) {
+    if (s->current_step >= s->total_steps)
+        return ACTION_STEP_DONE;
+
+    switch (s->phase) {
+    case PHASE_FORWARD:
+        if (s->current_group >= s->plan.n_groups) {
+            // Forward pass complete for all groups — start backward
+            s->phase = PHASE_BACKWARD;
+            s->current_group = s->plan.n_groups - 1;
+            s->group_compiled = false;
+            return pipeline_next_action(s);
+        }
+        if (!s->group_compiled) {
+            // Check if we have compile budget for this group
+            LayerGroup *lg = &s->plan.groups[s->current_group];
+            if (!budget_can_fit(&s->budget, lg->total_kernels)) {
+                s->needs_restart = true;
+                return ACTION_EXEC_RESTART;
+            }
+            return ACTION_COMPILE_GROUP;
+        }
+        return ACTION_RUN_FORWARD_GROUP;
+
+    case PHASE_BACKWARD:
+        if (s->current_group < 0) {
+            // Backward complete — weight update
+            s->phase = PHASE_WEIGHT_UPDATE;
+            return ACTION_WEIGHT_UPDATE;
+        }
+        if (!s->group_compiled) {
+            LayerGroup *lg = &s->plan.groups[s->current_group];
+            if (!budget_can_fit(&s->budget, lg->total_kernels)) {
+                s->needs_restart = true;
+                return ACTION_EXEC_RESTART;
+            }
+            return ACTION_COMPILE_GROUP;
+        }
+        return ACTION_RUN_BACKWARD_GROUP;
+
+    case PHASE_WEIGHT_UPDATE:
+        return ACTION_WEIGHT_UPDATE;
+
+    case PHASE_DONE:
+        return ACTION_STEP_DONE;
+
+    default:
+        return ACTION_ERROR;
+    }
+}
+
+// Called after successfully compiling a layer group's kernels
+static void pipeline_group_compiled(PipelineScheduler *s) {
+    LayerGroup *lg = &s->plan.groups[s->current_group];
+    budget_consume(&s->budget, lg->total_kernels);
+    s->group_compiled = true;
+}
+
+// Called after successfully running forward for current group
+static void pipeline_forward_group_done(PipelineScheduler *s) {
+    s->current_group++;
+    s->group_compiled = false;
+}
+
+// Called after successfully running backward for current group
+static void pipeline_backward_group_done(PipelineScheduler *s) {
+    s->current_group--;
+    s->group_compiled = false;
+}
+
+// Called after weight update completes
+static void pipeline_weight_update_done(PipelineScheduler *s) {
+    s->accum_step++;
+    if (s->accum_step >= s->config.compile.accum_steps) {
+        s->accum_step = 0;
+        s->current_step++;
+    }
+    // Reset for next forward pass
+    s->phase = PHASE_FORWARD;
+    s->current_group = 0;
+    s->group_compiled = false;
+}
+
+// ===== mmap-based cross-exec state =====
+//
+// Layout: [Header][Layer 0 weights][Layer 0 adam][Layer 0 grads]...[Global state]
+// All tensors stored as fp32. The mmap file persists across exec() restarts.
+
+#define MMAP_SENTINEL 0x414E4550  // "ANEP" — file format identifier
+#define MMAP_VERSION 1
+
+typedef struct {
+    int sentinel;       // MMAP_SENTINEL for file identification
+    int version;
+    int n_layers;
+    int dim;
+    int hidden_dim;
+    int n_heads;
+    int vocab_size;
+    int seq_len;
+    // Scheduler state (for exec restart)
+    int phase;
+    int current_group;
+    int current_step;
+    int accum_step;
+    int total_steps;
+    int compile_count;      // compiles used in current process
+    int adam_t;             // Adam timestep
+    float learning_rate;
+    float last_loss;
+    // Offsets into mmap (bytes from base)
+    size_t layer_weights_offset;    // start of per-layer weight data
+    size_t layer_adam_offset;       // start of per-layer adam state
+    size_t layer_grads_offset;      // start of per-layer gradient accumulators
+    size_t layer_acts_offset;       // start of per-layer activation checkpoints
+    size_t global_offset;           // start of global state (rms_final, embed, etc.)
+    size_t total_size;              // total mmap size
+    int pad[4];                     // alignment
+} MmapHeader;
+
+typedef struct {
+    int fd;
+    void *base;
+    size_t size;
+    MmapHeader *header;
+    const char *path;
+} MmapState;
+
+// Compute mmap layout for a given config
+static size_t mmap_compute_size(const ModelConfig *cfg) {
+    const ModelDims *d = &cfg->dims;
+    size_t header = sizeof(MmapHeader);
+    // Round up to page boundary
+    header = (header + 4095) & ~(size_t)4095;
+
+    size_t per_layer_weights = layer_weight_bytes(d);
+    size_t per_layer_adam = layer_adam_bytes(d);
+    size_t per_layer_grads = layer_gradient_bytes(d);
+    size_t per_layer_acts = layer_activation_bytes(d);
+
+    size_t all_layers = (size_t)d->n_layers * (per_layer_weights + per_layer_adam + per_layer_grads + per_layer_acts);
+
+    // Global: rms_final + embed + their adam states + embed gradients
+    size_t global = (size_t)d->dim * 4                          // rms_final
+                  + (size_t)d->vocab_size * d->dim * 4          // embed
+                  + (size_t)d->dim * 2 * 4                      // rms_final adam (m+v)
+                  + (size_t)d->vocab_size * d->dim * 2 * 4      // embed adam
+                  + (size_t)d->dim * 4                          // rms_final grad
+                  + (size_t)d->vocab_size * d->dim * 4;         // embed grad
+
+    return header + all_layers + global;
+}
+
+// Create a new mmap state file
+static MmapState *mmap_state_create(const char *path, const ModelConfig *cfg) {
+    size_t total = mmap_compute_size(cfg);
+    int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0644);
+    if (fd < 0) { perror("mmap_state_create: open"); return NULL; }
+    if (ftruncate(fd, total) < 0) { perror("mmap_state_create: ftruncate"); close(fd); return NULL; }
+
+    void *base = mmap(NULL, total, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+    if (base == MAP_FAILED) { perror("mmap_state_create: mmap"); close(fd); return NULL; }
+
+    MmapState *ms = (MmapState *)calloc(1, sizeof(MmapState));
+    ms->fd = fd;
+    ms->base = base;
+    ms->size = total;
+    ms->path = path;
+    ms->header = (MmapHeader *)base;
+
+    // Initialize header
+    MmapHeader *h = ms->header;
+    h->sentinel = MMAP_SENTINEL;
+    h->version = MMAP_VERSION;
+    h->n_layers = cfg->dims.n_layers;
+    h->dim = cfg->dims.dim;
+    h->hidden_dim = cfg->dims.hidden_dim;
+    h->n_heads = cfg->dims.n_heads;
+    h->vocab_size = cfg->dims.vocab_size;
+    h->seq_len = cfg->dims.seq_len;
+
+    // Compute offsets
+    size_t header_end = (sizeof(MmapHeader) + 4095) & ~(size_t)4095;
+    const ModelDims *d = &cfg->dims;
+    size_t pw = layer_weight_bytes(d);
+    size_t pa = layer_adam_bytes(d);
+    size_t pg = layer_gradient_bytes(d);
+    size_t pact = layer_activation_bytes(d);
+
+    h->layer_weights_offset = header_end;
+    h->layer_adam_offset = h->layer_weights_offset + (size_t)d->n_layers * pw;
+    h->layer_grads_offset = h->layer_adam_offset + (size_t)d->n_layers * pa;
+    h->layer_acts_offset = h->layer_grads_offset + (size_t)d->n_layers * pg;
+    h->global_offset = h->layer_acts_offset + (size_t)d->n_layers * pact;
+    h->total_size = total;
+
+    return ms;
+}
+
+// Reopen existing mmap state (after exec() restart)
+static MmapState *mmap_state_open(const char *path) {
+    int fd = open(path, O_RDWR);
+    if (fd < 0) { perror("mmap_state_open: open"); return NULL; }
+    struct stat st;
+    if (fstat(fd, &st) < 0) { perror("mmap_state_open: fstat"); close(fd); return NULL; }
+
+    void *base = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+    if (base == MAP_FAILED) { perror("mmap_state_open: mmap"); close(fd); return NULL; }
+
+    MmapHeader *h = (MmapHeader *)base;
+    if (h->sentinel != MMAP_SENTINEL || h->version != MMAP_VERSION) {
+        fprintf(stderr, "mmap_state_open: invalid header\n");
+        munmap(base, st.st_size);
+        close(fd);
+        return NULL;
+    }
+
+    MmapState *ms = (MmapState *)calloc(1, sizeof(MmapState));
+    ms->fd = fd;
+    ms->base = base;
+    ms->size = st.st_size;
+    ms->path = path;
+    ms->header = h;
+    return ms;
+}
+
+// Close and unmap (does NOT delete the file)
+static void mmap_state_close(MmapState *ms) {
+    if (!ms) return;
+    msync(ms->base, ms->size, MS_SYNC);
+    munmap(ms->base, ms->size);
+    close(ms->fd);
+    free(ms);
+}
+
+// Delete the mmap file (call after training completes)
+static void mmap_state_destroy(MmapState *ms) {
+    if (!ms) return;
+    const char *p = ms->path;
+    mmap_state_close(ms);
+    unlink(p);
+}
+
+// ===== Typed accessors into mmap regions =====
+
+// Get pointer to layer L's weights in mmap
+static float *mmap_layer_weights(MmapState *ms, int layer) {
+    return (float *)((char *)ms->base + ms->header->layer_weights_offset
+                     + (size_t)layer * layer_weight_bytes(&(ModelDims){
+                        .dim = ms->header->dim,
+                        .hidden_dim = ms->header->hidden_dim,
+                        .n_heads = ms->header->n_heads,
+                        .vocab_size = ms->header->vocab_size,
+                        .seq_len = ms->header->seq_len
+                     }));
+}
+
+// Get pointer to layer L's adam state in mmap
+static float *mmap_layer_adam(MmapState *ms, int layer) {
+    ModelDims d = {
+        .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim,
+        .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size,
+        .seq_len = ms->header->seq_len
+    };
+    return (float *)((char *)ms->base + ms->header->layer_adam_offset
+                     + (size_t)layer * layer_adam_bytes(&d));
+}
+
+// Get pointer to layer L's gradient accumulators in mmap
+static float *mmap_layer_grads(MmapState *ms, int layer) {
+    ModelDims d = {
+        .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim,
+        .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size,
+        .seq_len = ms->header->seq_len
+    };
+    return (float *)((char *)ms->base + ms->header->layer_grads_offset
+                     + (size_t)layer * layer_gradient_bytes(&d));
+}
+
+// Get pointer to layer L's activation checkpoint in mmap
+static float *mmap_layer_acts(MmapState *ms, int layer) {
+    ModelDims d = {
+        .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim,
+        .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size,
+        .seq_len = ms->header->seq_len
+    };
+    return (float *)((char *)ms->base + ms->header->layer_acts_offset
+                     + (size_t)layer * layer_activation_bytes(&d));
+}
+
+// Get pointer to global state region (rms_final, embed, etc.)
+static float *mmap_global(MmapState *ms) {
+    return (float *)((char *)ms->base + ms->header->global_offset);
+}
+
+// ===== Save/restore scheduler state to/from mmap header =====
+
+static void pipeline_save_to_mmap(const PipelineScheduler *s, MmapState *ms) {
+    MmapHeader *h = ms->header;
+    h->phase = (int)s->phase;
+    h->current_group = s->current_group;
+    h->current_step = s->current_step;
+    h->accum_step = s->accum_step;
+    h->total_steps = s->total_steps;
+    h->learning_rate = s->learning_rate;
+    h->last_loss = s->last_loss;
+    msync(ms->base, sizeof(MmapHeader), MS_SYNC);
+}
+
+static void pipeline_restore_from_mmap(PipelineScheduler *s, const MmapState *ms) {
+    const MmapHeader *h = ms->header;
+    s->phase = (PipelinePhase)h->phase;
+    s->current_group = h->current_group;
+    s->current_step = h->current_step;
+    s->accum_step = h->accum_step;
+    s->total_steps = h->total_steps;
+    s->learning_rate = h->learning_rate;
+    s->last_loss = h->last_loss;
+    // Reset compile budget (new process after exec)
+    s->budget = budget_init(s->config.compile.compile_budget);
+    s->group_compiled = false;
+    s->needs_restart = false;
+}
+
+// ===== exec() restart with mmap persistence =====
+
+// Call this when ACTION_EXEC_RESTART is returned.
+// Saves scheduler state to mmap, syncs, and exec()s.
+// Does not return on success.
+static void pipeline_exec_restart(PipelineScheduler *s, MmapState *ms, char *argv[]) {
+    pipeline_save_to_mmap(s, ms);
+    printf("[pipeline] exec() restart: step=%d phase=%d group=%d compiles=%d\n",
+           s->current_step, s->phase, s->current_group, s->budget.used);
+    fflush(stdout);
+
+    // Sync all mmap data before exec
+    msync(ms->base, ms->size, MS_SYNC);
+
+    // exec with --pipeline-resume flag
+    execl(argv[0], argv[0], "--pipeline-resume", ms->path, NULL);
+    perror("pipeline_exec_restart: execl");
+}
+
+// Resume from exec() restart. Returns true if this is a resume.
+static bool pipeline_check_resume(int argc, char *argv[], PipelineScheduler *s, MmapState **ms_out) {
+    for (int i = 1; i < argc; i++) {
+        if (strcmp(argv[i], "--pipeline-resume") == 0 && i+1 < argc) {
+            const char *mmap_path = argv[i+1];
+            MmapState *ms = mmap_state_open(mmap_path);
+            if (!ms) {
+                fprintf(stderr, "[pipeline] Failed to reopen mmap at %s\n", mmap_path);
+                return false;
+            }
+            pipeline_restore_from_mmap(s, ms);
+            *ms_out = ms;
+            printf("[pipeline] Resumed: step=%d phase=%d group=%d\n",
+                   s->current_step, s->phase, s->current_group);
+            return true;
+        }
+    }
+    return false;
+}
+
+// ===== Pipeline pretty-print helpers =====
+
+static const char *phase_name(PipelinePhase p) {
+    switch (p) {
+        case PHASE_INIT: return "INIT";
+        case PHASE_FORWARD: return "FORWARD";
+        case PHASE_BACKWARD: return "BACKWARD";
+        case PHASE_WEIGHT_UPDATE: return "WEIGHT_UPDATE";
+        case PHASE_DONE: return "DONE";
+        default: return "UNKNOWN";
+    }
+}
+
+static const char *action_name(PipelineAction a) {
+    switch (a) {
+        case ACTION_COMPILE_GROUP: return "COMPILE_GROUP";
+        case ACTION_RUN_FORWARD_GROUP: return "RUN_FORWARD_GROUP";
+        case ACTION_RUN_BACKWARD_GROUP: return "RUN_BACKWARD_GROUP";
+        case ACTION_EXEC_RESTART: return "EXEC_RESTART";
+        case ACTION_WEIGHT_UPDATE: return "WEIGHT_UPDATE";
+        case ACTION_STEP_DONE: return "STEP_DONE";
+        case ACTION_ERROR: return "ERROR";
+        default: return "UNKNOWN";
+    }
+}
+
+static void pipeline_print_status(const PipelineScheduler *s) {
+    printf("[pipeline] step=%d/%d accum=%d/%d phase=%s group=%d/%d budget=%d/%d\n",
+           s->current_step, s->total_steps,
+           s->accum_step, s->config.compile.accum_steps,
+           phase_name(s->phase), s->current_group, s->plan.n_groups,
+           s->budget.used, s->budget.budget);
+}
diff --git a/training/train_pipeline.m b/training/train_pipeline.m
new file mode 100644
index 0000000..ef055b8
--- /dev/null
+++ b/training/train_pipeline.m
@@ -0,0 +1,258 @@
+// train_pipeline.m — Pipeline-scheduled multi-group ANE training
+//
+// Entry point that uses the pipeline scaffolding to train models
+// beyond the single-compile-batch limit.
+//
+// Architecture:
+//   ModelConfig  → what the model looks like
+//   PipelinePlan → which layers go in which compile groups
+//   PipelineScheduler → state machine driving forward/backward/restart
+//   MmapState    → cross-exec() shared memory for all tensor state
+//   CheckpointManager → activation save/recompute policy
+//
+// Usage:
+//   ./train_pipeline --model stories110m --steps 100 --lr 3e-4
+//   ./train_pipeline --model llama1b --steps 50 --lr 1e-4 --checkpoint sqrt
+//   ./train_pipeline --pipeline-resume /tmp/ane_pipeline.mmap   (auto after exec restart)
+//
+// Build:
+//   make train_pipeline
+//
+// Currently runs in planning/dry-run mode — prints the full execution
+// plan and simulates the scheduler state machine without compiling
+// actual MIL programs. Enable ANE_LIVE for real kernels.
+
+#import <Foundation/Foundation.h>
+#import <stdio.h>
+#import <stdlib.h>
+#import <string.h>
+#import <mach/mach_time.h>
+
+#include "model_config.h"
+#include "pipeline.h"
+#include "gradient_checkpoint.h"
+
+#define MMAP_PATH "/tmp/ane_pipeline.mmap"
+
+// ===== Forward declarations for ANE kernel operations =====
+// These would call into stories_io.h / stories_mil.h for real execution.
+// Stubbed here for planning mode.
+
+#ifdef ANE_LIVE
+#include "stories_io.h"
+#include "stories_mil.h"
+#include "stories_cpu_ops.h"
+// Real ANE kernel compilation and execution would go here
+#endif
+
+// ===== Dry-run simulation =====
+
+static void simulate_compile_group(const PipelineScheduler *s, const LayerGroup *lg) {
+    printf("    [compile] Layers [%d..%d): %d weight-bearing + %d static kernels\n",
+           lg->start_layer, lg->end_layer, lg->weight_kernels, lg->static_kernels);
+    printf("             Budget: %d/%d used → %d/%d after\n",
+           s->budget.used, s->budget.budget,
+           s->budget.used + lg->total_kernels, s->budget.budget);
+}
+
+static void simulate_forward_group(const PipelineScheduler *s, const LayerGroup *lg,
+                                    const CheckpointManager *cm) {
+    printf("    [forward] Layers [%d..%d)\n", lg->start_layer, lg->end_layer);
+    for (int L = lg->start_layer; L < lg->end_layer; L++) {
+        bool save = checkpoint_should_save(cm, L);
+        printf("      L%02d: fwdAttn → residual → fwdFFN → residual %s\n",
+               L, save ? "[SAVE acts]" : "[skip acts]");
+    }
+}
+
+static void simulate_backward_group(const PipelineScheduler *s, const LayerGroup *lg,
+                                      const CheckpointManager *cm) {
+    printf("    [backward] Layers [%d..%d) (reverse)\n", lg->start_layer, lg->end_layer);
+    for (int L = lg->end_layer - 1; L >= lg->start_layer; L--) {
+        bool recompute = checkpoint_needs_recompute(cm, L);
+        if (recompute) {
+            int from = checkpoint_nearest_saved_before(cm, L);
+            printf("      L%02d: [RECOMPUTE from L%02d] → ffnBwd → rmsnorm2_bwd → sdpaBwd1 → sdpaBwd2 → qkvBwd → rmsnorm1_bwd\n",
+                   L, from);
+        } else {
+            printf("      L%02d: ffnBwd → rmsnorm2_bwd → sdpaBwd1 → sdpaBwd2 → qkvBwd → rmsnorm1_bwd\n", L);
+        }
+    }
+}
+
+// ===== Main =====
+
+int main(int argc, char *argv[]) {
+    @autoreleasepool {
+        // Parse model config from command line
+        ModelConfig cfg = model_config_from_args(argc, argv);
+
+        // Parse additional training args
+        int total_steps = 100;
+        float lr = 3e-4f;
+        bool dry_run = true;
+        CheckpointPolicy ckpt_policy = CKPT_ALL;
+
+        for (int i = 1; i < argc; i++) {
+            if (strcmp(argv[i], "--steps") == 0 && i+1 < argc) total_steps = atoi(argv[++i]);
+            else if (strcmp(argv[i], "--lr") == 0 && i+1 < argc) lr = atof(argv[++i]);
+            else if (strcmp(argv[i], "--live") == 0) dry_run = false;
+            else if (strcmp(argv[i], "--checkpoint") == 0 && i+1 < argc) {
+                const char *p = argv[++i];
+                if (strcmp(p, "all") == 0) ckpt_policy = CKPT_ALL;
+                else if (strcmp(p, "boundary") == 0) ckpt_policy = CKPT_BOUNDARY;
+                else if (strcmp(p, "sqrt") == 0) ckpt_policy = CKPT_SQRT;
+                else if (strcmp(p, "none") == 0) ckpt_policy = CKPT_NONE;
+                else fprintf(stderr, "Unknown checkpoint policy: %s\n", p);
+            }
+        }
+
+        // Check for exec() resume
+        PipelineScheduler sched = pipeline_scheduler_init(cfg, total_steps, lr);
+        MmapState *ms = NULL;
+
+        if (pipeline_check_resume(argc, argv, &sched, &ms)) {
+            printf("[pipeline] Resumed from exec() restart\n");
+        } else {
+            // Fresh start
+            printf("=== ANE Pipeline Training ===\n");
+            if (dry_run) printf("  ** DRY RUN MODE — no ANE kernels compiled **\n\n");
+
+            // Print model config and pipeline plan
+            PipelinePlan plan = compute_pipeline_plan(&cfg);
+            pipeline_plan_print(&cfg, &plan);
+            printf("\n");
+
+            // Print checkpoint policy
+            CheckpointManager cm = checkpoint_init(ckpt_policy, &cfg, &plan);
+            checkpoint_print(&cm, &cfg.dims);
+            printf("\n");
+
+            // Print FLOP estimates
+            double total_flops = flops_per_step(&cfg);
+            double ane_flops = ane_flops_per_step(&cfg);
+            printf("=== Compute Estimate ===\n");
+            printf("  FLOPs/step: %.0fM total, %.0fM ANE (%.0f%% on-engine)\n",
+                   total_flops/1e6, ane_flops/1e6, 100.0*ane_flops/total_flops);
+            printf("  At 15.8 TFLOPS ANE: %.1f ms/step theoretical minimum\n",
+                   ane_flops / 15.8e9);
+            printf("  Training: %d steps × %d accum = %d optimizer updates\n",
+                   total_steps, cfg.compile.accum_steps, total_steps / cfg.compile.accum_steps);
+            printf("\n");
+
+            // Print mmap state size
+            size_t mmap_sz = mmap_compute_size(&cfg);
+            printf("=== State Management ===\n");
+            printf("  mmap file: %s (%.1fMB)\n", MMAP_PATH, mmap_sz/1e6);
+            printf("  Per-layer: weights=%.1fMB adam=%.1fMB grads=%.1fMB acts=%.1fMB\n",
+                   layer_weight_bytes(&cfg.dims)/1e6, layer_adam_bytes(&cfg.dims)/1e6,
+                   layer_gradient_bytes(&cfg.dims)/1e6, layer_activation_bytes(&cfg.dims)/1e6);
+            printf("\n");
+
+            // Create mmap state
+            ms = mmap_state_create(MMAP_PATH, &cfg);
+            if (!ms) {
+                fprintf(stderr, "Failed to create mmap state\n");
+                checkpoint_free(&cm);
+                pipeline_plan_free(&plan);
+                return 1;
+            }
+
+            if (dry_run) {
+                // ===== Simulate the full scheduler state machine =====
+                printf("=== Execution Trace (1 training step) ===\n");
+                int max_actions = 200;  // safety limit
+                int action_count = 0;
+
+                while (action_count < max_actions) {
+                    PipelineAction action = pipeline_next_action(&sched);
+                    action_count++;
+
+                    printf("\n  [%d] %s (phase=%s group=%d)\n",
+                           action_count, action_name(action),
+                           phase_name(sched.phase), sched.current_group);
+
+                    switch (action) {
+                    case ACTION_COMPILE_GROUP: {
+                        LayerGroup *lg = &sched.plan.groups[sched.current_group];
+                        simulate_compile_group(&sched, lg);
+                        pipeline_group_compiled(&sched);
+                        break;
+                    }
+                    case ACTION_RUN_FORWARD_GROUP: {
+                        LayerGroup *lg = &sched.plan.groups[sched.current_group];
+                        simulate_forward_group(&sched, lg, &cm);
+                        pipeline_forward_group_done(&sched);
+                        break;
+                    }
+                    case ACTION_RUN_BACKWARD_GROUP: {
+                        LayerGroup *lg = &sched.plan.groups[sched.current_group];
+                        simulate_backward_group(&sched, lg, &cm);
+                        pipeline_backward_group_done(&sched);
+                        break;
+                    }
+                    case ACTION_EXEC_RESTART:
+                        printf("    [exec] Would restart process to reset compile budget\n");
+                        printf("           Saving scheduler state to mmap, calling exec()\n");
+                        // In dry-run, just reset the budget and continue
+                        sched.budget = budget_init(cfg.compile.compile_budget);
+                        sched.needs_restart = false;
+                        break;
+
+                    case ACTION_WEIGHT_UPDATE:
+                        printf("    [adam] Optimizer step on all %d layers + global params\n",
+                               cfg.dims.n_layers);
+                        printf("           LR=%.1e adam_t=%d\n", sched.learning_rate, sched.current_step+1);
+                        pipeline_weight_update_done(&sched);
+                        break;
+
+                    case ACTION_STEP_DONE:
+                        printf("\n=== Training step complete ===\n");
+                        goto done_trace;
+
+                    case ACTION_ERROR:
+                        printf("    ERROR in scheduler\n");
+                        goto done_trace;
+                    }
+                }
+                done_trace:
+
+                printf("\nTotal actions simulated: %d\n", action_count);
+                printf("Compile budget consumed: %d/%d\n", sched.budget.used, sched.budget.budget);
+
+                // Summary for multi-group models
+                if (plan.n_groups > 1) {
+                    printf("\n=== Multi-Group Pipeline Summary ===\n");
+                    printf("  This model requires %d layer groups per training step\n", plan.n_groups);
+                    printf("  Forward pass: %d compile batches (left to right)\n", plan.n_groups);
+                    printf("  Backward pass: %d compile batches (right to left)\n", plan.n_groups);
+                    printf("  Each compile batch may need exec() restart\n");
+                    printf("  All tensor state survives restarts via mmap (%s)\n", MMAP_PATH);
+                    printf("  Checkpoint policy '%s' saves %d/%d layer activations (%.0f%% memory reduction)\n",
+                           checkpoint_policy_name(ckpt_policy), cm.n_checkpointed, cm.n_layers,
+                           100.0 * checkpoint_memory_saved(&cm, &cfg.dims) /
+                           ((double)cm.n_layers * layer_activation_bytes(&cfg.dims)));
+                }
+
+                checkpoint_free(&cm);
+            } else {
+                // ===== Live training mode =====
+#ifdef ANE_LIVE
+                printf("Live training not yet implemented — use train_large.m for Stories110M\n");
+                printf("This entry point will be wired up once the scaffolding is validated.\n");
+#else
+                printf("Compiled without ANE_LIVE — use --live with ANE_LIVE defined.\n");
+                printf("Build with: xcrun clang -DANE_LIVE -O2 ... train_pipeline.m\n");
+#endif
+                checkpoint_free(&cm);
+            }
+
+            pipeline_plan_free(&plan);
+        }
+
+        // Cleanup
+        if (ms) mmap_state_destroy(ms);
+    }
+    return 0;
+}
+

From f486dda1abc02f80ef20dd6dc8d45e24bf0ea91b Mon Sep 17 00:00:00 2001
From: "codegen-sh[bot]" <131295404+codegen-sh[bot]@users.noreply.github.com>
Date: Mon, 2 Mar 2026 23:44:36 +0000
Subject: [PATCH 20/21] Address review feedback: configurable headroom, mmap
 hardening, unit tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- model_config.h: Added headroom_pct field to CompileConfig, used in
  max_layers_per_compile() with validation (falls back to 10% for invalid
  values). All presets include default. --headroom CLI flag added.
- pipeline.h: Tightened mmap error handling — calloc checks, size
  validation in mmap_state_open (file size vs header, truncation
  detection), sentinel/version in error message, msync/munmap return
  checks in close.
- test_pipeline_unit.c: 23 unit tests for model_config, pipeline
  planning, gradient checkpoint, and FLOP estimation. Pure C, no ANE
  dependency. All passing.

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
---
 training/Makefile             |   6 +-
 training/model_config.h       |  16 +-
 training/pipeline.h           |  28 ++-
 training/test_pipeline_unit.c | 397 ++++++++++++++++++++++++++++++++++
 4 files changed, 434 insertions(+), 13 deletions(-)
 create mode 100644 training/test_pipeline_unit.c

diff --git a/training/Makefile b/training/Makefile
index 2df029b..74b9211 100644
--- a/training/Makefile
+++ b/training/Makefile
@@ -49,8 +49,10 @@ probes: $(PROBES)
 tokenize:
 	python3 tokenize.py
 
+test_pipeline_unit: test_pipeline_unit.c $(HEADERS_PIPELINE)
+	cc -O2 -Wall -o $@ $< -lm
+
 clean:
-	rm -f train train_large train_large_ane train_pipeline $(PROBES) test_rmsnorm_bwd test_classifier
+	rm -f train train_large train_large_ane train_pipeline test_pipeline_unit $(PROBES) test_rmsnorm_bwd test_classifier
 
 .PHONY: clean tokenize probes
-
diff --git a/training/model_config.h b/training/model_config.h
index 8e99090..ddd9229 100644
--- a/training/model_config.h
+++ b/training/model_config.h
@@ -27,6 +27,7 @@ typedef struct {
     int kernels_per_layer;  // weight-bearing kernels per layer (currently 5)
     int static_per_layer;   // weight-free kernels per layer (sdpaBwd2 = 1)
     int accum_steps;        // gradient accumulation steps per compile batch
+    float headroom_pct;     // safety margin as fraction of budget (0.0-1.0, default 0.10)
 } CompileConfig;
 
 typedef struct {
@@ -118,8 +119,9 @@ static inline size_t total_model_bytes(const ModelConfig *cfg) {
 
 // Compute how many layers can fit in one compile batch
 static int max_layers_per_compile(const CompileConfig *cc) {
-    // Reserve some headroom (90% of budget) for safety
-    int usable = (int)(cc->compile_budget * 0.9);
+    float headroom = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f)
+                   ? cc->headroom_pct : 0.10f;
+    int usable = (int)(cc->compile_budget * (1.0f - headroom));
     int per_layer = cc->kernels_per_layer + cc->static_per_layer;
     if (per_layer <= 0) return 1;
     return usable / per_layer;
@@ -232,7 +234,7 @@ static ModelConfig model_config_stories110m(void) {
     };
     cfg.compile = (CompileConfig){
         .compile_budget = 119, .kernels_per_layer = 5,
-        .static_per_layer = 1, .accum_steps = 10
+        .static_per_layer = 1, .accum_steps = 10, .headroom_pct = 0.10f
     };
     model_dims_init(&cfg.dims);
     return cfg;
@@ -247,7 +249,7 @@ static ModelConfig model_config_stories42m(void) {
     };
     cfg.compile = (CompileConfig){
         .compile_budget = 119, .kernels_per_layer = 5,
-        .static_per_layer = 1, .accum_steps = 10
+        .static_per_layer = 1, .accum_steps = 10, .headroom_pct = 0.10f
     };
     model_dims_init(&cfg.dims);
     return cfg;
@@ -262,7 +264,7 @@ static ModelConfig model_config_llama_1b(void) {
     };
     cfg.compile = (CompileConfig){
         .compile_budget = 119, .kernels_per_layer = 5,
-        .static_per_layer = 1, .accum_steps = 4
+        .static_per_layer = 1, .accum_steps = 4, .headroom_pct = 0.10f
     };
     model_dims_init(&cfg.dims);
     return cfg;
@@ -277,7 +279,7 @@ static ModelConfig model_config_llama_7b(void) {
     };
     cfg.compile = (CompileConfig){
         .compile_budget = 119, .kernels_per_layer = 5,
-        .static_per_layer = 1, .accum_steps = 2
+        .static_per_layer = 1, .accum_steps = 2, .headroom_pct = 0.10f
     };
     model_dims_init(&cfg.dims);
     return cfg;
@@ -303,8 +305,8 @@ static ModelConfig model_config_from_args(int argc, char *argv[]) {
         else if (strcmp(argv[i], "--vocab") == 0 && i+1 < argc) cfg.dims.vocab_size = atoi(argv[++i]);
         else if (strcmp(argv[i], "--budget") == 0 && i+1 < argc) cfg.compile.compile_budget = atoi(argv[++i]);
         else if (strcmp(argv[i], "--accum") == 0 && i+1 < argc) cfg.compile.accum_steps = atoi(argv[++i]);
+        else if (strcmp(argv[i], "--headroom") == 0 && i+1 < argc) cfg.compile.headroom_pct = atof(argv[++i]);
     }
     model_dims_init(&cfg.dims);
     return cfg;
 }
-
diff --git a/training/pipeline.h b/training/pipeline.h
index c04eb0c..798f6d6 100644
--- a/training/pipeline.h
+++ b/training/pipeline.h
@@ -263,6 +263,7 @@ static MmapState *mmap_state_create(const char *path, const ModelConfig *cfg) {
     if (base == MAP_FAILED) { perror("mmap_state_create: mmap"); close(fd); return NULL; }
 
     MmapState *ms = (MmapState *)calloc(1, sizeof(MmapState));
+    if (!ms) { perror("mmap_state_create: calloc"); munmap(base, total); close(fd); return NULL; }
     ms->fd = fd;
     ms->base = base;
     ms->size = total;
@@ -308,15 +309,32 @@ static MmapState *mmap_state_open(const char *path) {
     void *base = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
     if (base == MAP_FAILED) { perror("mmap_state_open: mmap"); close(fd); return NULL; }
 
+    if ((size_t)st.st_size < sizeof(MmapHeader)) {
+        fprintf(stderr, "mmap_state_open: file too small (%lld bytes)\n", (long long)st.st_size);
+        munmap(base, st.st_size);
+        close(fd);
+        return NULL;
+    }
+
     MmapHeader *h = (MmapHeader *)base;
     if (h->sentinel != MMAP_SENTINEL || h->version != MMAP_VERSION) {
-        fprintf(stderr, "mmap_state_open: invalid header\n");
+        fprintf(stderr, "mmap_state_open: invalid header (sentinel=0x%08x version=%d)\n",
+                h->sentinel, h->version);
+        munmap(base, st.st_size);
+        close(fd);
+        return NULL;
+    }
+
+    if (h->total_size != 0 && (size_t)st.st_size < h->total_size) {
+        fprintf(stderr, "mmap_state_open: file truncated (expected %zu, got %lld)\n",
+                h->total_size, (long long)st.st_size);
         munmap(base, st.st_size);
         close(fd);
         return NULL;
     }
 
     MmapState *ms = (MmapState *)calloc(1, sizeof(MmapState));
+    if (!ms) { perror("mmap_state_open: calloc"); munmap(base, st.st_size); close(fd); return NULL; }
     ms->fd = fd;
     ms->base = base;
     ms->size = st.st_size;
@@ -328,9 +346,11 @@ static MmapState *mmap_state_open(const char *path) {
 // Close and unmap (does NOT delete the file)
 static void mmap_state_close(MmapState *ms) {
     if (!ms) return;
-    msync(ms->base, ms->size, MS_SYNC);
-    munmap(ms->base, ms->size);
-    close(ms->fd);
+    if (ms->base && ms->base != MAP_FAILED) {
+        if (msync(ms->base, ms->size, MS_SYNC) < 0) perror("mmap_state_close: msync");
+        if (munmap(ms->base, ms->size) < 0) perror("mmap_state_close: munmap");
+    }
+    if (ms->fd >= 0) close(ms->fd);
     free(ms);
 }
 
diff --git a/training/test_pipeline_unit.c b/training/test_pipeline_unit.c
new file mode 100644
index 0000000..b641911
--- /dev/null
+++ b/training/test_pipeline_unit.c
@@ -0,0 +1,397 @@
+// test_pipeline_unit.c — Unit tests for pipeline scheduler + checkpoint manager
+// Pure C, no ANE dependency. Validates state machine transitions and checkpoint logic.
+// Build: cc -O2 -o test_pipeline_unit test_pipeline_unit.c -lm
+// Run:   ./test_pipeline_unit
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <string.h>
+#include <math.h>
+#include <stdbool.h>
+
+// Stub out mmap/exec dependencies — we only test the pure logic
+#define _PIPELINE_SKIP_MMAP 1
+
+#include "model_config.h"
+#include "gradient_checkpoint.h"
+
+// ===== Test helpers =====
+
+static int tests_run = 0;
+static int tests_passed = 0;
+
+#define TEST(name) do { \
+    tests_run++; \
+    printf("  %-50s", name); \
+} while(0)
+
+#define PASS() do { tests_passed++; printf("PASS\n"); } while(0)
+#define FAIL(msg) do { printf("FAIL: %s\n", msg); } while(0)
+
+#define ASSERT_EQ(a, b, msg) do { \
+    if ((a) != (b)) { FAIL(msg); printf("    got %d, expected %d\n", (int)(a), (int)(b)); return; } \
+} while(0)
+
+#define ASSERT_TRUE(cond, msg) do { \
+    if (!(cond)) { FAIL(msg); return; } \
+} while(0)
+
+// ===== model_config.h tests =====
+
+static void test_dims_init(void) {
+    TEST("model_dims_init computes derived fields");
+    ModelDims d = {.dim = 768, .n_heads = 12, .n_kv_heads = 12, .seq_len = 256};
+    model_dims_init(&d);
+    ASSERT_EQ(d.head_dim, 64, "head_dim = dim / n_heads");
+    ASSERT_EQ(d.kv_dim, 768, "kv_dim = head_dim * n_kv_heads");
+    ASSERT_EQ(d.score_ch, 12 * 256, "score_ch = n_heads * seq_len");
+    PASS();
+}
+
+static void test_stories110m_preset(void) {
+    TEST("Stories110M preset");
+    ModelConfig cfg = model_config_stories110m();
+    ASSERT_EQ(cfg.dims.dim, 768, "dim");
+    ASSERT_EQ(cfg.dims.n_layers, 12, "n_layers");
+    ASSERT_EQ(cfg.dims.n_heads, 12, "n_heads");
+    ASSERT_EQ(cfg.compile.compile_budget, 119, "compile_budget");
+    ASSERT_TRUE(cfg.compile.headroom_pct > 0.0f, "headroom > 0");
+    PASS();
+}
+
+static void test_llama7b_preset(void) {
+    TEST("LLaMA-7B preset");
+    ModelConfig cfg = model_config_llama_7b();
+    ASSERT_EQ(cfg.dims.dim, 4096, "dim");
+    ASSERT_EQ(cfg.dims.n_layers, 32, "n_layers");
+    ASSERT_EQ(cfg.dims.hidden_dim, 11008, "hidden_dim");
+    PASS();
+}
+
+static void test_layer_memory_nonzero(void) {
+    TEST("Per-layer memory sizes are nonzero");
+    ModelConfig cfg = model_config_stories110m();
+    ASSERT_TRUE(layer_weight_bytes(&cfg.dims) > 0, "weight bytes");
+    ASSERT_TRUE(layer_adam_bytes(&cfg.dims) > 0, "adam bytes");
+    ASSERT_TRUE(layer_activation_bytes(&cfg.dims) > 0, "activation bytes");
+    ASSERT_TRUE(layer_gradient_bytes(&cfg.dims) > 0, "gradient bytes");
+    ASSERT_TRUE(total_model_bytes(&cfg) > 0, "total model bytes");
+    PASS();
+}
+
+static void test_adam_is_2x_weights(void) {
+    TEST("Adam state = 2x weight size");
+    ModelConfig cfg = model_config_stories110m();
+    ASSERT_EQ(layer_adam_bytes(&cfg.dims), 2 * layer_weight_bytes(&cfg.dims), "adam = 2 * weights");
+    PASS();
+}
+
+// ===== Pipeline planning tests =====
+
+static void test_max_layers_per_compile(void) {
+    TEST("max_layers_per_compile respects budget");
+    CompileConfig cc = {.compile_budget = 119, .kernels_per_layer = 5,
+                        .static_per_layer = 1, .headroom_pct = 0.10f};
+    int max = max_layers_per_compile(&cc);
+    // usable = floor(119 * 0.9) = 107, per_layer = 6, max = 107/6 = 17
+    ASSERT_EQ(max, 17, "max layers = 17 for budget=119, 6 kernels/layer, 10% headroom");
+    PASS();
+}
+
+static void test_configurable_headroom(void) {
+    TEST("Configurable headroom changes max layers");
+    CompileConfig cc5 = {.compile_budget = 119, .kernels_per_layer = 5,
+                         .static_per_layer = 1, .headroom_pct = 0.05f};
+    CompileConfig cc20 = {.compile_budget = 119, .kernels_per_layer = 5,
+                          .static_per_layer = 1, .headroom_pct = 0.20f};
+    int max5 = max_layers_per_compile(&cc5);    // floor(119*0.95/6) = 18
+    int max20 = max_layers_per_compile(&cc20);   // floor(119*0.80/6) = 15
+    ASSERT_TRUE(max5 > max20, "5% headroom fits more layers than 20%");
+    ASSERT_EQ(max5, 18, "5% headroom: 18 layers");
+    ASSERT_EQ(max20, 15, "20% headroom: 15 layers");
+    PASS();
+}
+
+static void test_invalid_headroom_defaults(void) {
+    TEST("Invalid headroom falls back to 10%");
+    CompileConfig cc_neg = {.compile_budget = 119, .kernels_per_layer = 5,
+                            .static_per_layer = 1, .headroom_pct = -0.5f};
+    CompileConfig cc_over = {.compile_budget = 119, .kernels_per_layer = 5,
+                             .static_per_layer = 1, .headroom_pct = 1.5f};
+    CompileConfig cc_def = {.compile_budget = 119, .kernels_per_layer = 5,
+                            .static_per_layer = 1, .headroom_pct = 0.10f};
+    ASSERT_EQ(max_layers_per_compile(&cc_neg), max_layers_per_compile(&cc_def),
+              "negative headroom -> default");
+    ASSERT_EQ(max_layers_per_compile(&cc_over), max_layers_per_compile(&cc_def),
+              "headroom > 1.0 -> default");
+    PASS();
+}
+
+static void test_plan_stories110m(void) {
+    TEST("Stories110M fits in 1 group");
+    ModelConfig cfg = model_config_stories110m();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_EQ(plan.n_groups, 1, "1 group");
+    ASSERT_EQ(plan.groups[0].start_layer, 0, "starts at 0");
+    ASSERT_EQ(plan.groups[0].end_layer, 12, "ends at 12");
+    ASSERT_EQ(plan.groups[0].n_layers, 12, "12 layers");
+    ASSERT_EQ(plan.groups[0].total_kernels, 72, "72 total kernels");
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_plan_llama7b_multiple_groups(void) {
+    TEST("LLaMA-7B needs multiple groups");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_TRUE(plan.n_groups >= 2, "at least 2 groups for 32 layers");
+    // Verify all layers covered
+    int total_layers = 0;
+    for (int g = 0; g < plan.n_groups; g++) {
+        total_layers += plan.groups[g].n_layers;
+        ASSERT_TRUE(plan.groups[g].n_layers > 0, "no empty groups");
+    }
+    ASSERT_EQ(total_layers, 32, "all 32 layers covered");
+    // Verify contiguous
+    for (int g = 1; g < plan.n_groups; g++) {
+        ASSERT_EQ(plan.groups[g].start_layer, plan.groups[g-1].end_layer, "contiguous groups");
+    }
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_plan_kernel_budget(void) {
+    TEST("No group exceeds compile budget");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    int usable = (int)(cfg.compile.compile_budget * (1.0f - cfg.compile.headroom_pct));
+    for (int g = 0; g < plan.n_groups; g++) {
+        ASSERT_TRUE(plan.groups[g].total_kernels <= usable,
+                    "group kernel count <= usable budget");
+    }
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+// ===== Gradient checkpoint tests =====
+
+static void test_ckpt_all_saves_everything(void) {
+    TEST("CKPT_ALL saves all layers");
+    ModelConfig cfg = model_config_stories110m();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan);
+    ASSERT_EQ(cm.n_checkpointed, 12, "12 layers saved");
+    for (int i = 0; i < 12; i++) {
+        ASSERT_TRUE(checkpoint_should_save(&cm, i), "every layer saved");
+        ASSERT_TRUE(!checkpoint_needs_recompute(&cm, i), "no recompute needed");
+    }
+    ASSERT_TRUE(checkpoint_recompute_overhead(&cm) < 0.001, "zero overhead");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_none_saves_minimum(void) {
+    TEST("CKPT_NONE saves only layer 0");
+    ModelConfig cfg = model_config_stories110m();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_NONE, &cfg, &plan);
+    ASSERT_EQ(cm.n_checkpointed, 1, "only 1 layer saved");
+    ASSERT_TRUE(checkpoint_should_save(&cm, 0), "layer 0 saved");
+    ASSERT_TRUE(checkpoint_needs_recompute(&cm, 5), "layer 5 needs recompute");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_sqrt_interval(void) {
+    TEST("CKPT_SQRT uses sqrt(N) interval");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan);
+    int expected_interval = (int)sqrtf(32.0f);  // 5
+    ASSERT_EQ(cm.interval, expected_interval, "interval = sqrt(32) = 5");
+    // Layer 0 always saved, then 5, 10, 15, 20, 25, 30, 31
+    ASSERT_TRUE(checkpoint_should_save(&cm, 0), "layer 0 saved");
+    ASSERT_TRUE(checkpoint_should_save(&cm, 5), "layer 5 saved");
+    ASSERT_TRUE(!checkpoint_should_save(&cm, 3), "layer 3 not saved");
+    ASSERT_TRUE(checkpoint_should_save(&cm, 31), "last layer saved");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_boundary(void) {
+    TEST("CKPT_BOUNDARY saves group edges");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_BOUNDARY, &cfg, &plan);
+    // First layer of each group + last layer overall
+    for (int g = 0; g < plan.n_groups; g++) {
+        ASSERT_TRUE(checkpoint_should_save(&cm, plan.groups[g].start_layer),
+                    "group start layer saved");
+    }
+    ASSERT_TRUE(checkpoint_should_save(&cm, 31), "last layer saved");
+    // Middle of first group should not be saved
+    if (plan.groups[0].n_layers > 2) {
+        int mid = plan.groups[0].start_layer + plan.groups[0].n_layers / 2;
+        ASSERT_TRUE(checkpoint_needs_recompute(&cm, mid), "mid-group needs recompute");
+    }
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_memory_savings(void) {
+    TEST("Checkpoint memory savings are positive for non-ALL policies");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+
+    CheckpointManager cm_all = checkpoint_init(CKPT_ALL, &cfg, &plan);
+    CheckpointManager cm_sqrt = checkpoint_init(CKPT_SQRT, &cfg, &plan);
+    CheckpointManager cm_none = checkpoint_init(CKPT_NONE, &cfg, &plan);
+
+    size_t saved_sqrt = checkpoint_memory_saved(&cm_sqrt, &cfg.dims);
+    size_t saved_none = checkpoint_memory_saved(&cm_none, &cfg.dims);
+
+    ASSERT_TRUE(saved_sqrt > 0, "SQRT saves memory");
+    ASSERT_TRUE(saved_none > saved_sqrt, "NONE saves more than SQRT");
+    ASSERT_EQ(checkpoint_memory_saved(&cm_all, &cfg.dims), 0, "ALL saves nothing");
+
+    checkpoint_free(&cm_all);
+    checkpoint_free(&cm_sqrt);
+    checkpoint_free(&cm_none);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_recompute_depth(void) {
+    TEST("Recompute depth counts layers from nearest checkpoint");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan);
+    // With interval=5: checkpoints at 0, 5, 10, 15, 20, 25, 30, 31
+    // Layer 3: nearest saved before = 0, depth = 3
+    ASSERT_EQ(checkpoint_recompute_depth(&cm, 3), 3, "depth from layer 0 to 3");
+    // Layer 7: nearest saved before = 5, depth = 2
+    ASSERT_EQ(checkpoint_recompute_depth(&cm, 7), 2, "depth from layer 5 to 7");
+    // Layer 5: nearest saved = 5, depth = 0
+    ASSERT_EQ(checkpoint_recompute_depth(&cm, 5), 0, "checkpointed layer = 0 depth");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_out_of_bounds(void) {
+    TEST("Checkpoint queries handle out-of-bounds gracefully");
+    ModelConfig cfg = model_config_stories110m();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan);
+    ASSERT_TRUE(!checkpoint_should_save(&cm, -1), "negative index returns false");
+    ASSERT_TRUE(!checkpoint_should_save(&cm, 100), "over-max index returns false");
+    checkpoint_free(&cm);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+// ===== FLOP estimation tests =====
+
+static void test_flops_nonzero(void) {
+    TEST("FLOP estimates are nonzero and ANE < total");
+    ModelConfig cfg = model_config_stories110m();
+    double total = flops_per_step(&cfg);
+    double ane = ane_flops_per_step(&cfg);
+    ASSERT_TRUE(total > 0, "total FLOPs > 0");
+    ASSERT_TRUE(ane > 0, "ANE FLOPs > 0");
+    ASSERT_TRUE(ane < total, "ANE FLOPs < total (dW is on CPU)");
+    PASS();
+}
+
+static void test_flops_scale_with_layers(void) {
+    TEST("FLOPs scale roughly linearly with layer count");
+    ModelConfig cfg12 = model_config_stories110m();
+    ModelConfig cfg8 = model_config_stories42m();
+    double f12 = flops_per_step(&cfg12);
+    double f8 = flops_per_step(&cfg8);
+    // Not exact linear due to different dims, but 12-layer should be >8-layer
+    ASSERT_TRUE(f12 > f8, "12 layers > 8 layers");
+    PASS();
+}
+
+// ===== Pipeline plan edge cases =====
+
+static void test_plan_single_layer(void) {
+    TEST("Single-layer model = 1 group");
+    ModelConfig cfg = model_config_stories110m();
+    cfg.dims.n_layers = 1;
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_EQ(plan.n_groups, 1, "1 group");
+    ASSERT_EQ(plan.groups[0].n_layers, 1, "1 layer in group");
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_plan_exact_budget_fit(void) {
+    TEST("Layers that exactly fill budget = 1 group");
+    ModelConfig cfg = model_config_stories110m();
+    // 17 layers * 6 kernels = 102 <= 107 usable (10% headroom on 119)
+    cfg.dims.n_layers = 17;
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_EQ(plan.n_groups, 1, "17 layers fit in 1 group");
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_plan_one_over_budget(void) {
+    TEST("One layer over budget = 2 groups");
+    ModelConfig cfg = model_config_stories110m();
+    // 18 layers * 6 kernels = 108 > 107 usable -> 2 groups
+    cfg.dims.n_layers = 18;
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    ASSERT_EQ(plan.n_groups, 2, "18 layers = 2 groups");
+    int total = plan.groups[0].n_layers + plan.groups[1].n_layers;
+    ASSERT_EQ(total, 18, "all layers covered");
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+// ===== Main =====
+
+int main(void) {
+    printf("=== Pipeline Unit Tests ===\n\n");
+
+    printf("[model_config.h]\n");
+    test_dims_init();
+    test_stories110m_preset();
+    test_llama7b_preset();
+    test_layer_memory_nonzero();
+    test_adam_is_2x_weights();
+
+    printf("\n[pipeline planning]\n");
+    test_max_layers_per_compile();
+    test_configurable_headroom();
+    test_invalid_headroom_defaults();
+    test_plan_stories110m();
+    test_plan_llama7b_multiple_groups();
+    test_plan_kernel_budget();
+    test_plan_single_layer();
+    test_plan_exact_budget_fit();
+    test_plan_one_over_budget();
+
+    printf("\n[gradient_checkpoint.h]\n");
+    test_ckpt_all_saves_everything();
+    test_ckpt_none_saves_minimum();
+    test_ckpt_sqrt_interval();
+    test_ckpt_boundary();
+    test_ckpt_memory_savings();
+    test_ckpt_recompute_depth();
+    test_ckpt_out_of_bounds();
+
+    printf("\n[FLOP estimation]\n");
+    test_flops_nonzero();
+    test_flops_scale_with_layers();
+
+    printf("\n=== Results: %d/%d passed ===\n", tests_passed, tests_run);
+    return (tests_passed == tests_run) ? 0 : 1;
+}
+

From 5d5ea4191fe2eae231064189e6cd14ac438d0164 Mon Sep 17 00:00:00 2001
From: "codegen-sh[bot]" <131295404+codegen-sh[bot]@users.noreply.github.com>
Date: Tue, 3 Mar 2026 06:54:28 +0000
Subject: [PATCH 21/21] Fix 6 review issues: checkpoint counting bug,
 headroom/memory consistency, safety guards

Bug fix: n_checkpointed counting wrong in CKPT_BOUNDARY/SQRT/EVERY_N
  - Replaced per-policy arithmetic with single post-switch loop that counts
    actual is_saved bits. Eliminates edge-case miscounts when last layer
    falls on an interval boundary.

Inconsistency: headroom mismatch between planner and runtime budget
  - budget_init() now takes CompileConfig* and uses the same headroom_pct
    validation as max_layers_per_compile(). Both paths yield identical
    usable-budget calculations.

Inconsistency: total_model_bytes() omitted global gradients
  - Added rms_final_grad and embed_grad terms to match mmap_compute_size().
    Diagnostic output now agrees with actual allocation.

Design: divide-by-zero in model_dims_init() if n_heads=0
  - Guarded head_dim = dim / n_heads with n_heads > 0 check.

Design: no bounds checking in mmap typed accessors
  - All four mmap_layer_* accessors now validate layer index and return NULL
    on out-of-bounds. Extracted shared mmap_dims() helper to deduplicate
    ModelDims reconstruction.

Design: CKPT_EVERY_N interval hardcoded despite caller should set
  - Added custom_interval parameter to checkpoint_init(). Pass 0 for
    default (4), or any positive int for custom spacing.

Tests: 26/26 passing (3 new: custom interval, n_checkpointed accuracy,
zero-heads guard).

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
---
 training/gradient_checkpoint.h | 23 +++++------
 training/model_config.h        |  6 ++-
 training/pipeline.h            | 60 ++++++++++++++--------------
 training/test_pipeline_unit.c  | 71 +++++++++++++++++++++++++++++-----
 training/train_pipeline.m      |  5 +--
 5 files changed, 105 insertions(+), 60 deletions(-)

diff --git a/training/gradient_checkpoint.h b/training/gradient_checkpoint.h
index 4065c04..29a5aa0 100644
--- a/training/gradient_checkpoint.h
+++ b/training/gradient_checkpoint.h
@@ -27,8 +27,9 @@ typedef struct {
 
 // ===== Initialization =====
 
+// custom_interval: used for CKPT_EVERY_N (pass 0 for default=4, ignored for other policies)
 static CheckpointManager checkpoint_init(CheckpointPolicy policy, const ModelConfig *cfg,
-                                          const PipelinePlan *plan) {
+                                          const PipelinePlan *plan, int custom_interval) {
     CheckpointManager cm = {0};
     cm.policy = policy;
     cm.n_layers = cfg->dims.n_layers;
@@ -38,47 +39,42 @@ static CheckpointManager checkpoint_init(CheckpointPolicy policy, const ModelCon
 
     switch (policy) {
     case CKPT_ALL:
-        // Save everything — no recompute needed
         for (int i = 0; i < cm.n_layers; i++) cm.is_saved[i] = true;
-        cm.n_checkpointed = cm.n_layers;
         break;
 
     case CKPT_BOUNDARY:
-        // Save only the input to each layer group
         for (int g = 0; g < plan->n_groups; g++) {
             cm.is_saved[plan->groups[g].start_layer] = true;
         }
-        // Always save the last layer's output (needed for loss backward)
         cm.is_saved[cm.n_layers - 1] = true;
-        cm.n_checkpointed = plan->n_groups + 1;
         break;
 
     case CKPT_SQRT: {
-        // Save every √N layers — optimal memory/compute balance
         int interval = (int)sqrtf((float)cm.n_layers);
         if (interval < 1) interval = 1;
         cm.interval = interval;
         for (int i = 0; i < cm.n_layers; i += interval) cm.is_saved[i] = true;
         cm.is_saved[cm.n_layers - 1] = true;
-        cm.n_checkpointed = (cm.n_layers + interval - 1) / interval;
         break;
     }
 
     case CKPT_EVERY_N:
-        // Caller should set cm.interval before using
-        cm.interval = 4;   // default
+        cm.interval = (custom_interval > 0) ? custom_interval : 4;
         for (int i = 0; i < cm.n_layers; i += cm.interval) cm.is_saved[i] = true;
         cm.is_saved[cm.n_layers - 1] = true;
-        cm.n_checkpointed = (cm.n_layers + cm.interval - 1) / cm.interval;
         break;
 
     case CKPT_NONE:
-        // Save nothing except layer 0 input (needed as recompute starting point)
         cm.is_saved[0] = true;
-        cm.n_checkpointed = 1;
         break;
     }
 
+    // Count actual saved layers — single source of truth, no fragile arithmetic
+    cm.n_checkpointed = 0;
+    for (int i = 0; i < cm.n_layers; i++) {
+        if (cm.is_saved[i]) cm.n_checkpointed++;
+    }
+
     return cm;
 }
 
@@ -167,4 +163,3 @@ static void checkpoint_print(const CheckpointManager *cm, const ModelDims *d) {
     }
     printf("\n");
 }
-
diff --git a/training/model_config.h b/training/model_config.h
index ddd9229..fa8be1c 100644
--- a/training/model_config.h
+++ b/training/model_config.h
@@ -56,7 +56,7 @@ typedef struct {
 // ===== Derived dimension helpers =====
 
 static void model_dims_init(ModelDims *d) {
-    d->head_dim = d->dim / d->n_heads;
+    d->head_dim = (d->n_heads > 0) ? d->dim / d->n_heads : 0;
     d->kv_dim = d->head_dim * d->n_kv_heads;
     d->score_ch = d->n_heads * d->seq_len;
 }
@@ -111,7 +111,9 @@ static inline size_t total_model_bytes(const ModelConfig *cfg) {
     size_t global = (size_t)d->dim * sizeof(float)                  // rms_final
                   + (size_t)d->vocab_size * d->dim * sizeof(float)  // embed
                   + (size_t)d->dim * 2 * sizeof(float)              // rms_final adam
-                  + (size_t)d->vocab_size * d->dim * 2 * sizeof(float); // embed adam
+                  + (size_t)d->vocab_size * d->dim * 2 * sizeof(float)  // embed adam
+                  + (size_t)d->dim * sizeof(float)                  // rms_final grad
+                  + (size_t)d->vocab_size * d->dim * sizeof(float); // embed grad
     return per_layer * d->n_layers + global;
 }
 
diff --git a/training/pipeline.h b/training/pipeline.h
index 798f6d6..d3ceb95 100644
--- a/training/pipeline.h
+++ b/training/pipeline.h
@@ -16,11 +16,13 @@ typedef struct {
     int headroom;       // safety margin (budget * 0.1)
 } CompileBudget;
 
-static CompileBudget budget_init(int max_compiles) {
+static CompileBudget budget_init(const CompileConfig *cc) {
     CompileBudget b;
-    b.budget = max_compiles;
+    b.budget = cc->compile_budget;
     b.used = 0;
-    b.headroom = max_compiles / 10;
+    float pct = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f)
+              ? cc->headroom_pct : 0.10f;
+    b.headroom = (int)(cc->compile_budget * pct);
     return b;
 }
 
@@ -85,7 +87,7 @@ static PipelineScheduler pipeline_scheduler_init(ModelConfig config, int total_s
     PipelineScheduler s = {0};
     s.config = config;
     s.plan = compute_pipeline_plan(&config);
-    s.budget = budget_init(config.compile.compile_budget);
+    s.budget = budget_init(&config.compile);
     s.phase = PHASE_FORWARD;
     s.current_group = 0;
     s.current_step = 0;
@@ -364,47 +366,43 @@ static void mmap_state_destroy(MmapState *ms) {
 
 // ===== Typed accessors into mmap regions =====
 
-// Get pointer to layer L's weights in mmap
+// Reconstruct ModelDims from mmap header (avoids repeating in each accessor)
+static inline ModelDims mmap_dims(const MmapState *ms) {
+    return (ModelDims){
+        .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim,
+        .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size,
+        .seq_len = ms->header->seq_len
+    };
+}
+
+// Get pointer to layer L's weights in mmap (NULL if out of bounds)
 static float *mmap_layer_weights(MmapState *ms, int layer) {
+    if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL;
+    ModelDims d = mmap_dims(ms);
     return (float *)((char *)ms->base + ms->header->layer_weights_offset
-                     + (size_t)layer * layer_weight_bytes(&(ModelDims){
-                        .dim = ms->header->dim,
-                        .hidden_dim = ms->header->hidden_dim,
-                        .n_heads = ms->header->n_heads,
-                        .vocab_size = ms->header->vocab_size,
-                        .seq_len = ms->header->seq_len
-                     }));
+                     + (size_t)layer * layer_weight_bytes(&d));
 }
 
-// Get pointer to layer L's adam state in mmap
+// Get pointer to layer L's adam state in mmap (NULL if out of bounds)
 static float *mmap_layer_adam(MmapState *ms, int layer) {
-    ModelDims d = {
-        .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim,
-        .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size,
-        .seq_len = ms->header->seq_len
-    };
+    if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL;
+    ModelDims d = mmap_dims(ms);
     return (float *)((char *)ms->base + ms->header->layer_adam_offset
                      + (size_t)layer * layer_adam_bytes(&d));
 }
 
-// Get pointer to layer L's gradient accumulators in mmap
+// Get pointer to layer L's gradient accumulators in mmap (NULL if out of bounds)
 static float *mmap_layer_grads(MmapState *ms, int layer) {
-    ModelDims d = {
-        .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim,
-        .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size,
-        .seq_len = ms->header->seq_len
-    };
+    if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL;
+    ModelDims d = mmap_dims(ms);
     return (float *)((char *)ms->base + ms->header->layer_grads_offset
                      + (size_t)layer * layer_gradient_bytes(&d));
 }
 
-// Get pointer to layer L's activation checkpoint in mmap
+// Get pointer to layer L's activation checkpoint in mmap (NULL if out of bounds)
 static float *mmap_layer_acts(MmapState *ms, int layer) {
-    ModelDims d = {
-        .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim,
-        .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size,
-        .seq_len = ms->header->seq_len
-    };
+    if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL;
+    ModelDims d = mmap_dims(ms);
     return (float *)((char *)ms->base + ms->header->layer_acts_offset
                      + (size_t)layer * layer_activation_bytes(&d));
 }
@@ -438,7 +436,7 @@ static void pipeline_restore_from_mmap(PipelineScheduler *s, const MmapState *ms
     s->learning_rate = h->learning_rate;
     s->last_loss = h->last_loss;
     // Reset compile budget (new process after exec)
-    s->budget = budget_init(s->config.compile.compile_budget);
+    s->budget = budget_init(&s->config.compile);
     s->group_compiled = false;
     s->needs_restart = false;
 }
diff --git a/training/test_pipeline_unit.c b/training/test_pipeline_unit.c
index b641911..dcdd9e5 100644
--- a/training/test_pipeline_unit.c
+++ b/training/test_pipeline_unit.c
@@ -179,7 +179,7 @@ static void test_ckpt_all_saves_everything(void) {
     TEST("CKPT_ALL saves all layers");
     ModelConfig cfg = model_config_stories110m();
     PipelinePlan plan = compute_pipeline_plan(&cfg);
-    CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan);
+    CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan, 0);
     ASSERT_EQ(cm.n_checkpointed, 12, "12 layers saved");
     for (int i = 0; i < 12; i++) {
         ASSERT_TRUE(checkpoint_should_save(&cm, i), "every layer saved");
@@ -195,7 +195,7 @@ static void test_ckpt_none_saves_minimum(void) {
     TEST("CKPT_NONE saves only layer 0");
     ModelConfig cfg = model_config_stories110m();
     PipelinePlan plan = compute_pipeline_plan(&cfg);
-    CheckpointManager cm = checkpoint_init(CKPT_NONE, &cfg, &plan);
+    CheckpointManager cm = checkpoint_init(CKPT_NONE, &cfg, &plan, 0);
     ASSERT_EQ(cm.n_checkpointed, 1, "only 1 layer saved");
     ASSERT_TRUE(checkpoint_should_save(&cm, 0), "layer 0 saved");
     ASSERT_TRUE(checkpoint_needs_recompute(&cm, 5), "layer 5 needs recompute");
@@ -208,7 +208,7 @@ static void test_ckpt_sqrt_interval(void) {
     TEST("CKPT_SQRT uses sqrt(N) interval");
     ModelConfig cfg = model_config_llama_7b();
     PipelinePlan plan = compute_pipeline_plan(&cfg);
-    CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan);
+    CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0);
     int expected_interval = (int)sqrtf(32.0f);  // 5
     ASSERT_EQ(cm.interval, expected_interval, "interval = sqrt(32) = 5");
     // Layer 0 always saved, then 5, 10, 15, 20, 25, 30, 31
@@ -225,7 +225,7 @@ static void test_ckpt_boundary(void) {
     TEST("CKPT_BOUNDARY saves group edges");
     ModelConfig cfg = model_config_llama_7b();
     PipelinePlan plan = compute_pipeline_plan(&cfg);
-    CheckpointManager cm = checkpoint_init(CKPT_BOUNDARY, &cfg, &plan);
+    CheckpointManager cm = checkpoint_init(CKPT_BOUNDARY, &cfg, &plan, 0);
     // First layer of each group + last layer overall
     for (int g = 0; g < plan.n_groups; g++) {
         ASSERT_TRUE(checkpoint_should_save(&cm, plan.groups[g].start_layer),
@@ -247,9 +247,9 @@ static void test_ckpt_memory_savings(void) {
     ModelConfig cfg = model_config_llama_7b();
     PipelinePlan plan = compute_pipeline_plan(&cfg);
 
-    CheckpointManager cm_all = checkpoint_init(CKPT_ALL, &cfg, &plan);
-    CheckpointManager cm_sqrt = checkpoint_init(CKPT_SQRT, &cfg, &plan);
-    CheckpointManager cm_none = checkpoint_init(CKPT_NONE, &cfg, &plan);
+    CheckpointManager cm_all = checkpoint_init(CKPT_ALL, &cfg, &plan, 0);
+    CheckpointManager cm_sqrt = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0);
+    CheckpointManager cm_none = checkpoint_init(CKPT_NONE, &cfg, &plan, 0);
 
     size_t saved_sqrt = checkpoint_memory_saved(&cm_sqrt, &cfg.dims);
     size_t saved_none = checkpoint_memory_saved(&cm_none, &cfg.dims);
@@ -269,7 +269,7 @@ static void test_ckpt_recompute_depth(void) {
     TEST("Recompute depth counts layers from nearest checkpoint");
     ModelConfig cfg = model_config_llama_7b();
     PipelinePlan plan = compute_pipeline_plan(&cfg);
-    CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan);
+    CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0);
     // With interval=5: checkpoints at 0, 5, 10, 15, 20, 25, 30, 31
     // Layer 3: nearest saved before = 0, depth = 3
     ASSERT_EQ(checkpoint_recompute_depth(&cm, 3), 3, "depth from layer 0 to 3");
@@ -286,7 +286,7 @@ static void test_ckpt_out_of_bounds(void) {
     TEST("Checkpoint queries handle out-of-bounds gracefully");
     ModelConfig cfg = model_config_stories110m();
     PipelinePlan plan = compute_pipeline_plan(&cfg);
-    CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan);
+    CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan, 0);
     ASSERT_TRUE(!checkpoint_should_save(&cm, -1), "negative index returns false");
     ASSERT_TRUE(!checkpoint_should_save(&cm, 100), "over-max index returns false");
     checkpoint_free(&cm);
@@ -294,6 +294,55 @@ static void test_ckpt_out_of_bounds(void) {
     PASS();
 }
 
+static void test_ckpt_every_n_custom_interval(void) {
+    TEST("CKPT_EVERY_N respects custom_interval parameter");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointManager cm3 = checkpoint_init(CKPT_EVERY_N, &cfg, &plan, 3);
+    CheckpointManager cm8 = checkpoint_init(CKPT_EVERY_N, &cfg, &plan, 8);
+    ASSERT_EQ(cm3.interval, 3, "interval=3 when custom_interval=3");
+    ASSERT_EQ(cm8.interval, 8, "interval=8 when custom_interval=8");
+    ASSERT_TRUE(cm3.n_checkpointed > cm8.n_checkpointed,
+                "shorter interval = more checkpoints");
+    // Verify layer 0 and last layer always saved
+    ASSERT_TRUE(checkpoint_should_save(&cm3, 0), "layer 0 saved (interval=3)");
+    ASSERT_TRUE(checkpoint_should_save(&cm3, 31), "last layer saved (interval=3)");
+    ASSERT_TRUE(checkpoint_should_save(&cm8, 0), "layer 0 saved (interval=8)");
+    ASSERT_TRUE(checkpoint_should_save(&cm8, 31), "last layer saved (interval=8)");
+    checkpoint_free(&cm3);
+    checkpoint_free(&cm8);
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_ckpt_n_checkpointed_accuracy(void) {
+    TEST("n_checkpointed matches actual is_saved bit count");
+    ModelConfig cfg = model_config_llama_7b();
+    PipelinePlan plan = compute_pipeline_plan(&cfg);
+    CheckpointPolicy policies[] = {CKPT_ALL, CKPT_BOUNDARY, CKPT_SQRT, CKPT_EVERY_N, CKPT_NONE};
+    for (int p = 0; p < 5; p++) {
+        CheckpointManager cm = checkpoint_init(policies[p], &cfg, &plan, 0);
+        int actual = 0;
+        for (int i = 0; i < cm.n_layers; i++) {
+            if (cm.is_saved[i]) actual++;
+        }
+        ASSERT_EQ(cm.n_checkpointed, actual, "n_checkpointed matches is_saved count");
+        checkpoint_free(&cm);
+    }
+    pipeline_plan_free(&plan);
+    PASS();
+}
+
+static void test_dims_init_zero_heads(void) {
+    TEST("model_dims_init guards divide-by-zero on n_heads=0");
+    ModelDims d = {.dim = 768, .n_heads = 0, .n_kv_heads = 0, .seq_len = 256};
+    model_dims_init(&d);
+    ASSERT_EQ(d.head_dim, 0, "head_dim=0 when n_heads=0");
+    ASSERT_EQ(d.kv_dim, 0, "kv_dim=0 when n_heads=0");
+    ASSERT_EQ(d.score_ch, 0, "score_ch=0 when n_heads=0");
+    PASS();
+}
+
 // ===== FLOP estimation tests =====
 
 static void test_flops_nonzero(void) {
@@ -386,6 +435,9 @@ int main(void) {
     test_ckpt_memory_savings();
     test_ckpt_recompute_depth();
     test_ckpt_out_of_bounds();
+    test_ckpt_every_n_custom_interval();
+    test_ckpt_n_checkpointed_accuracy();
+    test_dims_init_zero_heads();
 
     printf("\n[FLOP estimation]\n");
     test_flops_nonzero();
@@ -394,4 +446,3 @@ int main(void) {
     printf("\n=== Results: %d/%d passed ===\n", tests_passed, tests_run);
     return (tests_passed == tests_run) ? 0 : 1;
 }
-
diff --git a/training/train_pipeline.m b/training/train_pipeline.m
index ef055b8..b0fc3a4 100644
--- a/training/train_pipeline.m
+++ b/training/train_pipeline.m
@@ -124,7 +124,7 @@ int main(int argc, char *argv[]) {
             printf("\n");
 
             // Print checkpoint policy
-            CheckpointManager cm = checkpoint_init(ckpt_policy, &cfg, &plan);
+            CheckpointManager cm = checkpoint_init(ckpt_policy, &cfg, &plan, 0);
             checkpoint_print(&cm, &cfg.dims);
             printf("\n");
 
@@ -195,7 +195,7 @@ int main(int argc, char *argv[]) {
                         printf("    [exec] Would restart process to reset compile budget\n");
                         printf("           Saving scheduler state to mmap, calling exec()\n");
                         // In dry-run, just reset the budget and continue
-                        sched.budget = budget_init(cfg.compile.compile_budget);
+                        sched.budget = budget_init(&cfg.compile);
                         sched.needs_restart = false;
                         break;
 
@@ -255,4 +255,3 @@ int main(int argc, char *argv[]) {
     }
     return 0;
 }
-