diff --git a/README.md b/README.md index d2c7bb2..ed2362d 100644 --- a/README.md +++ b/README.md @@ -2,15 +2,67 @@ Training neural networks directly on Apple's Neural Engine (ANE) via reverse-engineered private APIs. No CoreML training APIs, no Metal, no GPU — pure ANE compute. +## Project Scope & Intent + +I'm genuinely grateful for all the attention this project has received — I never expected a weekend research hack to blow up like this. Thank you to everyone who starred, forked, ran benchmarks on their own hardware, and shared the work. It means a lot. + +That said, I want to set clear expectations about what this project is and isn't. + +This is a **research project**, not a production framework. + +The goal was to demonstrate that **training on the Apple Neural Engine — and potentially other NPUs — is possible**, and that the barrier has always been software support, not hardware capability. The ANE is a remarkably capable piece of silicon that Apple restricts to inference-only use through CoreML. This project bypasses that restriction using reverse-engineered private APIs to show what's possible when you give the hardware a chance. + +### What This Project Is + +- A proof of concept for ANE training via `_ANEClient` and `_ANECompiler` private APIs +- A set of benchmarks documenting real ANE performance characteristics (throughput, power, SRAM behavior) +- A reference for anyone exploring direct ANE access outside CoreML +- Research code that I update when I find something interesting + +### What This Project Is Not + +- A maintained framework or library +- A replacement for CoreML, MLX, llama.cpp, or any production inference stack +- A path to training large models on consumer hardware (yet) + +### On The Hype + +Some coverage of this project has overstated its implications. To be clear: + +- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining +- Many element-wise operations still fall back to CPU +- This does **not** replace GPU training for anything beyond small research models today + +The honest results — including all limitations — are documented in the accompanying articles: +- [Part 1: Reverse Engineering](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine) +- [Part 2: Benchmarks](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615) + +### On Maintenance + +I don't intend to grow this into a large community project. My focus is on original research (compiler infrastructure for edge AI optimization), and maintaining an open-source framework takes time away from that. + +That said: +- I'll keep pushing updates when I discover something interesting +- Bug fixes and benchmark contributions (especially on hardware I don't own) are welcome +- Feature requests will likely go unaddressed — but feel free to fork +- PRs will be merged at a relatively slow pace, otherwise I become the bottleneck for community growth around this tech + +### Fork it, build on it + +This is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it.If in future, community decides to maintain one source of truth repo, I'm in full support of that. + +--- + ## What This Is A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware. -**Current results (M4, single transformer layer, dim=768, seq=512):** -- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained) -- 6 ANE kernel dispatches per training step +**Current results — Stories110M (12-layer, dim=768, seq=256, 109M params):** +- Static pipeline: **91 ms/step** (M3 Ultra), **106 ms/step** (M4) +- Dynamic pipeline: **110 ms/step**, no recompilation +- 72 ANE kernels per step (static), 9 shared kernels (dynamic) - All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas) -- Adam optimizer, gradient accumulation, checkpoint/resume +- Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart ## Architecture @@ -59,6 +111,14 @@ Key optimizations: └── Makefile ``` +## Training Data + +Training requires pretokenized TinyStories data. To download: +```bash +cd training && bash download_data.sh +``` +See [training/README.md](training/README.md) for detailed training instructions. + ## Building Requires macOS 15+ on Apple Silicon (tested on M4). @@ -87,8 +147,8 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve - **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE) - **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint -- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling -- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP +- **Compile overhead** — Static pipeline recompiles 60+ kernels every 10 steps (~3.7s); dynamic pipeline avoids this +- **Low utilization** — Training sustains ~1-2 TFLOPS out of 15.8+ peak due to CPU fallbacks and I/O overhead ## Performance History @@ -104,8 +164,13 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve ## Disclaimer -This project is independent research into Apple Neural Engine architecture. It uses undocumented APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk. +This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk. ## License MIT — see [LICENSE](LICENSE) + +--- + +*Built by a human + Claude, one weekend at a time.* + diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md new file mode 100644 index 0000000..b7095a0 --- /dev/null +++ b/benchmarks/ANE_BENCHMARK_REPORT.md @@ -0,0 +1,163 @@ +# Apple Neural Engine — Cross-Generation Benchmark Report + +Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3). + +## Model Configuration + +All training benchmarks use **Stories110M** — a Llama2-architecture transformer: + +``` +Parameter Value +──────────────────────── +Architecture Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready) +Layers 12 +Dimension 768 +Hidden (FFN) 2048 +Heads 12 +Vocab 32000 (Llama 2 BPE) +Sequence 256 +Total Params 109.53M (84.95M transformer + 24.58M embedding) +Training Data TinyStories (~20M tokens, pretokenized) +Optimizer Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999) +Precision FP16 on ANE, FP32 on CPU +``` + +Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2). +Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU. + +## Training Performance (Static Pipeline) + +``` +Chip ms/step ANE ms Compile/10 ANE TFLOPS Util% Contributor +───────────────────────────────────────────────────────────────────────────────── +M1 Pro 148-163 32-35 7.9-8.5s 0.57-0.63 3.6-4.0 @moriwang +M1 Max 143-167 35-45 ~7.1s 0.54-0.65 3.4-4.1 @andyg5000 +M3 Ultra* 91 ~10 ~3.7s 0.88 5.6 (repo ref) +M4 Pro 69-73 8.9 ~3.5s 1.28 8.1 @srt54558 +M4 Max 64 10.2 ~3.5s 1.45 9.2 @SethBurkart123 +M5 101-120 9.1-9.8 3.2-3.4s 0.77-0.91 4.9-5.8 @GitBubble +``` + +*M3 Ultra = reference platform this project was developed on. + +## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64) + +``` +Chip NE Cores FP16 TFLOPS (measured) Rated TOPS (Apple spec*) +──────────────────────────────────────────────────────────────────────────── +M1 Pro 16 FAIL 11 (MIL compat issue) +M1 Max 16 FAIL 11 (MIL compat issue) +M3 Pro 16 9.98 15.8 +M3 Ultra 32 - 31.6 (ref platform) +M4 Pro 16 12.57 38 +M4 Max 16 10.93 38 +M5 16 12.17 not disclosed +M5 (other) 16 12.44 not disclosed +``` + +*Apple's "Rated TOPS" changed methodology across generations — M1/M3 report FP16, +M4 reports INT8/mixed-precision peak. The numbers are not directly comparable across +generations. Use the measured FP16 TFLOPS column for apples-to-apples comparison. +All chips have 16 NE cores except Ultra variants (32 cores, two dies via UltraFusion). +Max variants share the same 16-core NE as Pro — the M4 Max vs M4 Pro TFLOPS difference +is run-to-run variance, not hardware.* + +## Comparative Chart + +``` +ANE Training Speed (ms/step, lower is better) +══════════════════════════════════════════════════════════════ + +M1 Pro ████████████████████████████████████████░░░░ 148-163 ms +M1 Max ██████████████████████████████████████░░░░░░ 143-167 ms +M3 Ultra ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 91 ms +M4 Pro ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 69-73 ms +M4 Max ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 64 ms +M5 ████████████████████████░░░░░░░░░░░░░░░░░░░░ 101-120 ms + + 0 50 100 150 200 + + +Peak ANE Throughput (TFLOPS, higher is better) +══════════════════════════════════════════════════════════════ + +M1 Pro FAIL (MIL compat) +M1 Max FAIL (MIL compat) +M3 Pro ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░ 9.98 +M4 Pro ████████████████████████████████░░░░░░░░░░░░░ 12.57 +M4 Max ██████████████████████░░░░░░░░░░░░░░░░░░░░░░ 10.93 +M5 █████████████████████████░░░░░░░░░░░░░░░░░░░ 12.17 + + 0 3 6 9 12 15 18 + + +ANE Sustained Throughput (TFLOPS, 5s window) +══════════════════════════════════════════════════════════════ + +M3 Pro ██████████████████████████████████████████████ 15.04 (95.2%) + + 0 3 6 9 12 15 18 + (Only M3 Pro submitted sustained benchmark) +``` + +## Key Findings + +### M1/M1 Pro/M1 Max +- **Standalone benchmarks fail** — `ane_mil_gen.h` single-blob weight format rejected +- **Training works** via `stories_mil.h` (separate per-matrix weight blobs) +- ANE compiler handles weight blobs differently from M4+ +- Training at 148-167 ms/step, ~0.6 TFLOPS + +### M3 Pro +- **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted +- Fixed 512-wide lane structure in SRAM tiling +- **Peak: 16.77 TFLOPS** (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048 +- **Sustained: 15.04 TFLOPS** over 5 seconds (95.2% utilization) +- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement) + +### M4 Pro / M4 Max +- Flexible channel support (256/384/512/768+) +- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step +- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall) +- `sram_probe` and `inmem_bench` fail on M4 Pro (same MIL compat issue) + +### M5 +- Training works out of the box with existing `program(1.3)` MIL +- Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra) +- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro) +- ANE appears to be same H16 family as M4 +- **M5 Pro/Max not yet benchmarked** — Fusion Architecture may change ANE behavior + +### Cross-Generation MIL Compatibility + +``` +Feature M1 M3 M4 M5 +───────────────────────────────────────────────────────── +program(1.3) / ios18 PARTIAL YES YES YES +Single-blob weights FAIL YES YES YES +Per-matrix weight blobs YES YES YES YES +Channel flexibility ? ch=512 FLEX FLEX +BLOBFILE offset refs FAIL YES YES YES +``` + +## macOS Compatibility Issues + +- **macOS 26.x** — `[MLModel compileModelAtURL:]` broken for standalone benchmarks + (fixed in PR #27: switched to in-memory MIL compilation) +- **macOS 15.x** — Works for all M-series with correct MIL format +- M1 generation requires `stories_mil.h` path, not `ane_mil_gen.h` + +## How to Contribute + +Run on your hardware and post results to [Issue #3](https://github.com/maderix/ANE/issues/3): + +```bash +cd training && make train_large +./train_large ane_stories110M_ckpt.bin 256 20 1e-4 +``` + +Include: chip model, macOS version, full output with JSON lines. + +--- +*Report compiled 2026-03-04 from community submissions.* +*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton* diff --git a/benchmarks/community_results.json b/benchmarks/community_results.json new file mode 100644 index 0000000..e975925 --- /dev/null +++ b/benchmarks/community_results.json @@ -0,0 +1,113 @@ +{ + "report_date": "2026-03-04", + "source": "https://github.com/maderix/ANE/issues/3", + "model": "Stories110M (12-layer transformer, 109M params)", + "config": {"dim": 768, "hidden": 2048, "heads": 12, "seq": 256, "vocab": 32000, "layers": 12}, + "training_results": [ + { + "chip": "M1 Pro", + "cores": "10-core CPU", + "ram_gb": 32, + "macos": "15.0", + "ms_per_step": [148, 163], + "ane_ms": [32, 35], + "compile_ms": [7900, 8500], + "ane_tflops": [0.57, 0.63], + "ane_util_pct": [3.6, 4.0], + "benchmarks_pass": false, + "notes": "Standalone benchmarks fail (MIL compat). Training works via stories_mil.h.", + "contributor": "moriwang" + }, + { + "chip": "M1 Max", + "cores": "10-core CPU", + "ram_gb": 64, + "macos": "15.6.1", + "ms_per_step": [143, 167], + "ane_ms": [35, 45], + "compile_ms": [7100, 7100], + "ane_tflops": [0.54, 0.65], + "ane_util_pct": [3.4, 4.1], + "benchmarks_pass": false, + "notes": "Same MIL compat issue as M1 Pro.", + "contributor": "andyg5000" + }, + { + "chip": "M3 Pro", + "cores": "12-core CPU", + "ram_gb": 36, + "macos": "15.7.4", + "peak_tflops": 16.77, + "sustained_tflops": 15.04, + "sustained_util_pct": 95.2, + "channel_constraint": "ch=512 only", + "notes": "Only ch=512 compiles. 52 values tested. Peak at 128x conv 512ch sp2048.", + "contributor": "D-Ogi" + }, + { + "chip": "M4 Pro", + "cores": "unknown", + "ram_gb": null, + "macos": null, + "ms_per_step": [69, 73], + "ane_ms": [8.9, 8.9], + "compile_ms": [3465, 3465], + "ane_tflops": [1.28, 1.28], + "ane_util_pct": [8.1, 8.1], + "peak_tflops_inmem": 12.57, + "notes": "sram_probe and inmem_bench fail. inmem_peak and training work.", + "contributor": "srt54558" + }, + { + "chip": "M4 Max", + "cores": "unknown", + "ram_gb": null, + "macos": null, + "ms_per_step": [64, 64], + "ane_ms": [10.2, 10.2], + "compile_ms": [3531, 3531], + "ane_tflops": [1.45, 1.45], + "ane_util_pct": [9.2, 9.2], + "peak_tflops_inmem": 10.93, + "notes": "Fastest training ms/step overall.", + "contributor": "SethBurkart123" + }, + { + "chip": "M5", + "cores": "10-core (4P+6E)", + "ram_gb": 16, + "macos": "26.3", + "ms_per_step": [101, 120], + "ane_ms": [9.1, 9.8], + "compile_ms": [3200, 3400], + "ane_tflops": [0.77, 0.91], + "ane_util_pct": [4.9, 5.8], + "peak_tflops_inmem": 12.44, + "notes": "H16 ANE family (same as M4). Training works with existing program(1.3) MIL.", + "contributor": "GitBubble" + }, + { + "chip": "M5", + "cores": "unknown", + "ram_gb": 32, + "macos": "26.4", + "peak_tflops_inmem": 12.17, + "notes": "inmem_peak only, no training data submitted.", + "contributor": "elijah-pelton" + } + ], + "neural_engine_specs": { + "M1": {"ne_cores": 16, "rated_tops": 11}, + "M1_Max": {"ne_cores": 16, "rated_tops": 11}, + "M1_Ultra": {"ne_cores": 32, "rated_tops": 22}, + "M2": {"ne_cores": 16, "rated_tops": 15.8}, + "M2_Max": {"ne_cores": 16, "rated_tops": 15.8}, + "M2_Ultra": {"ne_cores": 32, "rated_tops": 31.6}, + "M3": {"ne_cores": 16, "rated_tops": 15.8}, + "M3_Max": {"ne_cores": 16, "rated_tops": 15.8}, + "M3_Ultra": {"ne_cores": 32, "rated_tops": 31.6}, + "M4": {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"}, + "M4_Max": {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"}, + "M5": {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19} + } +} diff --git a/bridge/Makefile b/bridge/Makefile new file mode 100644 index 0000000..753d749 --- /dev/null +++ b/bridge/Makefile @@ -0,0 +1,17 @@ +CC = xcrun clang +CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc -fPIC +FRAMEWORKS = -framework Foundation -framework IOSurface -ldl +TARGET = libane_bridge.dylib + +all: $(TARGET) + +$(TARGET): ane_bridge.m ane_bridge.h + $(CC) $(CFLAGS) -dynamiclib -o $@ ane_bridge.m $(FRAMEWORKS) + +test: test_bridge.m ane_bridge.h $(TARGET) + $(CC) $(CFLAGS) -o test_bridge test_bridge.m -L. -lane_bridge $(FRAMEWORKS) + +clean: + rm -f $(TARGET) test_bridge + +.PHONY: all clean test diff --git a/bridge/ane_bridge.h b/bridge/ane_bridge.h new file mode 100644 index 0000000..3e8ff47 --- /dev/null +++ b/bridge/ane_bridge.h @@ -0,0 +1,87 @@ +// ane_bridge.h — C-callable bridge to ANE private APIs for Python ctypes +// Wraps _ANEInMemoryModel via private AppleNeuralEngine.framework + +#ifndef ANE_BRIDGE_H +#define ANE_BRIDGE_H + +#include +#include +#include + +#ifdef __cplusplus +extern "C" { +#endif + +// Opaque kernel handle +typedef struct ANEKernelHandle ANEKernelHandle; + +// Initialize ANE runtime (load private framework, resolve classes) +// Returns 0 on success, -1 on failure +int ane_bridge_init(void); + +// Compile a MIL program with weight blobs into an ANE kernel +// mil_text: UTF-8 MIL program text +// mil_len: length of MIL text +// weight_data: raw weight blob (can be NULL) +// weight_len: length of weight blob +// n_inputs: number of input tensors +// input_sizes: array of byte sizes for each input +// n_outputs: number of output tensors +// output_sizes: array of byte sizes for each output +// Returns kernel handle or NULL on failure +ANEKernelHandle *ane_bridge_compile(const char *mil_text, size_t mil_len, + const uint8_t *weight_data, size_t weight_len, + int n_inputs, const size_t *input_sizes, + int n_outputs, const size_t *output_sizes); + +// Compile with multiple named weight files (for transformer kernels) +// weight_names: array of weight file paths (e.g. "@model_path/weights/wq.bin") +// weight_datas: array of weight data pointers +// weight_lens: array of weight data lengths +// n_weights: number of weight files +ANEKernelHandle *ane_bridge_compile_multi_weights( + const char *mil_text, size_t mil_len, + const char **weight_names, const uint8_t **weight_datas, + const size_t *weight_lens, int n_weights, + int n_inputs, const size_t *input_sizes, + int n_outputs, const size_t *output_sizes); + +// Evaluate (run) a compiled kernel on ANE +// Returns true on success +bool ane_bridge_eval(ANEKernelHandle *kernel); + +// Write data to kernel input tensor +void ane_bridge_write_input(ANEKernelHandle *kernel, int idx, + const void *data, size_t bytes); + +// Read data from kernel output tensor +void ane_bridge_read_output(ANEKernelHandle *kernel, int idx, + void *data, size_t bytes); + +// Free a compiled kernel and all associated resources +void ane_bridge_free(ANEKernelHandle *kernel); + +// Get compile count (for exec() restart budgeting) +int ane_bridge_get_compile_count(void); + +// Reset compile count +void ane_bridge_reset_compile_count(void); + +// Build a weight blob in ANE format (128-byte header + fp16 data) +// src: float32 weights [rows x cols] +// Returns allocated buffer and sets out_len. Caller must free(). +uint8_t *ane_bridge_build_weight_blob(const float *src, int rows, int cols, + size_t *out_len); + +// Build a transposed weight blob in ANE format +uint8_t *ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols, + size_t *out_len); + +// Free a blob allocated by ane_bridge_build_weight_blob* +void ane_bridge_free_blob(void *ptr); + +#ifdef __cplusplus +} +#endif + +#endif // ANE_BRIDGE_H diff --git a/bridge/ane_bridge.m b/bridge/ane_bridge.m new file mode 100644 index 0000000..2b27ddc --- /dev/null +++ b/bridge/ane_bridge.m @@ -0,0 +1,328 @@ +// ane_bridge.m — Objective-C implementation of ANE bridge for Python ctypes +// Wraps _ANEInMemoryModel private APIs into C-callable functions + +#import +#import +#import +#import +#import +#include "ane_bridge.h" + +// --- Private class references --- +static Class g_ANEDesc = nil; +static Class g_ANEInMem = nil; +static Class g_ANEReq = nil; +static Class g_ANEIO = nil; +static bool g_initialized = false; +static int g_compile_count = 0; + +// --- Kernel handle struct --- +struct ANEKernelHandle { + id model; // _ANEInMemoryModel + IOSurfaceRef *ioInputs; + IOSurfaceRef *ioOutputs; + id request; // _ANERequest + NSString *tmpDir; + int nInputs, nOutputs; + size_t *inputBytes; + size_t *outputBytes; +}; + +// --- Public API --- + +int ane_bridge_init(void) { + if (g_initialized) return 0; + + void *handle = dlopen( + "/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", + RTLD_NOW); + if (!handle) { + fprintf(stderr, "ane_bridge: Failed to load AppleNeuralEngine.framework\n"); + return -1; + } + + g_ANEDesc = NSClassFromString(@"_ANEInMemoryModelDescriptor"); + g_ANEInMem = NSClassFromString(@"_ANEInMemoryModel"); + g_ANEReq = NSClassFromString(@"_ANERequest"); + g_ANEIO = NSClassFromString(@"_ANEIOSurfaceObject"); + + if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) { + fprintf(stderr, "ane_bridge: Failed to resolve ANE private classes\n"); + return -1; + } + + g_initialized = true; + g_compile_count = 0; + return 0; +} + +static IOSurfaceRef create_surface(size_t bytes) { + return IOSurfaceCreate((__bridge CFDictionaryRef)@{ + (id)kIOSurfaceWidth: @(bytes), + (id)kIOSurfaceHeight: @1, + (id)kIOSurfaceBytesPerElement: @1, + (id)kIOSurfaceBytesPerRow: @(bytes), + (id)kIOSurfaceAllocSize: @(bytes), + (id)kIOSurfacePixelFormat: @0 + }); +} + +ANEKernelHandle *ane_bridge_compile_multi_weights( + const char *mil_text, size_t mil_len, + const char **weight_names, const uint8_t **weight_datas, + const size_t *weight_lens, int n_weights, + int n_inputs, const size_t *input_sizes, + int n_outputs, const size_t *output_sizes) +{ + @autoreleasepool { + if (!g_initialized) { + fprintf(stderr, "ane_bridge: Not initialized\n"); + return NULL; + } + + NSData *milData = [NSData dataWithBytes:mil_text length:mil_len]; + NSError *e = nil; + + // Build weight dictionary + NSMutableDictionary *wdict = [NSMutableDictionary dictionary]; + for (int i = 0; i < n_weights; i++) { + NSString *name = [NSString stringWithUTF8String:weight_names[i]]; + NSData *data = [NSData dataWithBytes:weight_datas[i] length:weight_lens[i]]; + wdict[name] = @{@"offset": @0, @"data": data}; + } + + id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)( + g_ANEDesc, @selector(modelWithMILText:weights:optionsPlist:), + milData, wdict.count > 0 ? wdict : nil, nil); + if (!desc) { + fprintf(stderr, "ane_bridge: modelWithMILText failed\n"); + return NULL; + } + + id mdl = ((id(*)(Class,SEL,id))objc_msgSend)( + g_ANEInMem, @selector(inMemoryModelWithDescriptor:), desc); + if (!mdl) { + fprintf(stderr, "ane_bridge: inMemoryModelWithDescriptor failed\n"); + return NULL; + } + + // Pre-populate temp dir + id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier)); + NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx]; + NSFileManager *fm = [NSFileManager defaultManager]; + [fm createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"] + withIntermediateDirectories:YES attributes:nil error:nil]; + [milData writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES]; + + for (int i = 0; i < n_weights; i++) { + NSString *name = [NSString stringWithUTF8String:weight_names[i]]; + // Extract filename from path like "@model_path/weights/wq.bin" -> "weights/wq.bin" + NSString *relPath = name; + if ([name hasPrefix:@"@model_path/"]) { + relPath = [name substringFromIndex:12]; + } + NSString *fullPath = [td stringByAppendingPathComponent:relPath]; + NSString *dir = [fullPath stringByDeletingLastPathComponent]; + [fm createDirectoryAtPath:dir withIntermediateDirectories:YES attributes:nil error:nil]; + NSData *data = [NSData dataWithBytes:weight_datas[i] length:weight_lens[i]]; + [data writeToFile:fullPath atomically:YES]; + } + + // Compile + if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)( + mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e)) { + fprintf(stderr, "ane_bridge: ANE compile failed: %s\n", + e ? [[e description] UTF8String] : "unknown"); + [fm removeItemAtPath:td error:nil]; + return NULL; + } + + // Load (with one retry after a brief pause for ANE slot reclamation) + BOOL loaded = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)( + mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e); + if (!loaded) { + fprintf(stderr, "ane_bridge: ANE load failed (retrying in 100ms): %s\n", + e ? [[e description] UTF8String] : "unknown"); + usleep(100000); // 100ms + e = nil; + loaded = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)( + mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e); + } + if (!loaded) { + fprintf(stderr, "ane_bridge: ANE load failed after retry: %s\n", + e ? [[e description] UTF8String] : "unknown"); + [fm removeItemAtPath:td error:nil]; + return NULL; + } + + g_compile_count++; + + // Create kernel handle + ANEKernelHandle *k = (ANEKernelHandle *)calloc(1, sizeof(ANEKernelHandle)); + k->model = mdl; + k->tmpDir = td; + k->nInputs = n_inputs; + k->nOutputs = n_outputs; + k->inputBytes = (size_t *)malloc(n_inputs * sizeof(size_t)); + k->outputBytes = (size_t *)malloc(n_outputs * sizeof(size_t)); + memcpy(k->inputBytes, input_sizes, n_inputs * sizeof(size_t)); + memcpy(k->outputBytes, output_sizes, n_outputs * sizeof(size_t)); + + // Create IOSurfaces + k->ioInputs = (IOSurfaceRef *)malloc(n_inputs * sizeof(IOSurfaceRef)); + k->ioOutputs = (IOSurfaceRef *)malloc(n_outputs * sizeof(IOSurfaceRef)); + for (int i = 0; i < n_inputs; i++) + k->ioInputs[i] = create_surface(input_sizes[i]); + for (int i = 0; i < n_outputs; i++) + k->ioOutputs[i] = create_surface(output_sizes[i]); + + // Build request + NSMutableArray *wIns = [NSMutableArray arrayWithCapacity:n_inputs]; + NSMutableArray *iIdx = [NSMutableArray arrayWithCapacity:n_inputs]; + for (int i = 0; i < n_inputs; i++) { + [wIns addObject:((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)( + g_ANEIO, @selector(objectWithIOSurface:), k->ioInputs[i])]; + [iIdx addObject:@(i)]; + } + NSMutableArray *wOuts = [NSMutableArray arrayWithCapacity:n_outputs]; + NSMutableArray *oIdx = [NSMutableArray arrayWithCapacity:n_outputs]; + for (int i = 0; i < n_outputs; i++) { + [wOuts addObject:((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)( + g_ANEIO, @selector(objectWithIOSurface:), k->ioOutputs[i])]; + [oIdx addObject:@(i)]; + } + k->request = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)( + g_ANEReq, + @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:), + wIns, iIdx, wOuts, oIdx, nil, nil, @0); + + return k; + } +} + +ANEKernelHandle *ane_bridge_compile(const char *mil_text, size_t mil_len, + const uint8_t *weight_data, size_t weight_len, + int n_inputs, const size_t *input_sizes, + int n_outputs, const size_t *output_sizes) { + if (weight_data && weight_len > 0) { + const char *name = "@model_path/weights/weight.bin"; + return ane_bridge_compile_multi_weights( + mil_text, mil_len, + &name, &weight_data, &weight_len, 1, + n_inputs, input_sizes, + n_outputs, output_sizes); + } else { + return ane_bridge_compile_multi_weights( + mil_text, mil_len, + NULL, NULL, NULL, 0, + n_inputs, input_sizes, + n_outputs, output_sizes); + } +} + +bool ane_bridge_eval(ANEKernelHandle *kernel) { + @autoreleasepool { + if (!kernel || !kernel->model) return false; + NSError *e = nil; + return ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( + kernel->model, @selector(evaluateWithQoS:options:request:error:), + 21, @{}, kernel->request, &e); + } +} + +void ane_bridge_write_input(ANEKernelHandle *kernel, int idx, + const void *data, size_t bytes) { + if (!kernel || idx < 0 || idx >= kernel->nInputs) return; + IOSurfaceLock(kernel->ioInputs[idx], 0, NULL); + memcpy(IOSurfaceGetBaseAddress(kernel->ioInputs[idx]), data, bytes); + IOSurfaceUnlock(kernel->ioInputs[idx], 0, NULL); +} + +void ane_bridge_read_output(ANEKernelHandle *kernel, int idx, + void *data, size_t bytes) { + if (!kernel || idx < 0 || idx >= kernel->nOutputs) return; + IOSurfaceLock(kernel->ioOutputs[idx], kIOSurfaceLockReadOnly, NULL); + memcpy(data, IOSurfaceGetBaseAddress(kernel->ioOutputs[idx]), bytes); + IOSurfaceUnlock(kernel->ioOutputs[idx], kIOSurfaceLockReadOnly, NULL); +} + +void ane_bridge_free(ANEKernelHandle *kernel) { + @autoreleasepool { + if (!kernel) return; + NSError *e = nil; + if (kernel->model) { + ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)( + kernel->model, @selector(unloadWithQoS:error:), 21, &e); + } + for (int i = 0; i < kernel->nInputs; i++) + if (kernel->ioInputs[i]) CFRelease(kernel->ioInputs[i]); + for (int i = 0; i < kernel->nOutputs; i++) + if (kernel->ioOutputs[i]) CFRelease(kernel->ioOutputs[i]); + if (kernel->tmpDir) { + [[NSFileManager defaultManager] removeItemAtPath:kernel->tmpDir error:nil]; + } + free(kernel->ioInputs); + free(kernel->ioOutputs); + free(kernel->inputBytes); + free(kernel->outputBytes); + + // Explicitly nil Objective-C objects to trigger ARC release before freeing struct + kernel->model = nil; + kernel->request = nil; + kernel->tmpDir = nil; + + free(kernel); + } +} + +int ane_bridge_get_compile_count(void) { + return g_compile_count; +} + +void ane_bridge_reset_compile_count(void) { + g_compile_count = 0; +} + +uint8_t *ane_bridge_build_weight_blob(const float *src, int rows, int cols, + size_t *out_len) { + int wsize = rows * cols * 2; // fp16 + int total = 128 + wsize; + uint8_t *buf = (uint8_t *)calloc(total, 1); + + // ANE blob header + buf[0] = 0x01; buf[4] = 0x02; + buf[64] = 0xEF; buf[65] = 0xBE; buf[66] = 0xAD; buf[67] = 0xDE; + buf[68] = 0x01; + *(uint32_t*)(buf + 72) = wsize; + *(uint32_t*)(buf + 80) = 128; + + // Convert float32 -> float16 + _Float16 *fp16 = (_Float16 *)(buf + 128); + for (int i = 0; i < rows * cols; i++) { + fp16[i] = (_Float16)src[i]; + } + + *out_len = total; + return buf; +} + +uint8_t *ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols, + size_t *out_len) { + int wsize = rows * cols * 2; + int total = 128 + wsize; + uint8_t *buf = (uint8_t *)calloc(total, 1); + + buf[0] = 0x01; buf[4] = 0x02; + buf[64] = 0xEF; buf[65] = 0xBE; buf[66] = 0xAD; buf[67] = 0xDE; + buf[68] = 0x01; + *(uint32_t*)(buf + 72) = wsize; + *(uint32_t*)(buf + 80) = 128; + + _Float16 *fp16 = (_Float16 *)(buf + 128); + for (int i = 0; i < rows; i++) + for (int j = 0; j < cols; j++) + fp16[j * rows + i] = (_Float16)src[i * cols + j]; + + *out_len = total; + return buf; +} diff --git a/bridge/libane_bridge.dylib b/bridge/libane_bridge.dylib new file mode 100755 index 0000000..72acc32 Binary files /dev/null and b/bridge/libane_bridge.dylib differ diff --git a/inmem_bench.m b/inmem_bench.m index 8a5af33..51bd0aa 100644 --- a/inmem_bench.m +++ b/inmem_bench.m @@ -1,5 +1,4 @@ #import -#import #import #import #import @@ -9,18 +8,45 @@ static mach_timebase_info_data_t g_tb; static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; } +static NSData *buildWeightBlob(int ch) { + NSUInteger wsize = (NSUInteger)ch * ch * 2; + NSUInteger total = 64 + 64 + wsize; + uint8_t *buf = calloc(total, 1); + buf[0] = 0x01; buf[4] = 0x02; + uint8_t *chunk = buf + 64; + chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE; + chunk[4]=0x01; chunk[10]=0x08; + uint16_t *fp16 = (uint16_t*)(chunk + 64); + for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++) + fp16[j] = (arc4random() & 0x03FF) | 0x2000; + return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES]; +} + +static NSString *genMIL(int ch, int sp) { + NSMutableString *m = [NSMutableString string]; + [m appendString:@"program(1.3)\n[buildInfo = dict({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"]; + [m appendFormat:@" func main(tensor x) {\n", ch, sp]; + [m appendString: + @" string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n" + @" tensor c_strides = const()[name = string(\"c_strides\"), val = tensor([1, 1])];\n" + @" tensor c_pad = const()[name = string(\"c_pad\"), val = tensor([0, 0, 0, 0])];\n" + @" tensor c_dilations = const()[name = string(\"c_dilations\"), val = tensor([1, 1])];\n" + @" int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n" + @" string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"]; + [m appendFormat:@" tensor x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp]; + [m appendFormat:@" tensor W = const()[name = string(\"W\"), val = tensor(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch]; + [m appendFormat:@" tensor y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp]; + [m appendString:@" string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"]; + [m appendFormat:@" tensor y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp]; + [m appendString:@" } -> (y);\n}\n"]; + return m; +} + double benchInMem(int ch, int sp) { @autoreleasepool { NSError *e = nil; - NSString *path = [NSString stringWithFormat:@"/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp]; - NSURL *compiled = [MLModel compileModelAtURL:[NSURL fileURLWithPath:path] error:&e]; - if (e) return -1; - - NSData *milData = [[NSString stringWithContentsOfFile: - [[compiled path] stringByAppendingPathComponent:@"model.mil"] - encoding:NSUTF8StringEncoding error:nil] dataUsingEncoding:NSUTF8StringEncoding]; - NSData *weightBlob = [NSData dataWithContentsOfFile: - [[compiled path] stringByAppendingPathComponent:@"weights/weight.bin"]]; + NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy]; + NSData *wb = buildWeightBlob(ch); Class Desc = NSClassFromString(@"_ANEInMemoryModelDescriptor"); Class IMM = NSClassFromString(@"_ANEInMemoryModel"); @@ -28,7 +54,7 @@ Class AIO = NSClassFromString(@"_ANEIOSurfaceObject"); NSDictionary *wdict = @{ - @"@model_path/weights/weight.bin": @{@"offset": @64, @"data": weightBlob} + @"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb} }; id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)( Desc, @selector(modelWithMILText:weights:optionsPlist:), @@ -43,7 +69,7 @@ [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"] withIntermediateDirectories:YES attributes:nil error:nil]; [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES]; - [weightBlob writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES]; + [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES]; BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)( model, @selector(compileWithQoS:options:error:), 21, @{}, &e); diff --git a/sram_bench.m b/sram_bench.m index 9dc3a35..85b46d5 100644 --- a/sram_bench.m +++ b/sram_bench.m @@ -1,5 +1,4 @@ #import -#import #import #import #import @@ -8,25 +7,79 @@ static mach_timebase_info_data_t g_tb; static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; } -static id g_client; -static Class AM, AR, AIO; -double bench(const char *path, int ch, int sp) { +static NSData *buildWeightBlob(int ch) { + NSUInteger wsize = (NSUInteger)ch * ch * 2; + NSUInteger total = 64 + 64 + wsize; + uint8_t *buf = calloc(total, 1); + buf[0] = 0x01; buf[4] = 0x02; + uint8_t *chunk = buf + 64; + chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE; + chunk[4]=0x01; chunk[10]=0x08; + uint16_t *fp16 = (uint16_t*)(chunk + 64); + for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++) + fp16[j] = (arc4random() & 0x03FF) | 0x2000; + return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES]; +} + +static NSString *genMIL(int ch, int sp) { + NSMutableString *m = [NSMutableString string]; + [m appendString:@"program(1.3)\n[buildInfo = dict({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"]; + [m appendFormat:@" func main(tensor x) {\n", ch, sp]; + [m appendString: + @" string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n" + @" tensor c_strides = const()[name = string(\"c_strides\"), val = tensor([1, 1])];\n" + @" tensor c_pad = const()[name = string(\"c_pad\"), val = tensor([0, 0, 0, 0])];\n" + @" tensor c_dilations = const()[name = string(\"c_dilations\"), val = tensor([1, 1])];\n" + @" int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n" + @" string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"]; + [m appendFormat:@" tensor x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp]; + [m appendFormat:@" tensor W = const()[name = string(\"W\"), val = tensor(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch]; + [m appendFormat:@" tensor y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp]; + [m appendString:@" string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"]; + [m appendFormat:@" tensor y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp]; + [m appendString:@" } -> (y);\n}\n"]; + return m; +} + +double bench(int ch, int sp) { @autoreleasepool { NSError *e = nil; - NSURL *compiled = [MLModel compileModelAtURL: - [NSURL fileURLWithPath:[NSString stringWithUTF8String:path]] error:&e]; - if (e) return -1; - id model = ((id(*)(Class,SEL,id,id))objc_msgSend)(AM, @selector(modelAtURL:key:), compiled, @"s"); - BOOL ok = ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(compileModel:options:qos:error:), model, - @{@"kANEFModelType":@"kANEFModelMIL",@"kANEFNetPlistFilenameKey":@"model.mil"}, 21, &e); - if (!ok) return -2; - ok = ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(loadModel:options:qos:error:), model, @{}, 21, &e); - if (!ok) return -3; - - NSUInteger bytes = ch * sp * 4; // FP32 input + NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy]; + NSData *wb = buildWeightBlob(ch); + + Class D = NSClassFromString(@"_ANEInMemoryModelDescriptor"); + Class I = NSClassFromString(@"_ANEInMemoryModel"); + Class AR = NSClassFromString(@"_ANERequest"); + Class AIO = NSClassFromString(@"_ANEIOSurfaceObject"); + + id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)( + D, @selector(modelWithMILText:weights:optionsPlist:), + milData, @{@"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb}}, nil); + if (!desc) return -2; + + id model = ((id(*)(Class,SEL,id))objc_msgSend)( + I, @selector(inMemoryModelWithDescriptor:), desc); + if (!model) return -3; + + id hexId = ((id(*)(id,SEL))objc_msgSend)(model, @selector(hexStringIdentifier)); + NSString *tmpDir = [NSTemporaryDirectory() stringByAppendingPathComponent:hexId]; + NSFileManager *fm = [NSFileManager defaultManager]; + [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"] + withIntermediateDirectories:YES attributes:nil error:nil]; + [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES]; + [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES]; + + if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)( + model, @selector(compileWithQoS:options:error:), 21, @{}, &e)) { + [fm removeItemAtPath:tmpDir error:nil]; return -4; + } + if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)( + model, @selector(loadWithQoS:options:error:), 21, @{}, &e)) { + [fm removeItemAtPath:tmpDir error:nil]; return -5; + } + + NSUInteger bytes = ch * sp * 4; IOSurfaceRef ioIn = IOSurfaceCreate((__bridge CFDictionaryRef)@{ (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1, (id)kIOSurfaceBytesPerElement:@1,(id)kIOSurfaceBytesPerRow:@(bytes), @@ -35,7 +88,6 @@ (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1, (id)kIOSurfaceBytesPerElement:@1,(id)kIOSurfaceBytesPerRow:@(bytes), (id)kIOSurfaceAllocSize:@(bytes),(id)kIOSurfacePixelFormat:@0}); - id wIn = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(AIO, @selector(objectWithIOSurface:), ioIn); id wOut = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(AIO, @selector(objectWithIOSurface:), ioOut); id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(AR, @@ -43,19 +95,20 @@ @[wIn], @[@0], @[wOut], @[@0], nil, nil, @0); for (int i = 0; i < 5; i++) - ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e); + ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( + model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e); int iters = 30; uint64_t t0 = mach_absolute_time(); for (int i = 0; i < iters; i++) - ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e); + ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( + model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e); double ms = ticksToMs(mach_absolute_time() - t0) / iters; - ((void(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(unloadModel:options:qos:error:), model, @{}, 21, &e); + ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)( + model, @selector(unloadWithQoS:error:), 21, &e); CFRelease(ioIn); CFRelease(ioOut); + [fm removeItemAtPath:tmpDir error:nil]; return ms; } } @@ -63,10 +116,6 @@ int main() { mach_timebase_info(&g_tb); dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW); - g_client = [NSClassFromString(@"_ANEClient") performSelector:@selector(sharedConnection)]; - AM = NSClassFromString(@"_ANEModel"); - AR = NSClassFromString(@"_ANERequest"); - AIO = NSClassFromString(@"_ANEIOSurfaceObject"); printf("=== ANE SRAM Probe: 1x1 Conv with Increasing Weight Size ===\n\n"); printf("%-25s %8s %8s %8s %10s %8s\n", "Config", "W (MB)", "Act(MB)", "Tot(MB)", "ms/eval", "TFLOPS"); @@ -82,9 +131,7 @@ int main() { double tot = w_mb + 2 * a_mb; double gflop = 2.0 * ch * ch * sp / 1e9; - char path[256]; - snprintf(path, sizeof(path), "/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp); - double ms = bench(path, ch, sp); + double ms = bench(ch, sp); double tflops = (ms > 0) ? gflop / ms : -1; char label[64]; diff --git a/sram_probe.m b/sram_probe.m index 0766187..4ca4df6 100644 --- a/sram_probe.m +++ b/sram_probe.m @@ -1,5 +1,4 @@ #import -#import #import #import #import @@ -8,20 +7,78 @@ static mach_timebase_info_data_t g_tb; static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; } -static id g_client; static Class AM, AR, AIO; -double bench(const char *path, int ch, int sp) { +static NSData *buildWeightBlob(int ch) { + NSUInteger wsize = (NSUInteger)ch * ch * 2; + NSUInteger total = 64 + 64 + wsize; + uint8_t *buf = calloc(total, 1); + buf[0] = 0x01; buf[4] = 0x02; + uint8_t *chunk = buf + 64; + chunk[0]=0xEF; chunk[1]=0xBE; chunk[2]=0xAD; chunk[3]=0xDE; + chunk[4]=0x01; chunk[10]=0x08; + uint16_t *fp16 = (uint16_t*)(chunk + 64); + for (NSUInteger j = 0; j < (NSUInteger)ch * ch; j++) + fp16[j] = (arc4random() & 0x03FF) | 0x2000; + return [NSData dataWithBytesNoCopy:buf length:total freeWhenDone:YES]; +} + +static NSString *genMIL(int ch, int sp) { + NSMutableString *m = [NSMutableString string]; + [m appendString:@"program(1.3)\n[buildInfo = dict({{\"coremlc-component-MIL\", \"3510.2.1\"}, {\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, {\"coremltools-version\", \"9.0\"}})]\n{\n"]; + [m appendFormat:@" func main(tensor x) {\n", ch, sp]; + [m appendString: + @" string c_pad_type = const()[name = string(\"c_pad_type\"), val = string(\"valid\")];\n" + @" tensor c_strides = const()[name = string(\"c_strides\"), val = tensor([1, 1])];\n" + @" tensor c_pad = const()[name = string(\"c_pad\"), val = tensor([0, 0, 0, 0])];\n" + @" tensor c_dilations = const()[name = string(\"c_dilations\"), val = tensor([1, 1])];\n" + @" int32 c_groups = const()[name = string(\"c_groups\"), val = int32(1)];\n" + @" string to_fp16 = const()[name = string(\"to_fp16\"), val = string(\"fp16\")];\n"]; + [m appendFormat:@" tensor x16 = cast(dtype = to_fp16, x = x)[name = string(\"cast_in\")];\n", ch, sp]; + [m appendFormat:@" tensor W = const()[name = string(\"W\"), val = tensor(BLOBFILE(path = string(\"@model_path/weights/weight.bin\"), offset = uint64(64)))];\n", ch, ch, ch, ch]; + [m appendFormat:@" tensor y16 = conv(dilations = c_dilations, groups = c_groups, pad = c_pad, pad_type = c_pad_type, strides = c_strides, weight = W, x = x16)[name = string(\"conv\")];\n", ch, sp]; + [m appendString:@" string to_fp32 = const()[name = string(\"to_fp32\"), val = string(\"fp32\")];\n"]; + [m appendFormat:@" tensor y = cast(dtype = to_fp32, x = y16)[name = string(\"cast_out\")];\n", ch, sp]; + [m appendString:@" } -> (y);\n}\n"]; + return m; +} + +double bench(int ch, int sp) { @autoreleasepool { NSError *e = nil; - NSURL *compiled = [MLModel compileModelAtURL: - [NSURL fileURLWithPath:[NSString stringWithUTF8String:path]] error:&e]; - if (e) return -1; - id model = ((id(*)(Class,SEL,id,id))objc_msgSend)(AM, @selector(modelAtURL:key:), compiled, @"s"); - ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(compileModel:options:qos:error:), model, - @{@"kANEFModelType":@"kANEFModelMIL",@"kANEFNetPlistFilenameKey":@"model.mil"}, 21, &e); - ((BOOL(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(loadModel:options:qos:error:), model, @{}, 21, &e); + NSData *milData = [[genMIL(ch, sp) dataUsingEncoding:NSUTF8StringEncoding] copy]; + NSData *wb = buildWeightBlob(ch); + + Class D = NSClassFromString(@"_ANEInMemoryModelDescriptor"); + Class I = NSClassFromString(@"_ANEInMemoryModel"); + Class AR = NSClassFromString(@"_ANERequest"); + Class AIO = NSClassFromString(@"_ANEIOSurfaceObject"); + + id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)( + D, @selector(modelWithMILText:weights:optionsPlist:), + milData, @{@"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wb}}, nil); + if (!desc) return -2; + + id model = ((id(*)(Class,SEL,id))objc_msgSend)( + I, @selector(inMemoryModelWithDescriptor:), desc); + if (!model) return -3; + + id hexId = ((id(*)(id,SEL))objc_msgSend)(model, @selector(hexStringIdentifier)); + NSString *tmpDir = [NSTemporaryDirectory() stringByAppendingPathComponent:hexId]; + NSFileManager *fm = [NSFileManager defaultManager]; + [fm createDirectoryAtPath:[tmpDir stringByAppendingPathComponent:@"weights"] + withIntermediateDirectories:YES attributes:nil error:nil]; + [milData writeToFile:[tmpDir stringByAppendingPathComponent:@"model.mil"] atomically:YES]; + [wb writeToFile:[tmpDir stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES]; + + if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)( + model, @selector(compileWithQoS:options:error:), 21, @{}, &e)) { + [fm removeItemAtPath:tmpDir error:nil]; return -4; + } + if (!((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)( + model, @selector(loadWithQoS:options:error:), 21, @{}, &e)) { + [fm removeItemAtPath:tmpDir error:nil]; return -5; + } + NSUInteger bytes = ch * sp * 4; IOSurfaceRef ioIn = IOSurfaceCreate((__bridge CFDictionaryRef)@{ (id)kIOSurfaceWidth:@(bytes),(id)kIOSurfaceHeight:@1, @@ -36,18 +93,22 @@ id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(AR, @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:), @[wIn], @[@0], @[wOut], @[@0], nil, nil, @0); + for (int i = 0; i < 5; i++) - ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e); + ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( + model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e); + int iters = 50; uint64_t t0 = mach_absolute_time(); for (int i = 0; i < iters; i++) - ((BOOL(*)(id,SEL,id,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(evaluateWithModel:options:request:qos:error:), model, @{}, req, 21, &e); + ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( + model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e); double ms = ticksToMs(mach_absolute_time() - t0) / iters; - ((void(*)(id,SEL,id,id,NSUInteger,NSError**))objc_msgSend)( - g_client, @selector(unloadModel:options:qos:error:), model, @{}, 21, &e); + + ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)( + model, @selector(unloadWithQoS:error:), 21, &e); CFRelease(ioIn); CFRelease(ioOut); + [fm removeItemAtPath:tmpDir error:nil]; return ms; } } @@ -55,9 +116,6 @@ int main() { mach_timebase_info(&g_tb); dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW); - g_client = [NSClassFromString(@"_ANEClient") performSelector:@selector(sharedConnection)]; - AM = NSClassFromString(@"_ANEModel"); AR = NSClassFromString(@"_ANERequest"); - AIO = NSClassFromString(@"_ANEIOSurfaceObject"); printf("=== ANE SRAM Fine Probe (weights only vary, spatial=64) ===\n\n"); printf("%-12s %8s %10s %8s %12s\n", "Channels", "W (MB)", "ms/eval", "TFLOPS", "GFLOPS/MB"); @@ -70,9 +128,7 @@ int main() { int ch = chs[i], sp = sps[i]; double w_mb = (double)ch * ch * 2 / 1024 / 1024; double gf = 2.0 * ch * ch * sp / 1e9; - char path[256]; - snprintf(path, sizeof(path), "/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp); - double ms = bench(path, ch, sp); + double ms = bench(ch, sp); double tf = (ms > 0) ? gf / ms : 0; double eff = (ms > 0) ? tf * 1000 / w_mb : 0; printf("%6d ch %7.1f %8.3f ms %7.2f %10.1f %s\n", diff --git a/training/Makefile b/training/Makefile index 9cc9e34..74b9211 100644 --- a/training/Makefile +++ b/training/Makefile @@ -1,36 +1,58 @@ -CC = xcrun clang -CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc -FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface -LDFLAGS = $(FRAMEWORKS) -ldl - -HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h - -train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h - $(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS) - -train_large: train_large.m $(HEADERS_LARGE) - $(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate - -PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced - -test_weight_reload: test_weight_reload.m - $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) - -test_perf_stats: test_perf_stats.m - $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) - -test_qos_sweep: test_qos_sweep.m - $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) - -test_ane_advanced: test_ane_advanced.m - $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) - -probes: $(PROBES) - -tokenize: - python3 tokenize.py - -clean: - rm -f train train_large $(PROBES) - -.PHONY: clean tokenize probes +CC = xcrun clang +CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc +FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface +LDFLAGS = $(FRAMEWORKS) -ldl + +HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h + +HEADERS_ANE = $(HEADERS_LARGE) ane_rmsnorm_bwd.h ane_classifier.h + +HEADERS_PIPELINE = model_config.h pipeline.h gradient_checkpoint.h + +train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h + $(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS) + +train_large: train_large.m $(HEADERS_LARGE) + $(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate + +train_large_ane: train_large_ane.m $(HEADERS_ANE) + $(CC) $(CFLAGS) -o $@ train_large_ane.m $(LDFLAGS) -framework Accelerate + +train_pipeline: train_pipeline.m $(HEADERS_PIPELINE) + $(CC) $(CFLAGS) -o $@ train_pipeline.m $(LDFLAGS) -framework Accelerate + +train_pipeline_live: train_pipeline.m $(HEADERS_PIPELINE) $(HEADERS_LARGE) + $(CC) $(CFLAGS) -DANE_LIVE -o train_pipeline train_pipeline.m $(LDFLAGS) -framework Accelerate + +PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced + +test_rmsnorm_bwd: test_rmsnorm_bwd.m $(HEADERS_ANE) + $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate + +test_classifier: test_classifier.m $(HEADERS_ANE) + $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate + +test_weight_reload: test_weight_reload.m + $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) + +test_perf_stats: test_perf_stats.m + $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) + +test_qos_sweep: test_qos_sweep.m + $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) + +test_ane_advanced: test_ane_advanced.m + $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) + +probes: $(PROBES) + +tokenize: + python3 tokenize.py + +test_pipeline_unit: test_pipeline_unit.c $(HEADERS_PIPELINE) + cc -O2 -Wall -o $@ $< -lm + +clean: + rm -f train train_large train_large_ane train_pipeline test_pipeline_unit $(PROBES) test_rmsnorm_bwd test_classifier + +.PHONY: clean tokenize probes diff --git a/training/README.md b/training/README.md index 53edbb9..a3f33eb 100644 --- a/training/README.md +++ b/training/README.md @@ -8,62 +8,136 @@ Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly - **Model**: Stories110M — dim=768, hidden=2048, heads=12, layers=12, vocab=32000, seq=256 - **109.53M params** (84.95M transformer + 24.58M embedding) -- **72 ANE kernels** per compile (60 weight-bearing, 12 weight-free sdpaBwd2) -- **6 kernel types per layer**: fwdAttn, fwdFFN, ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd +- **SDPA causal mask workaround**: ANE hardware ignores attn_mask — decompose into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv) -## Performance +## Three Training Pipelines -| Component | Time (ms/step) | -|-----------|---------------| -| ANE eval | 9.6 | -| IO (fp16 conversion) | 4.1 | -| Classifier (cblas) | 9.1 | -| Cross-entropy + residuals | 14.4 | -| RMSNorm | 0.1 | -| **Total** | **107 ms/step** | +### 1. Static Baseline (`train_large`) +Original pipeline. Weights baked as constants in MIL kernels — recompile every 10 steps via `exec()` restart. + +- 60 weight-bearing + 12 weight-free kernels = 72 per compile batch +- Classifier + softmax + RMSNorm backward on CPU +- **106.7 ms/step**, 7.6s compile per restart + +### 2. Static + ANE Extras (`train_large_ane`) — PR#19 +Offloads classifier forward (32K conv), softmax, final RMSNorm, and RMSNorm backward to ANE. Bridge API for C-callable ANE access. + +- 86 kernels per compile batch (+24 rmsnorm_bwd, +1 classifier, +1 finalRms) +- **91.8 ms/step** (14% faster), 9.6s compile per restart +- Use `--no-ane-extras` to disable and fall back to CPU (for debugging) + +### 3. Dynamic Weight Pipeline (`training_dynamic/`) +Weights passed via IOSurface spatial dimension — compile 9 kernels once at startup, no recompilation needed. + +- 9 shared kernels across all 12 layers +- **111 ms/step**, 0.4s one-time compile +- No exec() restart, no compile limit issues + +## Performance Comparison (20 Steps) + +| | Static Baseline | PR#19 + ANE extras | PR#19 no extras | Dynamic | +|---|---|---|---|---| +| **Wall time** | **10.1s** | **11.7s** | **10.7s** | **~2.6s** | +| Compile | 7.6s (75.7%) | 9.6s (81.6%) | 7.5s (69.7%) | 0.4s (15%) | +| Train | 2.1s (21.2%) | 1.8s (15.6%) | 2.9s (27.4%) | 2.2s (85%) | +| **ms/step** | **106.7** | **91.8** | **147.0** | **111** | +| Kernels/restart | 72 | 86 | 60 | 9 (once) | +| ANE TFLOPS | 0.87 | 1.15 | 0.72 | — | +| Total TFLOPS | 1.63 | 1.90 | 1.19 | — | + +**Key insights:** +- Dynamic wins on wall time for any practical run length (3.9x faster at 20 steps) +- PR#19 has the best per-step throughput (92ms) but compile overhead dominates short runs +- Static restarts every 10 steps, so dynamic's zero-recompile advantage compounds ## Files | File | Description | |------|-------------| -| `train_large.m` | Main training loop — 12-layer forward/backward, checkpoint, exec() restart | -| `stories_config.h` | Model config, structs, alloc helpers | +| `train_large.m` | Static baseline — 72 kernels, classifier/softmax on CPU | +| `train_large_ane.m` | PR#19 — 86 kernels, classifier/softmax/rmsnorm_bwd on ANE | +| `training_dynamic/train.m` | Dynamic pipeline — 9 kernels, weights via IOSurface | +| `training_dynamic/mil_dynamic.h` | MIL generators for dynamic weight kernels | +| `training_dynamic/config.h` | Model config (DIM=768, HIDDEN=2048, etc.) | +| `training_dynamic/io.h` | IOSurface I/O + MIL compilation helpers | +| `training_dynamic/cpu_ops.h` | CPU ops (SiLU backward, cross-entropy, Adam) | +| `stories_config.h` | Static pipeline config, structs, alloc helpers | | `stories_io.h` | IOSurface I/O, NEON fp16 conversion, kernel compile/eval | -| `stories_mil.h` | MIL program generators for all 6 ANE kernel types | -| `stories_cpu_ops.h` | vDSP-vectorized RMSNorm, cross-entropy, Adam, embedding ops | -| `dashboard.py` | TUI dashboard — loss curve, power/CPU/memory graphs, text generation | -| `tokenize.py` | Extract pretokenized TinyStories data | +| `stories_mil.h` | MIL generators for static pipeline (6 kernel types) | +| `stories_cpu_ops.h` | vDSP-vectorized RMSNorm, cross-entropy, Adam | +| `ane_classifier.h` | ANE classifier fwd (32K conv), softmax kernels | +| `ane_rmsnorm_bwd.h` | ANE rmsnorm backward kernel | +| `dashboard.py` | TUI dashboard — loss curve, power/CPU/memory graphs | | `Makefile` | Build targets | -## How it works - -1. **Forward pass**: Each layer runs fwdAttn (QKV + SDPA + Wo) and fwdFFN (W1 + SiLU(W3) + W2) on ANE via MIL-compiled kernels. Final RMSNorm + classifier matmul on CPU (cblas). +## Usage -2. **Backward pass**: Reverse layer order. ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd on ANE. Weight gradients (dW) via async cblas_sgemm on CPU. RMSNorm backward via vDSP. +### 1. Download Training Data -3. **Compile budget**: ANE has a ~119 compile limit per process. With 72 kernels per batch, we run 10 accumulation steps then `exec()` restart with checkpoint resume. +```bash +bash download_data.sh +``` -4. **Data**: Real TinyStories text (20M tokens), mmap'd uint16 token IDs, random position sampling per step. +Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from HuggingFace. Produces `tinystories_data00.bin` (~41 MB, ~20M tokens). -## Usage +### 2. Build & Train ```bash -# Extract tokenized data -python3 tokenize.py +# Static baseline (classifier + softmax on CPU) +make train_large +./train_large stories110M.bin 256 100 1e-4 +./train_large --model stories110M.bin --steps 100 --lr 1e-4 +./train_large --data ./tinystories_data00.bin --steps 100 --lr 1e-4 + +# PR#19: ANE-offloaded classifier + softmax + rmsnorm_bwd +make train_large_ane +./train_large_ane stories110M.bin 256 100 1e-4 +./train_large_ane --no-ane-extras --steps 100 # disable ANE extras +./train_large_ane --data ./tinystories_data00.bin --steps 100 --lr 1e-4 + +# Dynamic pipeline (no recompilation) +cd training_dynamic && make train +./train --scratch # train from random init +./train # resume from checkpoint +./train --steps 200 --lr 1e-4 # custom steps/lr +``` -# Build and train -make train_large -./train_large # fresh start -./train_large --resume # resume from checkpoint +**CLI flags (`train_large` / `train_large_ane`):** +- `--steps N` (default 10000) +- `--lr F` (default 3e-4) +- `--model PATH` — pretrained weights file +- `--data PATH` — tokenized TinyStories `.bin` file (default: `tinystories_data00.bin`) +- `--ckpt PATH` — checkpoint file (preserved across exec() restarts) +- `--resume` — resume from checkpoint +- `--no-ane-extras` — (train_large_ane only) disable ANE classifier/softmax/rmsnorm_bwd -# Monitor with dashboard +### 3. Monitor with Dashboard + +```bash pip install blessed psutil numpy -python3 dashboard.py --resume # needs sudo for powermetrics +sudo python3 dashboard.py # static pipeline +sudo python3 dashboard.py --dynamic # dynamic pipeline +``` + +### 4. Benchmarking + +All programs print an **Efficiency Report** at completion: + +``` +=== Efficiency Report === +Total steps: 20 +Wall time: 11738 ms (11.7 s) +Compile time: 9583 ms (81.6%) +Train time: 1835 ms (15.6%) +Avg train: 91.8 ms/step +ANE TFLOPS: 1.15 sustained ``` -## Key techniques +## Key Techniques -- **NEON vectorized fp16<->fp32**: ARM NEON intrinsics for fast IOSurface data transfer +- **NEON vectorized fp16↔fp32**: ARM NEON intrinsics for fast IOSurface data transfer - **vDSP cross-entropy**: `vDSP_mtrans` + `vvexpf` + `vDSP_sve` — 8x faster than scalar - **Async weight gradients**: cblas_sgemm dispatched to background queue, overlapped with ANE -- **SDPA causal mask workaround**: ANE hardware ignores attn_mask, so we decompose attention into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv) +- **Vocab compaction** (dynamic): 32K → 9.2K active tokens, 3.5x reduction in classifier work +- **Dynamic weight packing**: Activations + weights concatenated in IOSurface spatial dimension — one kernel serves all 12 layers +- **exec() restart**: Workaround for ANE ~119 compile limit per process diff --git a/training/ane_classifier.h b/training/ane_classifier.h new file mode 100644 index 0000000..1b1b0e8 --- /dev/null +++ b/training/ane_classifier.h @@ -0,0 +1,102 @@ +// ane_classifier.h — MIL generators for classifier matmul and softmax on ANE +// Replaces classifier cblas_sgemm and cross-entropy softmax from CPU +#pragma once +#include "stories_mil.h" + +// ============================================================ +// Classifier forward: logits = embed @ x_final +// embed: [VOCAB, DIM] baked as conv weight [VOCAB, DIM, 1, 1] +// x: [1, DIM, 1, SEQ] input +// out: [1, VOCAB, 1, SEQ] logits +// +// VOCAB=32000 output channels — this is the largest conv we've attempted. +// If it fails, we'll need to tile into smaller chunks. +// ============================================================ +static NSString *gen_classifier_fwd(void) { + NSMutableString *m = [NSMutableString string]; + [m appendString:MIL_HDR]; + [m appendFormat:@" func main(tensor x) {\n", DIM, SEQ]; + [m appendString:@CONV_CONST]; + [m appendFormat:@" tensor We = const()[name=string(\"We\"), " + "val=tensor(BLOBFILE(path=string(\"@model_path/weights/embed.bin\"), offset=uint64(64)))];\n", + VOCAB, DIM, VOCAB, DIM]; + [m appendFormat:@" tensor out = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=We,x=x)[name=string(\"cls\")];\n", VOCAB, SEQ]; + [m appendString:@" } -> (out);\n}\n"]; + return m; +} + +// ============================================================ +// Classifier backward: dx = embed^T @ dlogits +// ANE rejects conv with 32000 input channels. +// Use matmul instead: reshape dlogits to [1, VOCAB, SEQ], +// bake embed^T as [1, DIM, VOCAB], matmul → [1, DIM, SEQ], +// reshape back to [1, DIM, 1, SEQ]. +// ============================================================ +static NSString *gen_classifier_bwd(void) { + NSMutableString *m = [NSMutableString string]; + [m appendString:MIL_HDR]; + [m appendFormat:@" func main(tensor dl) {\n", VOCAB, SEQ]; + // Reshape dlogits from [1, VOCAB, 1, SEQ] to [1, VOCAB, SEQ] + [m appendFormat:@" tensor sh3 = const()[name=string(\"sh3\"), val=tensor([1,%d,%d])];\n", VOCAB, SEQ]; + [m appendFormat:@" tensor dl3 = reshape(shape=sh3,x=dl)[name=string(\"rdl\")];\n", VOCAB, SEQ]; + // embed_t as baked constant [1, DIM, VOCAB] + [m appendFormat:@" tensor Wet = const()[name=string(\"Wet\"), " + "val=tensor(BLOBFILE(path=string(\"@model_path/weights/embed_t.bin\"), offset=uint64(64)))];\n", + DIM, VOCAB, DIM, VOCAB]; + // matmul: [1, DIM, VOCAB] @ [1, VOCAB, SEQ] -> [1, DIM, SEQ] + [m appendString:@" bool bF = const()[name=string(\"bF\"), val=bool(false)];\n"]; + [m appendFormat:@" tensor dx3 = matmul(transpose_x=bF,transpose_y=bF,x=Wet,y=dl3)[name=string(\"mm\")];\n", DIM, SEQ]; + // Reshape back to [1, DIM, 1, SEQ] + [m appendFormat:@" tensor sh4 = const()[name=string(\"sh4\"), val=tensor([1,%d,1,%d])];\n", DIM, SEQ]; + [m appendFormat:@" tensor out = reshape(shape=sh4,x=dx3)[name=string(\"out\")];\n", DIM, SEQ]; + [m appendString:@" } -> (out);\n}\n"]; + return m; +} + +// ============================================================ +// Softmax over VOCAB dimension (channel axis) for cross-entropy +// Input: logits [1, VOCAB, 1, SEQ] +// Output: probs [1, VOCAB, 1, SEQ] +// +// softmax(x, axis=1) = exp(x - max(x)) / sum(exp(x - max(x))) +// +// Note: After getting probs from ANE, the NLL loss + gradient +// (prob[target] -= 1.0) are done on CPU since they need target indexing. +// ============================================================ +static NSString *gen_softmax_vocab(void) { + NSMutableString *m = [NSMutableString string]; + [m appendString:MIL_HDR]; + [m appendFormat:@" func main(tensor x) {\n", VOCAB, SEQ]; + [m appendString:@" int32 ax = const()[name=string(\"ax\"), val=int32(1)];\n"]; + [m appendFormat:@" tensor out = softmax(axis=ax,x=x)[name=string(\"sm\")];\n", VOCAB, SEQ]; + [m appendString:@" } -> (out);\n}\n"]; + return m; +} + +// ============================================================ +// Final RMSNorm on ANE (replaces CPU rmsnorm for final layer) +// Input: x [1, DIM, 1, SEQ] +// Baked: rms_final weights [DIM] +// Output: xn [1, DIM, 1, SEQ] +// ============================================================ +static NSString *gen_final_rmsnorm(void) { + float invd = 1.0f/(float)DIM; + NSMutableString *m = [NSMutableString string]; + [m appendString:MIL_HDR]; + [m appendFormat:@" func main(tensor x) {\n", DIM, SEQ]; + [m appendFormat:@" tensor sq = mul(x=x,y=x)[name=string(\"sq\")];\n", DIM, SEQ]; + [m appendFormat:@" tensor rax = const()[name=string(\"rax\"), val=tensor([1])];\n"]; + [m appendFormat:@" bool kd = const()[name=string(\"kd\"), val=bool(true)];\n"]; + [m appendFormat:@" tensor ss = reduce_sum(x=sq,axes=rax,keep_dims=kd)[name=string(\"ss\")];\n", SEQ]; + [m appendFormat:@" fp16 invd = const()[name=string(\"invd\"), val=fp16(%f)];\n", invd]; + [m appendFormat:@" tensor ss2 = mul(x=ss,y=invd)[name=string(\"ss2\")];\n", SEQ]; + [m appendFormat:@" fp16 eps = const()[name=string(\"eps\"), val=fp16(0.00001)];\n"]; + [m appendFormat:@" tensor ss3 = add(x=ss2,y=eps)[name=string(\"ss3\")];\n", SEQ]; + [m appendFormat:@" fp16 nhalf = const()[name=string(\"nhalf\"), val=fp16(-0.5)];\n"]; + [m appendFormat:@" tensor rrms = pow(x=ss3,y=nhalf)[name=string(\"rrms\")];\n", SEQ]; + [m appendFormat:@" tensor xr = mul(x=x,y=rrms)[name=string(\"xr\")];\n", DIM, SEQ]; + [m appendFormat:@" tensor rw = const()[name=string(\"rw\"), val=tensor(BLOBFILE(path=string(\"@model_path/weights/rms_w.bin\"), offset=uint64(64)))];\n", DIM, DIM]; + [m appendFormat:@" tensor out = mul(x=xr,y=rw)[name=string(\"out\")];\n", DIM, SEQ]; + [m appendString:@" } -> (out);\n}\n"]; + return m; +} diff --git a/training/ane_rmsnorm_bwd.h b/training/ane_rmsnorm_bwd.h new file mode 100644 index 0000000..eb51896 --- /dev/null +++ b/training/ane_rmsnorm_bwd.h @@ -0,0 +1,78 @@ +// ane_rmsnorm_bwd.h — MIL generator for RMSNorm backward on ANE +// Replaces CPU rmsnorm_bwd() from stories_cpu_ops.h +// +// RMSNorm forward: xn = x * rrms * w, where rrms = 1/sqrt(mean(x²) + eps) +// RMSNorm backward: dx = w * rrms * (dy - x * sum(dy*w*x) * invd * rrms²) +// +// Input: concat(dy, x) as [1, 2*DIM, 1, SEQ] +// Baked: RMSNorm weights w [1, DIM, 1, 1] as BLOBFILE +// Output: dx [1, DIM, 1, SEQ] +// +// Note: dw (weight gradient) stays on CPU — it requires reduce_sum over SEQ +// and accumulation across steps, which is cheap and better done on CPU. +#pragma once +#include "stories_mil.h" + +// Generate MIL for RMSNorm backward +// Input: concat(dy, x) [1, 2*DIM, 1, SEQ] +// Baked weights: rms_w [DIM] — the RMSNorm scale weights +// Output: dx [1, DIM, 1, SEQ] +static NSString *gen_rmsnorm_bwd(void) { + float invd = 1.0f / (float)DIM; + NSMutableString *m = [NSMutableString string]; + [m appendString:MIL_HDR]; + + // Input: concat of dy and x along channel dimension + [m appendFormat:@" func main(tensor inp) {\n", 2*DIM, SEQ]; + + // Slice out dy [1, DIM, 1, SEQ] and x [1, DIM, 1, SEQ] + [m appendFormat:@" tensor sz = const()[name=string(\"sz\"), val=tensor([1,%d,1,%d])];\n", DIM, SEQ]; + [m appendString:@" tensor b0 = const()[name=string(\"b0\"), val=tensor([0,0,0,0])];\n"]; + [m appendFormat:@" tensor dy = slice_by_size(x=inp,begin=b0,size=sz)[name=string(\"sdy\")];\n", DIM, SEQ]; + [m appendFormat:@" tensor b1 = const()[name=string(\"b1\"), val=tensor([0,%d,0,0])];\n", DIM]; + [m appendFormat:@" tensor x = slice_by_size(x=inp,begin=b1,size=sz)[name=string(\"sx\")];\n", DIM, SEQ]; + + // Step 1: Compute rrms = 1/sqrt(mean(x²) + eps) + // sq = x * x + [m appendFormat:@" tensor sq = mul(x=x,y=x)[name=string(\"sq\")];\n", DIM, SEQ]; + // ss = sum(sq, axis=1, keepdims=true) → [1,1,1,SEQ] + [m appendFormat:@" tensor rax = const()[name=string(\"rax\"), val=tensor([1])];\n"]; + [m appendFormat:@" bool kd = const()[name=string(\"kd\"), val=bool(true)];\n"]; + [m appendFormat:@" tensor ss = reduce_sum(x=sq,axes=rax,keep_dims=kd)[name=string(\"ss\")];\n", SEQ]; + // ss2 = ss * invd + eps + [m appendFormat:@" fp16 invd = const()[name=string(\"invd\"), val=fp16(%f)];\n", invd]; + [m appendFormat:@" tensor ss2 = mul(x=ss,y=invd)[name=string(\"ss2\")];\n", SEQ]; + [m appendFormat:@" fp16 eps = const()[name=string(\"eps\"), val=fp16(0.00001)];\n"]; + [m appendFormat:@" tensor ss3 = add(x=ss2,y=eps)[name=string(\"ss3\")];\n", SEQ]; + // rrms = pow(ss3, -0.5) → [1,1,1,SEQ] + [m appendFormat:@" fp16 nhalf = const()[name=string(\"nhalf\"), val=fp16(-0.5)];\n"]; + [m appendFormat:@" tensor rrms = pow(x=ss3,y=nhalf)[name=string(\"rrms\")];\n", SEQ]; + + // Step 2: Load RMSNorm weights w [1, DIM, 1, 1] + [m appendFormat:@" tensor w = const()[name=string(\"w\"), val=tensor(BLOBFILE(path=string(\"@model_path/weights/rms_w.bin\"), offset=uint64(64)))];\n", DIM, DIM]; + + // Step 3: Compute dot = sum(dy * w * x, axis=1) * invd * rrms² + // dyw = dy * w → [1, DIM, 1, SEQ] + [m appendFormat:@" tensor dyw = mul(x=dy,y=w)[name=string(\"dyw\")];\n", DIM, SEQ]; + // dywx = dyw * x → [1, DIM, 1, SEQ] + [m appendFormat:@" tensor dywx = mul(x=dyw,y=x)[name=string(\"dywx\")];\n", DIM, SEQ]; + // dot_sum = sum(dywx, axis=1, keepdims=true) → [1,1,1,SEQ] + [m appendFormat:@" tensor dot_sum = reduce_sum(x=dywx,axes=rax,keep_dims=kd)[name=string(\"ds\")];\n", SEQ]; + // dot_scaled = dot_sum * invd → [1,1,1,SEQ] + [m appendFormat:@" tensor dot_sc = mul(x=dot_sum,y=invd)[name=string(\"dsc\")];\n", SEQ]; + // rrms_sq = rrms * rrms → [1,1,1,SEQ] + [m appendFormat:@" tensor rrms2 = mul(x=rrms,y=rrms)[name=string(\"rr2\")];\n", SEQ]; + // coeff = dot_scaled * rrms_sq → [1,1,1,SEQ] + [m appendFormat:@" tensor coeff = mul(x=dot_sc,y=rrms2)[name=string(\"cof\")];\n", SEQ]; + + // Step 4: dx = (dy * w - x * coeff) * rrms + // x_coeff = x * coeff → [1, DIM, 1, SEQ] + [m appendFormat:@" tensor xc = mul(x=x,y=coeff)[name=string(\"xc\")];\n", DIM, SEQ]; + // diff = dyw - xc → [1, DIM, 1, SEQ] + [m appendFormat:@" tensor diff = sub(x=dyw,y=xc)[name=string(\"dif\")];\n", DIM, SEQ]; + // dx = diff * rrms → [1, DIM, 1, SEQ] + [m appendFormat:@" tensor out = mul(x=diff,y=rrms)[name=string(\"out\")];\n", DIM, SEQ]; + + [m appendString:@" } -> (out);\n}\n"]; + return m; +} diff --git a/training/ane_runtime.h b/training/ane_runtime.h index 585d0f0..58bcb79 100644 --- a/training/ane_runtime.h +++ b/training/ane_runtime.h @@ -141,9 +141,14 @@ static void ane_read_output(ANEKernel *k, int idx, void *data, size_t bytes) { static bool ane_eval(ANEKernel *k) { NSError *e = nil; - return ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( + BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( k->model, @selector(evaluateWithQoS:options:request:error:), 21, @{}, k->request, &e); + if (!ok) { + fprintf(stderr, "ANE eval failed: %s\n", + e ? [[e description] UTF8String] : "unknown error"); + } + return ok; } static void ane_free(ANEKernel *k) { diff --git a/training/dashboard.py b/training/dashboard.py index a3a1503..18203d7 100644 --- a/training/dashboard.py +++ b/training/dashboard.py @@ -1,6 +1,6 @@ """TUI dashboard for ANE training (train_large). Uses blessed for terminal UI.""" -import argparse, fcntl, math, os, re, select, signal, struct, subprocess, sys, time, threading +import argparse, fcntl, json, math, os, re, select, signal, struct, subprocess, sys, time, threading from collections import deque from pathlib import Path @@ -20,7 +20,9 @@ DIM, HIDDEN, HEADS, SEQ, VOCAB, NLAYERS = 768, 2048, 12, 256, 32000, 12 HD = DIM // HEADS -CKPT_PATH = 'ane_stories110M_ckpt.bin' +CKPT_PATH_STATIC = 'ane_stories110M_ckpt.bin' +CKPT_PATH_DYNAMIC = 'training_dynamic/ane_stories110M_dyn_ckpt.bin' +CKPT_PATH = CKPT_PATH_STATIC # set in main() based on --dynamic TOKENIZER_PATH = str(Path(__file__).resolve().parent.parent.parent / 'assets' / 'models' / 'tokenizer.bin') @@ -56,6 +58,9 @@ def __init__(self): self.mem_mb_history = deque(maxlen=300) self.proc_mem_mb_history = deque(maxlen=300) self.train_pid = None + self.step_timestamps = [] # (step, time.monotonic()) for running ms/step + self.train_start = None # wall clock when first step seen + self.compile_ms = 0.0 # total compile time S = State() @@ -142,7 +147,7 @@ def softmax(x): e = np.exp(x) return e / np.sum(e) -def generate_text(W, tok, max_tokens=64, temperature=0.8): +def generate_text(W, max_tokens=64, temperature=0.8): tokenizer = get_tokenizer() if tokenizer is None: return '[no tokenizer]' @@ -244,7 +249,7 @@ def generation_thread(): with S.gen_lock: S.gen_status = 'idle' continue - text = generate_text(W, get_tokenizer(), max_tokens=64, temperature=0.8) + text = generate_text(W, max_tokens=64, temperature=0.8) with S.gen_lock: S.gen_text = text S.gen_step = S.step @@ -278,23 +283,69 @@ def sysmetrics_thread(): RE_CONFIG = re.compile(r'dim=(\d+) hidden=(\d+) heads=(\d+) seq=(\d+) vocab=(\d+) layers=(\d+)') RE_PARAMS = re.compile(r'Params: ([\d.]+)M \(transformer ([\d.]+)M \+ embed ([\d.]+)M\)') RE_KERNELS = re.compile(r'Kernels: (\d+).*?(\d+) weight-bearing') +RE_KERNELS_DYN = re.compile(r'Kernels: (\d+) compiled, (\d+) weight-bearing') RE_ACCUM = re.compile(r'Accum (\d+).*LR=([\d.e+-]+)') -RE_STEP = re.compile(r'step\s+(\d+)\s+loss=([\d.]+)') +RE_STEP = re.compile(r'step\s+(\d+)\s+loss=([\d.]+)(?:\s+lr=([\d.e+-]+))?(?:\s+([\d.]+)ms/step)?') RE_BATCH = re.compile(r'\[batch (\d+): compile=([\d.]+)ms train=([\d.]+)ms \(([\d.]+)ms/step\) compiles=(\d+)\]') RE_TIMING = re.compile(r'ane=([\d.]+) io=([\d.]+) cls=([\d.]+) elem=([\d.]+) rms=([\d.]+) cblas_wait=([\d.]+)') +RE_TIMING_DYN = re.compile(r'ane_fwd=([\d.]+) io_fwd=([\d.]+) rms=([\d.]+) ane_bwd=([\d.]+) io_bwd=([\d.]+) silu=([\d.]+) rms_bwd=([\d.]+) cls=([\d.]+) cblas_wait=([\d.]+) dw_copy=([\d.]+)') RE_RESTART = re.compile(r'\[exec\(\) restart step (\d+)') RE_RESUME = re.compile(r'\[RESUMED step (\d+), loss=([\d.]+)\]') RE_FLOPS = re.compile(r'FLOPs/step: fwd=([\d.]+)M bwd_dx=([\d.]+)M bwd_dW=([\d.]+)M sdpa_bwd=([\d.]+)M total=([\d.]+)M') RE_ANE_FLOPS = re.compile(r'ANE FLOPs/step: ([\d.]+)M') RE_ANE_TFLOPS = re.compile(r'ANE TFLOPS:\s+([\d.]+)') RE_ANE_UTIL = re.compile(r'ANE utilization:\s+([\d.]+)%') -RE_EFFICIENCY = re.compile(r'(Total steps|Wall time|Compile time|Train time|Avg compile|Avg train|ANE TFLOPS|Total TFLOPS|ANE utilization):?\s+(.+)') +RE_EFFICIENCY = re.compile(r'(Total steps|Wall time|Compile time|Compile|Train time|Avg compile|Avg train|ANE TFLOPS|Total TFLOPS|ANE utilization):?\s+(.+)') +RE_COMPILED = re.compile(r'Compiled (\d+) kernels in (\d+)ms') RE_ANE_POWER = re.compile(r'ANE Power:\s+([\d.]+)\s*mW') RE_CPU_POWER = re.compile(r'CPU Power:\s+([\d.]+)\s*mW') RE_GPU_POWER = re.compile(r'GPU Power:\s+([\d.]+)\s*mW') def parse_line(line): S.logs.append(line) + # Parse JSON lines from static pipeline ({"type":"step",...} or {"type":"batch",...}) + stripped = line.strip() + if stripped.startswith('{'): + try: + j = json.loads(stripped) + jt = j.get('type') + if jt == 'step': + S.step, S.loss = j['step'], j['loss'] + S.loss_history.append((S.step, S.loss)) + S.best_loss = min(S.best_loss, S.loss) + S.compiles = j.get('compiles', S.compiles) + now = time.monotonic() + if S.train_start is None: + S.train_start = now + S.step_timestamps.append((S.step, now)) + if len(S.step_timestamps) >= 2: + dt = S.step_timestamps[-1][1] - S.step_timestamps[-2][1] + if dt > 0: + S.ms_per_step = dt * 1000 + # Extract component timing from JSON + ct = {} + for k in ('t_ane', 't_io', 't_cls', 't_elem', 't_rms', 't_cblas_wait'): + if k in j: + ct[k[2:]] = j[k] # strip 't_' prefix + if ct: + S.component_timing = ct + return + elif jt == 'batch': + S.batch_num = j.get('batch', S.batch_num) + compile_ms = j.get('compile_ms', 0) + train_ms = j.get('train_ms', 0) + S.ms_per_step = j.get('ms_per_step', S.ms_per_step) + S.compile_ms += compile_ms + S.compile_pct = 100 * S.compile_ms / (S.compile_ms + train_ms) if S.compile_ms + train_ms > 0 else 0 + return + elif jt == 'perf': + if 'ane_tflops' in j: + S.flops['ane_tflops'] = j['ane_tflops'] + if 'ane_util_pct' in j: + S.flops['ane_util'] = j['ane_util_pct'] + return + except (json.JSONDecodeError, KeyError): + pass m = RE_CONFIG.search(line) if m: S.model_config = dict(zip(['dim', 'hidden', 'heads', 'seq', 'vocab', 'layers'], map(int, m.groups()))) @@ -303,7 +354,7 @@ def parse_line(line): if m: S.params = {'total': float(m[1]), 'transformer': float(m[2]), 'embed': float(m[3])} return - m = RE_KERNELS.search(line) + m = RE_KERNELS_DYN.search(line) or RE_KERNELS.search(line) if m: S.kernels = {'total': int(m[1]), 'weight_bearing': int(m[2])} return @@ -323,6 +374,18 @@ def parse_line(line): m = RE_STEP.search(line) if m: S.step, S.loss = int(m[1]), float(m[2]) + if m[3]: + S.training['lr'] = m[3] + if m[4]: + S.ms_per_step = float(m[4]) + now = time.monotonic() + if S.train_start is None: + S.train_start = now + S.step_timestamps.append((S.step, now)) + if not m[4] and len(S.step_timestamps) >= 2: + dt = S.step_timestamps[-1][1] - S.step_timestamps[-2][1] + if dt > 0: + S.ms_per_step = dt * 1000 S.loss_history.append((S.step, S.loss)) S.best_loss = min(S.best_loss, S.loss) return @@ -334,6 +397,16 @@ def parse_line(line): S.compiles = int(m[5]) S.compile_pct = 100 * compile_ms / (compile_ms + train_ms) if compile_ms + train_ms > 0 else 0 return + m = RE_TIMING_DYN.search(line) + if m: + vals = list(map(float, m.groups())) + S.component_timing = { + 'ane_fwd': vals[0], 'io_fwd': vals[1], 'rms': vals[2], + 'ane_bwd': vals[3], 'io_bwd': vals[4], 'silu': vals[5], + 'rms_bwd': vals[6], 'cls': vals[7], 'cblas_wait': vals[8], 'dw_copy': vals[9], + '_dynamic': True + } + return m = RE_TIMING.search(line) if m: S.component_timing = dict(zip(['ane', 'io', 'cls', 'elem', 'rms', 'cblas_wait'], map(float, m.groups()))) @@ -346,6 +419,11 @@ def parse_line(line): if m: S.flops['ane_util'] = float(m[1]) return + m = RE_COMPILED.search(line) + if m: + S.compiles = int(m[1]) + S.compile_ms += float(m[2]) + return m = RE_EFFICIENCY.search(line) if m: S.efficiency[m[1].strip()] = m[2].strip() @@ -514,23 +592,49 @@ def put(y, x, text, style=''): # Training stats (right panel) sr = row step_str = f'{S.step}' + (f'/{S.total_steps}' if S.total_steps and S.total_steps < 999999 else '') - put(sr, mid_x + 1, f' Step: {step_str} Loss: {S.loss:.4f}' if S.loss else ' Step: --', term.yellow) + # Elapsed time + elapsed = 0.0 + if S.train_start: + elapsed = time.monotonic() - S.train_start + elapsed_str = f'{elapsed:.1f}s' if elapsed < 60 else f'{elapsed/60:.1f}m' + put(sr, mid_x + 1, f' Step: {step_str} Loss: {S.loss:.4f} [{elapsed_str}]' if S.loss else ' Step: --', term.yellow) sr += 1 - put(sr, mid_x + 1, f' Best: {S.best_loss:.4f} ms/step: {S.ms_per_step:.1f}' if S.best_loss < float('inf') else ' Best: --') + # ms/step + steps/sec + sps = 1000.0 / S.ms_per_step if S.ms_per_step > 0 else 0 + put(sr, mid_x + 1, f' Best: {S.best_loss:.4f} {S.ms_per_step:.1f}ms/step ({sps:.1f} steps/s)' if S.best_loss < float('inf') else ' Best: --') sr += 1 + # TFLOPS ane_tflops = S.flops.get('ane_tflops', 0) ane_util = S.flops.get('ane_util', 0) + total_tflops = 0 + if S.ms_per_step > 0 and S.flops.get('ane', 0) > 0: + if not ane_tflops: + ane_tflops = (S.flops['ane'] * 1e6) / (S.ms_per_step * 1e-3) / 1e12 + total_tflops = (S.flops.get('total', 0) * 1e6) / (S.ms_per_step * 1e-3) / 1e12 + if not ane_util and ane_tflops: + ane_util = 100.0 * ane_tflops / 15.8 + compile_str = f' Compile: {S.compile_ms/1000:.1f}s' if S.compile_ms > 0 else '' if ane_tflops: - put(sr, mid_x + 1, f' ANE: {ane_tflops:.2f}T Compile: {S.compile_pct:.0f}% Util: {ane_util:.1f}%') - else: - put(sr, mid_x + 1, f' Compile: {S.compile_pct:.0f}%') + tflops_str = f' ANE: {ane_tflops:.2f}T' + if total_tflops: + tflops_str += f' Total: {total_tflops:.2f}T' + tflops_str += f' Util: {ane_util:.1f}%{compile_str}' + put(sr, mid_x + 1, tflops_str) + elif compile_str: + put(sr, mid_x + 1, f'{compile_str}') sr += 1 ct = S.component_timing if ct: - put(sr, mid_x + 1, f' ane={ct.get("ane", 0):.1f} io={ct.get("io", 0):.1f} cls={ct.get("cls", 0):.1f} elem={ct.get("elem", 0):.1f}') - sr += 1 - put(sr, mid_x + 1, f' rms={ct.get("rms", 0):.1f} cblas_wait={ct.get("cblas_wait", 0):.1f} ms/step') - sr += 1 + if ct.get('_dynamic'): + put(sr, mid_x + 1, f' fwd={ct.get("ane_fwd",0):.1f} bwd={ct.get("ane_bwd",0):.1f} io={ct.get("io_fwd",0)+ct.get("io_bwd",0):.1f} silu={ct.get("silu",0):.1f}') + sr += 1 + put(sr, mid_x + 1, f' cls={ct.get("cls",0):.1f} rms={ct.get("rms",0)+ct.get("rms_bwd",0):.1f} dw={ct.get("dw_copy",0):.1f} ms/step') + sr += 1 + else: + put(sr, mid_x + 1, f' ane={ct.get("ane", 0):.1f} io={ct.get("io", 0):.1f} cls={ct.get("cls", 0):.1f} elem={ct.get("elem", 0):.1f}') + sr += 1 + put(sr, mid_x + 1, f' rms={ct.get("rms", 0):.1f} cblas_wait={ct.get("cblas_wait", 0):.1f} ms/step') + sr += 1 pw = S.power if any(pw.values()): put(sr, mid_x + 1, '\u2500 Power ' + '\u2500' * max(0, right_w - 9), term.cyan) @@ -659,10 +763,24 @@ def set_nonblock(fd): fl = fcntl.fcntl(fd, fcntl.F_GETFL) fcntl.fcntl(fd, fcntl.F_SETFL, fl | os.O_NONBLOCK) -def spawn_training(resume=False, steps=10000): - cmd = 'make train_large 2>&1 && ./train_large' +def spawn_training(resume=False, steps=10000, dynamic=False, ane=False, scratch=False, + lr=None, accum=None, no_ane_extras=False): + if dynamic: + cmd = 'cd training_dynamic && make 2>&1 && ./train' + elif ane: + cmd = 'make train_large_ane 2>&1 && ./train_large_ane' + else: + cmd = 'make train_large 2>&1 && ./train_large' if resume: cmd += ' --resume' + if scratch and dynamic: + cmd += ' --scratch' + if lr is not None: + cmd += f' --lr {lr}' + if accum is not None and dynamic: + cmd += f' --accum {accum}' + if no_ane_extras and ane: + cmd += ' --no-ane-extras' cmd += f' --steps {steps}' proc = subprocess.Popen( ['bash', '-c', cmd], @@ -672,6 +790,8 @@ def spawn_training(resume=False, steps=10000): return proc def spawn_powermetrics(): + if not sys.stdin.isatty(): + return None try: proc = subprocess.Popen( ['sudo', 'powermetrics', '--samplers', 'cpu_power,gpu_power,ane_power', '-i', '1000'], @@ -684,6 +804,12 @@ def spawn_powermetrics(): def main(): parser = argparse.ArgumentParser(description='ANE Training Dashboard (stories110M)') parser.add_argument('--resume', action='store_true', help='Resume from checkpoint') + parser.add_argument('--dynamic', action='store_true', help='Dynamic weight pipeline (training_dynamic/)') + parser.add_argument('--ane', action='store_true', help='PR#19: ANE-offloaded classifier/softmax/rmsnorm_bwd') + parser.add_argument('--no-ane-extras', action='store_true', help='Disable ANE extras (use with --ane)') + parser.add_argument('--scratch', action='store_true', help='Train from scratch (random init)') + parser.add_argument('--lr', type=float, default=None, help='Learning rate') + parser.add_argument('--accum', type=int, default=None, help='Gradient accumulation steps') parser.add_argument('--infinite', action='store_true', help='Train indefinitely') parser.add_argument('--no-powermetrics', action='store_true') parser.add_argument('--no-generate', action='store_true', help='Disable text generation') @@ -694,10 +820,15 @@ def main(): args.steps = 999999999 S.total_steps = args.steps + global CKPT_PATH + CKPT_PATH = CKPT_PATH_DYNAMIC if args.dynamic else CKPT_PATH_STATIC + term = Terminal() procs = [] - train_proc = spawn_training(resume=args.resume, steps=args.steps) + train_proc = spawn_training(resume=args.resume, steps=args.steps, dynamic=args.dynamic, + scratch=args.scratch, lr=args.lr, accum=args.accum, + ane=args.ane, no_ane_extras=args.no_ane_extras) S.train_pid = train_proc.pid procs.append(train_proc) @@ -837,7 +968,9 @@ def on_resize(*a): if train_proc: train_proc.terminate() train_proc.wait() - train_proc = spawn_training(resume=True, steps=args.steps) + train_proc = spawn_training(resume=True, steps=args.steps, dynamic=args.dynamic, + lr=args.lr, accum=args.accum, + ane=args.ane, no_ane_extras=args.no_ane_extras) S.train_pid = train_proc.pid procs = [p for p in procs if p.poll() is None] procs.append(train_proc) @@ -851,7 +984,7 @@ def force_gen(): try: W = load_weights_from_ckpt(CKPT_PATH) if W: - text = generate_text(W, get_tokenizer(), max_tokens=64, temperature=0.8) + text = generate_text(W, max_tokens=64, temperature=0.8) with S.gen_lock: S.gen_text = text S.gen_step = S.step diff --git a/training/download_data.sh b/training/download_data.sh new file mode 100755 index 0000000..2d27d96 --- /dev/null +++ b/training/download_data.sh @@ -0,0 +1,91 @@ +#!/bin/bash +# Download pretokenized TinyStories data for ANE training +# Format: flat uint16 token IDs (Llama2 BPE, 32K vocab) +# Source: enio/TinyStories on HuggingFace (pretokenized with karpathy/llama2.c) +# +# The tar.gz contains data00.bin..data49.bin (50 shards). +# We extract only data00.bin and rename it to tinystories_data00.bin. + +set -e + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +OUTPUT="$SCRIPT_DIR/tinystories_data00.bin" + +if [ -f "$OUTPUT" ]; then + SIZE=$(stat -f%z "$OUTPUT" 2>/dev/null || stat -c%s "$OUTPUT" 2>/dev/null) + TOKENS=$((SIZE / 2)) + echo "$OUTPUT already exists ($TOKENS tokens, $(echo "scale=1; $SIZE/1000000" | bc) MB)" + exit 0 +fi + +TAR_URL="https://huggingface.co/datasets/enio/TinyStories/resolve/main/tok32000/TinyStories_tok32000.tar.gz?download=true" +TAR_FILE="$SCRIPT_DIR/TinyStories_tok32000.tar.gz" + +echo "=== TinyStories Data Download ===" +echo "Downloading pretokenized TinyStories (32K vocab, ~993 MB)..." +echo " Source: enio/TinyStories on HuggingFace" +echo " This will take a few minutes depending on your connection." +echo "" + +# Download the tar.gz +if [ ! -f "$TAR_FILE" ]; then + if command -v curl &>/dev/null; then + curl -L --progress-bar -o "$TAR_FILE" "$TAR_URL" + elif command -v wget &>/dev/null; then + wget --show-progress -O "$TAR_FILE" "$TAR_URL" + else + echo "Error: need curl or wget" + exit 1 + fi +else + echo "Tar file already downloaded, skipping..." +fi + +# Verify it's actually a gzip file (not an error page) +if ! file "$TAR_FILE" | grep -q "gzip"; then + echo "Error: Downloaded file is not a valid gzip archive." + echo "Content: $(head -c 100 "$TAR_FILE")" + rm -f "$TAR_FILE" + exit 1 +fi + +echo "" +echo "Extracting data00.bin from archive..." + +# List what's in the archive to find the right path +DATA_FILE=$(tar tzf "$TAR_FILE" 2>/dev/null | grep 'data00\.bin' | head -1) +if [ -z "$DATA_FILE" ]; then + echo "Error: data00.bin not found in archive. Contents:" + tar tzf "$TAR_FILE" | head -20 + exit 1 +fi +echo " Found: $DATA_FILE" + +# Extract just data00.bin +tar xzf "$TAR_FILE" -C "$SCRIPT_DIR" "$DATA_FILE" + +# Move to expected location (might be in a subdirectory) +EXTRACTED="$SCRIPT_DIR/$DATA_FILE" +if [ "$EXTRACTED" != "$OUTPUT" ]; then + mv "$EXTRACTED" "$OUTPUT" + # Clean up any extracted subdirectories + rmdir "$(dirname "$EXTRACTED")" 2>/dev/null || true +fi + +# Clean up tar.gz to save disk space +echo "Cleaning up archive..." +rm -f "$TAR_FILE" + +SIZE=$(stat -f%z "$OUTPUT" 2>/dev/null || stat -c%s "$OUTPUT" 2>/dev/null) +TOKENS=$((SIZE / 2)) +echo "" +echo "Done: $OUTPUT" +echo " $TOKENS tokens ($(echo "scale=1; $SIZE/1000000" | bc) MB)" + +# Sanity check +python3 -c " +import struct +with open('$OUTPUT', 'rb') as f: + tokens = struct.unpack('<10H', f.read(20)) + print(f'First 10 tokens: {tokens}') +" 2>/dev/null || true diff --git a/training/forward.h b/training/forward.h index adcf898..1a2a31f 100644 --- a/training/forward.h +++ b/training/forward.h @@ -7,7 +7,7 @@ // ANE conv eval: input [S, in_dim] row-major → transpose to [in_dim, S] channels-first // ANE computes conv(W, x) with baked W → output [out_dim, S] // Transpose back to [S, out_dim] row-major -static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y, +static bool ane_conv_eval(ANEKernel *kernel, const float *x, float *y, int S, int in_dim, int out_dim) { float *x_t = (float*)malloc(S * in_dim * sizeof(float)); for (int t = 0; t < S; t++) @@ -15,7 +15,11 @@ static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y, x_t[i*S + t] = x[t*in_dim + i]; ane_write_input(kernel, 0, x_t, S * in_dim * sizeof(float)); - ane_eval(kernel); + bool ok = ane_eval(kernel); + if (!ok) { + free(x_t); + return false; + } float *y_t = (float*)malloc(S * out_dim * sizeof(float)); ane_read_output(kernel, 0, y_t, S * out_dim * sizeof(float)); @@ -25,6 +29,7 @@ static void ane_conv_eval(ANEKernel *kernel, const float *x, float *y, y[t*out_dim + i] = y_t[i*S + t]; free(x_t); free(y_t); + return true; } // CPU matmul fallback: y = W @ x, W[out_dim, in_dim], x[S, in_dim] → y[S, out_dim] diff --git a/training/gradient_checkpoint.h b/training/gradient_checkpoint.h new file mode 100644 index 0000000..29a5aa0 --- /dev/null +++ b/training/gradient_checkpoint.h @@ -0,0 +1,165 @@ +// gradient_checkpoint.h — Activation checkpointing for deep models +// Trades compute for memory: recompute forward activations during backward +// instead of storing all layers' activations simultaneously +#pragma once +#include "model_config.h" + +// ===== Checkpoint policies ===== + +typedef enum { + CKPT_ALL, // save all layers' activations (current behavior) + CKPT_BOUNDARY, // save only group boundary activations, recompute within group + CKPT_SQRT, // save every √N layers (optimal memory/compute tradeoff) + CKPT_EVERY_N, // save every N-th layer (configurable) + CKPT_NONE // save nothing, recompute everything (max memory savings) +} CheckpointPolicy; + +typedef struct { + CheckpointPolicy policy; + int interval; // for CKPT_EVERY_N: save every N layers + int n_layers; // total layers in model + int n_groups; // layer groups in pipeline + int layers_per_group; // layers per group (from pipeline plan) + // Derived + int n_checkpointed; // how many layers have saved activations + bool *is_saved; // per-layer: true if activation is saved (not recomputed) +} CheckpointManager; + +// ===== Initialization ===== + +// custom_interval: used for CKPT_EVERY_N (pass 0 for default=4, ignored for other policies) +static CheckpointManager checkpoint_init(CheckpointPolicy policy, const ModelConfig *cfg, + const PipelinePlan *plan, int custom_interval) { + CheckpointManager cm = {0}; + cm.policy = policy; + cm.n_layers = cfg->dims.n_layers; + cm.n_groups = plan->n_groups; + cm.layers_per_group = (plan->n_groups > 0) ? plan->groups[0].n_layers : cfg->dims.n_layers; + cm.is_saved = (bool *)calloc(cfg->dims.n_layers, sizeof(bool)); + + switch (policy) { + case CKPT_ALL: + for (int i = 0; i < cm.n_layers; i++) cm.is_saved[i] = true; + break; + + case CKPT_BOUNDARY: + for (int g = 0; g < plan->n_groups; g++) { + cm.is_saved[plan->groups[g].start_layer] = true; + } + cm.is_saved[cm.n_layers - 1] = true; + break; + + case CKPT_SQRT: { + int interval = (int)sqrtf((float)cm.n_layers); + if (interval < 1) interval = 1; + cm.interval = interval; + for (int i = 0; i < cm.n_layers; i += interval) cm.is_saved[i] = true; + cm.is_saved[cm.n_layers - 1] = true; + break; + } + + case CKPT_EVERY_N: + cm.interval = (custom_interval > 0) ? custom_interval : 4; + for (int i = 0; i < cm.n_layers; i += cm.interval) cm.is_saved[i] = true; + cm.is_saved[cm.n_layers - 1] = true; + break; + + case CKPT_NONE: + cm.is_saved[0] = true; + break; + } + + // Count actual saved layers — single source of truth, no fragile arithmetic + cm.n_checkpointed = 0; + for (int i = 0; i < cm.n_layers; i++) { + if (cm.is_saved[i]) cm.n_checkpointed++; + } + + return cm; +} + +static void checkpoint_free(CheckpointManager *cm) { + free(cm->is_saved); + cm->is_saved = NULL; +} + +// ===== Query functions ===== + +// Should we save this layer's activations during forward pass? +static bool checkpoint_should_save(const CheckpointManager *cm, int layer_idx) { + if (layer_idx < 0 || layer_idx >= cm->n_layers) return false; + return cm->is_saved[layer_idx]; +} + +// Does this layer need forward recompute during backward pass? +static bool checkpoint_needs_recompute(const CheckpointManager *cm, int layer_idx) { + return !checkpoint_should_save(cm, layer_idx); +} + +// Find the nearest saved checkpoint before this layer (for recompute starting point) +static int checkpoint_nearest_saved_before(const CheckpointManager *cm, int layer_idx) { + for (int i = layer_idx; i >= 0; i--) { + if (cm->is_saved[i]) return i; + } + return 0; // fallback to first layer +} + +// How many layers need recompute between the nearest checkpoint and this layer? +static int checkpoint_recompute_depth(const CheckpointManager *cm, int layer_idx) { + int saved = checkpoint_nearest_saved_before(cm, layer_idx); + return layer_idx - saved; +} + +// ===== Memory estimation ===== + +// Memory for saved activations only (bytes) +static size_t checkpoint_saved_memory(const CheckpointManager *cm, const ModelDims *d) { + return (size_t)cm->n_checkpointed * layer_activation_bytes(d); +} + +// Memory savings vs. saving all layers (bytes) +static size_t checkpoint_memory_saved(const CheckpointManager *cm, const ModelDims *d) { + size_t all = (size_t)cm->n_layers * layer_activation_bytes(d); + size_t used = checkpoint_saved_memory(cm, d); + return all - used; +} + +// Extra forward FLOPs due to recompute (fraction of 1.0) +static double checkpoint_recompute_overhead(const CheckpointManager *cm) { + int recomputed = cm->n_layers - cm->n_checkpointed; + return (double)recomputed / (double)cm->n_layers; +} + +// ===== Pretty-print ===== + +static const char *checkpoint_policy_name(CheckpointPolicy p) { + switch (p) { + case CKPT_ALL: return "ALL"; + case CKPT_BOUNDARY: return "BOUNDARY"; + case CKPT_SQRT: return "SQRT"; + case CKPT_EVERY_N: return "EVERY_N"; + case CKPT_NONE: return "NONE"; + default: return "UNKNOWN"; + } +} + +static void checkpoint_print(const CheckpointManager *cm, const ModelDims *d) { + printf("=== Checkpoint Policy: %s ===\n", checkpoint_policy_name(cm->policy)); + printf(" %d/%d layers checkpointed", cm->n_checkpointed, cm->n_layers); + if (cm->policy == CKPT_SQRT || cm->policy == CKPT_EVERY_N) + printf(" (interval=%d)", cm->interval); + printf("\n"); + printf(" Activation memory: %.1fMB (saved) / %.1fMB (all)\n", + checkpoint_saved_memory(cm, d) / 1e6, + (double)cm->n_layers * layer_activation_bytes(d) / 1e6); + printf(" Memory savings: %.1fMB (%.0f%%)\n", + checkpoint_memory_saved(cm, d) / 1e6, + 100.0 * checkpoint_memory_saved(cm, d) / ((double)cm->n_layers * layer_activation_bytes(d))); + printf(" Recompute overhead: %.0f%% extra forward FLOPs\n", + 100.0 * checkpoint_recompute_overhead(cm)); + printf(" Saved layers: "); + for (int i = 0; i < cm->n_layers; i++) { + if (cm->is_saved[i]) printf("%d ", i); + } + printf("\n"); +} diff --git a/training/model.h b/training/model.h index 6cee52f..4e68ebc 100644 --- a/training/model.h +++ b/training/model.h @@ -78,7 +78,10 @@ typedef struct { static int model_load_weights(Model *m, const char *path) { FILE *f = fopen(path, "rb"); if (!f) { fprintf(stderr, "Cannot open %s\n", path); return -1; } - fread(&m->cfg, sizeof(Config), 1, f); + if (fread(&m->cfg, sizeof(Config), 1, f) != 1) { + fprintf(stderr, "ERROR: failed to read config from %s\n", path); + fclose(f); return -1; + } bool shared = m->cfg.vocab_size > 0; if (m->cfg.vocab_size < 0) m->cfg.vocab_size = -m->cfg.vocab_size; @@ -89,7 +92,10 @@ static int model_load_weights(Model *m, const char *path) { int d = m->cfg.dim, hd = m->cfg.hidden_dim, nl = m->cfg.n_layers, vs = m->cfg.vocab_size; m->token_embedding = (float*)malloc(vs * d * sizeof(float)); - fread(m->token_embedding, sizeof(float), vs * d, f); + if (fread(m->token_embedding, sizeof(float), vs * d, f) != (size_t)(vs * d)) { + fprintf(stderr, "ERROR: short read on token_embedding (file truncated?)\n"); + fclose(f); return -1; + } float *rms_att_all = (float*)malloc(nl * d * sizeof(float)); float *wq_all = (float*)malloc(nl * d * d * sizeof(float)); @@ -101,15 +107,24 @@ static int model_load_weights(Model *m, const char *path) { float *w2_all = (float*)malloc(nl * d * hd * sizeof(float)); float *w3_all = (float*)malloc(nl * hd * d * sizeof(float)); - fread(rms_att_all, sizeof(float), nl * d, f); - fread(wq_all, sizeof(float), nl * d * d, f); - fread(wk_all, sizeof(float), nl * d * d, f); - fread(wv_all, sizeof(float), nl * d * d, f); - fread(wo_all, sizeof(float), nl * d * d, f); - fread(rms_ffn_all, sizeof(float), nl * d, f); - fread(w1_all, sizeof(float), nl * hd * d, f); - fread(w2_all, sizeof(float), nl * d * hd, f); - fread(w3_all, sizeof(float), nl * hd * d, f); + #define FREAD_CHECK(buf, count, file, label) do { \ + size_t _n = fread(buf, sizeof(float), count, file); \ + if (_n != (size_t)(count)) { \ + fprintf(stderr, "ERROR: short read on %s: got %zu, expected %zu (file truncated?)\n", \ + label, _n, (size_t)(count)); \ + fclose(file); return -1; \ + } \ + } while(0) + + FREAD_CHECK(rms_att_all, nl * d, f, "rms_att"); + FREAD_CHECK(wq_all, nl * d * d, f, "wq"); + FREAD_CHECK(wk_all, nl * d * d, f, "wk"); + FREAD_CHECK(wv_all, nl * d * d, f, "wv"); + FREAD_CHECK(wo_all, nl * d * d, f, "wo"); + FREAD_CHECK(rms_ffn_all, nl * d, f, "rms_ffn"); + FREAD_CHECK(w1_all, nl * hd * d, f, "w1"); + FREAD_CHECK(w2_all, nl * d * hd, f, "w2"); + FREAD_CHECK(w3_all, nl * hd * d, f, "w3"); for (int l = 0; l < nl; l++) { m->rms_att_w[l] = (float*)malloc(d * sizeof(float)); @@ -135,14 +150,15 @@ static int model_load_weights(Model *m, const char *path) { free(rms_ffn_all); free(w1_all); free(w2_all); free(w3_all); m->rms_final_w = (float*)malloc(d * sizeof(float)); - fread(m->rms_final_w, sizeof(float), d, f); + FREAD_CHECK(m->rms_final_w, d, f, "rms_final"); if (shared) { m->wcls = m->token_embedding; } else { m->wcls = (float*)malloc(vs * d * sizeof(float)); - fread(m->wcls, sizeof(float), vs * d, f); + FREAD_CHECK(m->wcls, vs * d, f, "wcls"); } + #undef FREAD_CHECK fclose(f); return 0; } @@ -188,32 +204,45 @@ static int model_compile_kernels(Model *m, int seq_len) { return 0; } -// Recompile all kernels after weight update — unload all first to avoid ANE model limit +// Recompile all kernels after weight update — compile new first, then swap static int model_recompile_kernels(Model *m) { int d = m->cfg.dim, hd = m->cfg.hidden_dim, vs = m->cfg.vocab_size; int S = m->seq_len; - // Phase 1: unload+free all + + // Phase 1: compile new kernels into temporaries + ANEKernel *new_q[N_LAYERS], *new_k[N_LAYERS], *new_v[N_LAYERS], *new_o[N_LAYERS]; + ANEKernel *new_w1[N_LAYERS], *new_w2[N_LAYERS], *new_w3[N_LAYERS]; for (int l = 0; l < N_LAYERS; l++) { - ane_free(m->kern_q[l]); ane_free(m->kern_k[l]); ane_free(m->kern_v[l]); ane_free(m->kern_o[l]); - ane_free(m->kern_w1[l]); ane_free(m->kern_w2[l]); ane_free(m->kern_w3[l]); - m->kern_q[l]=m->kern_k[l]=m->kern_v[l]=m->kern_o[l]=NULL; - m->kern_w1[l]=m->kern_w2[l]=m->kern_w3[l]=NULL; + new_q[l] = compile_conv_kernel(m->wq[l], d, d, S); + new_k[l] = compile_conv_kernel(m->wk[l], d, d, S); + new_v[l] = compile_conv_kernel(m->wv[l], d, d, S); + new_o[l] = compile_conv_kernel(m->wo[l], d, d, S); + new_w1[l] = compile_conv_kernel(m->w1[l], d, hd, S); + new_w2[l] = compile_conv_kernel(m->w2[l], hd, d, S); + new_w3[l] = compile_conv_kernel(m->w3[l], d, hd, S); + if (!new_q[l] || !new_k[l] || !new_v[l] || !new_o[l] || + !new_w1[l] || !new_w2[l] || !new_w3[l]) { + // Cleanup partially compiled new kernels + for (int i = 0; i <= l; i++) { + ane_free(new_q[i]); ane_free(new_k[i]); ane_free(new_v[i]); ane_free(new_o[i]); + ane_free(new_w1[i]); ane_free(new_w2[i]); ane_free(new_w3[i]); + } + fprintf(stderr, "Recompile failed at layer %d, keeping old kernels\n", l); + return -1; + } } - if (m->kern_cls) { ane_free(m->kern_cls); m->kern_cls=NULL; } - // Phase 2: recompile all + ANEKernel *new_cls = compile_conv_kernel(m->wcls, d, vs, S); + + // Phase 2: all compiles succeeded — swap and free old for (int l = 0; l < N_LAYERS; l++) { - m->kern_q[l] = compile_conv_kernel(m->wq[l], d, d, S); - m->kern_k[l] = compile_conv_kernel(m->wk[l], d, d, S); - m->kern_v[l] = compile_conv_kernel(m->wv[l], d, d, S); - m->kern_o[l] = compile_conv_kernel(m->wo[l], d, d, S); - m->kern_w1[l] = compile_conv_kernel(m->w1[l], d, hd, S); - m->kern_w2[l] = compile_conv_kernel(m->w2[l], hd, d, S); - m->kern_w3[l] = compile_conv_kernel(m->w3[l], d, hd, S); - if (!m->kern_q[l] || !m->kern_k[l] || !m->kern_v[l] || !m->kern_o[l] || - !m->kern_w1[l] || !m->kern_w2[l] || !m->kern_w3[l]) return -1; + ane_free(m->kern_q[l]); ane_free(m->kern_k[l]); ane_free(m->kern_v[l]); ane_free(m->kern_o[l]); + ane_free(m->kern_w1[l]); ane_free(m->kern_w2[l]); ane_free(m->kern_w3[l]); + m->kern_q[l] = new_q[l]; m->kern_k[l] = new_k[l]; + m->kern_v[l] = new_v[l]; m->kern_o[l] = new_o[l]; + m->kern_w1[l] = new_w1[l]; m->kern_w2[l] = new_w2[l]; m->kern_w3[l] = new_w3[l]; } - m->kern_cls = compile_conv_kernel(m->wcls, d, vs, S); - // cls may fail for large vocab — that's OK, forward uses CPU fallback + if (m->kern_cls) ane_free(m->kern_cls); + m->kern_cls = new_cls; // may be NULL for large vocab — forward uses CPU fallback return 0; } diff --git a/training/model_config.h b/training/model_config.h new file mode 100644 index 0000000..fa8be1c --- /dev/null +++ b/training/model_config.h @@ -0,0 +1,314 @@ +// model_config.h — Parameterized model configuration for pipeline training +// Replaces hardcoded #defines with portable structs + preset configs +#pragma once +#include +#include +#include +#include + +// ===== Model configuration ===== + +typedef struct { + int dim; // model dimension (embedding/residual width) + int hidden_dim; // FFN hidden dimension + int n_heads; // number of attention heads + int n_kv_heads; // number of KV heads (for GQA; == n_heads for MHA) + int n_layers; // total transformer layers + int vocab_size; // vocabulary size + int seq_len; // maximum sequence length + // Derived (computed by model_config_init) + int head_dim; // dim / n_heads + int kv_dim; // head_dim * n_kv_heads + int score_ch; // n_heads * seq_len (attention score channels for SDPA bwd) +} ModelDims; + +typedef struct { + int compile_budget; // max ANE compilations per process (~119) + int kernels_per_layer; // weight-bearing kernels per layer (currently 5) + int static_per_layer; // weight-free kernels per layer (sdpaBwd2 = 1) + int accum_steps; // gradient accumulation steps per compile batch + float headroom_pct; // safety margin as fraction of budget (0.0-1.0, default 0.10) +} CompileConfig; + +typedef struct { + ModelDims dims; + CompileConfig compile; + const char *name; // human-readable model name +} ModelConfig; + +// ===== Layer group for pipeline scheduling ===== + +typedef struct { + int start_layer; // first layer index (inclusive) + int end_layer; // last layer index (exclusive) + int n_layers; // end_layer - start_layer + int weight_kernels; // weight-bearing kernels in this group + int static_kernels; // weight-free kernels in this group + int total_kernels; // weight_kernels + static_kernels +} LayerGroup; + +typedef struct { + LayerGroup *groups; + int n_groups; + int total_exec_restarts; // estimated exec() restarts per training step +} PipelinePlan; + +// ===== Derived dimension helpers ===== + +static void model_dims_init(ModelDims *d) { + d->head_dim = (d->n_heads > 0) ? d->dim / d->n_heads : 0; + d->kv_dim = d->head_dim * d->n_kv_heads; + d->score_ch = d->n_heads * d->seq_len; +} + +// ===== Per-layer memory sizes (bytes) ===== + +// Weight sizes in floats (fp32) +static inline size_t wq_size(const ModelDims *d) { return (size_t)d->dim * d->dim; } +static inline size_t wo_size(const ModelDims *d) { return (size_t)d->dim * d->dim; } +static inline size_t w1_size(const ModelDims *d) { return (size_t)d->hidden_dim * d->dim; } +static inline size_t w2_size(const ModelDims *d) { return (size_t)d->dim * d->hidden_dim; } +static inline size_t w3_size(const ModelDims *d) { return (size_t)d->hidden_dim * d->dim; } + +static inline size_t layer_weight_floats(const ModelDims *d) { + return 4 * wq_size(d) // Wq, Wk, Wv, Wo + + w1_size(d) + w2_size(d) + w3_size(d) // W1, W2, W3 + + 2 * (size_t)d->dim; // rms_att, rms_ffn +} + +static inline size_t layer_weight_bytes(const ModelDims *d) { + return layer_weight_floats(d) * sizeof(float); +} + +// Adam state: 2x weight size (m + v vectors) +static inline size_t layer_adam_bytes(const ModelDims *d) { + return 2 * layer_weight_bytes(d); +} + +// Activation buffers per layer (saved for backward) +static inline size_t layer_activation_floats(const ModelDims *d) { + int S = d->seq_len, D = d->dim, H = d->hidden_dim; + // layer_in, xnorm, Q, K, V, attn_out, o_out, x2, x2norm = 9 * D*S + // h1, h3, silu_out = 3 * H*S + // ffn_out = D*S + return (size_t)(10 * D * S + 3 * H * S); +} + +static inline size_t layer_activation_bytes(const ModelDims *d) { + return layer_activation_floats(d) * sizeof(float); +} + +// Gradient accumulators per layer +static inline size_t layer_gradient_bytes(const ModelDims *d) { + return layer_weight_bytes(d); // same layout as weights +} + +// Total model memory (weights + adam + activations + gradients for all layers) +static inline size_t total_model_bytes(const ModelConfig *cfg) { + const ModelDims *d = &cfg->dims; + size_t per_layer = layer_weight_bytes(d) + layer_adam_bytes(d) + + layer_activation_bytes(d) + layer_gradient_bytes(d); + size_t global = (size_t)d->dim * sizeof(float) // rms_final + + (size_t)d->vocab_size * d->dim * sizeof(float) // embed + + (size_t)d->dim * 2 * sizeof(float) // rms_final adam + + (size_t)d->vocab_size * d->dim * 2 * sizeof(float) // embed adam + + (size_t)d->dim * sizeof(float) // rms_final grad + + (size_t)d->vocab_size * d->dim * sizeof(float); // embed grad + return per_layer * d->n_layers + global; +} + +// ===== Pipeline planning ===== + +// Compute how many layers can fit in one compile batch +static int max_layers_per_compile(const CompileConfig *cc) { + float headroom = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f) + ? cc->headroom_pct : 0.10f; + int usable = (int)(cc->compile_budget * (1.0f - headroom)); + int per_layer = cc->kernels_per_layer + cc->static_per_layer; + if (per_layer <= 0) return 1; + return usable / per_layer; +} + +// Compute optimal layer groups for a model given compile budget +// Returns a PipelinePlan (caller must free plan.groups) +static PipelinePlan compute_pipeline_plan(const ModelConfig *cfg) { + PipelinePlan plan = {0}; + int max_per = max_layers_per_compile(&cfg->compile); + if (max_per <= 0) max_per = 1; + + // Clamp to actual layer count + int group_size = (max_per < cfg->dims.n_layers) ? max_per : cfg->dims.n_layers; + + plan.n_groups = (cfg->dims.n_layers + group_size - 1) / group_size; + plan.groups = (LayerGroup *)calloc(plan.n_groups, sizeof(LayerGroup)); + + int kpl = cfg->compile.kernels_per_layer; + int spl = cfg->compile.static_per_layer; + + for (int g = 0; g < plan.n_groups; g++) { + LayerGroup *lg = &plan.groups[g]; + lg->start_layer = g * group_size; + lg->end_layer = lg->start_layer + group_size; + if (lg->end_layer > cfg->dims.n_layers) + lg->end_layer = cfg->dims.n_layers; + lg->n_layers = lg->end_layer - lg->start_layer; + lg->weight_kernels = lg->n_layers * kpl; + lg->static_kernels = lg->n_layers * spl; + lg->total_kernels = lg->weight_kernels + lg->static_kernels; + } + + // Each compile batch needs one exec() restart (except possibly the last) + // Forward: n_groups compiles. Backward: n_groups compiles. + // Per training step: forward + backward = 2 * n_groups compile batches + // Each batch may need exec() restart. Worst case: + plan.total_exec_restarts = 2 * plan.n_groups; + + return plan; +} + +static void pipeline_plan_free(PipelinePlan *plan) { + free(plan->groups); + plan->groups = NULL; + plan->n_groups = 0; +} + +// ===== Pretty-print plan ===== + +static void pipeline_plan_print(const ModelConfig *cfg, const PipelinePlan *plan) { + printf("=== Pipeline Plan: %s ===\n", cfg->name); + printf(" %d layers | dim=%d hidden=%d heads=%d seq=%d vocab=%d\n", + cfg->dims.n_layers, cfg->dims.dim, cfg->dims.hidden_dim, + cfg->dims.n_heads, cfg->dims.seq_len, cfg->dims.vocab_size); + printf(" Compile budget: %d | %d weight-kernels/layer + %d static/layer\n", + cfg->compile.compile_budget, cfg->compile.kernels_per_layer, + cfg->compile.static_per_layer); + printf(" %d layer group(s):\n", plan->n_groups); + for (int g = 0; g < plan->n_groups; g++) { + const LayerGroup *lg = &plan->groups[g]; + printf(" Group %d: layers [%d..%d) — %d layers, %d kernels (%d weight + %d static)\n", + g, lg->start_layer, lg->end_layer, lg->n_layers, + lg->total_kernels, lg->weight_kernels, lg->static_kernels); + } + printf(" Est. exec() restarts per step: %d\n", plan->total_exec_restarts); + printf(" Memory per layer: weights=%.1fMB adam=%.1fMB acts=%.1fMB grads=%.1fMB\n", + layer_weight_bytes(&cfg->dims)/1e6, layer_adam_bytes(&cfg->dims)/1e6, + layer_activation_bytes(&cfg->dims)/1e6, layer_gradient_bytes(&cfg->dims)/1e6); + printf(" Total model state: %.1fMB\n", total_model_bytes(cfg)/1e6); +} + +// ===== FLOP estimation per step ===== + +static inline double flops_per_step(const ModelConfig *cfg) { + const ModelDims *d = &cfg->dims; + int N = d->n_layers, D = d->dim, H = d->hidden_dim, S = d->seq_len; + int HD = d->head_dim, NH = d->n_heads; + // Forward: 4 linear projections (QKV+O) + 3 FFN projections per layer + double fwd = N * (4.0*2*D*D*S + 2.0*2*D*H*S + 2.0*H*D*S); + // Backward dx same flops as forward + double bwd_dx = fwd; + // Backward dW same flops as forward + double bwd_dw = fwd; + // SDPA backward (attention score computation) + double sdpa = N * 2.0 * NH * 5 * S * S * HD; + // Classifier (forward + backward) + double cls = 3.0 * 2.0 * d->vocab_size * D * S; + return fwd + bwd_dx + bwd_dw + sdpa + cls; +} + +static inline double ane_flops_per_step(const ModelConfig *cfg) { + const ModelDims *d = &cfg->dims; + int N = d->n_layers, D = d->dim, H = d->hidden_dim, S = d->seq_len; + int HD = d->head_dim, NH = d->n_heads; + double fwd = N * (4.0*2*D*D*S + 2.0*2*D*H*S + 2.0*H*D*S); + double bwd_dx = fwd; + double sdpa = N * 2.0 * NH * 5 * S * S * HD; + return fwd + bwd_dx + sdpa; // dW is on CPU (cblas) +} + +// ===== Model presets ===== + +static ModelConfig model_config_stories110m(void) { + ModelConfig cfg = {0}; + cfg.name = "Stories110M"; + cfg.dims = (ModelDims){ + .dim = 768, .hidden_dim = 2048, .n_heads = 12, + .n_kv_heads = 12, .n_layers = 12, .vocab_size = 32000, .seq_len = 256 + }; + cfg.compile = (CompileConfig){ + .compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .accum_steps = 10, .headroom_pct = 0.10f + }; + model_dims_init(&cfg.dims); + return cfg; +} + +static ModelConfig model_config_stories42m(void) { + ModelConfig cfg = {0}; + cfg.name = "Stories42M"; + cfg.dims = (ModelDims){ + .dim = 512, .hidden_dim = 1376, .n_heads = 8, + .n_kv_heads = 8, .n_layers = 8, .vocab_size = 32000, .seq_len = 256 + }; + cfg.compile = (CompileConfig){ + .compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .accum_steps = 10, .headroom_pct = 0.10f + }; + model_dims_init(&cfg.dims); + return cfg; +} + +static ModelConfig model_config_llama_1b(void) { + ModelConfig cfg = {0}; + cfg.name = "LLaMA-1.1B"; + cfg.dims = (ModelDims){ + .dim = 2048, .hidden_dim = 5504, .n_heads = 16, + .n_kv_heads = 16, .n_layers = 22, .vocab_size = 32000, .seq_len = 512 + }; + cfg.compile = (CompileConfig){ + .compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .accum_steps = 4, .headroom_pct = 0.10f + }; + model_dims_init(&cfg.dims); + return cfg; +} + +static ModelConfig model_config_llama_7b(void) { + ModelConfig cfg = {0}; + cfg.name = "LLaMA-7B"; + cfg.dims = (ModelDims){ + .dim = 4096, .hidden_dim = 11008, .n_heads = 32, + .n_kv_heads = 32, .n_layers = 32, .vocab_size = 32000, .seq_len = 512 + }; + cfg.compile = (CompileConfig){ + .compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .accum_steps = 2, .headroom_pct = 0.10f + }; + model_dims_init(&cfg.dims); + return cfg; +} + +// Parse config from command-line (returns preset, caller can override) +static ModelConfig model_config_from_args(int argc, char *argv[]) { + ModelConfig cfg = model_config_stories110m(); // default + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--model") == 0 && i+1 < argc) { + const char *name = argv[++i]; + if (strcmp(name, "stories42m") == 0) cfg = model_config_stories42m(); + else if (strcmp(name, "stories110m") == 0) cfg = model_config_stories110m(); + else if (strcmp(name, "llama1b") == 0) cfg = model_config_llama_1b(); + else if (strcmp(name, "llama7b") == 0) cfg = model_config_llama_7b(); + else fprintf(stderr, "Unknown model: %s (using stories110m)\n", name); + } + else if (strcmp(argv[i], "--dim") == 0 && i+1 < argc) cfg.dims.dim = atoi(argv[++i]); + else if (strcmp(argv[i], "--hidden") == 0 && i+1 < argc) cfg.dims.hidden_dim = atoi(argv[++i]); + else if (strcmp(argv[i], "--heads") == 0 && i+1 < argc) cfg.dims.n_heads = atoi(argv[++i]); + else if (strcmp(argv[i], "--layers") == 0 && i+1 < argc) cfg.dims.n_layers = atoi(argv[++i]); + else if (strcmp(argv[i], "--seq") == 0 && i+1 < argc) cfg.dims.seq_len = atoi(argv[++i]); + else if (strcmp(argv[i], "--vocab") == 0 && i+1 < argc) cfg.dims.vocab_size = atoi(argv[++i]); + else if (strcmp(argv[i], "--budget") == 0 && i+1 < argc) cfg.compile.compile_budget = atoi(argv[++i]); + else if (strcmp(argv[i], "--accum") == 0 && i+1 < argc) cfg.compile.accum_steps = atoi(argv[++i]); + else if (strcmp(argv[i], "--headroom") == 0 && i+1 < argc) cfg.compile.headroom_pct = atof(argv[++i]); + } + model_dims_init(&cfg.dims); + return cfg; +} diff --git a/training/pipeline.h b/training/pipeline.h new file mode 100644 index 0000000..d3ceb95 --- /dev/null +++ b/training/pipeline.h @@ -0,0 +1,515 @@ +// pipeline.h — Layer-group scheduling and mmap state for multi-group ANE training +// Manages compile budgets, exec() restarts, and cross-exec shared tensor state +#pragma once +#include "model_config.h" +#include +#include +#include +#include +#include + +// ===== Compile budget tracker ===== + +typedef struct { + int budget; // max compilations allowed + int used; // compilations consumed so far + int headroom; // safety margin (budget * 0.1) +} CompileBudget; + +static CompileBudget budget_init(const CompileConfig *cc) { + CompileBudget b; + b.budget = cc->compile_budget; + b.used = 0; + float pct = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f) + ? cc->headroom_pct : 0.10f; + b.headroom = (int)(cc->compile_budget * pct); + return b; +} + +static bool budget_can_fit(const CompileBudget *b, int n_kernels) { + return (b->used + n_kernels) <= (b->budget - b->headroom); +} + +static void budget_consume(CompileBudget *b, int n_kernels) { + b->used += n_kernels; +} + +static bool budget_needs_restart(const CompileBudget *b) { + return b->used >= (b->budget - b->headroom); +} + +static int budget_remaining(const CompileBudget *b) { + int r = b->budget - b->headroom - b->used; + return (r > 0) ? r : 0; +} + +// ===== Pipeline execution phases ===== + +typedef enum { + PHASE_INIT = 0, + PHASE_FORWARD, // running forward pass through layer groups + PHASE_BACKWARD, // running backward pass through layer groups (reverse) + PHASE_WEIGHT_UPDATE, // Adam step on accumulated gradients + PHASE_DONE // training step complete +} PipelinePhase; + +typedef enum { + ACTION_COMPILE_GROUP, // compile kernels for current layer group + ACTION_RUN_FORWARD_GROUP, // execute forward pass for compiled group + ACTION_RUN_BACKWARD_GROUP, // execute backward pass for compiled group + ACTION_EXEC_RESTART, // save state and exec() to reset compile budget + ACTION_WEIGHT_UPDATE, // run optimizer on all layers + ACTION_STEP_DONE, // training step complete + ACTION_ERROR // something went wrong +} PipelineAction; + +// ===== Scheduler state ===== + +typedef struct { + ModelConfig config; + PipelinePlan plan; + CompileBudget budget; + + PipelinePhase phase; + int current_group; // index into plan.groups + int current_step; // training step number + int accum_step; // gradient accumulation step within batch + int total_steps; // total training steps requested + float learning_rate; + float last_loss; + + // Flags + bool group_compiled; // whether current group's kernels are compiled + bool needs_restart; // whether we need exec() before next group +} PipelineScheduler; + +static PipelineScheduler pipeline_scheduler_init(ModelConfig config, int total_steps, float lr) { + PipelineScheduler s = {0}; + s.config = config; + s.plan = compute_pipeline_plan(&config); + s.budget = budget_init(&config.compile); + s.phase = PHASE_FORWARD; + s.current_group = 0; + s.current_step = 0; + s.accum_step = 0; + s.total_steps = total_steps; + s.learning_rate = lr; + s.last_loss = 999.0f; + s.group_compiled = false; + s.needs_restart = false; + return s; +} + +// Get the next action the training loop should take +static PipelineAction pipeline_next_action(PipelineScheduler *s) { + if (s->current_step >= s->total_steps) + return ACTION_STEP_DONE; + + switch (s->phase) { + case PHASE_FORWARD: + if (s->current_group >= s->plan.n_groups) { + // Forward pass complete for all groups — start backward + s->phase = PHASE_BACKWARD; + s->current_group = s->plan.n_groups - 1; + s->group_compiled = false; + return pipeline_next_action(s); + } + if (!s->group_compiled) { + // Check if we have compile budget for this group + LayerGroup *lg = &s->plan.groups[s->current_group]; + if (!budget_can_fit(&s->budget, lg->total_kernels)) { + s->needs_restart = true; + return ACTION_EXEC_RESTART; + } + return ACTION_COMPILE_GROUP; + } + return ACTION_RUN_FORWARD_GROUP; + + case PHASE_BACKWARD: + if (s->current_group < 0) { + // Backward complete — weight update + s->phase = PHASE_WEIGHT_UPDATE; + return ACTION_WEIGHT_UPDATE; + } + if (!s->group_compiled) { + LayerGroup *lg = &s->plan.groups[s->current_group]; + if (!budget_can_fit(&s->budget, lg->total_kernels)) { + s->needs_restart = true; + return ACTION_EXEC_RESTART; + } + return ACTION_COMPILE_GROUP; + } + return ACTION_RUN_BACKWARD_GROUP; + + case PHASE_WEIGHT_UPDATE: + return ACTION_WEIGHT_UPDATE; + + case PHASE_DONE: + return ACTION_STEP_DONE; + + default: + return ACTION_ERROR; + } +} + +// Called after successfully compiling a layer group's kernels +static void pipeline_group_compiled(PipelineScheduler *s) { + LayerGroup *lg = &s->plan.groups[s->current_group]; + budget_consume(&s->budget, lg->total_kernels); + s->group_compiled = true; +} + +// Called after successfully running forward for current group +static void pipeline_forward_group_done(PipelineScheduler *s) { + s->current_group++; + s->group_compiled = false; +} + +// Called after successfully running backward for current group +static void pipeline_backward_group_done(PipelineScheduler *s) { + s->current_group--; + s->group_compiled = false; +} + +// Called after weight update completes +static void pipeline_weight_update_done(PipelineScheduler *s) { + s->accum_step++; + if (s->accum_step >= s->config.compile.accum_steps) { + s->accum_step = 0; + s->current_step++; + } + // Reset for next forward pass + s->phase = PHASE_FORWARD; + s->current_group = 0; + s->group_compiled = false; +} + +// ===== mmap-based cross-exec state ===== +// +// Layout: [Header][Layer 0 weights][Layer 0 adam][Layer 0 grads]...[Global state] +// All tensors stored as fp32. The mmap file persists across exec() restarts. + +#define MMAP_SENTINEL 0x414E4550 // "ANEP" — file format identifier +#define MMAP_VERSION 1 + +typedef struct { + int sentinel; // MMAP_SENTINEL for file identification + int version; + int n_layers; + int dim; + int hidden_dim; + int n_heads; + int vocab_size; + int seq_len; + // Scheduler state (for exec restart) + int phase; + int current_group; + int current_step; + int accum_step; + int total_steps; + int compile_count; // compiles used in current process + int adam_t; // Adam timestep + float learning_rate; + float last_loss; + // Offsets into mmap (bytes from base) + size_t layer_weights_offset; // start of per-layer weight data + size_t layer_adam_offset; // start of per-layer adam state + size_t layer_grads_offset; // start of per-layer gradient accumulators + size_t layer_acts_offset; // start of per-layer activation checkpoints + size_t global_offset; // start of global state (rms_final, embed, etc.) + size_t total_size; // total mmap size + int pad[4]; // alignment +} MmapHeader; + +typedef struct { + int fd; + void *base; + size_t size; + MmapHeader *header; + const char *path; +} MmapState; + +// Compute mmap layout for a given config +static size_t mmap_compute_size(const ModelConfig *cfg) { + const ModelDims *d = &cfg->dims; + size_t header = sizeof(MmapHeader); + // Round up to page boundary + header = (header + 4095) & ~(size_t)4095; + + size_t per_layer_weights = layer_weight_bytes(d); + size_t per_layer_adam = layer_adam_bytes(d); + size_t per_layer_grads = layer_gradient_bytes(d); + size_t per_layer_acts = layer_activation_bytes(d); + + size_t all_layers = (size_t)d->n_layers * (per_layer_weights + per_layer_adam + per_layer_grads + per_layer_acts); + + // Global: rms_final + embed + their adam states + embed gradients + size_t global = (size_t)d->dim * 4 // rms_final + + (size_t)d->vocab_size * d->dim * 4 // embed + + (size_t)d->dim * 2 * 4 // rms_final adam (m+v) + + (size_t)d->vocab_size * d->dim * 2 * 4 // embed adam + + (size_t)d->dim * 4 // rms_final grad + + (size_t)d->vocab_size * d->dim * 4; // embed grad + + return header + all_layers + global; +} + +// Create a new mmap state file +static MmapState *mmap_state_create(const char *path, const ModelConfig *cfg) { + size_t total = mmap_compute_size(cfg); + int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0644); + if (fd < 0) { perror("mmap_state_create: open"); return NULL; } + if (ftruncate(fd, total) < 0) { perror("mmap_state_create: ftruncate"); close(fd); return NULL; } + + void *base = mmap(NULL, total, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (base == MAP_FAILED) { perror("mmap_state_create: mmap"); close(fd); return NULL; } + + MmapState *ms = (MmapState *)calloc(1, sizeof(MmapState)); + if (!ms) { perror("mmap_state_create: calloc"); munmap(base, total); close(fd); return NULL; } + ms->fd = fd; + ms->base = base; + ms->size = total; + ms->path = path; + ms->header = (MmapHeader *)base; + + // Initialize header + MmapHeader *h = ms->header; + h->sentinel = MMAP_SENTINEL; + h->version = MMAP_VERSION; + h->n_layers = cfg->dims.n_layers; + h->dim = cfg->dims.dim; + h->hidden_dim = cfg->dims.hidden_dim; + h->n_heads = cfg->dims.n_heads; + h->vocab_size = cfg->dims.vocab_size; + h->seq_len = cfg->dims.seq_len; + + // Compute offsets + size_t header_end = (sizeof(MmapHeader) + 4095) & ~(size_t)4095; + const ModelDims *d = &cfg->dims; + size_t pw = layer_weight_bytes(d); + size_t pa = layer_adam_bytes(d); + size_t pg = layer_gradient_bytes(d); + size_t pact = layer_activation_bytes(d); + + h->layer_weights_offset = header_end; + h->layer_adam_offset = h->layer_weights_offset + (size_t)d->n_layers * pw; + h->layer_grads_offset = h->layer_adam_offset + (size_t)d->n_layers * pa; + h->layer_acts_offset = h->layer_grads_offset + (size_t)d->n_layers * pg; + h->global_offset = h->layer_acts_offset + (size_t)d->n_layers * pact; + h->total_size = total; + + return ms; +} + +// Reopen existing mmap state (after exec() restart) +static MmapState *mmap_state_open(const char *path) { + int fd = open(path, O_RDWR); + if (fd < 0) { perror("mmap_state_open: open"); return NULL; } + struct stat st; + if (fstat(fd, &st) < 0) { perror("mmap_state_open: fstat"); close(fd); return NULL; } + + void *base = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (base == MAP_FAILED) { perror("mmap_state_open: mmap"); close(fd); return NULL; } + + if ((size_t)st.st_size < sizeof(MmapHeader)) { + fprintf(stderr, "mmap_state_open: file too small (%lld bytes)\n", (long long)st.st_size); + munmap(base, st.st_size); + close(fd); + return NULL; + } + + MmapHeader *h = (MmapHeader *)base; + if (h->sentinel != MMAP_SENTINEL || h->version != MMAP_VERSION) { + fprintf(stderr, "mmap_state_open: invalid header (sentinel=0x%08x version=%d)\n", + h->sentinel, h->version); + munmap(base, st.st_size); + close(fd); + return NULL; + } + + if (h->total_size != 0 && (size_t)st.st_size < h->total_size) { + fprintf(stderr, "mmap_state_open: file truncated (expected %zu, got %lld)\n", + h->total_size, (long long)st.st_size); + munmap(base, st.st_size); + close(fd); + return NULL; + } + + MmapState *ms = (MmapState *)calloc(1, sizeof(MmapState)); + if (!ms) { perror("mmap_state_open: calloc"); munmap(base, st.st_size); close(fd); return NULL; } + ms->fd = fd; + ms->base = base; + ms->size = st.st_size; + ms->path = path; + ms->header = h; + return ms; +} + +// Close and unmap (does NOT delete the file) +static void mmap_state_close(MmapState *ms) { + if (!ms) return; + if (ms->base && ms->base != MAP_FAILED) { + if (msync(ms->base, ms->size, MS_SYNC) < 0) perror("mmap_state_close: msync"); + if (munmap(ms->base, ms->size) < 0) perror("mmap_state_close: munmap"); + } + if (ms->fd >= 0) close(ms->fd); + free(ms); +} + +// Delete the mmap file (call after training completes) +static void mmap_state_destroy(MmapState *ms) { + if (!ms) return; + const char *p = ms->path; + mmap_state_close(ms); + unlink(p); +} + +// ===== Typed accessors into mmap regions ===== + +// Reconstruct ModelDims from mmap header (avoids repeating in each accessor) +static inline ModelDims mmap_dims(const MmapState *ms) { + return (ModelDims){ + .dim = ms->header->dim, .hidden_dim = ms->header->hidden_dim, + .n_heads = ms->header->n_heads, .vocab_size = ms->header->vocab_size, + .seq_len = ms->header->seq_len + }; +} + +// Get pointer to layer L's weights in mmap (NULL if out of bounds) +static float *mmap_layer_weights(MmapState *ms, int layer) { + if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL; + ModelDims d = mmap_dims(ms); + return (float *)((char *)ms->base + ms->header->layer_weights_offset + + (size_t)layer * layer_weight_bytes(&d)); +} + +// Get pointer to layer L's adam state in mmap (NULL if out of bounds) +static float *mmap_layer_adam(MmapState *ms, int layer) { + if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL; + ModelDims d = mmap_dims(ms); + return (float *)((char *)ms->base + ms->header->layer_adam_offset + + (size_t)layer * layer_adam_bytes(&d)); +} + +// Get pointer to layer L's gradient accumulators in mmap (NULL if out of bounds) +static float *mmap_layer_grads(MmapState *ms, int layer) { + if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL; + ModelDims d = mmap_dims(ms); + return (float *)((char *)ms->base + ms->header->layer_grads_offset + + (size_t)layer * layer_gradient_bytes(&d)); +} + +// Get pointer to layer L's activation checkpoint in mmap (NULL if out of bounds) +static float *mmap_layer_acts(MmapState *ms, int layer) { + if (!ms || layer < 0 || layer >= ms->header->n_layers) return NULL; + ModelDims d = mmap_dims(ms); + return (float *)((char *)ms->base + ms->header->layer_acts_offset + + (size_t)layer * layer_activation_bytes(&d)); +} + +// Get pointer to global state region (rms_final, embed, etc.) +static float *mmap_global(MmapState *ms) { + return (float *)((char *)ms->base + ms->header->global_offset); +} + +// ===== Save/restore scheduler state to/from mmap header ===== + +static void pipeline_save_to_mmap(const PipelineScheduler *s, MmapState *ms) { + MmapHeader *h = ms->header; + h->phase = (int)s->phase; + h->current_group = s->current_group; + h->current_step = s->current_step; + h->accum_step = s->accum_step; + h->total_steps = s->total_steps; + h->learning_rate = s->learning_rate; + h->last_loss = s->last_loss; + msync(ms->base, sizeof(MmapHeader), MS_SYNC); +} + +static void pipeline_restore_from_mmap(PipelineScheduler *s, const MmapState *ms) { + const MmapHeader *h = ms->header; + s->phase = (PipelinePhase)h->phase; + s->current_group = h->current_group; + s->current_step = h->current_step; + s->accum_step = h->accum_step; + s->total_steps = h->total_steps; + s->learning_rate = h->learning_rate; + s->last_loss = h->last_loss; + // Reset compile budget (new process after exec) + s->budget = budget_init(&s->config.compile); + s->group_compiled = false; + s->needs_restart = false; +} + +// ===== exec() restart with mmap persistence ===== + +// Call this when ACTION_EXEC_RESTART is returned. +// Saves scheduler state to mmap, syncs, and exec()s. +// Does not return on success. +static void pipeline_exec_restart(PipelineScheduler *s, MmapState *ms, char *argv[]) { + pipeline_save_to_mmap(s, ms); + printf("[pipeline] exec() restart: step=%d phase=%d group=%d compiles=%d\n", + s->current_step, s->phase, s->current_group, s->budget.used); + fflush(stdout); + + // Sync all mmap data before exec + msync(ms->base, ms->size, MS_SYNC); + + // exec with --pipeline-resume flag + execl(argv[0], argv[0], "--pipeline-resume", ms->path, NULL); + perror("pipeline_exec_restart: execl"); +} + +// Resume from exec() restart. Returns true if this is a resume. +static bool pipeline_check_resume(int argc, char *argv[], PipelineScheduler *s, MmapState **ms_out) { + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--pipeline-resume") == 0 && i+1 < argc) { + const char *mmap_path = argv[i+1]; + MmapState *ms = mmap_state_open(mmap_path); + if (!ms) { + fprintf(stderr, "[pipeline] Failed to reopen mmap at %s\n", mmap_path); + return false; + } + pipeline_restore_from_mmap(s, ms); + *ms_out = ms; + printf("[pipeline] Resumed: step=%d phase=%d group=%d\n", + s->current_step, s->phase, s->current_group); + return true; + } + } + return false; +} + +// ===== Pipeline pretty-print helpers ===== + +static const char *phase_name(PipelinePhase p) { + switch (p) { + case PHASE_INIT: return "INIT"; + case PHASE_FORWARD: return "FORWARD"; + case PHASE_BACKWARD: return "BACKWARD"; + case PHASE_WEIGHT_UPDATE: return "WEIGHT_UPDATE"; + case PHASE_DONE: return "DONE"; + default: return "UNKNOWN"; + } +} + +static const char *action_name(PipelineAction a) { + switch (a) { + case ACTION_COMPILE_GROUP: return "COMPILE_GROUP"; + case ACTION_RUN_FORWARD_GROUP: return "RUN_FORWARD_GROUP"; + case ACTION_RUN_BACKWARD_GROUP: return "RUN_BACKWARD_GROUP"; + case ACTION_EXEC_RESTART: return "EXEC_RESTART"; + case ACTION_WEIGHT_UPDATE: return "WEIGHT_UPDATE"; + case ACTION_STEP_DONE: return "STEP_DONE"; + case ACTION_ERROR: return "ERROR"; + default: return "UNKNOWN"; + } +} + +static void pipeline_print_status(const PipelineScheduler *s) { + printf("[pipeline] step=%d/%d accum=%d/%d phase=%s group=%d/%d budget=%d/%d\n", + s->current_step, s->total_steps, + s->accum_step, s->config.compile.accum_steps, + phase_name(s->phase), s->current_group, s->plan.n_groups, + s->budget.used, s->budget.budget); +} diff --git a/training/stories_cpu_ops.h b/training/stories_cpu_ops.h index c9f2cfa..cd103c5 100644 --- a/training/stories_cpu_ops.h +++ b/training/stories_cpu_ops.h @@ -2,14 +2,13 @@ #pragma once #include "stories_config.h" -static float *g_rms_tmp = NULL; static void rmsnorm(float *out, const float *x, const float *w, int d, int S) { - if (!g_rms_tmp) g_rms_tmp = (float*)malloc(S*4); + float *rms_tmp = (float*)malloc(S * sizeof(float)); float *ss = (float*)calloc(S, sizeof(float)); for (int i=0; i= V) { fprintf(stderr, "WARN: target token %d out of vocab range [0,%d), skipping\n", tgt, V); continue; } total_loss -= logf(row[tgt] + 1e-10f); // gradient: softmax - one_hot, then /S row[tgt] -= 1.0f; @@ -112,6 +112,7 @@ static float cross_entropy_loss(float *dlogits, const float *logits, const uint1 static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq) { for (int t = 0; t < seq; t++) { int tok = tokens[t]; + if (tok < 0 || tok >= VOCAB) { fprintf(stderr, "WARN: token %d out of range [0,%d)\n", tok, VOCAB); continue; } for (int d = 0; d < dim; d++) { x[d*seq + t] = embed[tok*dim + d]; } @@ -122,6 +123,7 @@ static void embed_lookup(float *x, const float *embed, const uint16_t *tokens, i static void embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq) { for (int t = 0; t < seq; t++) { int tok = tokens[t]; + if (tok < 0 || tok >= VOCAB) { continue; } for (int d = 0; d < dim; d++) { d_embed[tok*dim + d] += dx[d*seq + t]; } diff --git a/training/test_classifier.m b/training/test_classifier.m new file mode 100644 index 0000000..363e46e --- /dev/null +++ b/training/test_classifier.m @@ -0,0 +1,255 @@ +// test_classifier.m — Test classifier matmul (32000 channels) and softmax on ANE +// This tests the riskiest operations: VOCAB-sized conv and softmax +// Build: xcrun clang -O2 -framework Foundation -framework IOSurface \ +// -framework CoreML -framework Accelerate -ldl -lobjc \ +// -o test_classifier test_classifier.m +#include "ane_classifier.h" +#include "stories_cpu_ops.h" + +int main(void) { + @autoreleasepool { + setbuf(stdout, NULL); + ane_init(); + mach_timebase_info(&g_tb); + + printf("=== Test: Classifier + Softmax on ANE ===\n"); + printf("DIM=%d SEQ=%d VOCAB=%d\n\n", DIM, SEQ, VOCAB); + + // ======== Test 1: Final RMSNorm ======== + printf("--- Test 1: Final RMSNorm on ANE ---\n"); + { + float *x = (float*)malloc(DIM * SEQ * 4); + float *w = (float*)malloc(DIM * 4); + float *out_cpu = (float*)malloc(DIM * SEQ * 4); + float *out_ane = (float*)malloc(DIM * SEQ * 4); + srand48(42); + for (int i = 0; i < DIM * SEQ; i++) x[i] = (float)(drand48() * 2 - 1); + for (int i = 0; i < DIM; i++) w[i] = (float)(drand48() * 0.5 + 0.75); + + rmsnorm(out_cpu, x, w, DIM, SEQ); + + Kern *kern = compile_kern_mil_w(gen_final_rmsnorm(), (@{ + @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":build_blob(w, 1, DIM)}, + }), DIM*SEQ*2, DIM*SEQ*2); + + if (!kern) { printf("FAIL: Final RMSNorm compile failed\n"); return 1; } + printf("Compile OK\n"); + + io_write_fp16(kern->ioIn, x, DIM, SEQ); + ane_eval(kern); + io_read_fp16(kern->ioOut, out_ane, 0, DIM, SEQ); + + float max_err = 0; + for (int i = 0; i < DIM*SEQ; i++) { + float e = fabsf(out_cpu[i] - out_ane[i]); + if (e > max_err) max_err = e; + } + printf("Max error: %.6f %s\n\n", max_err, max_err < 0.05 ? "PASS ✅" : "FAIL ❌"); + free_kern(kern); + free(x); free(w); free(out_cpu); free(out_ane); + } + + // ======== Test 2: Classifier forward (32000-channel conv) ======== + printf("--- Test 2: Classifier Forward (VOCAB=%d channel conv) ---\n", VOCAB); + { + float *x_final = (float*)malloc(DIM * SEQ * 4); + float *embed = (float*)malloc((size_t)VOCAB * DIM * 4); + float *logits_cpu = (float*)malloc((size_t)VOCAB * SEQ * 4); + float *logits_ane = (float*)malloc((size_t)VOCAB * SEQ * 4); + + srand48(123); + for (int i = 0; i < DIM * SEQ; i++) x_final[i] = (float)(drand48() * 2 - 1) * 0.1f; + for (size_t i = 0; i < (size_t)VOCAB * DIM; i++) embed[i] = (float)(drand48() * 2 - 1) * 0.02f; + + // CPU reference: logits = embed @ x_final + // logits[v, t] = sum_d embed[v,d] * x_final[d,t] + // embed is [VOCAB, DIM] row-major, x_final is [DIM, SEQ] channel-first + uint64_t t0 = mach_absolute_time(); + cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, + VOCAB, SEQ, DIM, 1.0f, + embed, DIM, x_final, SEQ, 0.0f, logits_cpu, SEQ); + uint64_t t1 = mach_absolute_time(); + printf("CPU cblas_sgemm: %.2f ms\n", tb_ms(t1-t0)); + + // ANE: build weight blob for embed [VOCAB, DIM] + printf("Building embed blob (%.1f MB fp16)...\n", (float)VOCAB*DIM*2/1e6); + NSData *embed_blob = build_blob(embed, VOCAB, DIM); + + printf("Compiling classifier kernel...\n"); + t0 = mach_absolute_time(); + Kern *cls = compile_kern_mil_w(gen_classifier_fwd(), (@{ + @"@model_path/weights/embed.bin": @{@"offset":@0, @"data":embed_blob}, + }), DIM*SEQ*2, VOCAB*SEQ*2); + t1 = mach_absolute_time(); + + if (!cls) { + printf("FAIL: Classifier compile failed (32000 channels too large for ANE)\n"); + printf("This confirms tiling is needed.\n\n"); + } else { + printf("Compile OK in %.0f ms (compiles=%d)\n", tb_ms(t1-t0), g_compile_count); + + io_write_fp16(cls->ioIn, x_final, DIM, SEQ); + t0 = mach_absolute_time(); + ane_eval(cls); + t1 = mach_absolute_time(); + printf("ANE eval: %.2f ms\n", tb_ms(t1-t0)); + + // Read back and compare (sample — full read would be 32000*256*4 = 32MB) + io_read_fp16(cls->ioOut, logits_ane, 0, VOCAB, SEQ); + + float max_err = 0, sum_err = 0; + int cnt = 0; + for (int v = 0; v < VOCAB; v++) { + for (int t = 0; t < SEQ; t++) { + int idx = v*SEQ + t; + float e = fabsf(logits_cpu[idx] - logits_ane[idx]); + sum_err += e; + cnt++; + if (e > max_err) max_err = e; + } + } + printf("Max error: %.6f Mean error: %.6f %s\n", + max_err, sum_err/cnt, max_err < 1.0 ? "PASS ✅" : "FAIL ❌"); + + // Benchmark + int N = 10; + t0 = mach_absolute_time(); + for (int i = 0; i < N; i++) ane_eval(cls); + t1 = mach_absolute_time(); + printf("Benchmark: %d evals in %.2f ms (%.2f ms/eval)\n\n", N, tb_ms(t1-t0), tb_ms(t1-t0)/N); + free_kern(cls); + } + free(x_final); free(embed); free(logits_cpu); free(logits_ane); + } + + // ======== Test 3: Softmax over VOCAB dimension ======== + printf("--- Test 3: Softmax over VOCAB=%d ---\n", VOCAB); + { + float *logits = (float*)malloc((size_t)VOCAB * SEQ * 4); + float *probs_cpu = (float*)malloc((size_t)VOCAB * SEQ * 4); + float *probs_ane = (float*)malloc((size_t)VOCAB * SEQ * 4); + + srand48(999); + for (size_t i = 0; i < (size_t)VOCAB * SEQ; i++) + logits[i] = (float)(drand48() * 10 - 5); + + // CPU reference softmax (per position, over vocab) + // logits is [VOCAB, SEQ] channel-first + uint64_t t0 = mach_absolute_time(); + for (int t = 0; t < SEQ; t++) { + float maxv = -1e30f; + for (int v = 0; v < VOCAB; v++) { + float val = logits[v*SEQ+t]; + if (val > maxv) maxv = val; + } + float sum = 0; + for (int v = 0; v < VOCAB; v++) { + probs_cpu[v*SEQ+t] = expf(logits[v*SEQ+t] - maxv); + sum += probs_cpu[v*SEQ+t]; + } + for (int v = 0; v < VOCAB; v++) probs_cpu[v*SEQ+t] /= sum; + } + uint64_t t1 = mach_absolute_time(); + printf("CPU softmax: %.2f ms\n", tb_ms(t1-t0)); + + printf("Compiling softmax kernel...\n"); + int sm_bytes = VOCAB * SEQ * 2; + Kern *sm = compile_kern_mil_w(gen_softmax_vocab(), @{}, sm_bytes, sm_bytes); + + if (!sm) { + printf("FAIL: Softmax compile failed\n\n"); + } else { + printf("Compile OK\n"); + + io_write_fp16(sm->ioIn, logits, VOCAB, SEQ); + t0 = mach_absolute_time(); + ane_eval(sm); + t1 = mach_absolute_time(); + printf("ANE eval: %.2f ms\n", tb_ms(t1-t0)); + + io_read_fp16(sm->ioOut, probs_ane, 0, VOCAB, SEQ); + + // Check: probs should sum to ~1.0 per position + float max_err = 0; + for (int t = 0; t < 4; t++) { + float sum_cpu = 0, sum_ane = 0; + for (int v = 0; v < VOCAB; v++) { + sum_cpu += probs_cpu[v*SEQ+t]; + sum_ane += probs_ane[v*SEQ+t]; + float e = fabsf(probs_cpu[v*SEQ+t] - probs_ane[v*SEQ+t]); + if (e > max_err) max_err = e; + } + printf(" pos %d: CPU sum=%.4f ANE sum=%.4f\n", t, sum_cpu, sum_ane); + } + printf("Max error (first 4 positions): %.6f %s\n", + max_err, max_err < 0.01 ? "PASS ✅" : "FAIL ❌"); + + int N = 10; + t0 = mach_absolute_time(); + for (int i = 0; i < N; i++) ane_eval(sm); + t1 = mach_absolute_time(); + printf("Benchmark: %d evals in %.2f ms (%.2f ms/eval)\n\n", N, tb_ms(t1-t0), tb_ms(t1-t0)/N); + free_kern(sm); + } + free(logits); free(probs_cpu); free(probs_ane); + } + + // ======== Test 4: Classifier backward ======== + printf("--- Test 4: Classifier Backward (DIM=%d from VOCAB=%d) ---\n", DIM, VOCAB); + { + float *dlogits = (float*)malloc((size_t)VOCAB * SEQ * 4); + float *embed = (float*)malloc((size_t)VOCAB * DIM * 4); + float *dx_cpu = (float*)malloc(DIM * SEQ * 4); + float *dx_ane = (float*)malloc(DIM * SEQ * 4); + + srand48(456); + for (size_t i = 0; i < (size_t)VOCAB * SEQ; i++) dlogits[i] = (float)(drand48() * 2 - 1) * 0.01f; + for (size_t i = 0; i < (size_t)VOCAB * DIM; i++) embed[i] = (float)(drand48() * 2 - 1) * 0.02f; + + // CPU: dx = embed^T @ dlogits + uint64_t t0 = mach_absolute_time(); + cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans, + DIM, SEQ, VOCAB, 1.0f, + embed, DIM, dlogits, SEQ, 0.0f, dx_cpu, SEQ); + uint64_t t1 = mach_absolute_time(); + printf("CPU cblas_sgemm: %.2f ms\n", tb_ms(t1-t0)); + + // Build transposed embed blob + NSData *embed_t_blob = build_blob_t(embed, VOCAB, DIM); + + printf("Compiling classifier backward...\n"); + Kern *clsb = compile_kern_mil_w(gen_classifier_bwd(), (@{ + @"@model_path/weights/embed_t.bin": @{@"offset":@0, @"data":embed_t_blob}, + }), VOCAB*SEQ*2, DIM*SEQ*2); + + if (!clsb) { + printf("FAIL: Classifier backward compile failed\n\n"); + } else { + printf("Compile OK\n"); + + io_write_fp16(clsb->ioIn, dlogits, VOCAB, SEQ); + t0 = mach_absolute_time(); + ane_eval(clsb); + t1 = mach_absolute_time(); + printf("ANE eval: %.2f ms\n", tb_ms(t1-t0)); + + io_read_fp16(clsb->ioOut, dx_ane, 0, DIM, SEQ); + + float max_err = 0, sum_err = 0; + for (int i = 0; i < DIM*SEQ; i++) { + float e = fabsf(dx_cpu[i] - dx_ane[i]); + sum_err += e; + if (e > max_err) max_err = e; + } + printf("Max error: %.6f Mean error: %.6f %s\n\n", + max_err, sum_err/(DIM*SEQ), max_err < 1.0 ? "PASS ✅" : "FAIL ❌"); + free_kern(clsb); + } + free(dlogits); free(embed); free(dx_cpu); free(dx_ane); + } + + printf("=== All tests complete ===\n"); + printf("Total ANE compiles used: %d\n", g_compile_count); + return 0; + } +} diff --git a/training/test_dynamic_matmul.m b/training/test_dynamic_matmul.m new file mode 100644 index 0000000..72addbd --- /dev/null +++ b/training/test_dynamic_matmul.m @@ -0,0 +1,333 @@ +// test_dynamic_matmul.m — Benchmark dynamic matmul on ANE (no recompile) +// Layout: input [1, D, 1, S+D] — activations in sp[0:S], weight rows in sp[S:S+D] +// MIL: slice → reshape → matmul → reshape → output +#import +#import +#import +#import +#import +#import +#include +#include + +#include "stories_io.h" + +// Generate MIL for y = x @ W where both come from input IOSurface +// Input: [1, IC, 1, SEQ+OC] fp32 +// sp[0:SEQ] = activations x[IC, SEQ] +// sp[SEQ:SEQ+OC] = weight W[IC, OC] (each channel d holds W[d, :]) +// Output: [1, OC, 1, SEQ] fp32 +static NSString *gen_dynamic_matmul_mil(int ic, int oc, int seq) { + NSMutableString *m = [NSMutableString string]; + [m appendString:@"program(1.3)\n" + "[buildInfo = dict({{\"coremlc-component-MIL\", \"3510.2.1\"}, " + "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, " + "{\"coremltools-version\", \"9.0\"}})]\n{\n"]; + int sp_total = seq + oc; + [m appendFormat:@" func main(tensor x) {\n", ic, sp_total]; + // Cast to fp16 + [m appendString:@" string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"]; + [m appendFormat:@" tensor xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", ic, sp_total]; + // Slice activations [1, IC, 1, SEQ] + [m appendString:@" tensor ba = const()[name = string(\"ba\"), val = tensor([0,0,0,0])];\n"]; + [m appendFormat:@" tensor sa = const()[name = string(\"sa\"), val = tensor([1,%d,1,%d])];\n", ic, seq]; + [m appendFormat:@" tensor act = slice_by_size(x=xh,begin=ba,size=sa)[name=string(\"act\")];\n", ic, seq]; + // Slice weight [1, IC, 1, OC] + [m appendFormat:@" tensor bw = const()[name = string(\"bw\"), val = tensor([0,0,0,%d])];\n", seq]; + [m appendFormat:@" tensor sw = const()[name = string(\"sw\"), val = tensor([1,%d,1,%d])];\n", ic, oc]; + [m appendFormat:@" tensor wt = slice_by_size(x=xh,begin=bw,size=sw)[name=string(\"wt\")];\n", ic, oc]; + // Reshape act: [1,IC,1,SEQ] → [1,1,IC,SEQ] → transpose → [1,1,SEQ,IC] + [m appendFormat:@" tensor ra = const()[name = string(\"ra\"), val = tensor([1,1,%d,%d])];\n", ic, seq]; + [m appendFormat:@" tensor a2 = reshape(shape=ra,x=act)[name=string(\"a2\")];\n", ic, seq]; + [m appendString:@" tensor pm = const()[name = string(\"pm\"), val = tensor([0,1,3,2])];\n"]; + [m appendFormat:@" tensor a3 = transpose(perm=pm,x=a2)[name=string(\"a3\")];\n", seq, ic]; + // Reshape weight: [1,IC,1,OC] → [1,1,IC,OC] + [m appendFormat:@" tensor rw = const()[name = string(\"rw\"), val = tensor([1,1,%d,%d])];\n", ic, oc]; + [m appendFormat:@" tensor W = reshape(shape=rw,x=wt)[name=string(\"W\")];\n", ic, oc]; + // matmul: [1,1,SEQ,IC] @ [1,1,IC,OC] → [1,1,SEQ,OC] + [m appendString:@" bool bF = const()[name = string(\"bF\"), val = bool(false)];\n"]; + [m appendFormat:@" tensor yh = matmul(transpose_x=bF,transpose_y=bF,x=a3,y=W)[name=string(\"mm\")];\n", seq, oc]; + // Reshape+transpose back: [1,1,SEQ,OC] → transpose → [1,1,OC,SEQ] → reshape → [1,OC,1,SEQ] + [m appendFormat:@" tensor yt = transpose(perm=pm,x=yh)[name=string(\"yt\")];\n", oc, seq]; + [m appendFormat:@" tensor ro = const()[name = string(\"ro\"), val = tensor([1,%d,1,%d])];\n", oc, seq]; + [m appendFormat:@" tensor yr = reshape(shape=ro,x=yt)[name=string(\"yr\")];\n", oc, seq]; + // Cast back to fp32 + [m appendString:@" string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"]; + [m appendFormat:@" tensor y = cast(dtype = to32, x = yr)[name = string(\"cout\")];\n", oc, seq]; + [m appendString:@" } -> (y);\n}\n"]; + return m; +} + +// Tiled version: splits OC into tiles, each tile is a separate kernel +// For W[IC, OC], tile along OC: each tile handles W[:, t*T:(t+1)*T] +// Input per tile: [1, IC, 1, SEQ+T] +// Output per tile: [1, T, 1, SEQ] +typedef struct { + Kern **tiles; + int n_tiles, tile_oc, ic, oc, seq; +} TiledMatmul; + +static TiledMatmul *compile_tiled_matmul(int ic, int oc, int tile_oc, int seq) { + TiledMatmul *tm = (TiledMatmul*)calloc(1, sizeof(TiledMatmul)); + tm->ic = ic; tm->oc = oc; tm->seq = seq; tm->tile_oc = tile_oc; + tm->n_tiles = (oc + tile_oc - 1) / tile_oc; + tm->tiles = (Kern**)calloc(tm->n_tiles, sizeof(Kern*)); + for (int t = 0; t < tm->n_tiles; t++) { + int this_oc = (t == tm->n_tiles-1 && oc % tile_oc) ? (oc % tile_oc) : tile_oc; + NSString *mil = gen_dynamic_matmul_mil(ic, this_oc, seq); + int in_bytes = ic * (seq + this_oc) * 4; + int out_bytes = this_oc * seq * 4; + tm->tiles[t] = compile_kern_mil_w(mil, @{}, in_bytes, out_bytes); + if (!tm->tiles[t]) { printf("Tile %d compile FAIL\n", t); return NULL; } + } + return tm; +} + +// Write activations + weight tile into IOSurface +// act: [IC, SEQ] column-major (channel-first) +// W: [IC, OC] — full weight matrix, we extract the tile +static void write_tile_input(TiledMatmul *tm, int tile_idx, const float *act, const float *W) { + Kern *k = tm->tiles[tile_idx]; + int ic = tm->ic, seq = tm->seq, toc = tm->tile_oc; + int oc_off = tile_idx * toc; + int this_oc = (tile_idx == tm->n_tiles-1 && tm->oc % toc) ? (tm->oc % toc) : toc; + + IOSurfaceLock(k->ioIn, 0, NULL); + float *buf = (float*)IOSurfaceGetBaseAddress(k->ioIn); + // Activations: buf[d * (seq+this_oc) + t] = act[d * seq + t] + for (int d = 0; d < ic; d++) { + memcpy(buf + d*(seq+this_oc), act + d*seq, seq*sizeof(float)); + // Weight: buf[d * (seq+this_oc) + seq + c] = W[d * oc + oc_off + c] + for (int c = 0; c < this_oc; c++) + buf[d*(seq+this_oc) + seq + c] = W[d*tm->oc + oc_off + c]; + } + IOSurfaceUnlock(k->ioIn, 0, NULL); +} + +// Read tile output into full output buffer +static void read_tile_output(TiledMatmul *tm, int tile_idx, float *out) { + Kern *k = tm->tiles[tile_idx]; + int seq = tm->seq, toc = tm->tile_oc; + int oc_off = tile_idx * toc; + int this_oc = (tile_idx == tm->n_tiles-1 && tm->oc % toc) ? (tm->oc % toc) : toc; + + IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + float *obuf = (float*)IOSurfaceGetBaseAddress(k->ioOut); + for (int c = 0; c < this_oc; c++) + memcpy(out + (oc_off+c)*seq, obuf + c*seq, seq*sizeof(float)); + IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL); +} + +int main(int argc, char **argv) { + @autoreleasepool { + mach_timebase_info(&g_tb); + ane_init(); + + // === Test 1: Single 64×64 dynamic matmul (correctness) === + printf("=== Test 1: 64×64 dynamic matmul correctness ===\n"); + { + int D = 64, S = 64; + NSString *mil = gen_dynamic_matmul_mil(D, D, S); + int in_b = D * (S+D) * 4, out_b = D * S * 4; + Kern *k = compile_kern_mil_w(mil, @{}, in_b, out_b); + if (!k) { printf("FAIL\n"); return 1; } + + // Identity test + IOSurfaceLock(k->ioIn, 0, NULL); + float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn); + memset(inp, 0, in_b); + for (int d = 0; d < D; d++) + for (int s = 0; s < S; s++) + inp[d*(S+D) + s] = (float)(d*S + s) * 0.001f; + for (int d = 0; d < D; d++) + for (int c = 0; c < D; c++) + inp[d*(S+D) + S + c] = (d == c) ? 1.0f : 0.0f; + IOSurfaceUnlock(k->ioIn, 0, NULL); + + ane_eval(k); + IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + float *out = (float*)IOSurfaceGetBaseAddress(k->ioOut); + float me = 0; + for (int d = 0; d < D; d++) + for (int s = 0; s < S; s++) { + float e = fabsf(out[d*S+s] - inp[d*(S+D)+s]); + if (e > me) me = e; + } + IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("Identity: max_err=%.6f %s\n", me, me < 0.01 ? "PASS" : "FAIL"); + + // 2× test + IOSurfaceLock(k->ioIn, 0, NULL); + for (int d = 0; d < D; d++) + for (int c = 0; c < D; c++) + inp[d*(S+D) + S + c] = (d == c) ? 2.0f : 0.0f; + IOSurfaceUnlock(k->ioIn, 0, NULL); + ane_eval(k); + IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + float sr = 0; int cnt = 0; + for (int i = 0; i < D*S; i++) + if (fabsf(inp[i/(S)*((S)+D) + i%S]) > 0.001f) { sr += out[i]/inp[i/S*(S+D)+i%S]; cnt++; } + IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("2× W: ratio=%.3f %s\n\n", cnt?sr/cnt:0, fabsf(sr/cnt-2.0f)<0.1?"PASS":"FAIL"); + free_kern(k); + } + + // === Test 2: 768×768 single kernel (if it compiles) === + printf("=== Test 2: 768×768 single dynamic matmul ===\n"); + { + int D = 768, S = 256; + int sp_total = S + D; // 256 + 768 = 1024 + int in_b = D * sp_total * 4; // 768 * 1024 * 4 = 3.1MB + int out_b = D * S * 4; // 768 * 256 * 4 = 786KB + printf("IOSurface: in=%.1fMB out=%.1fKB\n", in_b/1e6, out_b/1e3); + + NSString *mil = gen_dynamic_matmul_mil(D, D, S); + uint64_t t0 = mach_absolute_time(); + Kern *k = compile_kern_mil_w(mil, @{}, in_b, out_b); + double compile_ms = tb_ms(mach_absolute_time() - t0); + if (!k) { printf("768×768 compile FAIL\n"); } + else { + printf("Compile: %.1fms\n", compile_ms); + // Random weights + float *act = (float*)calloc(D*S, sizeof(float)); + float *W = (float*)calloc(D*D, sizeof(float)); + for (int i = 0; i < D*S; i++) act[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.1f; + for (int i = 0; i < D*D; i++) W[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.01f; + + // Write to IOSurface + IOSurfaceLock(k->ioIn, 0, NULL); + float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn); + for (int d = 0; d < D; d++) { + memcpy(inp + d*(S+D), act + d*S, S*4); + memcpy(inp + d*(S+D) + S, W + d*D, D*4); + } + IOSurfaceUnlock(k->ioIn, 0, NULL); + + // Warmup + for (int i = 0; i < 3; i++) ane_eval(k); + + // Benchmark + int iters = 50; + t0 = mach_absolute_time(); + for (int i = 0; i < iters; i++) ane_eval(k); + double total_ms = tb_ms(mach_absolute_time() - t0); + double per_eval = total_ms / iters; + double flops = 2.0 * D * D * S; // matmul FLOPs + double gflops = flops / (per_eval * 1e6); + printf("768×768×256 matmul: %.3fms/eval %.1f GFLOP/s\n", per_eval, gflops); + + // Benchmark with IO write (simulating weight update) + t0 = mach_absolute_time(); + for (int i = 0; i < iters; i++) { + IOSurfaceLock(k->ioIn, 0, NULL); + float *p = (float*)IOSurfaceGetBaseAddress(k->ioIn); + for (int d = 0; d < D; d++) + memcpy(p + d*(S+D) + S, W + d*D, D*4); + IOSurfaceUnlock(k->ioIn, 0, NULL); + ane_eval(k); + } + total_ms = tb_ms(mach_absolute_time() - t0); + per_eval = total_ms / iters; + gflops = flops / (per_eval * 1e6); + printf("With weight IO: %.3fms/eval %.1f GFLOP/s\n", per_eval, gflops); + + free(act); free(W); free_kern(k); + } + } + + // === Test 3: Tiled matmul benchmark === + int tile_sizes[] = {64, 128, 256, 384, 768}; + int n_tiles_test = sizeof(tile_sizes)/sizeof(tile_sizes[0]); + printf("\n=== Test 3: Tiled 768×768 matmul (varying tile_oc) ===\n"); + printf("%-10s %-8s %-10s %-12s %-10s\n", "tile_oc", "tiles", "compile", "eval/ms", "GFLOP/s"); + { + int D = 768, S = 256; + float *act = (float*)calloc(D*S, sizeof(float)); + float *W = (float*)calloc(D*D, sizeof(float)); + float *out_full = (float*)calloc(D*S, sizeof(float)); + for (int i = 0; i < D*S; i++) act[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.1f; + for (int i = 0; i < D*D; i++) W[i] = ((float)arc4random() / UINT32_MAX - 0.5f) * 0.01f; + + for (int ti = 0; ti < n_tiles_test; ti++) { + int T = tile_sizes[ti]; + if (T > D) continue; + uint64_t t0 = mach_absolute_time(); + TiledMatmul *tm = compile_tiled_matmul(D, D, T, S); + double compile_ms = tb_ms(mach_absolute_time() - t0); + if (!tm) { printf("%-10d FAIL\n", T); continue; } + + // Warmup + for (int w = 0; w < 2; w++) { + for (int t = 0; t < tm->n_tiles; t++) { + write_tile_input(tm, t, act, W); + ane_eval(tm->tiles[t]); + } + } + + // Benchmark (with IO) + int iters = 20; + t0 = mach_absolute_time(); + for (int i = 0; i < iters; i++) { + for (int t = 0; t < tm->n_tiles; t++) { + write_tile_input(tm, t, act, W); + ane_eval(tm->tiles[t]); + read_tile_output(tm, t, out_full); + } + } + double total_ms = tb_ms(mach_absolute_time() - t0); + double per_matmul = total_ms / iters; + double flops = 2.0 * D * D * S; + double gflops = flops / (per_matmul * 1e6); + printf("%-10d %-8d %-10.0fms %-12.3fms %-10.1f\n", + T, tm->n_tiles, compile_ms, per_matmul, gflops); + + for (int t = 0; t < tm->n_tiles; t++) free_kern(tm->tiles[t]); + free(tm->tiles); free(tm); + } + + // === Correctness check: compare with cblas === + printf("\n=== Correctness: dynamic matmul vs cblas_sgemm ===\n"); + { + int T = 768; // full, no tiling + TiledMatmul *tm = compile_tiled_matmul(D, D, T, S); + if (tm) { + write_tile_input(tm, 0, act, W); + ane_eval(tm->tiles[0]); + read_tile_output(tm, 0, out_full); + + // Reference: cblas y = act^T @ W → y[s,oc] = sum_d act[d,s]*W[d,oc] + // act is [D,S] col-major, W is [D,D] row-major + // We want out[oc,s] = sum_d act[d,s] * W[d,oc] + // = W^T @ act where W^T is [D,D] and act is [D,S] → out is [D,S] + float *ref = (float*)calloc(D*S, sizeof(float)); + // out[oc*S+s] = sum_d W[d*D+oc] * act[d*S+s] + // This is: (W^T) @ act in column-major: M=D,N=S,K=D + // cblas: C = alpha*A*B + beta*C + // A=W^T [D×D], B=act [D×S], C=ref [D×S] + cblas_sgemm(CblasColMajor, CblasTrans, CblasNoTrans, + D, S, D, 1.0f, W, D, act, D, 0.0f, ref, D); + float me = 0; + for (int i = 0; i < D*S; i++) { + float e = fabsf(out_full[i] - ref[i]); + if (e > me) me = e; + } + printf("vs cblas: max_err=%.6f %s\n", me, me < 1.0 ? "PASS" : "FAIL"); + free(ref); + for (int t = 0; t < tm->n_tiles; t++) free_kern(tm->tiles[t]); + free(tm->tiles); free(tm); + } + } + + free(act); free(W); free(out_full); + } + + // === Summary for training === + printf("\n=== Summary ===\n"); + printf("Stories110M: 12 layers × 10 matmuls/layer = 120 matmuls/step\n"); + printf("Sizes: Wq/Wk/Wv/Wo [768,768], W1/W3 [2048,768], W2 [768,2048]\n"); + printf("With dynamic weights: compile once, update IOSurface every step\n"); + + printf("\nDone.\n"); + } + return 0; +} diff --git a/training/test_pipeline_unit.c b/training/test_pipeline_unit.c new file mode 100644 index 0000000..dcdd9e5 --- /dev/null +++ b/training/test_pipeline_unit.c @@ -0,0 +1,448 @@ +// test_pipeline_unit.c — Unit tests for pipeline scheduler + checkpoint manager +// Pure C, no ANE dependency. Validates state machine transitions and checkpoint logic. +// Build: cc -O2 -o test_pipeline_unit test_pipeline_unit.c -lm +// Run: ./test_pipeline_unit +#include +#include +#include +#include +#include +#include + +// Stub out mmap/exec dependencies — we only test the pure logic +#define _PIPELINE_SKIP_MMAP 1 + +#include "model_config.h" +#include "gradient_checkpoint.h" + +// ===== Test helpers ===== + +static int tests_run = 0; +static int tests_passed = 0; + +#define TEST(name) do { \ + tests_run++; \ + printf(" %-50s", name); \ +} while(0) + +#define PASS() do { tests_passed++; printf("PASS\n"); } while(0) +#define FAIL(msg) do { printf("FAIL: %s\n", msg); } while(0) + +#define ASSERT_EQ(a, b, msg) do { \ + if ((a) != (b)) { FAIL(msg); printf(" got %d, expected %d\n", (int)(a), (int)(b)); return; } \ +} while(0) + +#define ASSERT_TRUE(cond, msg) do { \ + if (!(cond)) { FAIL(msg); return; } \ +} while(0) + +// ===== model_config.h tests ===== + +static void test_dims_init(void) { + TEST("model_dims_init computes derived fields"); + ModelDims d = {.dim = 768, .n_heads = 12, .n_kv_heads = 12, .seq_len = 256}; + model_dims_init(&d); + ASSERT_EQ(d.head_dim, 64, "head_dim = dim / n_heads"); + ASSERT_EQ(d.kv_dim, 768, "kv_dim = head_dim * n_kv_heads"); + ASSERT_EQ(d.score_ch, 12 * 256, "score_ch = n_heads * seq_len"); + PASS(); +} + +static void test_stories110m_preset(void) { + TEST("Stories110M preset"); + ModelConfig cfg = model_config_stories110m(); + ASSERT_EQ(cfg.dims.dim, 768, "dim"); + ASSERT_EQ(cfg.dims.n_layers, 12, "n_layers"); + ASSERT_EQ(cfg.dims.n_heads, 12, "n_heads"); + ASSERT_EQ(cfg.compile.compile_budget, 119, "compile_budget"); + ASSERT_TRUE(cfg.compile.headroom_pct > 0.0f, "headroom > 0"); + PASS(); +} + +static void test_llama7b_preset(void) { + TEST("LLaMA-7B preset"); + ModelConfig cfg = model_config_llama_7b(); + ASSERT_EQ(cfg.dims.dim, 4096, "dim"); + ASSERT_EQ(cfg.dims.n_layers, 32, "n_layers"); + ASSERT_EQ(cfg.dims.hidden_dim, 11008, "hidden_dim"); + PASS(); +} + +static void test_layer_memory_nonzero(void) { + TEST("Per-layer memory sizes are nonzero"); + ModelConfig cfg = model_config_stories110m(); + ASSERT_TRUE(layer_weight_bytes(&cfg.dims) > 0, "weight bytes"); + ASSERT_TRUE(layer_adam_bytes(&cfg.dims) > 0, "adam bytes"); + ASSERT_TRUE(layer_activation_bytes(&cfg.dims) > 0, "activation bytes"); + ASSERT_TRUE(layer_gradient_bytes(&cfg.dims) > 0, "gradient bytes"); + ASSERT_TRUE(total_model_bytes(&cfg) > 0, "total model bytes"); + PASS(); +} + +static void test_adam_is_2x_weights(void) { + TEST("Adam state = 2x weight size"); + ModelConfig cfg = model_config_stories110m(); + ASSERT_EQ(layer_adam_bytes(&cfg.dims), 2 * layer_weight_bytes(&cfg.dims), "adam = 2 * weights"); + PASS(); +} + +// ===== Pipeline planning tests ===== + +static void test_max_layers_per_compile(void) { + TEST("max_layers_per_compile respects budget"); + CompileConfig cc = {.compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .headroom_pct = 0.10f}; + int max = max_layers_per_compile(&cc); + // usable = floor(119 * 0.9) = 107, per_layer = 6, max = 107/6 = 17 + ASSERT_EQ(max, 17, "max layers = 17 for budget=119, 6 kernels/layer, 10% headroom"); + PASS(); +} + +static void test_configurable_headroom(void) { + TEST("Configurable headroom changes max layers"); + CompileConfig cc5 = {.compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .headroom_pct = 0.05f}; + CompileConfig cc20 = {.compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .headroom_pct = 0.20f}; + int max5 = max_layers_per_compile(&cc5); // floor(119*0.95/6) = 18 + int max20 = max_layers_per_compile(&cc20); // floor(119*0.80/6) = 15 + ASSERT_TRUE(max5 > max20, "5% headroom fits more layers than 20%"); + ASSERT_EQ(max5, 18, "5% headroom: 18 layers"); + ASSERT_EQ(max20, 15, "20% headroom: 15 layers"); + PASS(); +} + +static void test_invalid_headroom_defaults(void) { + TEST("Invalid headroom falls back to 10%"); + CompileConfig cc_neg = {.compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .headroom_pct = -0.5f}; + CompileConfig cc_over = {.compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .headroom_pct = 1.5f}; + CompileConfig cc_def = {.compile_budget = 119, .kernels_per_layer = 5, + .static_per_layer = 1, .headroom_pct = 0.10f}; + ASSERT_EQ(max_layers_per_compile(&cc_neg), max_layers_per_compile(&cc_def), + "negative headroom -> default"); + ASSERT_EQ(max_layers_per_compile(&cc_over), max_layers_per_compile(&cc_def), + "headroom > 1.0 -> default"); + PASS(); +} + +static void test_plan_stories110m(void) { + TEST("Stories110M fits in 1 group"); + ModelConfig cfg = model_config_stories110m(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + ASSERT_EQ(plan.n_groups, 1, "1 group"); + ASSERT_EQ(plan.groups[0].start_layer, 0, "starts at 0"); + ASSERT_EQ(plan.groups[0].end_layer, 12, "ends at 12"); + ASSERT_EQ(plan.groups[0].n_layers, 12, "12 layers"); + ASSERT_EQ(plan.groups[0].total_kernels, 72, "72 total kernels"); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_plan_llama7b_multiple_groups(void) { + TEST("LLaMA-7B needs multiple groups"); + ModelConfig cfg = model_config_llama_7b(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + ASSERT_TRUE(plan.n_groups >= 2, "at least 2 groups for 32 layers"); + // Verify all layers covered + int total_layers = 0; + for (int g = 0; g < plan.n_groups; g++) { + total_layers += plan.groups[g].n_layers; + ASSERT_TRUE(plan.groups[g].n_layers > 0, "no empty groups"); + } + ASSERT_EQ(total_layers, 32, "all 32 layers covered"); + // Verify contiguous + for (int g = 1; g < plan.n_groups; g++) { + ASSERT_EQ(plan.groups[g].start_layer, plan.groups[g-1].end_layer, "contiguous groups"); + } + pipeline_plan_free(&plan); + PASS(); +} + +static void test_plan_kernel_budget(void) { + TEST("No group exceeds compile budget"); + ModelConfig cfg = model_config_llama_7b(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + int usable = (int)(cfg.compile.compile_budget * (1.0f - cfg.compile.headroom_pct)); + for (int g = 0; g < plan.n_groups; g++) { + ASSERT_TRUE(plan.groups[g].total_kernels <= usable, + "group kernel count <= usable budget"); + } + pipeline_plan_free(&plan); + PASS(); +} + +// ===== Gradient checkpoint tests ===== + +static void test_ckpt_all_saves_everything(void) { + TEST("CKPT_ALL saves all layers"); + ModelConfig cfg = model_config_stories110m(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan, 0); + ASSERT_EQ(cm.n_checkpointed, 12, "12 layers saved"); + for (int i = 0; i < 12; i++) { + ASSERT_TRUE(checkpoint_should_save(&cm, i), "every layer saved"); + ASSERT_TRUE(!checkpoint_needs_recompute(&cm, i), "no recompute needed"); + } + ASSERT_TRUE(checkpoint_recompute_overhead(&cm) < 0.001, "zero overhead"); + checkpoint_free(&cm); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_ckpt_none_saves_minimum(void) { + TEST("CKPT_NONE saves only layer 0"); + ModelConfig cfg = model_config_stories110m(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + CheckpointManager cm = checkpoint_init(CKPT_NONE, &cfg, &plan, 0); + ASSERT_EQ(cm.n_checkpointed, 1, "only 1 layer saved"); + ASSERT_TRUE(checkpoint_should_save(&cm, 0), "layer 0 saved"); + ASSERT_TRUE(checkpoint_needs_recompute(&cm, 5), "layer 5 needs recompute"); + checkpoint_free(&cm); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_ckpt_sqrt_interval(void) { + TEST("CKPT_SQRT uses sqrt(N) interval"); + ModelConfig cfg = model_config_llama_7b(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0); + int expected_interval = (int)sqrtf(32.0f); // 5 + ASSERT_EQ(cm.interval, expected_interval, "interval = sqrt(32) = 5"); + // Layer 0 always saved, then 5, 10, 15, 20, 25, 30, 31 + ASSERT_TRUE(checkpoint_should_save(&cm, 0), "layer 0 saved"); + ASSERT_TRUE(checkpoint_should_save(&cm, 5), "layer 5 saved"); + ASSERT_TRUE(!checkpoint_should_save(&cm, 3), "layer 3 not saved"); + ASSERT_TRUE(checkpoint_should_save(&cm, 31), "last layer saved"); + checkpoint_free(&cm); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_ckpt_boundary(void) { + TEST("CKPT_BOUNDARY saves group edges"); + ModelConfig cfg = model_config_llama_7b(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + CheckpointManager cm = checkpoint_init(CKPT_BOUNDARY, &cfg, &plan, 0); + // First layer of each group + last layer overall + for (int g = 0; g < plan.n_groups; g++) { + ASSERT_TRUE(checkpoint_should_save(&cm, plan.groups[g].start_layer), + "group start layer saved"); + } + ASSERT_TRUE(checkpoint_should_save(&cm, 31), "last layer saved"); + // Middle of first group should not be saved + if (plan.groups[0].n_layers > 2) { + int mid = plan.groups[0].start_layer + plan.groups[0].n_layers / 2; + ASSERT_TRUE(checkpoint_needs_recompute(&cm, mid), "mid-group needs recompute"); + } + checkpoint_free(&cm); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_ckpt_memory_savings(void) { + TEST("Checkpoint memory savings are positive for non-ALL policies"); + ModelConfig cfg = model_config_llama_7b(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + + CheckpointManager cm_all = checkpoint_init(CKPT_ALL, &cfg, &plan, 0); + CheckpointManager cm_sqrt = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0); + CheckpointManager cm_none = checkpoint_init(CKPT_NONE, &cfg, &plan, 0); + + size_t saved_sqrt = checkpoint_memory_saved(&cm_sqrt, &cfg.dims); + size_t saved_none = checkpoint_memory_saved(&cm_none, &cfg.dims); + + ASSERT_TRUE(saved_sqrt > 0, "SQRT saves memory"); + ASSERT_TRUE(saved_none > saved_sqrt, "NONE saves more than SQRT"); + ASSERT_EQ(checkpoint_memory_saved(&cm_all, &cfg.dims), 0, "ALL saves nothing"); + + checkpoint_free(&cm_all); + checkpoint_free(&cm_sqrt); + checkpoint_free(&cm_none); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_ckpt_recompute_depth(void) { + TEST("Recompute depth counts layers from nearest checkpoint"); + ModelConfig cfg = model_config_llama_7b(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + CheckpointManager cm = checkpoint_init(CKPT_SQRT, &cfg, &plan, 0); + // With interval=5: checkpoints at 0, 5, 10, 15, 20, 25, 30, 31 + // Layer 3: nearest saved before = 0, depth = 3 + ASSERT_EQ(checkpoint_recompute_depth(&cm, 3), 3, "depth from layer 0 to 3"); + // Layer 7: nearest saved before = 5, depth = 2 + ASSERT_EQ(checkpoint_recompute_depth(&cm, 7), 2, "depth from layer 5 to 7"); + // Layer 5: nearest saved = 5, depth = 0 + ASSERT_EQ(checkpoint_recompute_depth(&cm, 5), 0, "checkpointed layer = 0 depth"); + checkpoint_free(&cm); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_ckpt_out_of_bounds(void) { + TEST("Checkpoint queries handle out-of-bounds gracefully"); + ModelConfig cfg = model_config_stories110m(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + CheckpointManager cm = checkpoint_init(CKPT_ALL, &cfg, &plan, 0); + ASSERT_TRUE(!checkpoint_should_save(&cm, -1), "negative index returns false"); + ASSERT_TRUE(!checkpoint_should_save(&cm, 100), "over-max index returns false"); + checkpoint_free(&cm); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_ckpt_every_n_custom_interval(void) { + TEST("CKPT_EVERY_N respects custom_interval parameter"); + ModelConfig cfg = model_config_llama_7b(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + CheckpointManager cm3 = checkpoint_init(CKPT_EVERY_N, &cfg, &plan, 3); + CheckpointManager cm8 = checkpoint_init(CKPT_EVERY_N, &cfg, &plan, 8); + ASSERT_EQ(cm3.interval, 3, "interval=3 when custom_interval=3"); + ASSERT_EQ(cm8.interval, 8, "interval=8 when custom_interval=8"); + ASSERT_TRUE(cm3.n_checkpointed > cm8.n_checkpointed, + "shorter interval = more checkpoints"); + // Verify layer 0 and last layer always saved + ASSERT_TRUE(checkpoint_should_save(&cm3, 0), "layer 0 saved (interval=3)"); + ASSERT_TRUE(checkpoint_should_save(&cm3, 31), "last layer saved (interval=3)"); + ASSERT_TRUE(checkpoint_should_save(&cm8, 0), "layer 0 saved (interval=8)"); + ASSERT_TRUE(checkpoint_should_save(&cm8, 31), "last layer saved (interval=8)"); + checkpoint_free(&cm3); + checkpoint_free(&cm8); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_ckpt_n_checkpointed_accuracy(void) { + TEST("n_checkpointed matches actual is_saved bit count"); + ModelConfig cfg = model_config_llama_7b(); + PipelinePlan plan = compute_pipeline_plan(&cfg); + CheckpointPolicy policies[] = {CKPT_ALL, CKPT_BOUNDARY, CKPT_SQRT, CKPT_EVERY_N, CKPT_NONE}; + for (int p = 0; p < 5; p++) { + CheckpointManager cm = checkpoint_init(policies[p], &cfg, &plan, 0); + int actual = 0; + for (int i = 0; i < cm.n_layers; i++) { + if (cm.is_saved[i]) actual++; + } + ASSERT_EQ(cm.n_checkpointed, actual, "n_checkpointed matches is_saved count"); + checkpoint_free(&cm); + } + pipeline_plan_free(&plan); + PASS(); +} + +static void test_dims_init_zero_heads(void) { + TEST("model_dims_init guards divide-by-zero on n_heads=0"); + ModelDims d = {.dim = 768, .n_heads = 0, .n_kv_heads = 0, .seq_len = 256}; + model_dims_init(&d); + ASSERT_EQ(d.head_dim, 0, "head_dim=0 when n_heads=0"); + ASSERT_EQ(d.kv_dim, 0, "kv_dim=0 when n_heads=0"); + ASSERT_EQ(d.score_ch, 0, "score_ch=0 when n_heads=0"); + PASS(); +} + +// ===== FLOP estimation tests ===== + +static void test_flops_nonzero(void) { + TEST("FLOP estimates are nonzero and ANE < total"); + ModelConfig cfg = model_config_stories110m(); + double total = flops_per_step(&cfg); + double ane = ane_flops_per_step(&cfg); + ASSERT_TRUE(total > 0, "total FLOPs > 0"); + ASSERT_TRUE(ane > 0, "ANE FLOPs > 0"); + ASSERT_TRUE(ane < total, "ANE FLOPs < total (dW is on CPU)"); + PASS(); +} + +static void test_flops_scale_with_layers(void) { + TEST("FLOPs scale roughly linearly with layer count"); + ModelConfig cfg12 = model_config_stories110m(); + ModelConfig cfg8 = model_config_stories42m(); + double f12 = flops_per_step(&cfg12); + double f8 = flops_per_step(&cfg8); + // Not exact linear due to different dims, but 12-layer should be >8-layer + ASSERT_TRUE(f12 > f8, "12 layers > 8 layers"); + PASS(); +} + +// ===== Pipeline plan edge cases ===== + +static void test_plan_single_layer(void) { + TEST("Single-layer model = 1 group"); + ModelConfig cfg = model_config_stories110m(); + cfg.dims.n_layers = 1; + PipelinePlan plan = compute_pipeline_plan(&cfg); + ASSERT_EQ(plan.n_groups, 1, "1 group"); + ASSERT_EQ(plan.groups[0].n_layers, 1, "1 layer in group"); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_plan_exact_budget_fit(void) { + TEST("Layers that exactly fill budget = 1 group"); + ModelConfig cfg = model_config_stories110m(); + // 17 layers * 6 kernels = 102 <= 107 usable (10% headroom on 119) + cfg.dims.n_layers = 17; + PipelinePlan plan = compute_pipeline_plan(&cfg); + ASSERT_EQ(plan.n_groups, 1, "17 layers fit in 1 group"); + pipeline_plan_free(&plan); + PASS(); +} + +static void test_plan_one_over_budget(void) { + TEST("One layer over budget = 2 groups"); + ModelConfig cfg = model_config_stories110m(); + // 18 layers * 6 kernels = 108 > 107 usable -> 2 groups + cfg.dims.n_layers = 18; + PipelinePlan plan = compute_pipeline_plan(&cfg); + ASSERT_EQ(plan.n_groups, 2, "18 layers = 2 groups"); + int total = plan.groups[0].n_layers + plan.groups[1].n_layers; + ASSERT_EQ(total, 18, "all layers covered"); + pipeline_plan_free(&plan); + PASS(); +} + +// ===== Main ===== + +int main(void) { + printf("=== Pipeline Unit Tests ===\n\n"); + + printf("[model_config.h]\n"); + test_dims_init(); + test_stories110m_preset(); + test_llama7b_preset(); + test_layer_memory_nonzero(); + test_adam_is_2x_weights(); + + printf("\n[pipeline planning]\n"); + test_max_layers_per_compile(); + test_configurable_headroom(); + test_invalid_headroom_defaults(); + test_plan_stories110m(); + test_plan_llama7b_multiple_groups(); + test_plan_kernel_budget(); + test_plan_single_layer(); + test_plan_exact_budget_fit(); + test_plan_one_over_budget(); + + printf("\n[gradient_checkpoint.h]\n"); + test_ckpt_all_saves_everything(); + test_ckpt_none_saves_minimum(); + test_ckpt_sqrt_interval(); + test_ckpt_boundary(); + test_ckpt_memory_savings(); + test_ckpt_recompute_depth(); + test_ckpt_out_of_bounds(); + test_ckpt_every_n_custom_interval(); + test_ckpt_n_checkpointed_accuracy(); + test_dims_init_zero_heads(); + + printf("\n[FLOP estimation]\n"); + test_flops_nonzero(); + test_flops_scale_with_layers(); + + printf("\n=== Results: %d/%d passed ===\n", tests_passed, tests_run); + return (tests_passed == tests_run) ? 0 : 1; +} diff --git a/training/test_rmsnorm_bwd.m b/training/test_rmsnorm_bwd.m new file mode 100644 index 0000000..9014e53 --- /dev/null +++ b/training/test_rmsnorm_bwd.m @@ -0,0 +1,123 @@ +// test_rmsnorm_bwd.m — Test RMSNorm backward ANE kernel vs CPU reference +// Build: xcrun clang -O2 -framework Foundation -framework IOSurface \ +// -framework CoreML -framework Accelerate -ldl -lobjc \ +// -o test_rmsnorm_bwd test_rmsnorm_bwd.m +#include "ane_rmsnorm_bwd.h" +#include "stories_cpu_ops.h" + +int main(void) { + @autoreleasepool { + setbuf(stdout, NULL); + ane_init(); + mach_timebase_info(&g_tb); + + printf("=== Test: RMSNorm Backward on ANE ===\n"); + printf("DIM=%d SEQ=%d\n\n", DIM, SEQ); + + // Allocate test data + float *x = (float*)malloc(DIM * SEQ * 4); + float *dy = (float*)malloc(DIM * SEQ * 4); + float *w = (float*)malloc(DIM * 4); + float *dx_cpu = (float*)calloc(DIM * SEQ, 4); + float *dw_cpu = (float*)calloc(DIM, 4); + float *dx_ane = (float*)malloc(DIM * SEQ * 4); + + // Random init (channel-first [DIM, SEQ]) + srand48(42); + for (int i = 0; i < DIM * SEQ; i++) { + x[i] = (float)(drand48() * 2 - 1) * 0.5f; + dy[i] = (float)(drand48() * 2 - 1) * 0.1f; + } + for (int i = 0; i < DIM; i++) { + w[i] = (float)(drand48() * 0.5 + 0.75); // close to 1.0 + } + + // === CPU Reference === + uint64_t t0 = mach_absolute_time(); + rmsnorm_bwd(dx_cpu, dw_cpu, dy, x, w, DIM, SEQ); + uint64_t t1 = mach_absolute_time(); + printf("CPU rmsnorm_bwd: %.2f ms\n", tb_ms(t1 - t0)); + + // === ANE Kernel === + printf("Compiling ANE rmsnorm_bwd kernel...\n"); + NSString *mil = gen_rmsnorm_bwd(); + + // Build weight blob for RMSNorm weights + NSData *rms_blob = build_blob(w, 1, DIM); + + int in_bytes = 2 * DIM * SEQ * 2; // concat(dy, x) in fp16 + int out_bytes = DIM * SEQ * 2; // dx in fp16 + + Kern *kern = compile_kern_mil_w(mil, (@{ + @"@model_path/weights/rms_w.bin": @{@"offset":@0, @"data":rms_blob}, + }), in_bytes, out_bytes); + + if (!kern) { + printf("FAIL: ANE kernel compilation failed!\n"); + return 1; + } + printf("Compile OK (compiles=%d)\n", g_compile_count); + + // Write input: concat(dy, x) into ioIn + // dy goes at channel offset 0, x goes at channel offset DIM + io_write_fp16_at(kern->ioIn, 0, dy, DIM, SEQ); + io_write_fp16_at(kern->ioIn, DIM, x, DIM, SEQ); + + // Evaluate + t0 = mach_absolute_time(); + ane_eval(kern); + t1 = mach_absolute_time(); + printf("ANE eval: %.3f ms\n", tb_ms(t1 - t0)); + + // Read output + io_read_fp16(kern->ioOut, dx_ane, 0, DIM, SEQ); + + // === Compare === + float max_err = 0, sum_err = 0; + int max_i = 0, max_j = 0; + for (int i = 0; i < DIM; i++) { + for (int j = 0; j < SEQ; j++) { + int idx = i * SEQ + j; + float err = fabsf(dx_cpu[idx] - dx_ane[idx]); + sum_err += err; + if (err > max_err) { + max_err = err; + max_i = i; max_j = j; + } + } + } + float mean_err = sum_err / (DIM * SEQ); + + printf("\n=== Results ===\n"); + printf("Max absolute error: %.6f at [%d,%d] (CPU=%.6f ANE=%.6f)\n", + max_err, max_i, max_j, dx_cpu[max_i*SEQ+max_j], dx_ane[max_i*SEQ+max_j]); + printf("Mean absolute error: %.6f\n", mean_err); + + // Sample outputs + printf("\nSample dx values (first 4 channels, first 4 positions):\n"); + printf("%-6s %-12s %-12s %-10s\n", "Idx", "CPU", "ANE", "Error"); + for (int i = 0; i < 4 && i < DIM; i++) { + for (int j = 0; j < 4 && j < SEQ; j++) { + int idx = i * SEQ + j; + printf("[%d,%d] %-12.6f %-12.6f %-10.6f\n", + i, j, dx_cpu[idx], dx_ane[idx], fabsf(dx_cpu[idx] - dx_ane[idx])); + } + } + + // Benchmark: multiple evals + int N = 100; + t0 = mach_absolute_time(); + for (int i = 0; i < N; i++) ane_eval(kern); + t1 = mach_absolute_time(); + printf("\nBenchmark: %d evals in %.2f ms (%.3f ms/eval)\n", + N, tb_ms(t1-t0), tb_ms(t1-t0)/N); + + // Pass/fail + bool pass = max_err < 0.05f && mean_err < 0.01f; + printf("\n%s (threshold: max<0.05, mean<0.01)\n", pass ? "PASS ✅" : "FAIL ❌"); + + free_kern(kern); + free(x); free(dy); free(w); free(dx_cpu); free(dw_cpu); free(dx_ane); + return pass ? 0 : 1; + } +} diff --git a/training/test_weight_patch.m b/training/test_weight_patch.m new file mode 100644 index 0000000..13473b7 --- /dev/null +++ b/training/test_weight_patch.m @@ -0,0 +1,450 @@ +// test_weight_patch.m — Test whether ANE weights can be patched after compile +#import +#import +#import +#import +#import +#import +#import +#import +#include +#include + +#include "stories_io.h" + +// MIL: fp32 in → cast fp16 → conv → cast fp32 out (matches inmem_peak.m pattern) +static NSString *gen_conv_mil(int ic, int oc, int sp) { + NSMutableString *m = [NSMutableString string]; + [m appendString:@"program(1.3)\n" + "[buildInfo = dict({{\"coremlc-component-MIL\", \"3510.2.1\"}, " + "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, " + "{\"coremltools-version\", \"9.0\"}})]\n{\n"]; + [m appendFormat:@" func main(tensor x) {\n", ic, sp]; + [m appendString: + @" string pt = const()[name = string(\"pt\"), val = string(\"valid\")];\n" + " tensor st = const()[name = string(\"st\"), val = tensor([1, 1])];\n" + " tensor pd = const()[name = string(\"pd\"), val = tensor([0, 0, 0, 0])];\n" + " tensor dl = const()[name = string(\"dl\"), val = tensor([1, 1])];\n" + " int32 gr = const()[name = string(\"gr\"), val = int32(1)];\n" + " string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"]; + [m appendFormat:@" tensor xh = cast(dtype = to16, x = x)[name = string(\"cast_in\")];\n", ic, sp]; + [m appendFormat:@" tensor W = const()[name = string(\"W\"), " + "val = tensor(BLOBFILE(path = string(\"@model_path/weights/w.bin\"), offset = uint64(64)))];\n", + oc, ic, oc, ic]; + [m appendFormat:@" tensor yh = conv(dilations = dl, groups = gr, pad = pd, pad_type = pt, strides = st, weight = W, x = xh)" + "[name = string(\"conv\")];\n", oc, sp]; + [m appendString:@" string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"]; + [m appendFormat:@" tensor y = cast(dtype = to32, x = yh)[name = string(\"cast_out\")];\n", oc, sp]; + [m appendString:@" } -> (y);\n}\n"]; + return m; +} + +int main(int argc, char **argv) { + @autoreleasepool { + mach_timebase_info(&g_tb); + ane_init(); + + int IC = 256, OC = 256, SP = 64; + int io_bytes = IC * SP * 4; // fp32 + + // Identity weight + float *W_id = (float*)calloc(OC*IC, sizeof(float)); + for (int i = 0; i < IC; i++) W_id[i*IC+i] = 1.0f; + + NSString *mil = gen_conv_mil(IC, OC, SP); + NSDictionary *wd = @{@"@model_path/weights/w.bin": @{@"offset":@0, @"data":build_blob(W_id, OC, IC)}}; + + printf("=== Compiling conv %dx%d sp=%d ===\n", OC, IC, SP); + Kern *k = compile_kern_mil_w(mil, wd, io_bytes, io_bytes); + if (!k) { printf("COMPILE FAILED\n"); free(W_id); return 1; } + printf("Compile OK!\n"); + + // Write fp32 input + IOSurfaceLock(k->ioIn, 0, NULL); + float *inp = (float*)IOSurfaceGetBaseAddress(k->ioIn); + for (int i = 0; i < IC*SP; i++) inp[i] = (i % 100) * 0.01f; + IOSurfaceUnlock(k->ioIn, 0, NULL); + + // Eval with identity + ane_eval(k); + IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + float *out = (float*)IOSurfaceGetBaseAddress(k->ioOut); + printf("In: [%.3f, %.3f, %.3f, %.3f]\n", inp[0], inp[1], inp[2], inp[3]); + printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]); + float max_err = 0; + for (int i = 0; i < OC*SP; i++) { + float err = fabsf(out[i] - inp[i]); + if (err > max_err) max_err = err; + } + IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("Identity max_err=%.6f %s\n\n", max_err, max_err < 0.1 ? "PASS" : "FAIL"); + + // === Approach 1: Patch weight on disk, unload+reload === + printf("=== Approach 1: Disk patch + unload/reload ===\n"); + float *W_2x = (float*)calloc(OC*IC, sizeof(float)); + for (int i = 0; i < IC; i++) W_2x[i*IC+i] = 2.0f; + [build_blob(W_2x, OC, IC) writeToFile: + [(__bridge NSString*)k->tmpDir stringByAppendingPathComponent:@"weights/w.bin"] atomically:YES]; + + id mdl = (__bridge id)k->model; + NSError *e = nil; + ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e); + e = nil; + BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e); + printf("Reload: %s\n", ok?"OK":"FAIL"); + if (ok) { + // Re-create request after reload + id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioIn); + id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioOut); + CFRelease(k->request); + k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR, + @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:), + @[wI], @[@0], @[wO], @[@0], nil, nil, @0)); + ane_eval(k); + IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]); + float sr = 0; int cnt = 0; + for (int i = 0; i < OC*SP; i++) + if (fabsf(inp[i]) > 0.01f) { sr += out[i]/inp[i]; cnt++; } + IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("Ratio: %.3f (2.0=patched, 1.0=cached)\n\n", cnt>0?sr/cnt:0); + } + + // === Approach 2: Memory scan === + printf("=== Approach 2: Memory scan ===\n"); + uint16_t pat1[8] = {0x3C00, 0, 0, 0, 0, 0, 0, 0}; + uint16_t pat2[8] = {0x4000, 0, 0, 0, 0, 0, 0, 0}; + mach_port_t task = mach_task_self(); + vm_address_t addr = 0; vm_size_t sz; natural_t depth = 1; + int f1 = 0, f2 = 0; + while (1) { + struct vm_region_submap_info_64 info; + mach_msg_type_number_t count = VM_REGION_SUBMAP_INFO_COUNT_64; + if (vm_region_recurse_64(task, &addr, &sz, &depth, (vm_region_recurse_info_t)&info, &count) != KERN_SUCCESS) break; + if (info.is_submap) { depth++; continue; } + if (!(info.protection & VM_PROT_READ) || sz < (size_t)(OC*IC*2)) { addr += sz; continue; } + uint8_t *base = (uint8_t*)addr; + for (size_t off = 0; off + OC*IC*2 <= sz; off += 2) { + int w = 0; + if (memcmp(base+off, pat1, 16) == 0) w = 1; + else if (memcmp(base+off, pat2, 16) == 0) w = 2; + if (!w) continue; + uint16_t *p = (uint16_t*)(base+off), diag = (w==1)?0x3C00:0x4000; + int ok2 = 1; + for (int r = 0; r < OC && ok2; r++) + for (int c = 0; c < IC && ok2; c++) + if (p[r*IC+c] != ((r==c)?diag:0)) ok2 = 0; + if (!ok2) continue; + if (w==1) f1++; else f2++; + printf(" FOUND %dx @%p prot=%d/%d %s\n", w, (void*)(addr+off), + info.protection, info.max_protection, (info.protection&VM_PROT_WRITE)?"WR":"RO"); + } + addr += sz; + } + printf("Found: 1x=%d 2x=%d\n", f1, f2); + + // Now patch ALL found weight patterns to 3× and re-eval + if (f1 > 0 || f2 > 0) { + printf("Patching all found patterns to 3x identity...\n"); + addr = 0; depth = 1; + while (1) { + struct vm_region_submap_info_64 info2; + mach_msg_type_number_t count2 = VM_REGION_SUBMAP_INFO_COUNT_64; + if (vm_region_recurse_64(task, &addr, &sz, &depth, (vm_region_recurse_info_t)&info2, &count2) != KERN_SUCCESS) break; + if (info2.is_submap) { depth++; continue; } + if (!(info2.protection & VM_PROT_READ) || sz < (size_t)(OC*IC*2)) { addr += sz; continue; } + uint8_t *base2 = (uint8_t*)addr; + for (size_t off = 0; off + OC*IC*2 <= sz; off += 2) { + int w2 = 0; + if (memcmp(base2+off, pat1, 16) == 0) w2 = 1; + else if (memcmp(base2+off, pat2, 16) == 0) w2 = 2; + if (!w2) continue; + uint16_t *p2 = (uint16_t*)(base2+off), diag2 = (w2==1)?0x3C00:0x4000; + int ok3 = 1; + for (int r = 0; r < OC && ok3; r++) + for (int c = 0; c < IC && ok3; c++) + if (p2[r*IC+c] != ((r==c)?diag2:0)) ok3 = 0; + if (!ok3) continue; + if (info2.protection & VM_PROT_WRITE) { + printf(" Patching %dx @%p to 3x\n", w2, (void*)(addr+off)); + for (int r = 0; r < OC; r++) + for (int c = 0; c < IC; c++) + p2[r*IC+c] = (r==c) ? 0x4200 : 0; // fp16(3.0) + } + } + addr += sz; + } + + printf("\n=== Eval after memory patch (expect 3x) ===\n"); + ane_eval(k); + IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]); + float sr2 = 0; int cnt2 = 0; + for (int i = 0; i < OC*SP; i++) + if (fabsf(inp[i]) > 0.01f) { sr2 += out[i]/inp[i]; cnt2++; } + IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("Ratio: %.3f (3.0=mem patch works!, 1.0=ANE uses SRAM copy)\n", cnt2>0?sr2/cnt2:0); + } + printf("\n"); + + // === Approach 3: Explore classes === + printf("=== ANE classes ===\n"); + const char *cn[] = {"_ANEWeight", "_ANEProgramForEvaluation", "_ANEChainingRequest", NULL}; + for (int i = 0; cn[i]; i++) { + Class cls = NSClassFromString([NSString stringWithUTF8String:cn[i]]); + if (!cls) { printf("%s: NOT FOUND\n", cn[i]); continue; } + printf("%s:\n", cn[i]); + unsigned int mc = 0; Method *ms = class_copyMethodList(cls, &mc); + for (unsigned j = 0; j < mc; j++) printf(" - %s\n", sel_getName(method_getName(ms[j]))); + free(ms); + mc = 0; ms = class_copyMethodList(object_getClass(cls), &mc); + for (unsigned j = 0; j < mc; j++) printf(" + %s\n", sel_getName(method_getName(ms[j]))); + free(ms); printf("\n"); + } + @try { printf("programHandle: %s\n", [[[mdl valueForKey:@"programHandle"] description] UTF8String]); } @catch(id x) {} + @try { printf("intermediateBufferHandle: %s\n", [[[mdl valueForKey:@"intermediateBufferHandle"] description] UTF8String]); } @catch(id x) {} + + // === Approach 4: _ANEWeight + updateWeightURL === + printf("\n=== Approach 4: _ANEWeight API ===\n"); + Class AW = NSClassFromString(@"_ANEWeight"); + if (AW) { + // Write 5× identity weights to a new file + float *W_5x = (float*)calloc(OC*IC, sizeof(float)); + for (int i = 0; i < IC; i++) W_5x[i*IC+i] = 5.0f; + NSString *wpath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"patched_w.bin"]; + [build_blob(W_5x, OC, IC) writeToFile:wpath atomically:YES]; + free(W_5x); + + NSURL *wurl = [NSURL fileURLWithPath:wpath]; + id wobj = ((id(*)(Class,SEL,id,id))objc_msgSend)(AW, + @selector(weightWithSymbolAndURL:weightURL:), @"W", wurl); + printf(" _ANEWeight: %s\n", wobj ? [[wobj description] UTF8String] : "nil"); + if (wobj) { + printf(" weightSymbol: %s\n", [((id(*)(id,SEL))objc_msgSend)(wobj, @selector(weightSymbol)) UTF8String]); + printf(" weightURL: %s\n", [[((id(*)(id,SEL))objc_msgSend)(wobj, @selector(weightURL)) description] UTF8String]); + } + + // Try to pass as weightsBuffer in request + printf("\n Trying weightsBuffer in request...\n"); + id wI2 = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioIn); + id wO2 = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), k->ioOut); + + // Try passing weight array as weightsBuffer + if (wobj) { + CFRelease(k->request); + k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR, + @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:), + @[wI2], @[@0], @[wO2], @[@0], @[wobj], nil, @0)); + printf(" Request with weightsBuffer created\n"); + @try { + ane_eval(k); + IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf(" Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]); + float sr3 = 0; int cnt3 = 0; + for (int i2 = 0; i2 < OC*SP; i2++) + if (fabsf(inp[i2]) > 0.01f) { sr3 += out[i2]/inp[i2]; cnt3++; } + IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf(" Ratio: %.3f (5.0=weightsBuffer works!)\n", cnt3>0?sr3/cnt3:0); + } @catch(NSException *ex) { + printf(" Eval exception: %s\n", [[ex description] UTF8String]); + } + } + + // Also try IOSurface as weightsBuffer + printf("\n Trying IOSurface as weightsBuffer...\n"); + IOSurfaceRef wSurf = make_surface(OC*IC*2); // fp16 weights + IOSurfaceLock(wSurf, 0, NULL); + _Float16 *wfp16 = (_Float16*)IOSurfaceGetBaseAddress(wSurf); + for (int r = 0; r < OC; r++) + for (int c2 = 0; c2 < IC; c2++) + wfp16[r*IC+c2] = (r==c2) ? (_Float16)7.0f : (_Float16)0.0f; // 7× identity + IOSurfaceUnlock(wSurf, 0, NULL); + id wSurfObj = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), wSurf); + CFRelease(k->request); + k->request = (void*)CFBridgingRetain(((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR, + @selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:), + @[wI2], @[@0], @[wO2], @[@0], wSurfObj, nil, @0)); + @try { + ane_eval(k); + IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf(" Out: [%.3f, %.3f, %.3f, %.3f]\n", out[0], out[1], out[2], out[3]); + float sr4 = 0; int cnt4 = 0; + for (int i3 = 0; i3 < OC*SP; i3++) + if (fabsf(inp[i3]) > 0.01f) { sr4 += out[i3]/inp[i3]; cnt4++; } + IOSurfaceUnlock(k->ioOut, kIOSurfaceLockReadOnly, NULL); + printf(" Ratio: %.3f (7.0=IOSurface weights work!)\n", cnt4>0?sr4/cnt4:0); + } @catch(NSException *ex) { + printf(" Eval exception: %s\n", [[ex description] UTF8String]); + } + CFRelease(wSurf); + } + + // === Approach 5: Weights packed into input IOSurface (fp16 with cast) === + printf("\n=== Approach 5: Dynamic weights via input IOSurface ===\n"); + // Element-wise mul: x * w where both come from input + // Input [1, IC*2, 1, SP] fp32 → cast fp16 → slice → mul → cast fp32 + { + int C5 = IC; + NSMutableString *m5 = [NSMutableString string]; + [m5 appendString:@"program(1.3)\n" + "[buildInfo = dict({{\"coremlc-component-MIL\", \"3510.2.1\"}, " + "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, " + "{\"coremltools-version\", \"9.0\"}})]\n{\n"]; + [m5 appendFormat:@" func main(tensor x) {\n", C5*2, SP]; + [m5 appendString:@" string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"]; + [m5 appendFormat:@" tensor xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", C5*2, SP]; + [m5 appendFormat:@" tensor b0 = const()[name = string(\"b0\"), val = tensor([0,0,0,0])];\n"]; + [m5 appendFormat:@" tensor s0 = const()[name = string(\"s0\"), val = tensor([1,%d,1,%d])];\n", C5, SP]; + [m5 appendFormat:@" tensor data = slice_by_size(x=xh,begin=b0,size=s0)[name=string(\"data\")];\n", C5, SP]; + [m5 appendFormat:@" tensor b1 = const()[name = string(\"b1\"), val = tensor([0,%d,0,0])];\n", C5]; + [m5 appendFormat:@" tensor wt = slice_by_size(x=xh,begin=b1,size=s0)[name=string(\"wt\")];\n", C5, SP]; + [m5 appendFormat:@" tensor yh = mul(x=data,y=wt)[name=string(\"mul\")];\n", C5, SP]; + [m5 appendString:@" string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"]; + [m5 appendFormat:@" tensor y = cast(dtype = to32, x = yh)[name = string(\"cout\")];\n", C5, SP]; + [m5 appendString:@" } -> (y);\n}\n"]; + + int io5_in = C5*2*SP*4; + int io5_out = C5*SP*4; + Kern *k5 = compile_kern_mil_w(m5, @{}, io5_in, io5_out); + if (k5) { + printf("Compile OK!\n"); + IOSurfaceLock(k5->ioIn, 0, NULL); + float *in5 = (float*)IOSurfaceGetBaseAddress(k5->ioIn); + for (int i = 0; i < C5*SP; i++) in5[i] = (i%100)*0.01f; + for (int i = 0; i < C5*SP; i++) in5[C5*SP+i] = 2.0f; + IOSurfaceUnlock(k5->ioIn, 0, NULL); + ane_eval(k5); + IOSurfaceLock(k5->ioOut, kIOSurfaceLockReadOnly, NULL); + float *out5 = (float*)IOSurfaceGetBaseAddress(k5->ioOut); + printf("data=[%.3f,%.3f,%.3f], w=2.0 → out=[%.3f,%.3f,%.3f]\n", + in5[0],in5[1],in5[2], out5[0],out5[1],out5[2]); + IOSurfaceUnlock(k5->ioOut, kIOSurfaceLockReadOnly, NULL); + + // Change weight dynamically — NO recompile! + IOSurfaceLock(k5->ioIn, 0, NULL); + for (int i = 0; i < C5*SP; i++) in5[C5*SP+i] = 5.0f; + IOSurfaceUnlock(k5->ioIn, 0, NULL); + ane_eval(k5); + IOSurfaceLock(k5->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("w=5.0 → out=[%.3f,%.3f,%.3f] (expect 5×)\n", out5[0],out5[1],out5[2]); + IOSurfaceUnlock(k5->ioOut, kIOSurfaceLockReadOnly, NULL); + free_kern(k5); + } else printf("Compile FAILED\n"); + } + + // === Approach 6: matmul with dynamic weights from input === + printf("\n=== Approach 6: matmul with dynamic W from input ===\n"); + // Pack x[1,D,S,1] and W[1,D,1,D] into input, then reshape+matmul + // Input shape: [1, D+D*D, 1, S] — first D channels=activations, rest=weight matrix flattened + // Actually, matmul needs [1,H,S,D] shapes. Let's try: + // Input: [1, D*(S+D), 1, 1] reshaped as needed + // Simpler: just test matmul with two sliced inputs + { + int D6 = 64, S6 = 64; // small for test + // Input: [1, D6+D6, S6, D6] — but that's 4D... + // Actually ANE matmul works on [1,H,M,K] @ [1,H,K,N] → [1,H,M,N] + // Let's pack x[1,1,S6,D6] and W[1,1,D6,D6] into [1,2,S6,D6] + // Then slice → matmul + NSMutableString *m6 = [NSMutableString string]; + [m6 appendString:@"program(1.3)\n" + "[buildInfo = dict({{\"coremlc-component-MIL\", \"3510.2.1\"}, " + "{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, " + "{\"coremltools-version\", \"9.0\"}})]\n{\n"]; + // Input: [1, D6+D6, 1, S6*D6] — flatten everything, then reshape + // Actually simplest: two separate regions in channel dim + // x_data: [1, D6, 1, S6] and W: [1, D6*D6, 1, 1] + // Total input channels: D6 + D6*D6 + int total_ch = D6 + D6*D6; + [m6 appendFormat:@" func main(tensor x) {\n", total_ch, S6]; + [m6 appendString:@" string to16 = const()[name = string(\"to16\"), val = string(\"fp16\")];\n"]; + [m6 appendFormat:@" tensor xh = cast(dtype = to16, x = x)[name = string(\"cin\")];\n", total_ch, S6]; + // Slice activations: [1, D6, 1, S6] + [m6 appendFormat:@" tensor b0 = const()[name = string(\"b0\"), val = tensor([0,0,0,0])];\n"]; + [m6 appendFormat:@" tensor sa = const()[name = string(\"sa\"), val = tensor([1,%d,1,%d])];\n", D6, S6]; + [m6 appendFormat:@" tensor act = slice_by_size(x=xh,begin=b0,size=sa)[name=string(\"act\")];\n", D6, S6]; + // Slice weight: [1, D6*D6, 1, S6] but we only need [D6, D6] → reshape + [m6 appendFormat:@" tensor bw = const()[name = string(\"bw\"), val = tensor([0,%d,0,0])];\n", D6]; + [m6 appendFormat:@" tensor sw = const()[name = string(\"sw\"), val = tensor([1,%d,1,%d])];\n", D6*D6, S6]; + [m6 appendFormat:@" tensor wf = slice_by_size(x=xh,begin=bw,size=sw)[name=string(\"wf\")];\n", D6*D6, S6]; + // Reshape weight to [1, D6, D6, S6] for matmul-like operation + // Actually for conv: weight needs to be [OC, IC, 1, 1] const. Can't use dynamic weight with conv. + // For matmul: need [1, 1, D6, D6] or similar + // Let's try: reshape wf to [1, D6, D6, S6], take first slice [:,:,:,0] → no, that's hard + // Simpler: reshape to [D6, D6] and use matmul + // But matmul expects specific ranks... let me try: + [m6 appendFormat:@" tensor ws = const()[name = string(\"ws\"), val = tensor([1, 1, %d, %d])];\n", D6, D6]; + // Only take first column of wf to get [1, D6*D6, 1, 1] + [m6 appendFormat:@" tensor sw1 = const()[name = string(\"sw1\"), val = tensor([1,%d,1,1])];\n", D6*D6]; + [m6 appendFormat:@" tensor wf1 = slice_by_size(x=wf,begin=b0,size=sw1)[name=string(\"wf1\")];\n", D6*D6]; + [m6 appendFormat:@" tensor W = reshape(shape=ws,x=wf1)[name=string(\"W\")];\n", D6, D6]; + // Reshape act to [1, 1, S6, D6] for matmul + [m6 appendFormat:@" tensor as2 = const()[name = string(\"as2\"), val = tensor([1, 1, %d, %d])];\n", D6, S6]; + [m6 appendFormat:@" tensor pm = const()[name = string(\"pm\"), val = tensor([0, 1, 3, 2])];\n"]; + [m6 appendFormat:@" tensor a2 = reshape(shape=as2,x=act)[name=string(\"a2\")];\n", D6, S6]; + [m6 appendFormat:@" tensor a3 = transpose(perm=pm,x=a2)[name=string(\"a3\")];\n", S6, D6]; + // matmul: [1,1,S6,D6] @ [1,1,D6,D6] → [1,1,S6,D6] + [m6 appendString:@" bool bF = const()[name = string(\"bF\"), val = bool(false)];\n"]; + [m6 appendFormat:@" tensor yh = matmul(transpose_x = bF, transpose_y = bF, x = a3, y = W)[name = string(\"mm\")];\n", S6, D6]; + // Reshape back to [1, D6, 1, S6] + [m6 appendFormat:@" tensor yt = transpose(perm=pm,x=yh)[name=string(\"yt\")];\n", D6, S6]; + [m6 appendFormat:@" tensor os = const()[name = string(\"os\"), val = tensor([1,%d,1,%d])];\n", D6, S6]; + [m6 appendFormat:@" tensor yr = reshape(shape=os,x=yt)[name=string(\"yr\")];\n", D6, S6]; + [m6 appendString:@" string to32 = const()[name = string(\"to32\"), val = string(\"fp32\")];\n"]; + [m6 appendFormat:@" tensor y = cast(dtype = to32, x = yr)[name = string(\"cout\")];\n", D6, S6]; + [m6 appendString:@" } -> (y);\n}\n"]; + + int io6_in = total_ch * S6 * 4; + int io6_out = D6 * S6 * 4; + Kern *k6 = compile_kern_mil_w(m6, @{}, io6_in, io6_out); + if (k6) { + printf("Dynamic matmul compile OK!\n"); + // Set up: identity W, ramp input + IOSurfaceLock(k6->ioIn, 0, NULL); + float *in6 = (float*)IOSurfaceGetBaseAddress(k6->ioIn); + memset(in6, 0, io6_in); + // Activations: [D6, S6] in channel-first layout + for (int d = 0; d < D6; d++) + for (int s = 0; s < S6; s++) + in6[d*S6+s] = (d*S6+s) * 0.001f; + // Weight: identity matrix [D6, D6] packed in channels D6..D6+D6*D6, only col 0 + float *wbase = in6 + D6*S6; + for (int r = 0; r < D6; r++) + for (int c = 0; c < D6; c++) + wbase[(r*D6+c)*S6] = (r==c) ? 1.0f : 0.0f; // only sp=0 matters + IOSurfaceUnlock(k6->ioIn, 0, NULL); + + ane_eval(k6); + IOSurfaceLock(k6->ioOut, kIOSurfaceLockReadOnly, NULL); + float *out6 = (float*)IOSurfaceGetBaseAddress(k6->ioOut); + printf("Identity W: in=[%.4f,%.4f,%.4f] out=[%.4f,%.4f,%.4f]\n", + in6[0],in6[1],in6[2], out6[0],out6[1],out6[2]); + + // Check + float me6 = 0; + for (int i = 0; i < D6*S6; i++) { + float e6 = fabsf(out6[i] - in6[i]); + if (e6 > me6) me6 = e6; + } + IOSurfaceUnlock(k6->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("max_err=%.6f %s\n", me6, me6 < 0.1 ? "PASS" : "FAIL"); + + // Now: 2× identity — just change the IOSurface weight, no recompile! + IOSurfaceLock(k6->ioIn, 0, NULL); + for (int r = 0; r < D6; r++) + for (int c = 0; c < D6; c++) + wbase[(r*D6+c)*S6] = (r==c) ? 2.0f : 0.0f; + IOSurfaceUnlock(k6->ioIn, 0, NULL); + ane_eval(k6); + IOSurfaceLock(k6->ioOut, kIOSurfaceLockReadOnly, NULL); + printf("2× W: in=[%.4f,%.4f] out=[%.4f,%.4f] (expect 2×)\n", + in6[0],in6[1], out6[0],out6[1]); + IOSurfaceUnlock(k6->ioOut, kIOSurfaceLockReadOnly, NULL); + free_kern(k6); + } else printf("Dynamic matmul compile FAILED\n"); + } + + free_kern(k); free(W_id); free(W_2x); + printf("\nDone.\n"); + } + return 0; +} diff --git a/training/tiny_train.m b/training/tiny_train.m index e1e9d7d..0449dba 100644 --- a/training/tiny_train.m +++ b/training/tiny_train.m @@ -139,7 +139,7 @@ static void free_kern(Kern *k) { free(k); } -static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_ch, int sp) { +static bool ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_ch, int sp) { float *tmp = (float*)malloc(in_ch * sp * sizeof(float)); for (int t = 0; t < sp; t++) for (int c = 0; c < in_ch; c++) @@ -151,8 +151,13 @@ static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_ NSError *e = nil; id mdl = (__bridge id)k->model; id req = (__bridge id)k->request; - ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( + BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)( mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e); + if (!ok) { + fprintf(stderr, "ANE eval failed: %s\n", + e ? [[e description] UTF8String] : "unknown error"); + return false; + } float *tmp2 = (float*)malloc(out_ch * sp * sizeof(float)); IOSurfaceLock(k->ioOut, kIOSurfaceLockReadOnly, NULL); memcpy(tmp2, IOSurfaceGetBaseAddress(k->ioOut), out_ch * sp * sizeof(float)); @@ -161,6 +166,7 @@ static void ane_eval_k(Kern *k, const float *in, float *out, int in_ch, int out_ for (int c = 0; c < out_ch; c++) out[t*out_ch + c] = tmp2[c*sp + t]; free(tmp2); + return true; } // === Checkpoint: save/restore training state for exec() restart === @@ -179,21 +185,25 @@ static void save_checkpoint(const char *path, int step, float loss, int D, int H, int S, int total_steps, float lr, const float *W1, const float *W2, double cc, double ct, double cw, int cs, int cb) { - FILE *f = fopen(path, "wb"); + char tmp_path[512]; + snprintf(tmp_path, sizeof(tmp_path), "%s.tmp", path); + FILE *f = fopen(tmp_path, "wb"); + if (!f) { fprintf(stderr, "Failed to open %s for checkpoint\n", tmp_path); return; } CkptHeader hdr = {step, loss, D, H, S, total_steps, lr, cc, ct, cw, cs, cb}; fwrite(&hdr, sizeof(hdr), 1, f); fwrite(W1, sizeof(float), H * D, f); fwrite(W2, sizeof(float), D * H, f); fclose(f); + rename(tmp_path, path); // atomic on POSIX } static bool load_checkpoint(const char *path, CkptHeader *hdr, float *W1, float *W2, int H, int D) { FILE *f = fopen(path, "rb"); if (!f) return false; - fread(hdr, sizeof(CkptHeader), 1, f); - fread(W1, sizeof(float), H * D, f); - fread(W2, sizeof(float), D * H, f); + if (fread(hdr, sizeof(CkptHeader), 1, f) != 1) { fclose(f); return false; } + if (fread(W1, sizeof(float), H * D, f) != (size_t)(H * D)) { fclose(f); return false; } + if (fread(W2, sizeof(float), D * H, f) != (size_t)(D * H)) { fclose(f); return false; } fclose(f); return true; } diff --git a/training/train_large.m b/training/train_large.m index e58ce08..96f8f7a 100644 --- a/training/train_large.m +++ b/training/train_large.m @@ -5,9 +5,9 @@ #include "stories_mil.h" #include "stories_cpu_ops.h" -#define CKPT_PATH "ane_stories110M_ckpt.bin" -#define MODEL_PATH "../../assets/models/stories110M.bin" -#define DATA_PATH "tinystories_data00.bin" +#define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin" +#define MODEL_PATH_DEFAULT "stories110M.bin" +#define DATA_PATH_DEFAULT "tinystories_data00.bin" // ===== Weight loading from llama2.c format ===== static bool load_pretrained(LayerWeights *lw, float *rms_final, float *embed, const char *path) { @@ -192,12 +192,26 @@ int main(int argc, char *argv[]) { float adam_b1=0.9f, adam_b2=0.999f, adam_eps=1e-8f; int adam_t = 0, start_step = 0; - // Parse args - bool do_resume = false; - for (int i=1; i