Runtime training on ANE — no baked weights, no recompile

Hey wanted to share some findings that might be useful here. I've been digging around the concept and found a pattern that avoids the baked-weight limitation.

The key insight: ANE accepts runtime IOSurface inputs for both operands of matmul. So instead of baking weights as BLOBFILE constants, you pass them as a second input tensor — compile once at startup, update weights in-place via IOSurface writes, and never recompile.

This enables a full Adam optimizer loop:
- Forward: W @ x with W and x both runtime IOSurfaces
- Backward: W^T @ dy and dy @ x^T as separate kernels
- Adam m/v/w update as 3 more kernels
- All compiled once, weights updated each step via memcpy to IOSurface

Running a 28-block ConvNeXt UNet (96→384ch, 256×256) with full forward+backward+Adam at ~3 it/s on M1 this way.

Some gotchas discovered along the way (all from direct probing, no docs): IOSurface slot sizes must be strictly ascending for inputs / descending for outputs — violations produce silent zeros with no error. Matmul inner dim (Ci) must be a
multiple of 32 — non-multiples also silently produce zeros. conv with runtime weights fails for grouped/depthwise (InvalidMILProgram) so depthwise stays on CPU NEON. reshape, transpose, concat, pad, reduce_* all fail at runtime — ANE is purely feed-forward matmul/elementwise.

Full cheatsheet and working implementation (LN, GELU, Adam, attention, ConvNeXt blocks) here if useful: https://github.com/imperatormk/ane-train

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime training on ANE — no baked weights, no recompile #47

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Runtime training on ANE — no baked weights, no recompile #47

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions