-
Notifications
You must be signed in to change notification settings - Fork 879
Description
Hey wanted to share some findings that might be useful here. I've been digging around the concept and found a pattern that avoids the baked-weight limitation.
The key insight: ANE accepts runtime IOSurface inputs for both operands of matmul. So instead of baking weights as BLOBFILE constants, you pass them as a second input tensor — compile once at startup, update weights in-place via IOSurface writes, and never recompile.
This enables a full Adam optimizer loop:
- Forward: W @ x with W and x both runtime IOSurfaces
- Backward: W^T @ dy and dy @ x^T as separate kernels
- Adam m/v/w update as 3 more kernels
- All compiled once, weights updated each step via memcpy to IOSurface
Running a 28-block ConvNeXt UNet (96→384ch, 256×256) with full forward+backward+Adam at ~3 it/s on M1 this way.
Some gotchas discovered along the way (all from direct probing, no docs): IOSurface slot sizes must be strictly ascending for inputs / descending for outputs — violations produce silent zeros with no error. Matmul inner dim (Ci) must be a
multiple of 32 — non-multiples also silently produce zeros. conv with runtime weights fails for grouped/depthwise (InvalidMILProgram) so depthwise stays on CPU NEON. reshape, transpose, concat, pad, reduce_* all fail at runtime — ANE is purely feed-forward matmul/elementwise.
Full cheatsheet and working implementation (LN, GELU, Adam, attention, ConvNeXt blocks) here if useful: https://github.com/imperatormk/ane-train