Skip to content

Runtime training on ANE — no baked weights, no recompile #47

@imperatormk

Description

@imperatormk

Hey wanted to share some findings that might be useful here. I've been digging around the concept and found a pattern that avoids the baked-weight limitation.

The key insight: ANE accepts runtime IOSurface inputs for both operands of matmul. So instead of baking weights as BLOBFILE constants, you pass them as a second input tensor — compile once at startup, update weights in-place via IOSurface writes, and never recompile.

This enables a full Adam optimizer loop:

  • Forward: W @ x with W and x both runtime IOSurfaces
  • Backward: W^T @ dy and dy @ x^T as separate kernels
  • Adam m/v/w update as 3 more kernels
  • All compiled once, weights updated each step via memcpy to IOSurface

Running a 28-block ConvNeXt UNet (96→384ch, 256×256) with full forward+backward+Adam at ~3 it/s on M1 this way.

Some gotchas discovered along the way (all from direct probing, no docs): IOSurface slot sizes must be strictly ascending for inputs / descending for outputs — violations produce silent zeros with no error. Matmul inner dim (Ci) must be a
multiple of 32 — non-multiples also silently produce zeros. conv with runtime weights fails for grouped/depthwise (InvalidMILProgram) so depthwise stays on CPU NEON. reshape, transpose, concat, pad, reduce_* all fail at runtime — ANE is purely feed-forward matmul/elementwise.

Full cheatsheet and working implementation (LN, GELU, Adam, attention, ConvNeXt blocks) here if useful: https://github.com/imperatormk/ane-train

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions