Skip to content

fix: M1/M2 backward compatibility — MIL syntax + fp16 I/O fallback (rebased)#8

Draft
codegen-sh[bot] wants to merge 3 commits intomainfrom
codegen-bot/m2-compat-rebased-6d7a2a
Draft

fix: M1/M2 backward compatibility — MIL syntax + fp16 I/O fallback (rebased)#8
codegen-sh[bot] wants to merge 3 commits intomainfrom
codegen-bot/m2-compat-rebased-6d7a2a

Conversation

@codegen-sh
Copy link

@codegen-sh codegen-sh bot commented Mar 5, 2026

Rebased version of PR #3, now current with all 27 upstream commits on main.

What this does

Makes the ANE training pipeline run on M1/M2 chips (which lack the cast MIL op on their ANE) by:

  1. MIL syntax fixes — Downgraded from program(1.3) / ios18 to program(1.0) / ios16; replaced bare string, bool, int32 scalars with tensor<T, []> rank-0 form; swapped string(...) for tensor<string, []>(...) everywhere

  2. Dual I/O pathsg_fp16_io flag selects between:

    • fp16 I/O (M1/M2): inputs/outputs declared as fp16, no cast ops → ANE accepts the graph
    • fp32 I/O (M4+): inputs/outputs as fp32 with cast to/from fp16 internally
  3. Auto-retrymodel_compile_kernels / bench() tries fp32 first, on compile failure flips g_fp16_io=1 and retries fp16, else falls back to CPU

  4. IOSurface byte sizingg_fp16_io ? 2 : 4 bytes-per-element for buffer allocation

  5. Checkpoint persistenceCkptHeader saves/restores g_fp16_io so training resumes with the correct I/O mode

Rebase conflict resolution

Integrated upstream's bool return type + error handling + atomic checkpoint writes with our M2 compatibility patches. 4 conflict regions in tiny_train.m resolved manually, keeping the best of both.

Supersedes #3.


💻 View my work • 👤 Initiated by @dermitchell1993About Codegen
⛔ Remove Codegen from PR🚫 Ban action checks

codegen-sh bot added 2 commits March 5, 2026 15:23
Port upstream PR #6 (imperatormk) - fixes MIL scalar type syntax
from M4-only shorthand to canonical verbose format that compiles
on all Apple Silicon (M1/M2/M3/M4).

Changes:
- program(1.3) to program(1.0), ios18 to ios16 target
- Scalar type shorthand to canonical verbose format
- Simplified buildInfo dict (no M4-specific version strings)
- fp16 I/O fallback: g_fp16_io flag with auto-retry on compile
  failure for M1/M2 where cast op is unsupported
- Dynamic IOSurface byte calculation (bpe: 2 for fp16, 4 for fp32)

Tested on M1 Pro, macOS 26.3 (per upstream PR author).
train.m includes ane_mil_gen.h (via backward.h -> model.h) which
declares extern int g_fp16_io, but train.m never defined it --
producing an undefined symbol linker error.

Changes:
- train.m: add g_fp16_io = 0 at file scope, wrap model_compile_kernels
  with auto-retry (try fp32, on fail set g_fp16_io=1, retry fp16)
- model.h: compile_conv_kernel IOSurface byte calculation now uses
  g_fp16_io ? 2 : 4 (was hardcoded to 4)
- .gitignore: add train binary + test/probe binaries
Backports from imperatormk/ane-train:

1. Disk compile cache (ane_set_cache_dir / ane_enable_cache)
   Persists compiled kernels to ~/.cache/ane_compile/ — saves
   100-500ms per kernel on subsequent runs.

2. ane_rewire() — zero-copy IOSurface pointer swap for kernel chaining.
   Enables activation chaining, gradient routing, weight ping-pong
   without CPU roundtrips.

3. Non-nil weights dict fix — passing nil wdict to modelWithMILText:
   silently returns nil. Now passes @{} for weight-free kernels.

4. ANE_TRAINING.md — comprehensive constraint cheatsheet covering
   tensor layout, IOSurface slot ordering, broadcast rules, the sqrt
   bug, variable naming pitfalls, and proven training patterns.
   All findings from direct M1/M1 Pro probing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant