Add M1/M2 backward-compatibility module for ANE training#6
Draft
codegen-sh[bot] wants to merge 1 commit intomainfrom
Draft
Add M1/M2 backward-compatibility module for ANE training#6codegen-sh[bot] wants to merge 1 commit intomainfrom
codegen-sh[bot] wants to merge 1 commit intomainfrom
Conversation
Runtime detection (ane_hw_detect.h): - Clean ANEVersionDetect() using sysctl hw.cpufamily/brand_string - Chip profiles for M1/M2/M3/M4 with capability flags - Thread-safe singleton detection, falls back to conservative M2 defaults Conditional MIL paths (ane_compat.h): - Conv-only MIL generators: conv, QKV, FFN, classifier, RMSNorm fwd/bwd - program(1.0) ios16 target, verbose tensor<fp16, 1, C, 1, S> syntax - All matmul/SDPA replaced with conv1d equivalents - CPU fallback for attention core ops (Q*K^T, attn*V) - Classifier backward uses transposed-weight conv - 256-byte IOSurface alignment, max_compiles=80 (M2) / 60 (M1) Memory planner (ane_mem_budget.h): - M2MemoryBudget(24) caps batch=1, seq<=512, hidden<=4096 - Auto gradient checkpointing with interval selection Test harness (test_m2_compatibility.m): - Full Stories110M 12-layer training loop, 30-min stability target - Reports ANE utilization, power draw, crash-free uptime - NaN/Inf detection with auto-recovery Build (Makefile): - Universal -arch arm64 for M1 through M4 - New test_m2_compatibility target, make all / make compat-check Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
codegen-sh bot
added a commit
that referenced
this pull request
Mar 5, 2026
Port upstream PR #6 (imperatormk) - fixes MIL scalar type syntax from M4-only shorthand to canonical verbose format that compiles on all Apple Silicon (M1/M2/M3/M4). Changes: - program(1.3) to program(1.0), ios18 to ios16 target - Scalar type shorthand to canonical verbose format - Simplified buildInfo dict (no M4-specific version strings) - fp16 I/O fallback: g_fp16_io flag with auto-retry on compile failure for M1/M2 where cast op is unsupported - Dynamic IOSurface byte calculation (bpe: 2 for fp16, 4 for fp32) Tested on M1 Pro, macOS 26.3 (per upstream PR author).
codegen-sh bot
added a commit
that referenced
this pull request
Mar 5, 2026
Port upstream PR #6 (imperatormk) - fixes MIL scalar type syntax from M4-only shorthand to canonical verbose format that compiles on all Apple Silicon (M1/M2/M3/M4). Changes: - program(1.3) to program(1.0), ios18 to ios16 target - Scalar type shorthand to canonical verbose format - Simplified buildInfo dict (no M4-specific version strings) - fp16 I/O fallback: g_fp16_io flag with auto-retry on compile failure for M1/M2 where cast op is unsupported - Dynamic IOSurface byte calculation (bpe: 2 for fp16, 4 for fp32) Tested on M1 Pro, macOS 26.3 (per upstream PR author).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete backward-compatibility module enabling ANE-accelerated training on M1/M2 hardware. Targets stability for 24/7 swarm use with zero regression on M4 paths.
Total: 1,180 lines of new code across 4 new files + Makefile update.
New Files
ane_hw_detect.h— Runtime chip detection (265 lines)ANEVersionDetect()readssysctlhw.cpufamily, brand_string, hw.model — no private APIs, won't crashsupports_matmul,supports_sdpa, MIL version/target, max_compiles, IOSurface alignment, dimension limitsane_compat.h— M2-compatible MIL generators (407 lines)conv(weight=[out, in, 1, 1])ontensor<fp16, [1, C, 1, S]>inputsprogram(1.0)withfunc main<ios16>target (vs M4'sprogram(1.3)/ios18)mil_gen_conv_m2)make_surface_m2()— 256-byte aligned IOSurface allocationmil_gen_conv_compat()etc. return M2 variant or nil (caller uses M4 path)ane_mem_budget.h— Conservative memory planner (211 lines)M2MemoryBudget(availableUnifiedGB=24)entry pointANEAutoBudget()auto-detects available memory viahw.memsizetest_m2_compatibility.m— Stability test harness (297 lines)Modified Files
MakefileUNIVERSAL_CFLAGSwith-arch arm64(universal M1–M4)HEADERS_COMPATdependency grouptest_m2_compatibilitytargetmake allandmake compat-checkconvenience targetsArchitecture
M4 code paths are completely untouched. Dispatch happens via
ane_is_m1_or_m2()checks in calling code; the existing M4 generators remain the default.Estimated Performance (24 GB M2 MacBook)
The 2.4× slowdown is expected: 1×1 conv on M2 ANE saturates fewer NE cores per op than M4's fused matmul, and attention core ops (Q·K^T, attn·V) fall back to CPU.
Risk Matrix
_ANEInMemoryModelAPI changeprogram(1.0)deprecatedCaveats
Recommended next step: Run
make test_m2_compatibility && ./test_m2_compatibility model.bin --duration=30on an M2 MacBook and share the verdict output.💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks