maderix · dev-erik · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
diff --git a/.github/ISSUE_TEMPLATE/benchmark_submission.md b/.github/ISSUE_TEMPLATE/benchmark_submission.md
@@ -0,0 +1,26 @@
+---
+name: Benchmark Submission
+about: Submit your ANE benchmark results
+title: "[Benchmark] <Chip Model> results"
+labels: benchmark
+assignees: ''
+---
+
+## System Info
+
+- **Chip**: (e.g., Apple M4 Max)
+- **Machine**: (e.g., Mac16,5)
+- **macOS Version**: 
+- **Memory**: (e.g., 128 GB)
+
+## Benchmark Results
+
+Paste the contents of your JSON results file below:
+
+```json
+
+```
+
+## Notes
+
+Any observations, issues encountered, or interesting findings.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,33 @@
+---
+name: Bug Report
+about: Report a build failure, crash, or unexpected behavior
+title: "[Bug] "
+labels: bug
+assignees: ''
+---
+
+## Environment
+
+- **Chip**: 
+- **macOS Version**: 
+- **Xcode Version**: (run `xcodebuild -version`)
+
+## Description
+
+What happened?
+
+## Steps to Reproduce
+
+1. 
+2. 
+3. 
+
+## Expected Behavior
+
+What did you expect to happen?
+
+## Logs / Output
+
+```
+Paste relevant output here
+```
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,19 @@
+---
+name: Feature Request
+about: Suggest a new feature or research direction
+title: "[Feature] "
+labels: enhancement
+assignees: ''
+---
+
+## Description
+
+What would you like to see added?
+
+## Motivation
+
+Why would this be useful?
+
+## Possible Approach
+
+If you have ideas on how to implement this, share them here.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,83 @@
+# Build artifacts
+*.o
+*.dSYM/
+
+# Root-level compiled binaries
+ane_probe
+api_explore
+inmem_basic
+inmem_bench
+inmem_peak
+sram_bench
+sram_probe
+
+# Training binaries
+tiny_train
+tiny_train_m1
+train_large
+training/train_large
+training/train_large_ane
+training/train_opt
+training/train_double_buffer
+training/test_*
+!training/test_*.m
+
+# Inference binaries and runtime data
+inference/qwen_ane
+inference/qwen05b.bin
+inference/qwen05b_f32.bin
+inference/qwen05b_f16.bin
+inference/qwen05b_q8.bin
+inference/.venv/
+inference/benchmark_results.json
+
+# Dynamic training binaries
+training/training_dynamic/train
+
+# Test/research binaries
+test_chaining
+
+# Generated mlpackage files
+/tmp/ane_*.mlpackage
+
+# Benchmark results (keep community_benchmarks/ submissions)
+benchmark_results_*.txt
+community_benchmarks/SUMMARY.json
+community_benchmarks/SUMMARY.md
+community_benchmarks/apple_m4_max_20260303_*.json
+
+# Python
+__pycache__/
+*.pyc
+*.egg-info/
+/tmp/ane_venv/
+
+# Training data (downloaded separately)
+assets/
+
+# Web dashboard (lives in separate private repo)
+web/
+
+# Training data binaries (downloaded via make setup)
+training/tinystories_data00.bin
+training/ane_stories110M_ckpt.bin
+*.bin
+*.metallib
+!training/download_data.sh
+
+# Secrets / env
+.env
+inference/.env
+
+# Internal / private
+.cursor/
+docs/launch/
+comm
+
+# macOS
+.DS_Store
+
+# Editor
+*.swp
+*.swo
+*~
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,60 @@
+# Contributing to ANE Training
+
+Thanks for your interest in contributing! This community fork welcomes benchmark submissions, bug fixes, and research contributions.
+
+## Benchmark Submissions (Easiest Way to Contribute)
+
+The single most valuable thing you can do is run the benchmark on your hardware and submit results.
+
+### Quick Version
+
+```bash
+bash scripts/run_community_benchmark.sh
+```
+
+The script will guide you through everything, including optional auto-submission to the dashboard.
+
+### What Gets Collected
+
+- Your chip model (e.g., Apple M4 Max)
+- macOS version, memory, core counts
+- SRAM probe results (TFLOPS vs weight size)
+- In-memory peak TFLOPS
+- Training performance (optional, requires training data)
+- Your GitHub username (optional)
+
+No personal data, no IP addresses stored (only hashed for rate limiting).
+
+## Bug Reports
+
+Open an issue with:
+- Your hardware (chip, macOS version, memory)
+- Steps to reproduce
+- Expected vs actual behavior
+- Relevant log output
+
+## Code Contributions
+
+1. Fork the repository
+2. Create a feature branch (`git checkout -b my-feature`)
+3. Make your changes
+4. Test on your hardware
+5. Submit a Pull Request
+
+### Code Style
+
+- Objective-C: follow the existing style in `training/` (no ARC annotations in headers, `_Float16` for fp16)
+- Shell scripts: use `set -euo pipefail`, quote variables
+- Python: minimal dependencies, Python 3.11+ compatible
+
+### Areas Where Help is Needed
+
+- **Benchmarks on hardware we don't have**: M1, M2, M3, M3 Pro/Max/Ultra, M4 Pro, M5
+- **Reducing compilation overhead**: currently 80-85% of wall time
+- **`_ANEChainingRequest` research**: pipelining multiple ANE operations without recompile
+- **`_ANEPerformanceStats` investigation**: getting real hardware timing data
+- **Larger model support**: scaling beyond Stories110M
+
+## Questions?
+
+Open a GitHub issue or discussion. We're happy to help.
diff --git a/PROBE_RESULTS.md b/PROBE_RESULTS.md
@@ -0,0 +1,88 @@
+# ANE Probe Results: M4 (macOS 26.3)
+
+**Machine:** Apple M4 (10 cores), 32GB RAM, macOS 26.3  
+**Date:** 2026-03-03  
+**ANE Family:** H16 (same as M5 results in `training/m5result.md`)
+
+## Key Discovery: Compile and Eval Run in Parallel
+
+**This was not known before.** The M5 probes tested compile and eval sequentially.
+We tested with GCD `dispatch_async` and found they fully overlap.
+
+### probe_v2.m Results
+
+#### TEST 1: Pure Eval Throughput
+```
+Conv 128x128, spatial=64
+1000 evals: 189.1ms total, 0.189ms/eval
+11.09 GFLOPS sustained
+```
+
+#### TEST 2: Ping-pong (Two Pre-compiled Models)
+```
+500 ping-pong pairs: 207.4ms (0.415ms/pair, 0.207ms/eval)
+```
+Near-zero overhead switching between two loaded models.
+
+#### TEST 3: Sequential Compile (20 Models)
+```
+All 20 models compiled and verified ✓
+Compile time: ~23-29ms each (consistent, no degradation)
+All 20 models correct with different scale factors
+```
+
+#### TEST 4: Background Compile Overlap ⭐
+```
+Background compile: 26.8ms
+Foreground evals during compile: 119 (26.8ms total)
+Overlap: YES — compile and eval CAN run in parallel!
+Background model verified correct ✓
+```
+
+### Summary
+| Metric | Value |
+|--------|-------|
+| Compile time | ~25ms per kernel set |
+| Eval time | 0.189ms per eval |
+| Compile:eval ratio | ~130:1 |
+| Parallel compile+eval | **YES** |
+| Max simultaneous models | 20+ |
+| Ping-pong overhead | +10% vs single model |
+
+## Peak ANE Throughput (inmem_peak)
+
+```
+Config                         W(MB)   GFLOP   ms/eval  TFLOPS
+96x conv 512ch sp64            48.0    3.22    0.429 ms   7.50
+128x conv 512ch sp64           64.0    4.29    0.589 ms   7.30
+256x conv 256ch sp64           32.0    2.15    0.380 ms   5.65
+64x conv 512ch sp64            32.0    2.15    0.395 ms   5.43
+```
+
+Peak: **7.50 TFLOPS** (47% of 15.8 TFLOPS theoretical).
+
+## Implications for Training
+
+### Before (train_large.m)
+- Synchronous compile: **88.6% of wall time is compilation**
+- 55ms compile per batch, 0.54ms actual training
+- Training throughput limited by compiler, not by ANE
+
+### After (train_double_buffer.m)
+- Async double-buffered compile: **0% compile stall**
+- Background compile happens during forward/backward passes
+- ~130 eval steps fit in one compile window
+- Weight updates are "delayed" by one batch (standard technique in distributed training)
+- Training throughput limited only by ANE eval speed
+
+### Architecture
+```
+Time →
+Active kernels:  [=== eval batch N ===][=== eval batch N+1 ===][=== eval batch N+2 ===]
+Background:      [compile N+1 weights ][compile N+2 weights   ][compile N+3 weights   ]
+                 ↑                     ↑                       ↑
+                 swap ready            swap ready               swap ready
+```
+
+Two kernel sets (A and B) alternate between active evaluation and background compilation.
+When the background compile finishes, pointers swap atomically at the batch boundary.