Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
14 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/ISSUE_TEMPLATE/benchmark_submission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
name: Benchmark Submission
about: Submit your ANE benchmark results
title: "[Benchmark] <Chip Model> results"
labels: benchmark
assignees: ''
---

## System Info

- **Chip**: (e.g., Apple M4 Max)
- **Machine**: (e.g., Mac16,5)
- **macOS Version**:
- **Memory**: (e.g., 128 GB)

## Benchmark Results

Paste the contents of your JSON results file below:

```json

```

## Notes

Any observations, issues encountered, or interesting findings.
33 changes: 33 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
name: Bug Report
about: Report a build failure, crash, or unexpected behavior
title: "[Bug] "
labels: bug
assignees: ''
---

## Environment

- **Chip**:
- **macOS Version**:
- **Xcode Version**: (run `xcodebuild -version`)

## Description

What happened?

## Steps to Reproduce

1.
2.
3.

## Expected Behavior

What did you expect to happen?

## Logs / Output

```
Paste relevant output here
```
19 changes: 19 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
name: Feature Request
about: Suggest a new feature or research direction
title: "[Feature] "
labels: enhancement
assignees: ''
---

## Description

What would you like to see added?

## Motivation

Why would this be useful?

## Possible Approach

If you have ideas on how to implement this, share them here.
83 changes: 83 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Build artifacts
*.o
*.dSYM/

# Root-level compiled binaries
ane_probe
api_explore
inmem_basic
inmem_bench
inmem_peak
sram_bench
sram_probe

# Training binaries
tiny_train
tiny_train_m1
train_large
training/train_large
training/train_large_ane
training/train_opt
training/train_double_buffer
training/test_*
!training/test_*.m

# Inference binaries and runtime data
inference/qwen_ane
inference/qwen05b.bin
inference/qwen05b_f32.bin
inference/qwen05b_f16.bin
inference/qwen05b_q8.bin
inference/.venv/
inference/benchmark_results.json

# Dynamic training binaries
training/training_dynamic/train

# Test/research binaries
test_chaining

# Generated mlpackage files
/tmp/ane_*.mlpackage

# Benchmark results (keep community_benchmarks/ submissions)
benchmark_results_*.txt
community_benchmarks/SUMMARY.json
community_benchmarks/SUMMARY.md
community_benchmarks/apple_m4_max_20260303_*.json

# Python
__pycache__/
*.pyc
*.egg-info/
/tmp/ane_venv/

# Training data (downloaded separately)
assets/

# Web dashboard (lives in separate private repo)
web/

# Training data binaries (downloaded via make setup)
training/tinystories_data00.bin
training/ane_stories110M_ckpt.bin
*.bin
*.metallib
!training/download_data.sh

# Secrets / env
.env
inference/.env

# Internal / private
.cursor/
docs/launch/
comm

# macOS
.DS_Store

# Editor
*.swp
*.swo
*~
60 changes: 60 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Contributing to ANE Training

Thanks for your interest in contributing! This community fork welcomes benchmark submissions, bug fixes, and research contributions.

## Benchmark Submissions (Easiest Way to Contribute)

The single most valuable thing you can do is run the benchmark on your hardware and submit results.

### Quick Version

```bash
bash scripts/run_community_benchmark.sh
```

The script will guide you through everything, including optional auto-submission to the dashboard.

### What Gets Collected

- Your chip model (e.g., Apple M4 Max)
- macOS version, memory, core counts
- SRAM probe results (TFLOPS vs weight size)
- In-memory peak TFLOPS
- Training performance (optional, requires training data)
- Your GitHub username (optional)

No personal data, no IP addresses stored (only hashed for rate limiting).

## Bug Reports

Open an issue with:
- Your hardware (chip, macOS version, memory)
- Steps to reproduce
- Expected vs actual behavior
- Relevant log output

## Code Contributions

1. Fork the repository
2. Create a feature branch (`git checkout -b my-feature`)
3. Make your changes
4. Test on your hardware
5. Submit a Pull Request

### Code Style

- Objective-C: follow the existing style in `training/` (no ARC annotations in headers, `_Float16` for fp16)
- Shell scripts: use `set -euo pipefail`, quote variables
- Python: minimal dependencies, Python 3.11+ compatible

### Areas Where Help is Needed

- **Benchmarks on hardware we don't have**: M1, M2, M3, M3 Pro/Max/Ultra, M4 Pro, M5
- **Reducing compilation overhead**: currently 80-85% of wall time
- **`_ANEChainingRequest` research**: pipelining multiple ANE operations without recompile
- **`_ANEPerformanceStats` investigation**: getting real hardware timing data
- **Larger model support**: scaling beyond Stories110M

## Questions?

Open a GitHub issue or discussion. We're happy to help.
88 changes: 88 additions & 0 deletions PROBE_RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# ANE Probe Results: M4 (macOS 26.3)

**Machine:** Apple M4 (10 cores), 32GB RAM, macOS 26.3
**Date:** 2026-03-03
**ANE Family:** H16 (same as M5 results in `training/m5result.md`)

## Key Discovery: Compile and Eval Run in Parallel

**This was not known before.** The M5 probes tested compile and eval sequentially.
We tested with GCD `dispatch_async` and found they fully overlap.

### probe_v2.m Results

#### TEST 1: Pure Eval Throughput
```
Conv 128x128, spatial=64
1000 evals: 189.1ms total, 0.189ms/eval
11.09 GFLOPS sustained
```

#### TEST 2: Ping-pong (Two Pre-compiled Models)
```
500 ping-pong pairs: 207.4ms (0.415ms/pair, 0.207ms/eval)
```
Near-zero overhead switching between two loaded models.

#### TEST 3: Sequential Compile (20 Models)
```
All 20 models compiled and verified ✓
Compile time: ~23-29ms each (consistent, no degradation)
All 20 models correct with different scale factors
```

#### TEST 4: Background Compile Overlap ⭐
```
Background compile: 26.8ms
Foreground evals during compile: 119 (26.8ms total)
Overlap: YES — compile and eval CAN run in parallel!
Background model verified correct ✓
```

### Summary
| Metric | Value |
|--------|-------|
| Compile time | ~25ms per kernel set |
| Eval time | 0.189ms per eval |
| Compile:eval ratio | ~130:1 |
| Parallel compile+eval | **YES** |
| Max simultaneous models | 20+ |
| Ping-pong overhead | +10% vs single model |

## Peak ANE Throughput (inmem_peak)

```
Config W(MB) GFLOP ms/eval TFLOPS
96x conv 512ch sp64 48.0 3.22 0.429 ms 7.50
128x conv 512ch sp64 64.0 4.29 0.589 ms 7.30
256x conv 256ch sp64 32.0 2.15 0.380 ms 5.65
64x conv 512ch sp64 32.0 2.15 0.395 ms 5.43
```

Peak: **7.50 TFLOPS** (47% of 15.8 TFLOPS theoretical).

## Implications for Training

### Before (train_large.m)
- Synchronous compile: **88.6% of wall time is compilation**
- 55ms compile per batch, 0.54ms actual training
- Training throughput limited by compiler, not by ANE

### After (train_double_buffer.m)
- Async double-buffered compile: **0% compile stall**
- Background compile happens during forward/backward passes
- ~130 eval steps fit in one compile window
- Weight updates are "delayed" by one batch (standard technique in distributed training)
- Training throughput limited only by ANE eval speed

### Architecture
```
Time →
Active kernels: [=== eval batch N ===][=== eval batch N+1 ===][=== eval batch N+2 ===]
Background: [compile N+1 weights ][compile N+2 weights ][compile N+3 weights ]
↑ ↑ ↑
swap ready swap ready swap ready
```

Two kernel sets (A and B) alternate between active evaluation and background compilation.
When the background compile finishes, pointers swap atomically at the batch boundary.
Loading