Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,11 +148,17 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
| Channel-first layout | 20.3 | 5.2% |
| vDSP vectorized RMSNorm | 14.2 | 7.4% |
| GCD async cblas overlap | 11.4 | 9.2% |
| ANE RMSNorm fusion | 11.4 | 9.2% |
| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
| Deferred cblas wait | **9.3** | **11.2%** |

## Disclaimer
| ANE RMSNorm fusion | 11.4 | 9.2% |
| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
| Deferred cblas wait | **9.3** | **11.2%** |

## Community Benchmarks

Community hardware benchmark submissions live in [`benchmarks/submissions/`](benchmarks/submissions/).

- [Mac Studio (Apple M3 Ultra, 256 GB) — 2026-03-03](benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md)

## Disclaimer

This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.

Expand Down
34 changes: 34 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Community Benchmark Submissions

This folder is for reproducible hardware benchmark submissions from the community.

## Goals

- Make cross-chip results easy to compare.
- Keep raw logs attached so numbers are auditable.
- Keep submissions lightweight and low-maintenance.

## Submission Layout

Use one directory per machine/date:

`benchmarks/submissions/<chip>-<machine>-<YYYY-MM-DD>/`

Required files:

- `README.md` — short summary of machine, commands, and key results
- `metrics.json` — machine-readable summary of key metrics
- `raw/` — raw command outputs (`*.log`, `system_info.txt`, `upstream_commit.txt`)

## Privacy

Please redact machine serial numbers, UUIDs, and other unique identifiers before committing logs.

## Minimal Repro Guidance

Each submission should include:

- exact upstream commit hash tested
- exact commands run
- fixed step counts for training comparisons (for example, `--steps 20`)
- clear pass/fail status for each benchmark
81 changes: 81 additions & 0 deletions benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Mac Studio M3 Ultra Benchmark Submission (2026-03-03)

This submission targets upstream issue: `#3` (collecting results across Apple Silicon variants).

## Environment

- Upstream commit: `443194bca4491fae4400bae9dad2a0470692bdbf`
- Machine: Mac Studio (`Mac15,14`)
- Chip: Apple M3 Ultra
- CPU cores: 28 total (20P + 8E)
- Memory: 256 GB (`274877906944` bytes)
- OS: macOS 26.3 (`25D125`)
- Toolchain: Apple clang 17.0.0 (`/Library/Developer/CommandLineTools`)

Raw system capture: [`raw/system_info.txt`](raw/system_info.txt)

## Commands Run

Exact commands used are included in [`commands.sh`](commands.sh).

Highlights:

```bash
# Root benchmark
xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
-ldl -lobjc -o inmem_peak inmem_peak.m
./inmem_peak

# Training benchmarks
cd training
bash download_data.sh
make train_large train_large_ane
./train_large --steps 20 --lr 1e-4 --ckpt /tmp/train_large.ckpt
./train_large_ane --steps 20 --lr 1e-4 --ckpt /tmp/train_large_ane.ckpt
./train_large_ane --no-ane-extras --steps 20 --lr 1e-4 --ckpt /tmp/train_large_ane_no_extras.ckpt
cd training_dynamic
make train
./train --scratch --steps 20 --lr 1e-4
```

## Training Results (20 steps)

| Pipeline | Wall time | Compile time | Train time | Avg train | ANE TFLOPS | Total TFLOPS |
|---|---:|---:|---:|---:|---:|---:|
| `train_large` | 9471 ms | 7545 ms (79.7%) | 1623 ms (17.1%) | 81.2 ms/step | 1.15 | 2.15 |
| `train_large_ane` | 10898 ms | 9090 ms (83.4%) | 1428 ms (13.1%) | 71.4 ms/step | 1.48 | 2.44 |
| `train_large_ane --no-ane-extras` | 10248 ms | 7455 ms (72.7%) | 2476 ms (24.2%) | 123.8 ms/step | 0.85 | 1.41 |
| `training_dynamic/train --scratch` | 2.9 s | 353 ms (one-time, 12.0%) | 2309 ms | 115.4 ms/step | n/a | n/a |

Raw logs:

- [`raw/train_large.log`](raw/train_large.log)
- [`raw/train_large_ane.log`](raw/train_large_ane.log)
- [`raw/train_large_ane_no_extras.log`](raw/train_large_ane_no_extras.log)
- [`raw/train_dynamic.log`](raw/train_dynamic.log)

## In-Memory Peak Results

Best observed from `inmem_peak`:

- 8.08 TFLOPS at `128x conv 512ch sp64` (`4.29 GFLOP`, `0.531 ms/eval`)

Raw log:

- [`raw/inmem_peak.log`](raw/inmem_peak.log)

## Additional Root Benchmarks

- `inmem_bench`: all configs returned `FAIL(-1)` on this clean setup
- `sram_bench`: all configs returned `FAIL(-1)` on this clean setup

Raw logs:

- [`raw/inmem_bench.log`](raw/inmem_bench.log)
- [`raw/sram_bench.log`](raw/sram_bench.log)

## Notes

- `train_large_ane` had the best per-step throughput in this run.
- Dynamic had the best short-run wall-clock due to one-time compile cost.
- Static pipelines remained compile-dominated over 20 steps.
62 changes: 62 additions & 0 deletions benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/commands.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/usr/bin/env bash
set -euo pipefail

# Repro commands used for this submission.
# Machine: Mac Studio (Apple M3 Ultra)
# Commit: 443194bca4491fae4400bae9dad2a0470692bdbf

REPO="${REPO:-$HOME/Dev/ANE-upstream}"
ART="${ART:-$REPO/bench_artifacts/m3-ultra-2026-03-03/raw}"

mkdir -p "$ART"
cd "$REPO"

# System capture
{
echo "timestamp_utc=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
sw_vers
uname -a
echo
echo "=== sysctl ==="
sysctl hw.model hw.memsize hw.ncpu hw.physicalcpu hw.logicalcpu \
hw.perflevel0.physicalcpu hw.perflevel1.physicalcpu \
machdep.cpu.brand_string 2>/dev/null || true
echo
echo "=== system_profiler SPHardwareDataType ==="
system_profiler SPHardwareDataType
echo
echo "=== toolchain ==="
xcode-select -p
xcrun clang --version
} > "$ART/system_info.txt"

# Root benchmark
xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
-ldl -lobjc -o inmem_peak inmem_peak.m
./inmem_peak > "$ART/inmem_peak.log" 2>&1

# Optional root benchmarks (may fail on clean setups)
xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
-ldl -lobjc -o inmem_bench inmem_bench.m
./inmem_bench > "$ART/inmem_bench.log" 2>&1 || true

xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
-ldl -lobjc -o sram_bench sram_bench.m
./sram_bench > "$ART/sram_bench.log" 2>&1 || true

# Training benchmarks
cd "$REPO/training"
bash download_data.sh > "$ART/download_data.log" 2>&1
make train_large train_large_ane > "$ART/training_make.log" 2>&1
./train_large --steps 20 --lr 1e-4 --ckpt "$ART/train_large.ckpt" > "$ART/train_large.log" 2>&1
./train_large_ane --steps 20 --lr 1e-4 --ckpt "$ART/train_large_ane.ckpt" > "$ART/train_large_ane.log" 2>&1
./train_large_ane --no-ane-extras --steps 20 --lr 1e-4 --ckpt "$ART/train_large_ane_no_extras.ckpt" > "$ART/train_large_ane_no_extras.log" 2>&1

cd "$REPO/training/training_dynamic"
make train > "$ART/training_dynamic_make.log" 2>&1
./train --scratch --steps 20 --lr 1e-4 > "$ART/train_dynamic.log" 2>&1

cd "$REPO"
git rev-parse HEAD > "$ART/upstream_commit.txt"

echo "Done. Raw logs are in: $ART"
101 changes: 101 additions & 0 deletions benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/metrics.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
{
"submission_id": "m3-ultra-mac-studio-2026-03-03",
"captured_at_utc": "2026-03-03T18:34:30Z",
"upstream_commit": "443194bca4491fae4400bae9dad2a0470692bdbf",
"system": {
"model_name": "Mac Studio",
"model_identifier": "Mac15,14",
"chip": "Apple M3 Ultra",
"memory_bytes": 274877906944,
"cpu_cores_total": 28,
"cpu_cores_performance": 20,
"cpu_cores_efficiency": 8,
"os_product_version": "26.3",
"os_build_version": "25D125"
},
"toolchain": {
"developer_dir": "/Library/Developer/CommandLineTools",
"clang": "Apple clang version 17.0.0 (clang-1700.3.19.1)"
},
"training": {
"steps": 20,
"train_large": {
"wall_time_ms": 9471,
"compile_time_ms": 7545,
"compile_pct": 79.7,
"train_time_ms": 1623,
"train_pct": 17.1,
"avg_train_ms_per_step": 81.2,
"ane_tflops": 1.15,
"total_tflops": 2.15,
"ane_utilization_pct_of_15_8_tflops": 7.3
},
"train_large_ane": {
"wall_time_ms": 10898,
"compile_time_ms": 9090,
"compile_pct": 83.4,
"train_time_ms": 1428,
"train_pct": 13.1,
"avg_train_ms_per_step": 71.4,
"ane_tflops": 1.48,
"total_tflops": 2.44,
"ane_utilization_pct_of_15_8_tflops": 9.4
},
"train_large_ane_no_extras": {
"wall_time_ms": 10248,
"compile_time_ms": 7455,
"compile_pct": 72.7,
"train_time_ms": 2476,
"train_pct": 24.2,
"avg_train_ms_per_step": 123.8,
"ane_tflops": 0.85,
"total_tflops": 1.41,
"ane_utilization_pct_of_15_8_tflops": 5.4
},
"train_dynamic_scratch": {
"compile_time_ms": 353,
"compile_pct_one_time": 12.0,
"train_time_ms": 2309,
"avg_train_ms_per_step": 115.4,
"wall_time_s": 2.9
}
},
"inmem_peak": {
"best_tflops": 8.08,
"best_config": "128x conv 512ch sp64",
"rows": [
{ "config": "32x conv 512ch sp64", "weight_mb": 16.0, "gflop": 1.07, "ms_per_eval": 0.497, "tflops": 2.16 },
{ "config": "48x conv 512ch sp64", "weight_mb": 24.0, "gflop": 1.61, "ms_per_eval": 0.535, "tflops": 3.01 },
{ "config": "64x conv 512ch sp64", "weight_mb": 32.0, "gflop": 2.15, "ms_per_eval": 0.355, "tflops": 6.06 },
{ "config": "96x conv 512ch sp64", "weight_mb": 48.0, "gflop": 3.22, "ms_per_eval": 0.423, "tflops": 7.61 },
{ "config": "128x conv 512ch sp64", "weight_mb": 64.0, "gflop": 4.29, "ms_per_eval": 0.531, "tflops": 8.08 },
{ "config": "64x conv 256ch sp64", "weight_mb": 8.0, "gflop": 0.54, "ms_per_eval": 0.287, "tflops": 1.87 },
{ "config": "128x conv 256ch sp64", "weight_mb": 16.0, "gflop": 1.07, "ms_per_eval": 0.272, "tflops": 3.94 },
{ "config": "256x conv 256ch sp64", "weight_mb": 32.0, "gflop": 2.15, "ms_per_eval": 0.439, "tflops": 4.89 },
{ "config": "64x conv 384ch sp64", "weight_mb": 18.0, "gflop": 1.21, "ms_per_eval": 0.319, "tflops": 3.78 },
{ "config": "128x conv 384ch sp64", "weight_mb": 36.0, "gflop": 2.42, "ms_per_eval": 0.369, "tflops": 6.55 }
]
},
"inmem_bench": {
"status": "failed",
"failure": "all rows returned FAIL(-1)"
},
"sram_bench": {
"status": "failed",
"failure": "all rows returned FAIL(-1)"
},
"raw_files": [
"raw/system_info.txt",
"raw/upstream_commit.txt",
"raw/download_data.log",
"raw/training_make.log",
"raw/training_dynamic_make.log",
"raw/inmem_peak.log",
"raw/inmem_bench.log",
"raw/sram_bench.log",
"raw/train_large.log",
"raw/train_large_ane.log",
"raw/train_large_ane_no_extras.log",
"raw/train_dynamic.log"
]
}
Loading