maderix · nabbilkhan · Mar 3, 2026
diff --git a/README.md b/README.md
@@ -148,11 +148,17 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 | Channel-first layout | 20.3 | 5.2% |
 | vDSP vectorized RMSNorm | 14.2 | 7.4% |
 | GCD async cblas overlap | 11.4 | 9.2% |
-| ANE RMSNorm fusion | 11.4 | 9.2% |
-| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
-| Deferred cblas wait | **9.3** | **11.2%** |
-
-## Disclaimer
+| ANE RMSNorm fusion | 11.4 | 9.2% |
+| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
+| Deferred cblas wait | **9.3** | **11.2%** |
+
+## Community Benchmarks
+
+Community hardware benchmark submissions live in [`benchmarks/submissions/`](benchmarks/submissions/).
+
+- [Mac Studio (Apple M3 Ultra, 256 GB) — 2026-03-03](benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md)
+
+## Disclaimer
 
 This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.
 

diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,34 @@
+# Community Benchmark Submissions
+
+This folder is for reproducible hardware benchmark submissions from the community.
+
+## Goals
+
+- Make cross-chip results easy to compare.
+- Keep raw logs attached so numbers are auditable.
+- Keep submissions lightweight and low-maintenance.
+
+## Submission Layout
+
+Use one directory per machine/date:
+
+`benchmarks/submissions/<chip>-<machine>-<YYYY-MM-DD>/`
+
+Required files:
+
+- `README.md` — short summary of machine, commands, and key results
+- `metrics.json` — machine-readable summary of key metrics
+- `raw/` — raw command outputs (`*.log`, `system_info.txt`, `upstream_commit.txt`)
+
+## Privacy
+
+Please redact machine serial numbers, UUIDs, and other unique identifiers before committing logs.
+
+## Minimal Repro Guidance
+
+Each submission should include:
+
+- exact upstream commit hash tested
+- exact commands run
+- fixed step counts for training comparisons (for example, `--steps 20`)
+- clear pass/fail status for each benchmark
diff --git a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md
@@ -0,0 +1,81 @@
+# Mac Studio M3 Ultra Benchmark Submission (2026-03-03)
+
+This submission targets upstream issue: `#3` (collecting results across Apple Silicon variants).
+
+## Environment
+
+- Upstream commit: `443194bca4491fae4400bae9dad2a0470692bdbf`
+- Machine: Mac Studio (`Mac15,14`)
+- Chip: Apple M3 Ultra
+- CPU cores: 28 total (20P + 8E)
+- Memory: 256 GB (`274877906944` bytes)
+- OS: macOS 26.3 (`25D125`)
+- Toolchain: Apple clang 17.0.0 (`/Library/Developer/CommandLineTools`)
+
+Raw system capture: [`raw/system_info.txt`](raw/system_info.txt)
+
+## Commands Run
+
+Exact commands used are included in [`commands.sh`](commands.sh).
+
+Highlights:
+
+```bash
+# Root benchmark
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
+  -ldl -lobjc -o inmem_peak inmem_peak.m
+./inmem_peak
+
+# Training benchmarks
+cd training
+bash download_data.sh
+make train_large train_large_ane
+./train_large --steps 20 --lr 1e-4 --ckpt /tmp/train_large.ckpt
+./train_large_ane --steps 20 --lr 1e-4 --ckpt /tmp/train_large_ane.ckpt
+./train_large_ane --no-ane-extras --steps 20 --lr 1e-4 --ckpt /tmp/train_large_ane_no_extras.ckpt
+cd training_dynamic
+make train
+./train --scratch --steps 20 --lr 1e-4
+```
+
+## Training Results (20 steps)
+
+| Pipeline | Wall time | Compile time | Train time | Avg train | ANE TFLOPS | Total TFLOPS |
+|---|---:|---:|---:|---:|---:|---:|
+| `train_large` | 9471 ms | 7545 ms (79.7%) | 1623 ms (17.1%) | 81.2 ms/step | 1.15 | 2.15 |
+| `train_large_ane` | 10898 ms | 9090 ms (83.4%) | 1428 ms (13.1%) | 71.4 ms/step | 1.48 | 2.44 |
+| `train_large_ane --no-ane-extras` | 10248 ms | 7455 ms (72.7%) | 2476 ms (24.2%) | 123.8 ms/step | 0.85 | 1.41 |
+| `training_dynamic/train --scratch` | 2.9 s | 353 ms (one-time, 12.0%) | 2309 ms | 115.4 ms/step | n/a | n/a |
+
+Raw logs:
+
+- [`raw/train_large.log`](raw/train_large.log)
+- [`raw/train_large_ane.log`](raw/train_large_ane.log)
+- [`raw/train_large_ane_no_extras.log`](raw/train_large_ane_no_extras.log)
+- [`raw/train_dynamic.log`](raw/train_dynamic.log)
+
+## In-Memory Peak Results
+
+Best observed from `inmem_peak`:
+
+- 8.08 TFLOPS at `128x conv 512ch sp64` (`4.29 GFLOP`, `0.531 ms/eval`)
+
+Raw log:
+
+- [`raw/inmem_peak.log`](raw/inmem_peak.log)
+
+## Additional Root Benchmarks
+
+- `inmem_bench`: all configs returned `FAIL(-1)` on this clean setup
+- `sram_bench`: all configs returned `FAIL(-1)` on this clean setup
+
+Raw logs:
+
+- [`raw/inmem_bench.log`](raw/inmem_bench.log)
+- [`raw/sram_bench.log`](raw/sram_bench.log)
+
+## Notes
+
+- `train_large_ane` had the best per-step throughput in this run.
+- Dynamic had the best short-run wall-clock due to one-time compile cost.
+- Static pipelines remained compile-dominated over 20 steps.
diff --git a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/commands.sh b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/commands.sh
@@ -0,0 +1,62 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Repro commands used for this submission.
+# Machine: Mac Studio (Apple M3 Ultra)
+# Commit: 443194bca4491fae4400bae9dad2a0470692bdbf
+
+REPO="${REPO:-$HOME/Dev/ANE-upstream}"
+ART="${ART:-$REPO/bench_artifacts/m3-ultra-2026-03-03/raw}"
+
+mkdir -p "$ART"
+cd "$REPO"
+
+# System capture
+{
+  echo "timestamp_utc=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+  sw_vers
+  uname -a
+  echo
+  echo "=== sysctl ==="
+  sysctl hw.model hw.memsize hw.ncpu hw.physicalcpu hw.logicalcpu \
+    hw.perflevel0.physicalcpu hw.perflevel1.physicalcpu \
+    machdep.cpu.brand_string 2>/dev/null || true
+  echo
+  echo "=== system_profiler SPHardwareDataType ==="
+  system_profiler SPHardwareDataType
+  echo
+  echo "=== toolchain ==="
+  xcode-select -p
+  xcrun clang --version
+} > "$ART/system_info.txt"
+
+# Root benchmark
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
+  -ldl -lobjc -o inmem_peak inmem_peak.m
+./inmem_peak > "$ART/inmem_peak.log" 2>&1
+
+# Optional root benchmarks (may fail on clean setups)
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
+  -ldl -lobjc -o inmem_bench inmem_bench.m
+./inmem_bench > "$ART/inmem_bench.log" 2>&1 || true
+
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
+  -ldl -lobjc -o sram_bench sram_bench.m
+./sram_bench > "$ART/sram_bench.log" 2>&1 || true
+
+# Training benchmarks
+cd "$REPO/training"
+bash download_data.sh > "$ART/download_data.log" 2>&1
+make train_large train_large_ane > "$ART/training_make.log" 2>&1
+./train_large --steps 20 --lr 1e-4 --ckpt "$ART/train_large.ckpt" > "$ART/train_large.log" 2>&1
+./train_large_ane --steps 20 --lr 1e-4 --ckpt "$ART/train_large_ane.ckpt" > "$ART/train_large_ane.log" 2>&1
+./train_large_ane --no-ane-extras --steps 20 --lr 1e-4 --ckpt "$ART/train_large_ane_no_extras.ckpt" > "$ART/train_large_ane_no_extras.log" 2>&1
+
+cd "$REPO/training/training_dynamic"
+make train > "$ART/training_dynamic_make.log" 2>&1
+./train --scratch --steps 20 --lr 1e-4 > "$ART/train_dynamic.log" 2>&1
+
+cd "$REPO"
+git rev-parse HEAD > "$ART/upstream_commit.txt"
+
+echo "Done. Raw logs are in: $ART"
diff --git a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/metrics.json b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/metrics.json
@@ -0,0 +1,101 @@
+{
+  "submission_id": "m3-ultra-mac-studio-2026-03-03",
+  "captured_at_utc": "2026-03-03T18:34:30Z",
+  "upstream_commit": "443194bca4491fae4400bae9dad2a0470692bdbf",
+  "system": {
+    "model_name": "Mac Studio",
+    "model_identifier": "Mac15,14",
+    "chip": "Apple M3 Ultra",
+    "memory_bytes": 274877906944,
+    "cpu_cores_total": 28,
+    "cpu_cores_performance": 20,
+    "cpu_cores_efficiency": 8,
+    "os_product_version": "26.3",
+    "os_build_version": "25D125"
+  },
+  "toolchain": {
+    "developer_dir": "/Library/Developer/CommandLineTools",
+    "clang": "Apple clang version 17.0.0 (clang-1700.3.19.1)"
+  },
+  "training": {
+    "steps": 20,
+    "train_large": {
+      "wall_time_ms": 9471,
+      "compile_time_ms": 7545,
+      "compile_pct": 79.7,
+      "train_time_ms": 1623,
+      "train_pct": 17.1,
+      "avg_train_ms_per_step": 81.2,
+      "ane_tflops": 1.15,
+      "total_tflops": 2.15,
+      "ane_utilization_pct_of_15_8_tflops": 7.3
+    },
+    "train_large_ane": {
+      "wall_time_ms": 10898,
+      "compile_time_ms": 9090,
+      "compile_pct": 83.4,
+      "train_time_ms": 1428,
+      "train_pct": 13.1,
+      "avg_train_ms_per_step": 71.4,
+      "ane_tflops": 1.48,
+      "total_tflops": 2.44,
+      "ane_utilization_pct_of_15_8_tflops": 9.4
+    },
+    "train_large_ane_no_extras": {
+      "wall_time_ms": 10248,
+      "compile_time_ms": 7455,
+      "compile_pct": 72.7,
+      "train_time_ms": 2476,
+      "train_pct": 24.2,
+      "avg_train_ms_per_step": 123.8,
+      "ane_tflops": 0.85,
+      "total_tflops": 1.41,
+      "ane_utilization_pct_of_15_8_tflops": 5.4
+    },
+    "train_dynamic_scratch": {
+      "compile_time_ms": 353,
+      "compile_pct_one_time": 12.0,
+      "train_time_ms": 2309,
+      "avg_train_ms_per_step": 115.4,
+      "wall_time_s": 2.9
+    }
+  },
+  "inmem_peak": {
+    "best_tflops": 8.08,
+    "best_config": "128x conv 512ch sp64",
+    "rows": [
+      { "config": "32x conv 512ch sp64", "weight_mb": 16.0, "gflop": 1.07, "ms_per_eval": 0.497, "tflops": 2.16 },
+      { "config": "48x conv 512ch sp64", "weight_mb": 24.0, "gflop": 1.61, "ms_per_eval": 0.535, "tflops": 3.01 },
+      { "config": "64x conv 512ch sp64", "weight_mb": 32.0, "gflop": 2.15, "ms_per_eval": 0.355, "tflops": 6.06 },
+      { "config": "96x conv 512ch sp64", "weight_mb": 48.0, "gflop": 3.22, "ms_per_eval": 0.423, "tflops": 7.61 },
+      { "config": "128x conv 512ch sp64", "weight_mb": 64.0, "gflop": 4.29, "ms_per_eval": 0.531, "tflops": 8.08 },
+      { "config": "64x conv 256ch sp64", "weight_mb": 8.0, "gflop": 0.54, "ms_per_eval": 0.287, "tflops": 1.87 },
+      { "config": "128x conv 256ch sp64", "weight_mb": 16.0, "gflop": 1.07, "ms_per_eval": 0.272, "tflops": 3.94 },
+      { "config": "256x conv 256ch sp64", "weight_mb": 32.0, "gflop": 2.15, "ms_per_eval": 0.439, "tflops": 4.89 },
+      { "config": "64x conv 384ch sp64", "weight_mb": 18.0, "gflop": 1.21, "ms_per_eval": 0.319, "tflops": 3.78 },
+      { "config": "128x conv 384ch sp64", "weight_mb": 36.0, "gflop": 2.42, "ms_per_eval": 0.369, "tflops": 6.55 }
+    ]
+  },
+  "inmem_bench": {
+    "status": "failed",
+    "failure": "all rows returned FAIL(-1)"
+  },
+  "sram_bench": {
+    "status": "failed",
+    "failure": "all rows returned FAIL(-1)"
+  },
+  "raw_files": [
+    "raw/system_info.txt",
+    "raw/upstream_commit.txt",
+    "raw/download_data.log",
+    "raw/training_make.log",
+    "raw/training_dynamic_make.log",
+    "raw/inmem_peak.log",
+    "raw/inmem_bench.log",
+    "raw/sram_bench.log",
+    "raw/train_large.log",
+    "raw/train_large_ane.log",
+    "raw/train_large_ane_no_extras.log",
+    "raw/train_dynamic.log"
+  ]
+}