Skip to content

feat: ANE 7B–13B Production Roadmap + Starter Code for OpenClaw Swarm#7

Draft
codegen-sh[bot] wants to merge 1 commit intomainfrom
codegen-bot/ane-7b-roadmap-openclaw-a3f7c2
Draft

feat: ANE 7B–13B Production Roadmap + Starter Code for OpenClaw Swarm#7
codegen-sh[bot] wants to merge 1 commit intomainfrom
codegen-bot/ane-7b-roadmap-openclaw-a3f7c2

Conversation

@codegen-sh
Copy link

@codegen-sh codegen-sh bot commented Mar 5, 2026

3-Phase Technical Roadmap: Stories110M → 7B–13B on M2 ANE

Comprehensive roadmap + code-level starter files for evolving ANE from the current working 109M-param baseline to production-grade 7B–13B inference and fine-tuning on a 24 GB M2 MacBook, using only the ANE for compute.

Core Architectural Insight: Weight-Swap Architecture

Instead of compiling one kernel per layer (~96 for a 32-layer 7B model, exceeding the ~119 compile limit), compile 11 parameterized kernel programs (6 forward + 5 backward) and iterate layers by swapping weights via unload→rewrite→load. This is validated by the existing test_weight_reload.m — the Day 2 gate test at 7B-layer dimensions is the critical go/no-go.


Memory Budget (M2 24 GB)

Model Weights KV Cache Total Headroom
7B Q4 (group128) 3.5 GB 1.0 GB 7.0 GB 13.0 GB
13B Q3 (group64) 4.9 GB 1.6 GB 9.7 GB 10.3 GB

Performance Targets

Metric 7B Q4 13B Q3
Decode tok/s 12–18 5–9
Prefill tok/s (seq=512) 18–25 8–12
LoRA training tok/s 3–6 1–3

Deliverables

File Phase Description
roadmap/ROADMAP_7B_ANE.md Full 526-line spec with memory tables, perf derivations, decision gates
roadmap/mil_gen_llama.h 1 Parameterized MIL generator: LlamaConfig struct, RoPE fusion, GQA stub, residual+SwiGLU FFN
bridge/ane_model.py 1 Python ctypes wrapper: ANEBridge, ANEModel, ModelConfig presets, CPU reference forward, llama2c loader
roadmap/quant_pack.h 2 Q4/Q8 packing + NEON-optimized dequant + .anepak format header
bridge/openclaw_manifest.yaml 3 OpenClaw skill manifest: endpoints, telemetry, memory guard, watchdog, swarm registration

Decision Gates

Gate Day Criteria
G1 2 Weight swap < 10 ms/layer at dim=4096
G2 3 sin/cos MIL ops compile on ANE
G3 8 End-to-end 7B forward pass, correct logits
G4 10 > 10 tok/s decode on M2
G5 18 72h continuous run, 0 crashes

💻 View my work • 👤 Initiated by @dermitchell1993About Codegen
⛔ Remove Codegen from PR🚫 Ban action checks

Phase 1 (Days 1-3): Stable Multi-Layer Stacking
- mil_gen_llama.h: parameterized MIL generator with RoPE, GQA, residual fusion
- bridge/ane_model.py: Python ctypes wrapper + CPU reference forward pass
- Weight-swap architecture: 11 compiled kernels, all layers share via reload

Phase 2 (Days 4-10): Production Op Coverage + Quantization
- quant_pack.h: Q4/Q8 packing + NEON-optimized dequant (<1ms/layer)
- .anepak format for serialized quantized models
- LoRA adapter integration as extra constant blobs in MIL

Phase 3 (Days 11-21): Swarm-Ready Production
- openclaw_manifest.yaml: skill manifest for Lobster registration
- Telemetry, inference server, memory guard, watchdog specs

Memory fits: 7B Q4 = 7.0 GB, 13B Q3 = 9.7 GB (20 GB cap on M2 24 GB)
Target: 12-18 tok/s decode (7B Q4), 5-9 tok/s (13B Q3) on M2 ANE

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant