Pipeline scaffolding for multi-group ANE training by codegen-sh[bot] · Pull Request #1 · dermitchell1993/ANE

codegen-sh · 2026-03-02T21:27:37Z

Scaffolding infrastructure to scale ANE training beyond the single-compile-batch limit. All additive — train_large.m is untouched.

The problem

ANE has a ~119 compilation limit per process. With 5 weight-bearing kernels per layer + 1 static, the current 12-layer Stories110M uses 72 of those. Larger models (22-layer LLaMA-1B, 32-layer LLaMA-7B) blow the budget immediately. The exec() restart pattern works but needs systematic scheduling.

What this adds

`model_config.h` — Parameterized model configuration

ModelConfig struct replacing hardcoded #defines
Presets: stories42m, stories110m, llama1b, llama7b
compute_pipeline_plan() — computes optimal layer groupings given compile budget
Memory/FLOP estimation helpers
CLI arg parsing for dimensions/budget overrides

`pipeline.h` — Layer-group scheduler + mmap state

PipelineScheduler — state machine driving forward→backward→update across exec() boundaries
CompileBudget — tracks compilations used/remaining with headroom
MmapState — memory-mapped file for all tensor state (weights, adam, gradients, activations) that survives exec() restarts
pipeline_exec_restart() / pipeline_check_resume() — save/restore scheduler state across process boundaries
Typed accessors: mmap_layer_weights(), mmap_layer_adam(), mmap_layer_grads(), mmap_layer_acts()

`gradient_checkpoint.h` — Activation checkpointing

Policies: ALL (current behavior), BOUNDARY (group edges only), SQRT (every √N), NONE
Recompute tracking: knows which layers need forward re-run during backward
Memory savings estimation

`train_pipeline.m` — Entry point with dry-run mode

Composes the three headers into a working pipeline trainer
Dry-run mode prints the full execution plan, simulates the scheduler state machine
--model llama7b --checkpoint sqrt to see what a 32-layer pipeline looks like
ANE_LIVE compile flag for wiring up real kernel compilation (future work)

Makefile

make train_pipeline (dry-run/planning mode)
make train_pipeline_live (with ANE kernel compilation)

💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks

New files: - model_config.h: Parameterized model config with presets (Stories42M/110M, LLaMA-1B/7B), pipeline planning (compute_pipeline_plan), memory/FLOP estimation - pipeline.h: Layer-group scheduler (PipelineScheduler state machine), compile budget tracking, mmap-based cross-exec() shared tensor state, exec() restart with automatic resume - gradient_checkpoint.h: Activation checkpointing policies (ALL/BOUNDARY/SQRT/NONE), recompute tracking, memory savings estimation - train_pipeline.m: Entry point with dry-run simulation mode -- prints full execution plan for any model config, simulates scheduler state machine - Makefile: train_pipeline and train_pipeline_live targets All additive -- existing train_large.m untouched. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>

codegen-sh · 2026-03-02T22:02:26Z

Overall, this PR looks solid! It introduces gradient checkpointing and pipeline training features, which should help with training larger models on ANE hardware. The code is well-structured with good comments and modular design.

Strengths:

Clear separation of concerns between config, checkpointing, and pipeline scheduling
Nice use of enums and structs for readability
Memory management looks proper with init/free functions
FLOP estimations and pretty-printing are helpful additions

Suggestions:

Consider adding unit tests for the checkpoint manager and pipeline scheduler
Might want to add error handling for mmap failures in pipeline.h
The compile budget headroom (10%) is a good safety measure, but could be configurable

Great work on parameterizing the model configs - this will make experimentation much easier! 🚀

codegen-sh · 2026-03-02T22:02:43Z

training/Makefile

+
+train_pipeline_live: train_pipeline.m $(HEADERS_PIPELINE) $(HEADERS_LARGE)
+	$(CC) $(CFLAGS) -DANE_LIVE -o train_pipeline train_pipeline.m $(LDFLAGS) -framework Accelerate
+


Nice addition of the train_pipeline targets! This will make it easier to experiment with pipelining. Consider adding a brief comment in the Makefile explaining the difference between train_pipeline and train_pipeline_live.

codegen-sh · 2026-03-02T22:02:52Z

training/gradient_checkpoint.h

+        // Save only the input to each layer group
+        for (int g = 0; g < plan->n_groups; g++) {
+            cm.is_saved[plan->groups[g].start_layer] = true;
+        }


Solid implementation of checkpoint policies. The use of sqrt for interval in CKPT_SQRT is clever for balancing memory and compute. Consider adding a reference or comment explaining the optimality proof for this strategy.

…tests - model_config.h: Added headroom_pct field to CompileConfig, used in max_layers_per_compile() with validation (falls back to 10% for invalid values). All presets include default. --headroom CLI flag added. - pipeline.h: Tightened mmap error handling — calloc checks, size validation in mmap_state_open (file size vs header, truncation detection), sentinel/version in error message, msync/munmap return checks in close. - test_pipeline_unit.c: 23 unit tests for model_config, pipeline planning, gradient checkpoint, and FLOP estimation. Pure C, no ANE dependency. All passing. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>

codegen-sh · 2026-03-03T05:49:36Z

All three review items addressed cleanly. 🎯

Configurable headroom — The range guard with silent fallback to 10% is the right defensive call. Tests nail the math: 5% → 18 layers, 20% → 15, invalid inputs → default. CLI --headroom flag rounds it out.

Hardened mmap — Full lifecycle coverage: calloc failure in create, sentinel/version/truncation checks in open, msync/munmap return codes in close. Exactly the kind of error surfaces that matter when you're exec()-restarting across process boundaries.

Unit tests — 23 tests, pure C, no ANE dependency. Edge cases like single-layer, exact-fit, one-over-budget, out-of-bounds checkpoint queries — this is the kind of coverage that lets you refactor fearlessly later.

Two non-blocking observations for future reference:

headroom_pct = 0.0 is excluded by the > 0.0f guard, so zero-headroom is impossible (probably intentional — 0% headroom on ANE's ~119 budget seems risky). Just worth documenting that in a comment eventually so someone doesn't wonder why "no headroom" doesn't work.
The typed mmap accessors (mmap_layer_weights etc.) reconstruct a ModelDims from header fields on every call. Not a perf concern now, but if these get called in hot loops later, caching a ModelDims in MmapState would clean that up.

This is solid work. 👏

codegen-sh · 2026-03-03T06:03:37Z

Deep Review — Commit `e0b1b27` (Initial Pipeline Scaffolding)

Went through all 1,279 lines with a fine-toothed comb. Here's what I found.

🔴 Bugs

1. CKPT_BOUNDARY overcounts n_checkpointed for single-layer models

When n_layers=1, n_groups=1. The code saves groups[0].start_layer (layer 0) and then is_saved[n_layers-1] (also layer 0). n_checkpointed is set to n_groups + 1 = 2, but only 1 layer exists. This inflates memory estimates.

2. CKPT_SQRT and CKPT_EVERY_N undercount n_checkpointed by 1

Both policies iterate i += interval and then unconditionally save the last layer. But n_checkpointed is calculated as (n_layers + interval - 1) / interval — which doesn't account for the extra last-layer save when it doesn't land on an interval boundary.

Example: 32 layers, interval=5. Loop saves: 0, 5, 10, 15, 20, 25, 30 (7 layers). Then is_saved[31] = true adds an 8th. But (32+4)/5 = 7. Memory estimates are off by one layer's worth of activation bytes.

3. mmap_state_create — no NULL check on calloc (fixed in commit 2)

If calloc(1, sizeof(MmapState)) fails after a successful mmap, you'd dereference NULL on the next line.

4. mmap_state_open — missing validations (fixed in commit 2)

No check that file size ≥ sizeof(MmapHeader) before reading sentinel/version
No calloc return check
No truncation detection (st.st_size vs h->total_size)

5. mmap_state_close — no guard on ms->base (fixed in commit 2)

If base is somehow NULL or MAP_FAILED, msync/munmap would crash or return errors silently.

🟡 Inconsistencies

6. Headroom calculation mismatch between CompileBudget and max_layers_per_compile

budget_init computes headroom as max_compiles / 10 (integer: 119/10 = 11, so usable = 108).

max_layers_per_compile computes (int)(compile_budget * 0.9) = (int)(107.1) = 107 usable.

So the pipeline planner thinks 107 kernels fit, but the runtime budget tracker allows 108. Off-by-one. Not dangerous (planner is more conservative), but it means budget_can_fit would say "yes" to a group the planner wouldn't have created. In practice this can't trigger since groups are sized by the planner.

7. total_model_bytes vs mmap_compute_size — scope mismatch

total_model_bytes omits embed/rms_final gradient storage. mmap_compute_size includes it. So the "Total model state" printed by pipeline_plan_print is smaller than the actual mmap file. Not wrong per se (one is "model state", other is "training state"), but could confuse someone comparing the numbers.

🟡 Design Concerns

8. No input validation in model_dims_init

d->head_dim = d->dim / d->n_heads — divide by zero if someone passes --heads 0. The CLI uses atoi, which returns 0 for non-numeric input. A one-line guard (if (d->n_heads <= 0) d->n_heads = 1; or early exit) would prevent a crash.

9. No bounds checking in mmap typed accessors

mmap_layer_weights(ms, -1) or mmap_layer_weights(ms, 999) would silently return a garbage pointer into (or past) the mmap region. Should at minimum assert(layer >= 0 && layer < ms->header->n_layers).

10. CKPT_EVERY_N interval is not configurable via the API

The comment says "Caller should set cm.interval before using" but checkpoint_init hardcodes cm.interval = 4 and then immediately uses it in the loop. There's no way to pass a custom interval. Either accept interval as a parameter, or set it after init and re-run the loop.

11. PipelinePlan leak in pipeline_scheduler_init

compute_pipeline_plan callocs plan.groups and stores it in the scheduler. The scheduler is typically stack-allocated. When the program exits (or exec restarts), this leaks. Harmless in practice (OS reclaims), but if you ever want to re-init a scheduler in a loop, you'd accumulate leaks.

✅ What's Solid

Faithful translation of the hardcoded config: layer_weight_floats exactly matches stories_config.h's LAYER_PARAMS. The struct-based approach is a clean upgrade.
Scheduler state machine: The forward→backward→update flow with compile budget checks and exec restart points is well-designed. Recursive transition in pipeline_next_action is clean (bounded to depth 1).
Mmap layout: Page-aligned header, contiguous per-layer regions, typed accessors. The sentinel/version pattern is good for catching stale files across code changes.
FLOP estimation: The forward/backward/SDPA breakdown is reasonable. Separating ANE FLOPs (excluding CPU-side dW) is a nice touch for predicting actual wall-clock on the Neural Engine.
Dry-run simulation: Being able to ./train_pipeline --model llama7b --checkpoint sqrt and see the full execution trace without touching ANE is excellent for development.
Line ending cleanup: CR+LF → LF normalization in the Makefile. Silent but appreciated.

Summary

Items 1-2 (checkpoint counting bugs) are the only ones I'd call worth fixing that weren't addressed in commit 2. Everything else is either already fixed, minor, or a future-consideration. The architecture is sound — this is genuinely good scaffolding.

codegen-sh · 2026-03-03T06:03:59Z

training/gradient_checkpoint.h

+        cm.n_checkpointed = plan->n_groups + 1;
+        break;
+
+    case CKPT_SQRT: {


Bug: n_checkpointed = plan->n_groups + 1 overcounts when the last layer IS a group start layer (e.g., n_layers=1). In that case is_saved[0] gets set twice but n_checkpointed = 2 when only 1 layer exists.

Fix: count the actual set bits, or guard with if (cm.n_layers - 1 != plan->groups[plan->n_groups - 1].start_layer) before adding 1.

codegen-sh · 2026-03-03T06:04:20Z

training/gradient_checkpoint.h

+        cm.is_saved[cm.n_layers - 1] = true;
+        cm.n_checkpointed = (cm.n_layers + interval - 1) / interval;
+        break;
+    }


Bug: n_checkpointed is calculated as (n_layers + interval - 1) / interval which counts the loop iterations — but the is_saved[cm.n_layers - 1] = true on the next line may add one more if the last layer isn't on an interval boundary.

Example with 32 layers, interval=5: loop saves 0,5,10,15,20,25,30 → 7 layers. Then layer 31 is saved → 8 total. But (32+4)/5 = 7.

Same issue applies to CKPT_EVERY_N below.

Simplest fix: after the switch, count the actual saved layers:

cm.n_checkpointed = 0; for (int i = 0; i < cm.n_layers; i++) if (cm.is_saved[i]) cm.n_checkpointed++;

This would also fix the CKPT_BOUNDARY issue.

codegen-sh · 2026-03-03T06:04:39Z

training/model_config.h

+
+static void model_dims_init(ModelDims *d) {
+    d->head_dim = d->dim / d->n_heads;
+    d->kv_dim = d->head_dim * d->n_kv_heads;


Divide-by-zero risk: if someone passes --heads 0 (or a non-numeric string, since atoi returns 0), this line crashes.

Suggestion:

static void model_dims_init(ModelDims *d) { if (d->n_heads <= 0) { fprintf(stderr, "model_dims_init: n_heads must be > 0\n"); exit(1); } d->head_dim = d->dim / d->n_heads; ... }

codegen-sh · 2026-03-03T06:05:02Z

training/model_config.h

+
+// Compute how many layers can fit in one compile batch
+static int max_layers_per_compile(const CompileConfig *cc) {
+    float headroom = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f)


Off-by-one inconsistency with budget_init in pipeline.h.

Here: usable = (int)(compile_budget * 0.9) → for 119: (int)(107.1) = 107.
In budget_init: headroom = max_compiles / 10 = 11, so usable = 119 - 11 = 108.

The planner sizes groups assuming 107 usable kernels, but the runtime budget tracker allows 108. Not dangerous (groups are pre-planned so this extra slot is never used), but the intent should be unified — either both use integer division or both use float.

codegen-sh · 2026-03-03T06:05:24Z

training/gradient_checkpoint.h

+    case CKPT_EVERY_N:
+        // Caller should set cm.interval before using
+        cm.interval = 4;   // default
+        for (int i = 0; i < cm.n_layers; i += cm.interval) cm.is_saved[i] = true;


The comment says "Caller should set cm.interval before using" but the function immediately hardcodes cm.interval = 4 and uses it in the loop — so there's no way for the caller to set a custom interval. The API needs one of:

Accept interval as a parameter to checkpoint_init

Split into checkpoint_init (sets policy) + checkpoint_configure (sets interval and builds the is_saved array)

Just document that 4 is the fixed interval for CKPT_EVERY_N and update the comment

…ency, safety guards Bug fix: n_checkpointed counting wrong in CKPT_BOUNDARY/SQRT/EVERY_N - Replaced per-policy arithmetic with single post-switch loop that counts actual is_saved bits. Eliminates edge-case miscounts when last layer falls on an interval boundary. Inconsistency: headroom mismatch between planner and runtime budget - budget_init() now takes CompileConfig* and uses the same headroom_pct validation as max_layers_per_compile(). Both paths yield identical usable-budget calculations. Inconsistency: total_model_bytes() omitted global gradients - Added rms_final_grad and embed_grad terms to match mmap_compute_size(). Diagnostic output now agrees with actual allocation. Design: divide-by-zero in model_dims_init() if n_heads=0 - Guarded head_dim = dim / n_heads with n_heads > 0 check. Design: no bounds checking in mmap typed accessors - All four mmap_layer_* accessors now validate layer index and return NULL on out-of-bounds. Extracted shared mmap_dims() helper to deduplicate ModelDims reconstruction. Design: CKPT_EVERY_N interval hardcoded despite caller should set - Added custom_interval parameter to checkpoint_init(). Pass 0 for default (4), or any positive int for custom spacing. Tests: 26/26 passing (3 new: custom interval, n_checkpointed accuracy, zero-heads guard). Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>

codegen-sh bot assigned dermitchell1993 Mar 2, 2026

codegen-sh bot commented Mar 2, 2026

View reviewed changes

codegen-sh bot commented Mar 3, 2026

View reviewed changes

codegen-sh bot mentioned this pull request Mar 3, 2026

feat: M2 backward compatibility — dynamic ANE platform detection #2

Draft

codegen-sh bot mentioned this pull request Mar 5, 2026

Rebase pipeline scaffolding onto updated main #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline scaffolding for multi-group ANE training#1

Pipeline scaffolding for multi-group ANE training#1
codegen-sh[bot] wants to merge 3 commits intomainfrom
codegen-bot/pipeline-scaffolding-a7f3e2

codegen-sh bot commented Mar 2, 2026

Uh oh!

codegen-sh bot commented Mar 2, 2026

Uh oh!

codegen-sh bot Mar 2, 2026

Uh oh!

codegen-sh bot Mar 2, 2026

Uh oh!

codegen-sh bot commented Mar 3, 2026

Uh oh!

codegen-sh bot commented Mar 3, 2026

Uh oh!

codegen-sh bot Mar 3, 2026

Uh oh!

codegen-sh bot Mar 3, 2026

Uh oh!

codegen-sh bot Mar 3, 2026

Uh oh!

codegen-sh bot Mar 3, 2026

Uh oh!

codegen-sh bot Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		train_pipeline_live: train_pipeline.m $(HEADERS_PIPELINE) $(HEADERS_LARGE)
		$(CC) $(CFLAGS) -DANE_LIVE -o train_pipeline train_pipeline.m $(LDFLAGS) -framework Accelerate

Conversation

codegen-sh bot commented Mar 2, 2026

The problem

What this adds

model_config.h — Parameterized model configuration

pipeline.h — Layer-group scheduler + mmap state

gradient_checkpoint.h — Activation checkpointing

train_pipeline.m — Entry point with dry-run mode

Makefile

Uh oh!

codegen-sh bot commented Mar 2, 2026

Uh oh!

codegen-sh bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot commented Mar 3, 2026

Uh oh!

codegen-sh bot commented Mar 3, 2026

Deep Review — Commit e0b1b27 (Initial Pipeline Scaffolding)

🔴 Bugs

🟡 Inconsistencies

🟡 Design Concerns

✅ What's Solid

Summary

Uh oh!

codegen-sh bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`model_config.h` — Parameterized model configuration

`pipeline.h` — Layer-group scheduler + mmap state

`gradient_checkpoint.h` — Activation checkpointing

`train_pipeline.m` — Entry point with dry-run mode

Deep Review — Commit `e0b1b27` (Initial Pipeline Scaffolding)