Skip to content

Pipeline scaffolding for multi-group ANE training#1

Draft
codegen-sh[bot] wants to merge 3 commits intomainfrom
codegen-bot/pipeline-scaffolding-a7f3e2
Draft

Pipeline scaffolding for multi-group ANE training#1
codegen-sh[bot] wants to merge 3 commits intomainfrom
codegen-bot/pipeline-scaffolding-a7f3e2

Conversation

@codegen-sh
Copy link

@codegen-sh codegen-sh bot commented Mar 2, 2026

Scaffolding infrastructure to scale ANE training beyond the single-compile-batch limit. All additive — train_large.m is untouched.

The problem

ANE has a ~119 compilation limit per process. With 5 weight-bearing kernels per layer + 1 static, the current 12-layer Stories110M uses 72 of those. Larger models (22-layer LLaMA-1B, 32-layer LLaMA-7B) blow the budget immediately. The exec() restart pattern works but needs systematic scheduling.

What this adds

model_config.h — Parameterized model configuration

  • ModelConfig struct replacing hardcoded #defines
  • Presets: stories42m, stories110m, llama1b, llama7b
  • compute_pipeline_plan() — computes optimal layer groupings given compile budget
  • Memory/FLOP estimation helpers
  • CLI arg parsing for dimensions/budget overrides

pipeline.h — Layer-group scheduler + mmap state

  • PipelineScheduler — state machine driving forward→backward→update across exec() boundaries
  • CompileBudget — tracks compilations used/remaining with headroom
  • MmapState — memory-mapped file for all tensor state (weights, adam, gradients, activations) that survives exec() restarts
  • pipeline_exec_restart() / pipeline_check_resume() — save/restore scheduler state across process boundaries
  • Typed accessors: mmap_layer_weights(), mmap_layer_adam(), mmap_layer_grads(), mmap_layer_acts()

gradient_checkpoint.h — Activation checkpointing

  • Policies: ALL (current behavior), BOUNDARY (group edges only), SQRT (every √N), NONE
  • Recompute tracking: knows which layers need forward re-run during backward
  • Memory savings estimation

train_pipeline.m — Entry point with dry-run mode

  • Composes the three headers into a working pipeline trainer
  • Dry-run mode prints the full execution plan, simulates the scheduler state machine
  • --model llama7b --checkpoint sqrt to see what a 32-layer pipeline looks like
  • ANE_LIVE compile flag for wiring up real kernel compilation (future work)

Makefile

  • make train_pipeline (dry-run/planning mode)
  • make train_pipeline_live (with ANE kernel compilation)

💻 View my work • 👤 Initiated by @dermitchell1993About Codegen
⛔ Remove Codegen from PR🚫 Ban action checks

New files:
- model_config.h: Parameterized model config with presets (Stories42M/110M, LLaMA-1B/7B),
  pipeline planning (compute_pipeline_plan), memory/FLOP estimation
- pipeline.h: Layer-group scheduler (PipelineScheduler state machine),
  compile budget tracking, mmap-based cross-exec() shared tensor state,
  exec() restart with automatic resume
- gradient_checkpoint.h: Activation checkpointing policies (ALL/BOUNDARY/SQRT/NONE),
  recompute tracking, memory savings estimation
- train_pipeline.m: Entry point with dry-run simulation mode -- prints full execution
  plan for any model config, simulates scheduler state machine
- Makefile: train_pipeline and train_pipeline_live targets

All additive -- existing train_large.m untouched.

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
@codegen-sh
Copy link
Author

codegen-sh bot commented Mar 2, 2026

Overall, this PR looks solid! It introduces gradient checkpointing and pipeline training features, which should help with training larger models on ANE hardware. The code is well-structured with good comments and modular design.

Strengths:

  • Clear separation of concerns between config, checkpointing, and pipeline scheduling
  • Nice use of enums and structs for readability
  • Memory management looks proper with init/free functions
  • FLOP estimations and pretty-printing are helpful additions

Suggestions:

  • Consider adding unit tests for the checkpoint manager and pipeline scheduler
  • Might want to add error handling for mmap failures in pipeline.h
  • The compile budget headroom (10%) is a good safety measure, but could be configurable

Great work on parameterizing the model configs - this will make experimentation much easier! 🚀


train_pipeline_live: train_pipeline.m $(HEADERS_PIPELINE) $(HEADERS_LARGE)
$(CC) $(CFLAGS) -DANE_LIVE -o train_pipeline train_pipeline.m $(LDFLAGS) -framework Accelerate

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition of the train_pipeline targets! This will make it easier to experiment with pipelining. Consider adding a brief comment in the Makefile explaining the difference between train_pipeline and train_pipeline_live.

// Save only the input to each layer group
for (int g = 0; g < plan->n_groups; g++) {
cm.is_saved[plan->groups[g].start_layer] = true;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid implementation of checkpoint policies. The use of sqrt for interval in CKPT_SQRT is clever for balancing memory and compute. Consider adding a reference or comment explaining the optimality proof for this strategy.

…tests

- model_config.h: Added headroom_pct field to CompileConfig, used in
  max_layers_per_compile() with validation (falls back to 10% for invalid
  values). All presets include default. --headroom CLI flag added.
- pipeline.h: Tightened mmap error handling — calloc checks, size
  validation in mmap_state_open (file size vs header, truncation
  detection), sentinel/version in error message, msync/munmap return
  checks in close.
- test_pipeline_unit.c: 23 unit tests for model_config, pipeline
  planning, gradient checkpoint, and FLOP estimation. Pure C, no ANE
  dependency. All passing.

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
@codegen-sh
Copy link
Author

codegen-sh bot commented Mar 3, 2026

All three review items addressed cleanly. 🎯

Configurable headroom — The range guard with silent fallback to 10% is the right defensive call. Tests nail the math: 5% → 18 layers, 20% → 15, invalid inputs → default. CLI --headroom flag rounds it out.

Hardened mmap — Full lifecycle coverage: calloc failure in create, sentinel/version/truncation checks in open, msync/munmap return codes in close. Exactly the kind of error surfaces that matter when you're exec()-restarting across process boundaries.

Unit tests — 23 tests, pure C, no ANE dependency. Edge cases like single-layer, exact-fit, one-over-budget, out-of-bounds checkpoint queries — this is the kind of coverage that lets you refactor fearlessly later.

Two non-blocking observations for future reference:

  • headroom_pct = 0.0 is excluded by the > 0.0f guard, so zero-headroom is impossible (probably intentional — 0% headroom on ANE's ~119 budget seems risky). Just worth documenting that in a comment eventually so someone doesn't wonder why "no headroom" doesn't work.
  • The typed mmap accessors (mmap_layer_weights etc.) reconstruct a ModelDims from header fields on every call. Not a perf concern now, but if these get called in hot loops later, caching a ModelDims in MmapState would clean that up.

This is solid work. 👏

@codegen-sh
Copy link
Author

codegen-sh bot commented Mar 3, 2026

Deep Review — Commit e0b1b27 (Initial Pipeline Scaffolding)

Went through all 1,279 lines with a fine-toothed comb. Here's what I found.


🔴 Bugs

1. CKPT_BOUNDARY overcounts n_checkpointed for single-layer models

When n_layers=1, n_groups=1. The code saves groups[0].start_layer (layer 0) and then is_saved[n_layers-1] (also layer 0). n_checkpointed is set to n_groups + 1 = 2, but only 1 layer exists. This inflates memory estimates.

2. CKPT_SQRT and CKPT_EVERY_N undercount n_checkpointed by 1

Both policies iterate i += interval and then unconditionally save the last layer. But n_checkpointed is calculated as (n_layers + interval - 1) / interval — which doesn't account for the extra last-layer save when it doesn't land on an interval boundary.

Example: 32 layers, interval=5. Loop saves: 0, 5, 10, 15, 20, 25, 30 (7 layers). Then is_saved[31] = true adds an 8th. But (32+4)/5 = 7. Memory estimates are off by one layer's worth of activation bytes.

3. mmap_state_create — no NULL check on calloc (fixed in commit 2)

If calloc(1, sizeof(MmapState)) fails after a successful mmap, you'd dereference NULL on the next line.

4. mmap_state_open — missing validations (fixed in commit 2)

  • No check that file size ≥ sizeof(MmapHeader) before reading sentinel/version
  • No calloc return check
  • No truncation detection (st.st_size vs h->total_size)

5. mmap_state_close — no guard on ms->base (fixed in commit 2)

If base is somehow NULL or MAP_FAILED, msync/munmap would crash or return errors silently.


🟡 Inconsistencies

6. Headroom calculation mismatch between CompileBudget and max_layers_per_compile

budget_init computes headroom as max_compiles / 10 (integer: 119/10 = 11, so usable = 108).

max_layers_per_compile computes (int)(compile_budget * 0.9) = (int)(107.1) = 107 usable.

So the pipeline planner thinks 107 kernels fit, but the runtime budget tracker allows 108. Off-by-one. Not dangerous (planner is more conservative), but it means budget_can_fit would say "yes" to a group the planner wouldn't have created. In practice this can't trigger since groups are sized by the planner.

7. total_model_bytes vs mmap_compute_size — scope mismatch

total_model_bytes omits embed/rms_final gradient storage. mmap_compute_size includes it. So the "Total model state" printed by pipeline_plan_print is smaller than the actual mmap file. Not wrong per se (one is "model state", other is "training state"), but could confuse someone comparing the numbers.


🟡 Design Concerns

8. No input validation in model_dims_init

d->head_dim = d->dim / d->n_heads — divide by zero if someone passes --heads 0. The CLI uses atoi, which returns 0 for non-numeric input. A one-line guard (if (d->n_heads <= 0) d->n_heads = 1; or early exit) would prevent a crash.

9. No bounds checking in mmap typed accessors

mmap_layer_weights(ms, -1) or mmap_layer_weights(ms, 999) would silently return a garbage pointer into (or past) the mmap region. Should at minimum assert(layer >= 0 && layer < ms->header->n_layers).

10. CKPT_EVERY_N interval is not configurable via the API

The comment says "Caller should set cm.interval before using" but checkpoint_init hardcodes cm.interval = 4 and then immediately uses it in the loop. There's no way to pass a custom interval. Either accept interval as a parameter, or set it after init and re-run the loop.

11. PipelinePlan leak in pipeline_scheduler_init

compute_pipeline_plan callocs plan.groups and stores it in the scheduler. The scheduler is typically stack-allocated. When the program exits (or exec restarts), this leaks. Harmless in practice (OS reclaims), but if you ever want to re-init a scheduler in a loop, you'd accumulate leaks.


✅ What's Solid

  • Faithful translation of the hardcoded config: layer_weight_floats exactly matches stories_config.h's LAYER_PARAMS. The struct-based approach is a clean upgrade.
  • Scheduler state machine: The forward→backward→update flow with compile budget checks and exec restart points is well-designed. Recursive transition in pipeline_next_action is clean (bounded to depth 1).
  • Mmap layout: Page-aligned header, contiguous per-layer regions, typed accessors. The sentinel/version pattern is good for catching stale files across code changes.
  • FLOP estimation: The forward/backward/SDPA breakdown is reasonable. Separating ANE FLOPs (excluding CPU-side dW) is a nice touch for predicting actual wall-clock on the Neural Engine.
  • Dry-run simulation: Being able to ./train_pipeline --model llama7b --checkpoint sqrt and see the full execution trace without touching ANE is excellent for development.
  • Line ending cleanup: CR+LF → LF normalization in the Makefile. Silent but appreciated.

Summary

Items 1-2 (checkpoint counting bugs) are the only ones I'd call worth fixing that weren't addressed in commit 2. Everything else is either already fixed, minor, or a future-consideration. The architecture is sound — this is genuinely good scaffolding.

cm.n_checkpointed = plan->n_groups + 1;
break;

case CKPT_SQRT: {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: n_checkpointed = plan->n_groups + 1 overcounts when the last layer IS a group start layer (e.g., n_layers=1). In that case is_saved[0] gets set twice but n_checkpointed = 2 when only 1 layer exists.

Fix: count the actual set bits, or guard with if (cm.n_layers - 1 != plan->groups[plan->n_groups - 1].start_layer) before adding 1.

cm.is_saved[cm.n_layers - 1] = true;
cm.n_checkpointed = (cm.n_layers + interval - 1) / interval;
break;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: n_checkpointed is calculated as (n_layers + interval - 1) / interval which counts the loop iterations — but the is_saved[cm.n_layers - 1] = true on the next line may add one more if the last layer isn't on an interval boundary.

Example with 32 layers, interval=5: loop saves 0,5,10,15,20,25,30 → 7 layers. Then layer 31 is saved → 8 total. But (32+4)/5 = 7.

Same issue applies to CKPT_EVERY_N below.

Simplest fix: after the switch, count the actual saved layers:

cm.n_checkpointed = 0;
for (int i = 0; i < cm.n_layers; i++) 
    if (cm.is_saved[i]) cm.n_checkpointed++;

This would also fix the CKPT_BOUNDARY issue.


static void model_dims_init(ModelDims *d) {
d->head_dim = d->dim / d->n_heads;
d->kv_dim = d->head_dim * d->n_kv_heads;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Divide-by-zero risk: if someone passes --heads 0 (or a non-numeric string, since atoi returns 0), this line crashes.

Suggestion:

static void model_dims_init(ModelDims *d) {
    if (d->n_heads <= 0) { fprintf(stderr, "model_dims_init: n_heads must be > 0\n"); exit(1); }
    d->head_dim = d->dim / d->n_heads;
    ...
}


// Compute how many layers can fit in one compile batch
static int max_layers_per_compile(const CompileConfig *cc) {
float headroom = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-by-one inconsistency with budget_init in pipeline.h.

Here: usable = (int)(compile_budget * 0.9) → for 119: (int)(107.1) = 107.
In budget_init: headroom = max_compiles / 10 = 11, so usable = 119 - 11 = 108.

The planner sizes groups assuming 107 usable kernels, but the runtime budget tracker allows 108. Not dangerous (groups are pre-planned so this extra slot is never used), but the intent should be unified — either both use integer division or both use float.

case CKPT_EVERY_N:
// Caller should set cm.interval before using
cm.interval = 4; // default
for (int i = 0; i < cm.n_layers; i += cm.interval) cm.is_saved[i] = true;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Caller should set cm.interval before using" but the function immediately hardcodes cm.interval = 4 and uses it in the loop — so there's no way for the caller to set a custom interval. The API needs one of:

  1. Accept interval as a parameter to checkpoint_init
  2. Split into checkpoint_init (sets policy) + checkpoint_configure (sets interval and builds the is_saved array)
  3. Just document that 4 is the fixed interval for CKPT_EVERY_N and update the comment

…ency, safety guards

Bug fix: n_checkpointed counting wrong in CKPT_BOUNDARY/SQRT/EVERY_N
  - Replaced per-policy arithmetic with single post-switch loop that counts
    actual is_saved bits. Eliminates edge-case miscounts when last layer
    falls on an interval boundary.

Inconsistency: headroom mismatch between planner and runtime budget
  - budget_init() now takes CompileConfig* and uses the same headroom_pct
    validation as max_layers_per_compile(). Both paths yield identical
    usable-budget calculations.

Inconsistency: total_model_bytes() omitted global gradients
  - Added rms_final_grad and embed_grad terms to match mmap_compute_size().
    Diagnostic output now agrees with actual allocation.

Design: divide-by-zero in model_dims_init() if n_heads=0
  - Guarded head_dim = dim / n_heads with n_heads > 0 check.

Design: no bounds checking in mmap typed accessors
  - All four mmap_layer_* accessors now validate layer index and return NULL
    on out-of-bounds. Extracted shared mmap_dims() helper to deduplicate
    ModelDims reconstruction.

Design: CKPT_EVERY_N interval hardcoded despite caller should set
  - Added custom_interval parameter to checkpoint_init(). Pass 0 for
    default (4), or any positive int for custom spacing.

Tests: 26/26 passing (3 new: custom interval, n_checkpointed accuracy,
zero-heads guard).

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant