Pipeline scaffolding for multi-group ANE training#1
Pipeline scaffolding for multi-group ANE training#1codegen-sh[bot] wants to merge 3 commits intomainfrom
Conversation
New files: - model_config.h: Parameterized model config with presets (Stories42M/110M, LLaMA-1B/7B), pipeline planning (compute_pipeline_plan), memory/FLOP estimation - pipeline.h: Layer-group scheduler (PipelineScheduler state machine), compile budget tracking, mmap-based cross-exec() shared tensor state, exec() restart with automatic resume - gradient_checkpoint.h: Activation checkpointing policies (ALL/BOUNDARY/SQRT/NONE), recompute tracking, memory savings estimation - train_pipeline.m: Entry point with dry-run simulation mode -- prints full execution plan for any model config, simulates scheduler state machine - Makefile: train_pipeline and train_pipeline_live targets All additive -- existing train_large.m untouched. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
|
Overall, this PR looks solid! It introduces gradient checkpointing and pipeline training features, which should help with training larger models on ANE hardware. The code is well-structured with good comments and modular design. Strengths:
Suggestions:
Great work on parameterizing the model configs - this will make experimentation much easier! 🚀 |
|
|
||
| train_pipeline_live: train_pipeline.m $(HEADERS_PIPELINE) $(HEADERS_LARGE) | ||
| $(CC) $(CFLAGS) -DANE_LIVE -o train_pipeline train_pipeline.m $(LDFLAGS) -framework Accelerate | ||
|
|
There was a problem hiding this comment.
Nice addition of the train_pipeline targets! This will make it easier to experiment with pipelining. Consider adding a brief comment in the Makefile explaining the difference between train_pipeline and train_pipeline_live.
| // Save only the input to each layer group | ||
| for (int g = 0; g < plan->n_groups; g++) { | ||
| cm.is_saved[plan->groups[g].start_layer] = true; | ||
| } |
There was a problem hiding this comment.
Solid implementation of checkpoint policies. The use of sqrt for interval in CKPT_SQRT is clever for balancing memory and compute. Consider adding a reference or comment explaining the optimality proof for this strategy.
…tests - model_config.h: Added headroom_pct field to CompileConfig, used in max_layers_per_compile() with validation (falls back to 10% for invalid values). All presets include default. --headroom CLI flag added. - pipeline.h: Tightened mmap error handling — calloc checks, size validation in mmap_state_open (file size vs header, truncation detection), sentinel/version in error message, msync/munmap return checks in close. - test_pipeline_unit.c: 23 unit tests for model_config, pipeline planning, gradient checkpoint, and FLOP estimation. Pure C, no ANE dependency. All passing. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
|
All three review items addressed cleanly. 🎯 Configurable headroom — The range guard with silent fallback to 10% is the right defensive call. Tests nail the math: 5% → 18 layers, 20% → 15, invalid inputs → default. CLI Hardened mmap — Full lifecycle coverage: calloc failure in create, sentinel/version/truncation checks in open, msync/munmap return codes in close. Exactly the kind of error surfaces that matter when you're exec()-restarting across process boundaries. Unit tests — 23 tests, pure C, no ANE dependency. Edge cases like single-layer, exact-fit, one-over-budget, out-of-bounds checkpoint queries — this is the kind of coverage that lets you refactor fearlessly later. Two non-blocking observations for future reference:
This is solid work. 👏 |
Deep Review — Commit
|
| cm.n_checkpointed = plan->n_groups + 1; | ||
| break; | ||
|
|
||
| case CKPT_SQRT: { |
There was a problem hiding this comment.
Bug: n_checkpointed = plan->n_groups + 1 overcounts when the last layer IS a group start layer (e.g., n_layers=1). In that case is_saved[0] gets set twice but n_checkpointed = 2 when only 1 layer exists.
Fix: count the actual set bits, or guard with if (cm.n_layers - 1 != plan->groups[plan->n_groups - 1].start_layer) before adding 1.
| cm.is_saved[cm.n_layers - 1] = true; | ||
| cm.n_checkpointed = (cm.n_layers + interval - 1) / interval; | ||
| break; | ||
| } |
There was a problem hiding this comment.
Bug: n_checkpointed is calculated as (n_layers + interval - 1) / interval which counts the loop iterations — but the is_saved[cm.n_layers - 1] = true on the next line may add one more if the last layer isn't on an interval boundary.
Example with 32 layers, interval=5: loop saves 0,5,10,15,20,25,30 → 7 layers. Then layer 31 is saved → 8 total. But (32+4)/5 = 7.
Same issue applies to CKPT_EVERY_N below.
Simplest fix: after the switch, count the actual saved layers:
cm.n_checkpointed = 0;
for (int i = 0; i < cm.n_layers; i++)
if (cm.is_saved[i]) cm.n_checkpointed++;This would also fix the CKPT_BOUNDARY issue.
|
|
||
| static void model_dims_init(ModelDims *d) { | ||
| d->head_dim = d->dim / d->n_heads; | ||
| d->kv_dim = d->head_dim * d->n_kv_heads; |
There was a problem hiding this comment.
Divide-by-zero risk: if someone passes --heads 0 (or a non-numeric string, since atoi returns 0), this line crashes.
Suggestion:
static void model_dims_init(ModelDims *d) {
if (d->n_heads <= 0) { fprintf(stderr, "model_dims_init: n_heads must be > 0\n"); exit(1); }
d->head_dim = d->dim / d->n_heads;
...
}|
|
||
| // Compute how many layers can fit in one compile batch | ||
| static int max_layers_per_compile(const CompileConfig *cc) { | ||
| float headroom = (cc->headroom_pct > 0.0f && cc->headroom_pct < 1.0f) |
There was a problem hiding this comment.
Off-by-one inconsistency with budget_init in pipeline.h.
Here: usable = (int)(compile_budget * 0.9) → for 119: (int)(107.1) = 107.
In budget_init: headroom = max_compiles / 10 = 11, so usable = 119 - 11 = 108.
The planner sizes groups assuming 107 usable kernels, but the runtime budget tracker allows 108. Not dangerous (groups are pre-planned so this extra slot is never used), but the intent should be unified — either both use integer division or both use float.
| case CKPT_EVERY_N: | ||
| // Caller should set cm.interval before using | ||
| cm.interval = 4; // default | ||
| for (int i = 0; i < cm.n_layers; i += cm.interval) cm.is_saved[i] = true; |
There was a problem hiding this comment.
The comment says "Caller should set cm.interval before using" but the function immediately hardcodes cm.interval = 4 and uses it in the loop — so there's no way for the caller to set a custom interval. The API needs one of:
- Accept
intervalas a parameter tocheckpoint_init - Split into
checkpoint_init(sets policy) +checkpoint_configure(sets interval and builds theis_savedarray) - Just document that 4 is the fixed interval for
CKPT_EVERY_Nand update the comment
…ency, safety guards
Bug fix: n_checkpointed counting wrong in CKPT_BOUNDARY/SQRT/EVERY_N
- Replaced per-policy arithmetic with single post-switch loop that counts
actual is_saved bits. Eliminates edge-case miscounts when last layer
falls on an interval boundary.
Inconsistency: headroom mismatch between planner and runtime budget
- budget_init() now takes CompileConfig* and uses the same headroom_pct
validation as max_layers_per_compile(). Both paths yield identical
usable-budget calculations.
Inconsistency: total_model_bytes() omitted global gradients
- Added rms_final_grad and embed_grad terms to match mmap_compute_size().
Diagnostic output now agrees with actual allocation.
Design: divide-by-zero in model_dims_init() if n_heads=0
- Guarded head_dim = dim / n_heads with n_heads > 0 check.
Design: no bounds checking in mmap typed accessors
- All four mmap_layer_* accessors now validate layer index and return NULL
on out-of-bounds. Extracted shared mmap_dims() helper to deduplicate
ModelDims reconstruction.
Design: CKPT_EVERY_N interval hardcoded despite caller should set
- Added custom_interval parameter to checkpoint_init(). Pass 0 for
default (4), or any positive int for custom spacing.
Tests: 26/26 passing (3 new: custom interval, n_checkpointed accuracy,
zero-heads guard).
Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
Scaffolding infrastructure to scale ANE training beyond the single-compile-batch limit. All additive —
train_large.mis untouched.The problem
ANE has a ~119 compilation limit per process. With 5 weight-bearing kernels per layer + 1 static, the current 12-layer Stories110M uses 72 of those. Larger models (22-layer LLaMA-1B, 32-layer LLaMA-7B) blow the budget immediately. The
exec()restart pattern works but needs systematic scheduling.What this adds
model_config.h— Parameterized model configurationModelConfigstruct replacing hardcoded#definesstories42m,stories110m,llama1b,llama7bcompute_pipeline_plan()— computes optimal layer groupings given compile budgetpipeline.h— Layer-group scheduler + mmap statePipelineScheduler— state machine driving forward→backward→update across exec() boundariesCompileBudget— tracks compilations used/remaining with headroomMmapState— memory-mapped file for all tensor state (weights, adam, gradients, activations) that survivesexec()restartspipeline_exec_restart()/pipeline_check_resume()— save/restore scheduler state across process boundariesmmap_layer_weights(),mmap_layer_adam(),mmap_layer_grads(),mmap_layer_acts()gradient_checkpoint.h— Activation checkpointingALL(current behavior),BOUNDARY(group edges only),SQRT(every √N),NONEtrain_pipeline.m— Entry point with dry-run mode--model llama7b --checkpoint sqrtto see what a 32-layer pipeline looks likeANE_LIVEcompile flag for wiring up real kernel compilation (future work)Makefile
make train_pipeline(dry-run/planning mode)make train_pipeline_live(with ANE kernel compilation)💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks