Rebase pipeline scaffolding onto updated main#9
Open
codegen-sh[bot] wants to merge 30 commits intocodegen-bot/pipeline-scaffolding-a7f3e2from
Open
Rebase pipeline scaffolding onto updated main#9codegen-sh[bot] wants to merge 30 commits intocodegen-bot/pipeline-scaffolding-a7f3e2from
codegen-sh[bot] wants to merge 30 commits intocodegen-bot/pipeline-scaffolding-a7f3e2from
Conversation
Weave in scope notice near the top covering project intent, what it is/isn't, hype clarification, maintenance expectations, and fork encouragement. Consolidate private API disclaimer with existing disclaimer section to avoid duplication. https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv
…tice-EL9sS Add Project Scope & Intent notice to README
…offload (16% faster) Bridge+Memory leak fix+More functions
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps bottleneck. Weights are passed via IOSurface spatial dimension instead of baked as constants, so kernels compile once at startup (345ms) and run indefinitely without exec() restart. Key components: - training_dynamic/ — full pipeline (config, IO, MIL generators, train loop) - 9 dynamic kernels shared across all 12 layers - Vocab compaction 32K→9.2K for faster classifier - Vectorized cross-entropy with vDSP/NEON - Adam optimizer with gradient clipping + cosine LR schedule - Checkpoint save/resume - test_dynamic_matmul.m — validates dynamic weight matmul vs cblas - test_weight_patch.m — tests weight update via IOSurface - dashboard.py — updated with --dynamic flag for v2 pipeline support, improved step regex parsing, --scratch/--lr/--accum CLI args Performance: 110ms/step steady-state (no recompile overhead) ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
- Fix positional arg parsing (model_path, steps, lr were silently ignored) - Add --model, --ckpt flags; forward ckpt_path across exec() restarts - Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd - CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled - Update README with 4-way benchmark comparison table (20 steps)
- Parse static pipeline JSON step/batch/perf lines for real-time updates - Running elapsed time, ms/step from wall-clock timestamps, steps/sec - Compute ANE + Total TFLOPS from FLOPs/step when not reported directly - Support --ane (train_large_ane) and --no-ane-extras flags - Dynamic pipeline timing breakdown + CKPT_PATH per mode
… MIL pipeline [MLModel compileModelAtURL:] fails on macOS 26, breaking inmem_bench, sram_bench, and sram_probe. This switches all three to generate MIL text and weight blobs programmatically in memory (matching the working inmem_peak.m approach), bypassing CoreML disk compilation entirely. - inmem_bench.m: replace CoreML compile + file read with genMIL/buildWeightBlob - sram_bench.m: switch from _ANEClient/_ANEModel to _ANEInMemoryModel API - sram_probe.m: same _ANEClient → _ANEInMemoryModel conversion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Validate all fread() return values in model_load_weights (model.h) - Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m) - Log error details on ANE eval failure (ane_runtime.h) - Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h) - Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward - Atomic checkpoint writes via tmp+rename pattern (tiny_train.m) - Non-destructive recompile: compile new kernels first, swap only on success (model.h) - Validate fread() in load_checkpoint (tiny_train.m)
Updated README to reflect project scope, architecture, and limitations.
…ort-dataset-underflow-fix Fix token sampling underflow for short token datasets
Fix docs: add training data download instructions
Optimize dashboard and prevent sudo hang when password needed
…hmarks Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL
…ta-paths Fix hardcoded TinyStories data path in train_large/train_large_ane
…ctness fix: correctness and safety improvements for training
Follow-up to PR maderix#31 — assert() aborts on bad tokens, which is too harsh for training. Skip bad tokens with a warning instead.
Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5. Includes training performance, peak throughput, MIL compatibility matrix, and structured JSON data.
All chips have 16 NE cores except Ultra (32 via UltraFusion). M4 38 TOPS is INT8/mixed-precision, not comparable to M3 FP16 spec.
Benchmark report now includes full Stories110M model configuration (arch, layers, dims, kernels). README updated: 12-layer results replace stale single-layer numbers, limitations reflect current state.
New files: - model_config.h: Parameterized model config with presets (Stories42M/110M, LLaMA-1B/7B), pipeline planning (compute_pipeline_plan), memory/FLOP estimation - pipeline.h: Layer-group scheduler (PipelineScheduler state machine), compile budget tracking, mmap-based cross-exec() shared tensor state, exec() restart with automatic resume - gradient_checkpoint.h: Activation checkpointing policies (ALL/BOUNDARY/SQRT/NONE), recompute tracking, memory savings estimation - train_pipeline.m: Entry point with dry-run simulation mode -- prints full execution plan for any model config, simulates scheduler state machine - Makefile: train_pipeline and train_pipeline_live targets All additive -- existing train_large.m untouched. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
…tests - model_config.h: Added headroom_pct field to CompileConfig, used in max_layers_per_compile() with validation (falls back to 10% for invalid values). All presets include default. --headroom CLI flag added. - pipeline.h: Tightened mmap error handling — calloc checks, size validation in mmap_state_open (file size vs header, truncation detection), sentinel/version in error message, msync/munmap return checks in close. - test_pipeline_unit.c: 23 unit tests for model_config, pipeline planning, gradient checkpoint, and FLOP estimation. Pure C, no ANE dependency. All passing. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
…ency, safety guards
Bug fix: n_checkpointed counting wrong in CKPT_BOUNDARY/SQRT/EVERY_N
- Replaced per-policy arithmetic with single post-switch loop that counts
actual is_saved bits. Eliminates edge-case miscounts when last layer
falls on an interval boundary.
Inconsistency: headroom mismatch between planner and runtime budget
- budget_init() now takes CompileConfig* and uses the same headroom_pct
validation as max_layers_per_compile(). Both paths yield identical
usable-budget calculations.
Inconsistency: total_model_bytes() omitted global gradients
- Added rms_final_grad and embed_grad terms to match mmap_compute_size().
Diagnostic output now agrees with actual allocation.
Design: divide-by-zero in model_dims_init() if n_heads=0
- Guarded head_dim = dim / n_heads with n_heads > 0 check.
Design: no bounds checking in mmap typed accessors
- All four mmap_layer_* accessors now validate layer index and return NULL
on out-of-bounds. Extracted shared mmap_dims() helper to deduplicate
ModelDims reconstruction.
Design: CKPT_EVERY_N interval hardcoded despite caller should set
- Added custom_interval parameter to checkpoint_init(). Pass 0 for
default (4), or any positive int for custom spacing.
Tests: 26/26 passing (3 new: custom interval, n_checkpointed accuracy,
zero-heads guard).
Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rebases the 3 pipeline scaffolding commits onto current
main(which gainedefcf193,1a7d884,050bc4fand more upstream).Conflicts resolved:
training/Makefile— merged bothHEADERS_ANE(upstream) andHEADERS_PIPELINE(ours), plus unifiedcleanrule to include all binaries from both feature sets.26/26 unit tests still pass post-rebase.
Merge this into
codegen-bot/pipeline-scaffolding-a7f3e2to update PR #1 with the rebased history.💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks