Skip to content

[Dev] Add Llama3 training example and fix cache save#14

Merged
jiahy0825 merged 4 commits intoSandAI-org:mainfrom
wtr0504:dev/training
Apr 2, 2026
Merged

[Dev] Add Llama3 training example and fix cache save#14
jiahy0825 merged 4 commits intoSandAI-org:mainfrom
wtr0504:dev/training

Conversation

@wtr0504
Copy link
Copy Markdown
Collaborator

@wtr0504 wtr0504 commented Apr 1, 2026

🗂️ PR Category

  • ✨ New Feature
  • 🚀 Optimization (performance, memory, etc.)
  • 💥 Breaking Change
  • 🐛 Bug Fix
  • 🛠️ Development / Refactoring
  • 📚 Documentation
  • 🧹 Chore (Dependencies, CI/CD, Configuration, etc.)
  • 🧪 Testing

📝 Description

Summary

  • Add end-to-end Llama3 training example (example/training/) with FSDP support, a distributed training script, and an Nsys profiling launch script.
  • Fix a cache save bug where aot_autograd artifacts were empty, causing compiled graphs to fail to persist correctly.

Changes

  • example/training/llama3.py — Llama3 model definition adapted to use magi_compile
  • example/training/train.py — distributed training loop with FSDP and NVTX profiling hooks
  • example/training/train.sh — torchrun launcher with optional Nsys profiling
  • magi_compiler/magi_backend/piecewise_compiler.py — workaround for empty aot_autograd artifacts on cache save
  • magi_compiler/utils/nvtx.py — profiler for iteration
  • requirements-test.txt — update requirements

Copy link
Copy Markdown
Collaborator

@jiahy0825 jiahy0825 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiahy0825 jiahy0825 merged commit 8f931af into SandAI-org:main Apr 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants