explicitly permit JIT caching for model init#577
explicitly permit JIT caching for model init#577matthew-frank wants to merge 1 commit intomlcommons:masterfrom
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
Background: Because resume-from-checkpoint time is important for large model training, it is now a common industry practice for JITs used in LLM libraries to cache and reuse their kernels on shared persistent storage. Libraries like Megatron-core, Triton, Torch Titan and Hybrid EP all now support this initialization-time optimization. The current rules neither prohibit nor explicitly permit using JIT caches. Pros: JIT time has been a real issue when trying to collect results for very large benchmarks during the final weeks leading up to the deadline. The cost of gpus (e.g. 5 minutes of extra init per run for 2k-8k gpus 170-680 gpu hours) is one issue, but developer time when there is risk that runs will fail and have to be restarted is also an issue. Waiting for big models to get through init and start training has been stressful. Allowing JIT caching during init would reduce the developer stress and hardware costs. Potential Cons: |
|
WG discussion: Generally ok with this but only concern is that there could be a submitter who's init time is close to 30 mins so they would have to exceed the 30 min init time if not for cacheing although this rule gives them a bonus. Approved. Add note in the rule change that if init time is very close to 30 mins, review committee can investigate further if jit cacheing gives an unfair advantage to a submitter. |
Explicitly permit using JIT caching to reduce model init time.