explicitly permit JIT caching for model init by matthew-frank · Pull Request #577 · mlcommons/training_policies

matthew-frank · 2026-02-25T00:43:36Z

Explicitly permit using JIT caching to reduce model init time.

github-actions · 2026-02-25T00:43:47Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ShriyaRishab · 2026-03-12T15:38:57Z

Background:

Because resume-from-checkpoint time is important for large model training, it is now a common industry practice for JITs used in LLM libraries to cache and reuse their kernels on shared persistent storage. Libraries like Megatron-core, Triton, Torch Titan and Hybrid EP all now support this initialization-time optimization.

The current rules neither prohibit nor explicitly permit using JIT caches.

Pros:

JIT time has been a real issue when trying to collect results for very large benchmarks during the final weeks leading up to the deadline. The cost of gpus (e.g. 5 minutes of extra init per run for 2k-8k gpus 170-680 gpu hours) is one issue, but developer time when there is risk that runs will fail and have to be restarted is also an issue. Waiting for big models to get through init and start training has been stressful. Allowing JIT caching during init would reduce the developer stress and hardware costs.

Potential Cons:
If there’s a case where Jitting is currently taking > 30 minutes (max model init time), this change would allow that competitor to “hide” the jit time by reading from the cache. We actually believe that this is desirable: JITting with caching is not much different than profile-directed compilation, and (offline) profile-directed compilation is already permitted (and encouraged).

ShriyaRishab · 2026-03-19T15:58:47Z

WG discussion: Generally ok with this but only concern is that there could be a submitter who's init time is close to 30 mins so they would have to exceed the 30 min init time if not for cacheing although this rule gives them a bonus.
We don't think this is a normal scenario but if it comes up, review committee can make a decision.

Approved. Add note in the rule change that if init time is very close to 30 mins, review committee can investigate further if jit cacheing gives an unfair advantage to a submitter.

explicitly permit JIT caching for model init

ec745f7

matthew-frank requested review from a team as code owners February 25, 2026 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explicitly permit JIT caching for model init#577

explicitly permit JIT caching for model init#577
matthew-frank wants to merge 1 commit intomlcommons:masterfrom
matthew-frank:mfrank/jit-cache-proposal

matthew-frank commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

ShriyaRishab commented Mar 12, 2026

Uh oh!

ShriyaRishab commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matthew-frank commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

ShriyaRishab commented Mar 12, 2026

Uh oh!

ShriyaRishab commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShriyaRishab commented Mar 19, 2026 •

edited

Loading