Fix misusing incompletely generated artifacts#605
Fix misusing incompletely generated artifacts#605fzyzcjy wants to merge 7 commits intofeat/abs_pathfrom
Conversation
Move the "release" tracker file write to be the very last operation, after the checkpoint directory rename. This ensures the tracker content reliably indicates that all conversion work (including the directory rename) is fully complete.
Replace naive directory-existence checks with proper completion sentinels: convert_checkpoint now verifies latest_checkpointed_iteration.txt contains "release", and fp8_cast_bf16 now checks for model.safetensors.index.json. Both are the last files written by their respective tools, preventing false-positive skips after mid-conversion crashes.
…eeded If another rank crashed during save_checkpoint, the barrier will fail before we write the "release" marker, preventing a false completion signal.
Summary of ChangesHello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses potential race conditions and improves the robustness of checkpoint conversion and FP8 casting processes. It reorders the signaling of completion in the Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request improves the robustness of artifact generation by checking for specific completion markers instead of just directory existence, preventing the misuse of incompletely generated artifacts. However, the changes introduce potential security vulnerabilities related to insecure file handling and path traversal. Specifically, unvalidated CLI arguments used to construct file paths for moving directories and writing tracker files could be exploited to manipulate arbitrary files on the system. Additionally, I've provided suggestions to make the completion checks more robust by handling potential file I/O errors and validating file content.
The fused attention backend (cuDNN) consistently fails on node 4 (10.220.47.4) with CUDNN_STATUS_SUBLIBRARY_LOADING_FAILED during compute_log_prob in TransformerEngine's context parallel path. Switching to flash attention resolves this while maintaining correct training behavior with MLA + CP=4.
…g failure" This reverts commit 191a472.
a8fdcba to
1a76781
Compare
No description provided.