-
Notifications
You must be signed in to change notification settings - Fork 530
Fix: Prevent race condition in cubin loader when file is being consumed #1852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: Prevent race condition in cubin loader when file is being consumed #1852
Conversation
Summary of ChangesHello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of the cubin loading mechanism by eliminating a critical race condition. It ensures data integrity during file operations by employing atomic file replacement, thereby preventing scenarios where concurrent processes might access inconsistent or corrupted cubin files. The change focuses on making file writes and updates safe in a multi-threaded or multi-process environment. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a race condition in the cubin loader by using an atomic rename operation. This is a solid approach to ensure that consumers of the file do not read partially written or corrupted data. My review includes suggestions to enhance the robustness of this change by ensuring temporary files are cleaned up in case of errors, preventing potential disk space leaks.
temp_path = f"{local_path}.tmp" | ||
shutil.copy(source, temp_path) | ||
os.replace(temp_path, local_path) # Atomic rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While using a temporary file and atomic rename is a good solution for the race condition, this implementation can leave behind .tmp
files if an error occurs during the copy or rename operation. This could lead to disk space exhaustion over time. To make this more robust, I suggest wrapping the file operations in a try...finally
block to ensure the temporary file is always cleaned up.
temp_path = f"{local_path}.tmp" | |
shutil.copy(source, temp_path) | |
os.replace(temp_path, local_path) # Atomic rename | |
temp_path = f"{local_path}.tmp" | |
try: | |
shutil.copy(source, temp_path) | |
os.replace(temp_path, local_path) # Atomic rename | |
finally: | |
if os.path.exists(temp_path): | |
os.remove(temp_path) |
temp_path = f"{local_path}.tmp" | ||
with open(temp_path, "wb") as file: | ||
file.write(response.content) | ||
|
||
# Atomic rename to prevent readers from seeing partial writes | ||
os.replace(temp_path, local_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the local file copy, the URL download logic can leave temporary .tmp
files if an error occurs during file writing or the atomic rename. This can happen, for example, if there are file permission issues or if the disk is full. Adding a try...finally
block will ensure these temporary files are cleaned up, making the download process more robust.
temp_path = f"{local_path}.tmp" | |
with open(temp_path, "wb") as file: | |
file.write(response.content) | |
# Atomic rename to prevent readers from seeing partial writes | |
os.replace(temp_path, local_path) | |
temp_path = f"{local_path}.tmp" | |
try: | |
with open(temp_path, "wb") as file: | |
file.write(response.content) | |
# Atomic rename to prevent readers from seeing partial writes | |
os.replace(temp_path, local_path) | |
finally: | |
if os.path.exists(temp_path): | |
os.remove(temp_path) |
How is this possible that there is sha256 mismatch? |
📌 Description
There's a race condition between cubin download and compilation:
Timeline of the race condition:
Root cause: The critical region (lock scope) only protects the download operation, but doesn't protect the subsequent compile/load operations. We cannot simply extend the lock scope because:
Solution
Use atomic file rename to leverage filesystem inode semantics:
Why this works:
Changes
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commit
by runningpip install pre-commit
(or used your preferred method).pre-commit install
.pre-commit run --all-files
and fixed any reported issues.🧪 Tests
unittest
, etc.).Reviewer Notes
cc @nvpohanh @nvjullin @joker-eph