Skip to content

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Oct 3, 2025

📌 Description

There's a race condition between cubin download and compilation:

Timeline of the race condition:

  1. Worker1: Acquires lock -> downloads cubin -> releases lock
  2. Worker1: Starts compile operation (reading the cubin file) without lock protection
  3. Worker2: Acquires lock -> detects file corruption (sha256 mismatch) -> overwrites the file Worker1 is reading!
  4. Worker1: Compilation fails or uses corrupted data

Root cause: The critical region (lock scope) only protects the download operation, but doesn't protect the subsequent compile/load operations. We cannot simply extend the lock scope because:

  • load cubin + build op + load op are in different code paths
  • Cannot use the same lock across the entire pipeline (across C++/python)

Solution

Use atomic file rename to leverage filesystem inode semantics:

  1. Download/copy to temporary file: {local_path}.tmp
  2. Atomically rename to final path: os.replace(tmp, local_path)

Why this works:

  • os.replace() is an atomic operation
  • On Unix systems, when Worker2 replaces the file, Worker1's open file descriptor still points to the old inode
  • Worker1 can safely continue reading the old file
  • New readers (after the replace) see the new file
  • The old inode is only deleted when all file descriptors are closed

Changes

  • flashinfer/jit/cubin_loader.py:85: Atomic rename for local file copy
  • flashinfer/jit/cubin_loader.py:103: Atomic rename for URL downloads

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

cc @nvpohanh @nvjullin @joker-eph

Copy link
Contributor

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the cubin loading mechanism by eliminating a critical race condition. It ensures data integrity during file operations by employing atomic file replacement, thereby preventing scenarios where concurrent processes might access inconsistent or corrupted cubin files. The change focuses on making file writes and updates safe in a multi-threaded or multi-process environment.

Highlights

  • Race Condition Fix: Addresses a critical race condition in the cubin loader where concurrent operations could lead to a worker reading a corrupted or partially overwritten cubin file during compilation. The issue stemmed from the lock only protecting the download phase, not the subsequent file consumption.
  • Atomic File Operations: Implements atomic file renaming for both local file copies and URL downloads. Files are now first written to a temporary location (e.g., local_path.tmp) and then atomically moved to their final destination using os.replace(). This ensures that other processes never observe an incomplete or inconsistent file.
  • Leveraging Inode Semantics: The solution leverages Unix filesystem inode semantics. When an atomic rename occurs, processes that have already opened the original file continue to read from its inode, while new processes access the updated file. This guarantees data integrity for ongoing operations during file replacement.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a race condition in the cubin loader by using an atomic rename operation. This is a solid approach to ensure that consumers of the file do not read partially written or corrupted data. My review includes suggestions to enhance the robustness of this change by ensuring temporary files are cleaned up in case of errors, preventing potential disk space leaks.

Comment on lines +83 to +85
temp_path = f"{local_path}.tmp"
shutil.copy(source, temp_path)
os.replace(temp_path, local_path) # Atomic rename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using a temporary file and atomic rename is a good solution for the race condition, this implementation can leave behind .tmp files if an error occurs during the copy or rename operation. This could lead to disk space exhaustion over time. To make this more robust, I suggest wrapping the file operations in a try...finally block to ensure the temporary file is always cleaned up.

Suggested change
temp_path = f"{local_path}.tmp"
shutil.copy(source, temp_path)
os.replace(temp_path, local_path) # Atomic rename
temp_path = f"{local_path}.tmp"
try:
shutil.copy(source, temp_path)
os.replace(temp_path, local_path) # Atomic rename
finally:
if os.path.exists(temp_path):
os.remove(temp_path)

Comment on lines +98 to +103
temp_path = f"{local_path}.tmp"
with open(temp_path, "wb") as file:
file.write(response.content)

# Atomic rename to prevent readers from seeing partial writes
os.replace(temp_path, local_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the local file copy, the URL download logic can leave temporary .tmp files if an error occurs during file writing or the atomic rename. This can happen, for example, if there are file permission issues or if the disk is full. Adding a try...finally block will ensure these temporary files are cleaned up, making the download process more robust.

Suggested change
temp_path = f"{local_path}.tmp"
with open(temp_path, "wb") as file:
file.write(response.content)
# Atomic rename to prevent readers from seeing partial writes
os.replace(temp_path, local_path)
temp_path = f"{local_path}.tmp"
try:
with open(temp_path, "wb") as file:
file.write(response.content)
# Atomic rename to prevent readers from seeing partial writes
os.replace(temp_path, local_path)
finally:
if os.path.exists(temp_path):
os.remove(temp_path)

@joker-eph
Copy link
Collaborator

Worker2: Acquires lock -> detects file corruption (sha256 mismatch) -> overwrites the file Worker1 is reading!

How is this possible that there is sha256 mismatch?
Why would worker1 be happy with the file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants