Fix: Prevent race condition in cubin loader when file is being consumed #1852

yzh119 · 2025-10-03T07:33:21Z

📌 Description

There's a race condition between cubin download and compilation:

Timeline of the race condition:

Worker1: Acquires lock -> downloads cubin -> releases lock
Worker1: Starts compile operation (reading the cubin file) without lock protection
Worker2: Acquires lock -> detects file corruption (sha256 mismatch) -> overwrites the file Worker1 is reading!
Worker1: Compilation fails or uses corrupted data

Root cause: The critical region (lock scope) only protects the download operation, but doesn't protect the subsequent compile/load operations. We cannot simply extend the lock scope because:

load cubin + build op + load op are in different code paths
Cannot use the same lock across the entire pipeline (across C++/python)

Solution

Use atomic file rename to leverage filesystem inode semantics:

Download/copy to temporary file: {local_path}.tmp
Atomically rename to final path: os.replace(tmp, local_path)

Why this works:

os.replace() is an atomic operation
On Unix systems, when Worker2 replaces the file, Worker1's open file descriptor still points to the old inode
Worker1 can safely continue reading the old file
New readers (after the replace) see the new file
The old inode is only deleted when all file descriptors are closed

Changes

flashinfer/jit/cubin_loader.py:85: Atomic rename for local file copy
flashinfer/jit/cubin_loader.py:103: Atomic rename for URL downloads

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

cc @nvpohanh @nvjullin @joker-eph

gemini-code-assist · 2025-10-03T07:33:34Z

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the cubin loading mechanism by eliminating a critical race condition. It ensures data integrity during file operations by employing atomic file replacement, thereby preventing scenarios where concurrent processes might access inconsistent or corrupted cubin files. The change focuses on making file writes and updates safe in a multi-threaded or multi-process environment.

Highlights

Race Condition Fix: Addresses a critical race condition in the cubin loader where concurrent operations could lead to a worker reading a corrupted or partially overwritten cubin file during compilation. The issue stemmed from the lock only protecting the download phase, not the subsequent file consumption.
Atomic File Operations: Implements atomic file renaming for both local file copies and URL downloads. Files are now first written to a temporary location (e.g., local_path.tmp) and then atomically moved to their final destination using os.replace(). This ensures that other processes never observe an incomplete or inconsistent file.
Leveraging Inode Semantics: The solution leverages Unix filesystem inode semantics. When an atomic rename occurs, processes that have already opened the original file continue to read from its inode, while new processes access the updated file. This guarantees data integrity for ongoing operations during file replacement.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a race condition in the cubin loader by using an atomic rename operation. This is a solid approach to ensure that consumers of the file do not read partially written or corrupted data. My review includes suggestions to enhance the robustness of this change by ensuring temporary files are cleaned up in case of errors, preventing potential disk space leaks.

gemini-code-assist · 2025-10-03T07:35:04Z

flashinfer/jit/cubin_loader.py

+                    temp_path = f"{local_path}.tmp"
+                    shutil.copy(source, temp_path)
+                    os.replace(temp_path, local_path)  # Atomic rename


While using a temporary file and atomic rename is a good solution for the race condition, this implementation can leave behind .tmp files if an error occurs during the copy or rename operation. This could lead to disk space exhaustion over time. To make this more robust, I suggest wrapping the file operations in a try...finally block to ensure the temporary file is always cleaned up.

Suggested change

temp_path = f"{local_path}.tmp"

shutil.copy(source, temp_path)

os.replace(temp_path, local_path) # Atomic rename

temp_path = f"{local_path}.tmp"

try:

shutil.copy(source, temp_path)

os.replace(temp_path, local_path) # Atomic rename

finally:

if os.path.exists(temp_path):

os.remove(temp_path)

gemini-code-assist · 2025-10-03T07:35:04Z

flashinfer/jit/cubin_loader.py

+                    temp_path = f"{local_path}.tmp"
+                    with open(temp_path, "wb") as file:
                        file.write(response.content)

+                    # Atomic rename to prevent readers from seeing partial writes
+                    os.replace(temp_path, local_path)


Similar to the local file copy, the URL download logic can leave temporary .tmp files if an error occurs during file writing or the atomic rename. This can happen, for example, if there are file permission issues or if the disk is full. Adding a try...finally block will ensure these temporary files are cleaned up, making the download process more robust.

Suggested change

temp_path = f"{local_path}.tmp"

with open(temp_path, "wb") as file:

file.write(response.content)

# Atomic rename to prevent readers from seeing partial writes

os.replace(temp_path, local_path)

temp_path = f"{local_path}.tmp"

try:

with open(temp_path, "wb") as file:

file.write(response.content)

# Atomic rename to prevent readers from seeing partial writes

os.replace(temp_path, local_path)

finally:

if os.path.exists(temp_path):

os.remove(temp_path)

joker-eph · 2025-10-03T09:32:21Z

Worker2: Acquires lock -> detects file corruption (sha256 mismatch) -> overwrites the file Worker1 is reading!

How is this possible that there is sha256 mismatch?
Why would worker1 be happy with the file?

upd

817c35b

gemini-code-assist bot reviewed Oct 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Prevent race condition in cubin loader when file is being consumed #1852

Fix: Prevent race condition in cubin loader when file is being consumed #1852

yzh119 commented Oct 3, 2025

Uh oh!

gemini-code-assist bot commented Oct 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Uh oh!

gemini-code-assist bot Oct 3, 2025

Uh oh!

joker-eph commented Oct 3, 2025

Uh oh!

Uh oh!

Fix: Prevent race condition in cubin loader when file is being consumed #1852

Are you sure you want to change the base?

Fix: Prevent race condition in cubin loader when file is being consumed #1852

Conversation

yzh119 commented Oct 3, 2025

📌 Description

Solution

Changes

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist bot commented Oct 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

joker-eph commented Oct 3, 2025

Uh oh!

Uh oh!