Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jul 18, 2025

Closes #197

Add a .complete file to indicate if a database was completely loaded, so we do not need to acquire a lock in those cases.

Summary by Sourcery

Add a .complete sentinel file to mark fully loaded databases, use it to bypass locking in load routines, update metadata checks, refactor loading logic for clarity, and add comprehensive tests for .complete behavior

New Features:

  • Add .complete file to database root to signal completed loading
  • Skip acquiring file lock in load functions when .complete file exists

Enhancements:

  • Prioritize .complete file over metadata in _database_is_complete
  • Refactor load and load_media to early-exit locking logic and consolidate loading flow
  • Define COMPLETE_FILE constant in project definitions

Tests:

  • Add tests for .complete file creation, detection, fallback, constant definition, and concurrent completion
  • Add tests ensuring load functions skip or use locking appropriately based on .complete file

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jul 18, 2025

Reviewer's Guide

This PR introduces a persistent “.complete” marker file to avoid redundant locking once a database has been fully loaded. It adds a constant for the marker, ensures it’s created at the end of loading, updates the completeness check and load routines to look for the file (skipping locks when present, with a race-safe recheck), and provides a full test suite to validate these behaviors.

Sequence diagram for database loading with .complete file check

sequenceDiagram
    participant User
    participant load as load()
    participant OS as os.path
    participant Lock as FolderLock
    participant define as define.COMPLETE_FILE

    User->>load: load()
    load->>OS: os.path.exists(.complete)
    alt .complete exists
        load->>load: load_header_to(...)
        load->>User: return db (no lock)
    else .complete does not exist
        load->>Lock: acquire FolderLock
        Lock-->>load: lock acquired
        load->>load: load_header_to(...)
        load->>OS: os.path.exists(.complete)
        alt .complete now exists
            load->>User: return db
        else
            load->>load: _database_is_complete()
            alt database is complete
                load->>OS: audeer.touch(.complete)
            end
            load->>User: return db
        end
    end
Loading

Class diagram for .complete file integration in database loading

classDiagram
    class define {
        +LOCK_FILE: str
        +COMPLETE_FILE: str
    }
    class load {
        +load()
        +_database_is_complete()
        +_database_check_complete()
    }
    define <.. load : uses
    load : +load() checks for .complete file
    load : +_database_is_complete() checks for .complete file
    load : creates .complete file on completion
Loading

File-Level Changes

Change Details Files
Define COMPLETE_FILE constant
  • Add COMPLETE_FILE = '.complete' constant to project definitions
audb/core/define.py
Create .complete file to signal full database load
  • Touch .complete in check() after cleaning up temporary directory
audb/core/load.py
Detect .complete file and adjust locking logic
  • Early exit in _database_is_complete upon marker presence
  • Pre-check .complete in load() and load_media() to skip FolderLock
  • Recheck marker inside lock to handle race conditions
audb/core/load.py
Add tests for .complete functionality
  • Verify file creation, detection, and metadata update
  • Test skip-lock behavior and concurrent race handling
  • Cover fallback to metadata and missing-files scenarios
tests/test_complete_file.py

Assessment against linked issues

Issue Objective Addressed Explanation
#197 Implement a mechanism to signal that a database in the cache folder is complete, without requiring access to db.yaml.
#197 Modify the locking logic so that the cache folder is only locked if the database is not complete (i.e., the .complete file does not exist).
#197 Add or update tests to verify that the .complete file mechanism and locking behavior work as intended and avoid race conditions.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hagenw - I've reviewed your changes - here's some feedback:

  • Factor out the repeated “.complete” check and lock-skip logic into a shared helper to reduce duplication between load() and load_media().
  • Ensure that any stale .complete file is removed or refreshed at the start of a load so you don’t accidentally skip necessary work from a previous incomplete run.
  • Wrap the creation of the .complete file in the same lock or a rollback so that it isn’t left behind if an error occurs during loading.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Factor out the repeated “.complete” check and lock-skip logic into a shared helper to reduce duplication between load() and load_media().
- Ensure that any stale .complete file is removed or refreshed at the start of a load so you don’t accidentally skip necessary work from a previous incomplete run.
- Wrap the creation of the .complete file in the same lock or a rollback so that it isn’t left behind if an error occurs during loading.

## Individual Comments

### Comment 1
<location> `audb/core/load.py:196` </location>
<code_context>
         )
         audeer.rmdir(db_root_tmp)

+        # Create .complete file to signal completion
+        complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+        audeer.touch(complete_file)
+

</code_context>

<issue_to_address>
Consider atomic file creation for the .complete file.

audeer.touch may not guarantee atomicity, risking race conditions if multiple processes run concurrently. Use an atomic method, like writing to a temp file and renaming, to ensure .complete is reliably created after completion.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        # Create .complete file to signal completion
        complete_file = os.path.join(db_root, define.COMPLETE_FILE)
        audeer.touch(complete_file)
=======
        # Atomically create .complete file to signal completion
        complete_file = os.path.join(db_root, define.COMPLETE_FILE)
        tmp_complete_file = complete_file + ".tmp"
        with open(tmp_complete_file, "w") as f:
            pass  # create empty file
        os.replace(tmp_complete_file, complete_file)
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `audb/core/load.py:205` </location>
<code_context>
     db: audformat.Database,
 ) -> bool:
+    # First check for .complete file
+    if "audb" in db.meta and "root" in db.meta["audb"]:
+        complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE)
+        if os.path.exists(complete_file):
+            return True
+
</code_context>

<issue_to_address>
Checking for the .complete file before metadata may mask incomplete states.

A stale or incorrectly created .complete file could cause a false positive. Consider adding validation to ensure the .complete file reliably indicates database completeness.
</issue_to_address>

### Comment 3
<location> `audb/core/load.py:207` </location>
<code_context>
+    # First check for .complete file
+    if "audb" in db.meta and "root" in db.meta["audb"]:
+        complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE)
+        if os.path.exists(complete_file):
+            return True
+
</code_context>

<issue_to_address>
Bypassing the FolderLock when .complete exists could allow concurrent modifications.

This approach may introduce race conditions if .complete is created before all data is written. Evaluate whether the lock should still be used for reads or if other safeguards are necessary.
</issue_to_address>

### Comment 4
<location> `audb/core/load.py:1190` </location>
<code_context>
-                    cached_versions = _load_files(
-                        _tables,
-                        "table",
+                # Double-check completion status after acquiring lock
+                complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+                if os.path.exists(complete_file):
+                    db_is_complete = True
+                else:
</code_context>

<issue_to_address>
Double-checking .complete after acquiring the lock may not be sufficient if file creation is not atomic.

Using non-atomic file creation for .complete can lead to race conditions where other processes detect the file before the database is fully ready. Use atomic file creation or a more reliable signaling method to prevent this issue.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
                # Double-check completion status after acquiring lock
                complete_file = os.path.join(db_root, define.COMPLETE_FILE)
                if os.path.exists(complete_file):
                    db_is_complete = True
                else:
=======
                # Double-check completion status after acquiring lock using atomic file creation
                complete_file = os.path.join(db_root, define.COMPLETE_FILE)
                try:
                    # Try to atomically create the .complete file
                    fd = os.open(complete_file, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
                    with os.fdopen(fd, "w") as f:
                        f.write("complete\n")
                    db_is_complete = False  # We are the first to create it, so not yet complete
                except FileExistsError:
                    # If file exists, another process has completed the database
                    db_is_complete = True
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `audb/core/load.py:1203` </location>
<code_context>
-            ]
-
-            # load missing tables
-            if not db_is_complete:
-                for _tables in [
-                    requested_misc_tables,
</code_context>

<issue_to_address>
Potential for incomplete state if .complete file is created before all data is written.

Ensure the .complete file is only created after all data and metadata are fully written and flushed to disk to prevent incorrect assumptions about database completeness.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +196 to +198
# Create .complete file to signal completion
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
audeer.touch(complete_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Consider atomic file creation for the .complete file.

audeer.touch may not guarantee atomicity, risking race conditions if multiple processes run concurrently. Use an atomic method, like writing to a temp file and renaming, to ensure .complete is reliably created after completion.

Suggested change
# Create .complete file to signal completion
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
audeer.touch(complete_file)
# Atomically create .complete file to signal completion
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
tmp_complete_file = complete_file + ".tmp"
with open(tmp_complete_file, "w") as f:
pass # create empty file
os.replace(tmp_complete_file, complete_file)

Comment on lines +205 to +207
if "audb" in db.meta and "root" in db.meta["audb"]:
complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE)
if os.path.exists(complete_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Checking for the .complete file before metadata may mask incomplete states.

A stale or incorrectly created .complete file could cause a false positive. Consider adding validation to ensure the .complete file reliably indicates database completeness.

Comment on lines 1163 to 1170
)

try:
with FolderLock(db_root, timeout=timeout):
# Start with database header without tables
# Check if database is already complete by looking for .complete file
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
if os.path.exists(complete_file):
# Database is complete, no need to lock
db, backend_interface = load_header_to(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Bypassing the FolderLock when .complete exists could allow concurrent modifications.

This approach may introduce race conditions if .complete is created before all data is written. Evaluate whether the lock should still be used for reads or if other safeguards are necessary.

Comment on lines +1190 to +1194
# Double-check completion status after acquiring lock
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
if os.path.exists(complete_file):
db_is_complete = True
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Double-checking .complete after acquiring the lock may not be sufficient if file creation is not atomic.

Using non-atomic file creation for .complete can lead to race conditions where other processes detect the file before the database is fully ready. Use atomic file creation or a more reliable signaling method to prevent this issue.

Suggested change
# Double-check completion status after acquiring lock
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
if os.path.exists(complete_file):
db_is_complete = True
else:
# Double-check completion status after acquiring lock using atomic file creation
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
try:
# Try to atomically create the .complete file
fd = os.open(complete_file, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
with os.fdopen(fd, "w") as f:
f.write("complete\n")
db_is_complete = False # We are the first to create it, so not yet complete
except FileExistsError:
# If file exists, another process has completed the database
db_is_complete = True

Comment on lines 1246 to +1255
deps,
flavor,
cache_root,
False,
pickle_tables,
num_workers,
verbose,
)
requested_tables = requested_misc_tables + requested_tables

# filter media
if media is not None or tables is not None:
db.pick_files(requested_media)
# filter tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential for incomplete state if .complete file is created before all data is written.

Ensure the .complete file is only created after all data and metadata are fully written and flushed to disk to prevent incorrect assumptions about database completeness.

# Mock other dependencies
with mock.patch("audb.core.load.dependencies") as mock_deps:
# Create a proper mock dependencies object
mock_dep_instance = mock.Mock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Extract code out into function (extract-method)

# Mock other dependencies
with mock.patch("audb.core.load.dependencies") as mock_deps:
# Create a proper mock dependencies object
mock_dep_instance = mock.Mock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Extract code out into function (extract-method)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not lock cache folder for complete database

2 participants