Add .complete file to db root #514

hagenw · 2025-07-18T15:02:58Z

Closes #197

Add a .complete file to indicate if a database was completely loaded, so we do not need to acquire a lock in those cases.

Summary by Sourcery

Add a .complete sentinel file to mark fully loaded databases, use it to bypass locking in load routines, update metadata checks, refactor loading logic for clarity, and add comprehensive tests for .complete behavior

New Features:

Add .complete file to database root to signal completed loading
Skip acquiring file lock in load functions when .complete file exists

Enhancements:

Prioritize .complete file over metadata in _database_is_complete
Refactor load and load_media to early-exit locking logic and consolidate loading flow
Define COMPLETE_FILE constant in project definitions

Tests:

Add tests for .complete file creation, detection, fallback, constant definition, and concurrent completion
Add tests ensuring load functions skip or use locking appropriately based on .complete file

sourcery-ai · 2025-07-18T15:03:03Z

Reviewer's Guide

This PR introduces a persistent “.complete” marker file to avoid redundant locking once a database has been fully loaded. It adds a constant for the marker, ensures it’s created at the end of loading, updates the completeness check and load routines to look for the file (skipping locks when present, with a race-safe recheck), and provides a full test suite to validate these behaviors.

Sequence diagram for database loading with .complete file check

sequenceDiagram
    participant User
    participant load as load()
    participant OS as os.path
    participant Lock as FolderLock
    participant define as define.COMPLETE_FILE

    User->>load: load()
    load->>OS: os.path.exists(.complete)
    alt .complete exists
        load->>load: load_header_to(...)
        load->>User: return db (no lock)
    else .complete does not exist
        load->>Lock: acquire FolderLock
        Lock-->>load: lock acquired
        load->>load: load_header_to(...)
        load->>OS: os.path.exists(.complete)
        alt .complete now exists
            load->>User: return db
        else
            load->>load: _database_is_complete()
            alt database is complete
                load->>OS: audeer.touch(.complete)
            end
            load->>User: return db
        end
    end

Class diagram for .complete file integration in database loading

classDiagram
    class define {
        +LOCK_FILE: str
        +COMPLETE_FILE: str
    }
    class load {
        +load()
        +_database_is_complete()
        +_database_check_complete()
    }
    define <.. load : uses
    load : +load() checks for .complete file
    load : +_database_is_complete() checks for .complete file
    load : creates .complete file on completion

File-Level Changes

Change	Details	Files
Define COMPLETE_FILE constant	Add COMPLETE_FILE = '.complete' constant to project definitions	`audb/core/define.py`
Create .complete file to signal full database load	Touch .complete in check() after cleaning up temporary directory	`audb/core/load.py`
Detect .complete file and adjust locking logic	Early exit in _database_is_complete upon marker presence Pre-check .complete in load() and load_media() to skip FolderLock Recheck marker inside lock to handle race conditions	`audb/core/load.py`
Add tests for .complete functionality	Verify file creation, detection, and metadata update Test skip-lock behavior and concurrent race handling Cover fallback to metadata and missing-files scenarios	`tests/test_complete_file.py`

Assessment against linked issues

Issue	Objective	Addressed
#197	Implement a mechanism to signal that a database in the cache folder is complete, without requiring access to db.yaml.	✅
#197	Modify the locking logic so that the cache folder is only locked if the database is not complete (i.e., the .complete file does not exist).	✅
#197	Add or update tests to verify that the .complete file mechanism and locking behavior work as intended and avoid race conditions.	✅

Possibly linked issues

Do not lock cache folder for complete database #197: PR adds a .complete file to signal complete database, resolving issue's incomplete status and scans.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @hagenw - I've reviewed your changes - here's some feedback:

Factor out the repeated “.complete” check and lock-skip logic into a shared helper to reduce duplication between load() and load_media().
Ensure that any stale .complete file is removed or refreshed at the start of a load so you don’t accidentally skip necessary work from a previous incomplete run.
Wrap the creation of the .complete file in the same lock or a rollback so that it isn’t left behind if an error occurs during loading.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Factor out the repeated “.complete” check and lock-skip logic into a shared helper to reduce duplication between load() and load_media().
- Ensure that any stale .complete file is removed or refreshed at the start of a load so you don’t accidentally skip necessary work from a previous incomplete run.
- Wrap the creation of the .complete file in the same lock or a rollback so that it isn’t left behind if an error occurs during loading.

## Individual Comments

### Comment 1
<location> `audb/core/load.py:196` </location>
<code_context>
         )
         audeer.rmdir(db_root_tmp)

+        # Create .complete file to signal completion
+        complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+        audeer.touch(complete_file)
+

</code_context>

<issue_to_address>
Consider atomic file creation for the .complete file.

audeer.touch may not guarantee atomicity, risking race conditions if multiple processes run concurrently. Use an atomic method, like writing to a temp file and renaming, to ensure .complete is reliably created after completion.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        # Create .complete file to signal completion
        complete_file = os.path.join(db_root, define.COMPLETE_FILE)
        audeer.touch(complete_file)
=======
        # Atomically create .complete file to signal completion
        complete_file = os.path.join(db_root, define.COMPLETE_FILE)
        tmp_complete_file = complete_file + ".tmp"
        with open(tmp_complete_file, "w") as f:
            pass  # create empty file
        os.replace(tmp_complete_file, complete_file)
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `audb/core/load.py:205` </location>
<code_context>
     db: audformat.Database,
 ) -> bool:
+    # First check for .complete file
+    if "audb" in db.meta and "root" in db.meta["audb"]:
+        complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE)
+        if os.path.exists(complete_file):
+            return True
+
</code_context>

<issue_to_address>
Checking for the .complete file before metadata may mask incomplete states.

A stale or incorrectly created .complete file could cause a false positive. Consider adding validation to ensure the .complete file reliably indicates database completeness.
</issue_to_address>

### Comment 3
<location> `audb/core/load.py:207` </location>
<code_context>
+    # First check for .complete file
+    if "audb" in db.meta and "root" in db.meta["audb"]:
+        complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE)
+        if os.path.exists(complete_file):
+            return True
+
</code_context>

<issue_to_address>
Bypassing the FolderLock when .complete exists could allow concurrent modifications.

This approach may introduce race conditions if .complete is created before all data is written. Evaluate whether the lock should still be used for reads or if other safeguards are necessary.
</issue_to_address>

### Comment 4
<location> `audb/core/load.py:1190` </location>
<code_context>
-                    cached_versions = _load_files(
-                        _tables,
-                        "table",
+                # Double-check completion status after acquiring lock
+                complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+                if os.path.exists(complete_file):
+                    db_is_complete = True
+                else:
</code_context>

<issue_to_address>
Double-checking .complete after acquiring the lock may not be sufficient if file creation is not atomic.

Using non-atomic file creation for .complete can lead to race conditions where other processes detect the file before the database is fully ready. Use atomic file creation or a more reliable signaling method to prevent this issue.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
                # Double-check completion status after acquiring lock
                complete_file = os.path.join(db_root, define.COMPLETE_FILE)
                if os.path.exists(complete_file):
                    db_is_complete = True
                else:
=======
                # Double-check completion status after acquiring lock using atomic file creation
                complete_file = os.path.join(db_root, define.COMPLETE_FILE)
                try:
                    # Try to atomically create the .complete file
                    fd = os.open(complete_file, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
                    with os.fdopen(fd, "w") as f:
                        f.write("complete\n")
                    db_is_complete = False  # We are the first to create it, so not yet complete
                except FileExistsError:
                    # If file exists, another process has completed the database
                    db_is_complete = True
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `audb/core/load.py:1203` </location>
<code_context>
-            ]
-
-            # load missing tables
-            if not db_is_complete:
-                for _tables in [
-                    requested_misc_tables,
</code_context>

<issue_to_address>
Potential for incomplete state if .complete file is created before all data is written.

Ensure the .complete file is only created after all data and metadata are fully written and flushed to disk to prevent incorrect assumptions about database completeness.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-07-18T15:04:04Z

audb/core/load.py

+        # Create .complete file to signal completion
+        complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+        audeer.touch(complete_file)


suggestion (bug_risk): Consider atomic file creation for the .complete file.

audeer.touch may not guarantee atomicity, risking race conditions if multiple processes run concurrently. Use an atomic method, like writing to a temp file and renaming, to ensure .complete is reliably created after completion.

Suggested change

# Create .complete file to signal completion

complete_file = os.path.join(db_root, define.COMPLETE_FILE)

audeer.touch(complete_file)

# Atomically create .complete file to signal completion

complete_file = os.path.join(db_root, define.COMPLETE_FILE)

tmp_complete_file = complete_file + ".tmp"

with open(tmp_complete_file, "w") as f:

pass # create empty file

os.replace(tmp_complete_file, complete_file)

sourcery-ai · 2025-07-18T15:04:04Z

audb/core/load.py

+    if "audb" in db.meta and "root" in db.meta["audb"]:
+        complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE)
+        if os.path.exists(complete_file):


issue (bug_risk): Checking for the .complete file before metadata may mask incomplete states.

A stale or incorrectly created .complete file could cause a false positive. Consider adding validation to ensure the .complete file reliably indicates database completeness.

sourcery-ai · 2025-07-18T15:04:04Z

audb/core/load.py

    )

    try:
-        with FolderLock(db_root, timeout=timeout):
-            # Start with database header without tables
+        # Check if database is already complete by looking for .complete file
+        complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+        if os.path.exists(complete_file):
+            # Database is complete, no need to lock
            db, backend_interface = load_header_to(


issue (bug_risk): Bypassing the FolderLock when .complete exists could allow concurrent modifications.

This approach may introduce race conditions if .complete is created before all data is written. Evaluate whether the lock should still be used for reads or if other safeguards are necessary.

sourcery-ai · 2025-07-18T15:04:04Z

audb/core/load.py

+                # Double-check completion status after acquiring lock
+                complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+                if os.path.exists(complete_file):
+                    db_is_complete = True
+                else:


suggestion (bug_risk): Double-checking .complete after acquiring the lock may not be sufficient if file creation is not atomic.

Using non-atomic file creation for .complete can lead to race conditions where other processes detect the file before the database is fully ready. Use atomic file creation or a more reliable signaling method to prevent this issue.

Suggested change

# Double-check completion status after acquiring lock

complete_file = os.path.join(db_root, define.COMPLETE_FILE)

if os.path.exists(complete_file):

db_is_complete = True

else:

# Double-check completion status after acquiring lock using atomic file creation

complete_file = os.path.join(db_root, define.COMPLETE_FILE)

try:

# Try to atomically create the .complete file

fd = os.open(complete_file, os.O_CREAT | os.O_EXCL | os.O_WRONLY)

with os.fdopen(fd, "w") as f:

f.write("complete\n")

db_is_complete = False # We are the first to create it, so not yet complete

except FileExistsError:

# If file exists, another process has completed the database

db_is_complete = True

sourcery-ai · 2025-07-18T15:04:04Z

audb/core/load.py

                    deps,
                    flavor,
                    cache_root,
-                    False,
+                    pickle_tables,
                    num_workers,
                    verbose,
                )
+        requested_tables = requested_misc_tables + requested_tables

-            # filter media
-            if media is not None or tables is not None:
-                db.pick_files(requested_media)
+        # filter tables


issue (bug_risk): Potential for incomplete state if .complete file is created before all data is written.

Ensure the .complete file is only created after all data and metadata are fully written and flushed to disk to prevent incorrect assumptions about database completeness.

sourcery-ai · 2025-07-18T15:04:04Z

tests/test_complete_file.py

+        # Mock other dependencies
+        with mock.patch("audb.core.load.dependencies") as mock_deps:
+            # Create a proper mock dependencies object
+            mock_dep_instance = mock.Mock()


issue (code-quality): Extract code out into function (extract-method)

sourcery-ai · 2025-07-18T15:04:04Z

tests/test_complete_file.py

+        # Mock other dependencies
+        with mock.patch("audb.core.load.dependencies") as mock_deps:
+            # Create a proper mock dependencies object
+            mock_dep_instance = mock.Mock()


issue (code-quality): Extract code out into function (extract-method)

Add .complete file to db root

4f0747c

sourcery-ai bot reviewed Jul 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add .complete file to db root #514

Add .complete file to db root #514

Uh oh!

hagenw commented Jul 18, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jul 18, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Uh oh!

sourcery-ai bot Jul 18, 2025

Uh oh!

sourcery-ai bot Jul 18, 2025

Uh oh!

sourcery-ai bot Jul 18, 2025

Uh oh!

sourcery-ai bot Jul 18, 2025

Uh oh!

sourcery-ai bot Jul 18, 2025

Uh oh!

sourcery-ai bot Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add .complete file to db root #514

Are you sure you want to change the base?

Add .complete file to db root #514

Uh oh!

Conversation

hagenw commented Jul 18, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for database loading with .complete file check

Class diagram for .complete file integration in database loading

File-Level Changes

Assessment against linked issues

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Jul 18, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jul 18, 2025 •

edited

Loading