-
Notifications
You must be signed in to change notification settings - Fork 2
Add .complete file to db root #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR introduces a persistent “.complete” marker file to avoid redundant locking once a database has been fully loaded. It adds a constant for the marker, ensures it’s created at the end of loading, updates the completeness check and load routines to look for the file (skipping locks when present, with a race-safe recheck), and provides a full test suite to validate these behaviors. Sequence diagram for database loading with .complete file checksequenceDiagram
participant User
participant load as load()
participant OS as os.path
participant Lock as FolderLock
participant define as define.COMPLETE_FILE
User->>load: load()
load->>OS: os.path.exists(.complete)
alt .complete exists
load->>load: load_header_to(...)
load->>User: return db (no lock)
else .complete does not exist
load->>Lock: acquire FolderLock
Lock-->>load: lock acquired
load->>load: load_header_to(...)
load->>OS: os.path.exists(.complete)
alt .complete now exists
load->>User: return db
else
load->>load: _database_is_complete()
alt database is complete
load->>OS: audeer.touch(.complete)
end
load->>User: return db
end
end
Class diagram for .complete file integration in database loadingclassDiagram
class define {
+LOCK_FILE: str
+COMPLETE_FILE: str
}
class load {
+load()
+_database_is_complete()
+_database_check_complete()
}
define <.. load : uses
load : +load() checks for .complete file
load : +_database_is_complete() checks for .complete file
load : creates .complete file on completion
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @hagenw - I've reviewed your changes - here's some feedback:
- Factor out the repeated “.complete” check and lock-skip logic into a shared helper to reduce duplication between load() and load_media().
- Ensure that any stale .complete file is removed or refreshed at the start of a load so you don’t accidentally skip necessary work from a previous incomplete run.
- Wrap the creation of the .complete file in the same lock or a rollback so that it isn’t left behind if an error occurs during loading.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Factor out the repeated “.complete” check and lock-skip logic into a shared helper to reduce duplication between load() and load_media().
- Ensure that any stale .complete file is removed or refreshed at the start of a load so you don’t accidentally skip necessary work from a previous incomplete run.
- Wrap the creation of the .complete file in the same lock or a rollback so that it isn’t left behind if an error occurs during loading.
## Individual Comments
### Comment 1
<location> `audb/core/load.py:196` </location>
<code_context>
)
audeer.rmdir(db_root_tmp)
+ # Create .complete file to signal completion
+ complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+ audeer.touch(complete_file)
+
</code_context>
<issue_to_address>
Consider atomic file creation for the .complete file.
audeer.touch may not guarantee atomicity, risking race conditions if multiple processes run concurrently. Use an atomic method, like writing to a temp file and renaming, to ensure .complete is reliably created after completion.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# Create .complete file to signal completion
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
audeer.touch(complete_file)
=======
# Atomically create .complete file to signal completion
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
tmp_complete_file = complete_file + ".tmp"
with open(tmp_complete_file, "w") as f:
pass # create empty file
os.replace(tmp_complete_file, complete_file)
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `audb/core/load.py:205` </location>
<code_context>
db: audformat.Database,
) -> bool:
+ # First check for .complete file
+ if "audb" in db.meta and "root" in db.meta["audb"]:
+ complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE)
+ if os.path.exists(complete_file):
+ return True
+
</code_context>
<issue_to_address>
Checking for the .complete file before metadata may mask incomplete states.
A stale or incorrectly created .complete file could cause a false positive. Consider adding validation to ensure the .complete file reliably indicates database completeness.
</issue_to_address>
### Comment 3
<location> `audb/core/load.py:207` </location>
<code_context>
+ # First check for .complete file
+ if "audb" in db.meta and "root" in db.meta["audb"]:
+ complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE)
+ if os.path.exists(complete_file):
+ return True
+
</code_context>
<issue_to_address>
Bypassing the FolderLock when .complete exists could allow concurrent modifications.
This approach may introduce race conditions if .complete is created before all data is written. Evaluate whether the lock should still be used for reads or if other safeguards are necessary.
</issue_to_address>
### Comment 4
<location> `audb/core/load.py:1190` </location>
<code_context>
- cached_versions = _load_files(
- _tables,
- "table",
+ # Double-check completion status after acquiring lock
+ complete_file = os.path.join(db_root, define.COMPLETE_FILE)
+ if os.path.exists(complete_file):
+ db_is_complete = True
+ else:
</code_context>
<issue_to_address>
Double-checking .complete after acquiring the lock may not be sufficient if file creation is not atomic.
Using non-atomic file creation for .complete can lead to race conditions where other processes detect the file before the database is fully ready. Use atomic file creation or a more reliable signaling method to prevent this issue.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# Double-check completion status after acquiring lock
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
if os.path.exists(complete_file):
db_is_complete = True
else:
=======
# Double-check completion status after acquiring lock using atomic file creation
complete_file = os.path.join(db_root, define.COMPLETE_FILE)
try:
# Try to atomically create the .complete file
fd = os.open(complete_file, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
with os.fdopen(fd, "w") as f:
f.write("complete\n")
db_is_complete = False # We are the first to create it, so not yet complete
except FileExistsError:
# If file exists, another process has completed the database
db_is_complete = True
>>>>>>> REPLACE
</suggested_fix>
### Comment 5
<location> `audb/core/load.py:1203` </location>
<code_context>
- ]
-
- # load missing tables
- if not db_is_complete:
- for _tables in [
- requested_misc_tables,
</code_context>
<issue_to_address>
Potential for incomplete state if .complete file is created before all data is written.
Ensure the .complete file is only created after all data and metadata are fully written and flushed to disk to prevent incorrect assumptions about database completeness.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # Create .complete file to signal completion | ||
| complete_file = os.path.join(db_root, define.COMPLETE_FILE) | ||
| audeer.touch(complete_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Consider atomic file creation for the .complete file.
audeer.touch may not guarantee atomicity, risking race conditions if multiple processes run concurrently. Use an atomic method, like writing to a temp file and renaming, to ensure .complete is reliably created after completion.
| # Create .complete file to signal completion | |
| complete_file = os.path.join(db_root, define.COMPLETE_FILE) | |
| audeer.touch(complete_file) | |
| # Atomically create .complete file to signal completion | |
| complete_file = os.path.join(db_root, define.COMPLETE_FILE) | |
| tmp_complete_file = complete_file + ".tmp" | |
| with open(tmp_complete_file, "w") as f: | |
| pass # create empty file | |
| os.replace(tmp_complete_file, complete_file) |
| if "audb" in db.meta and "root" in db.meta["audb"]: | ||
| complete_file = os.path.join(db.meta["audb"]["root"], define.COMPLETE_FILE) | ||
| if os.path.exists(complete_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Checking for the .complete file before metadata may mask incomplete states.
A stale or incorrectly created .complete file could cause a false positive. Consider adding validation to ensure the .complete file reliably indicates database completeness.
| ) | ||
|
|
||
| try: | ||
| with FolderLock(db_root, timeout=timeout): | ||
| # Start with database header without tables | ||
| # Check if database is already complete by looking for .complete file | ||
| complete_file = os.path.join(db_root, define.COMPLETE_FILE) | ||
| if os.path.exists(complete_file): | ||
| # Database is complete, no need to lock | ||
| db, backend_interface = load_header_to( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Bypassing the FolderLock when .complete exists could allow concurrent modifications.
This approach may introduce race conditions if .complete is created before all data is written. Evaluate whether the lock should still be used for reads or if other safeguards are necessary.
| # Double-check completion status after acquiring lock | ||
| complete_file = os.path.join(db_root, define.COMPLETE_FILE) | ||
| if os.path.exists(complete_file): | ||
| db_is_complete = True | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Double-checking .complete after acquiring the lock may not be sufficient if file creation is not atomic.
Using non-atomic file creation for .complete can lead to race conditions where other processes detect the file before the database is fully ready. Use atomic file creation or a more reliable signaling method to prevent this issue.
| # Double-check completion status after acquiring lock | |
| complete_file = os.path.join(db_root, define.COMPLETE_FILE) | |
| if os.path.exists(complete_file): | |
| db_is_complete = True | |
| else: | |
| # Double-check completion status after acquiring lock using atomic file creation | |
| complete_file = os.path.join(db_root, define.COMPLETE_FILE) | |
| try: | |
| # Try to atomically create the .complete file | |
| fd = os.open(complete_file, os.O_CREAT | os.O_EXCL | os.O_WRONLY) | |
| with os.fdopen(fd, "w") as f: | |
| f.write("complete\n") | |
| db_is_complete = False # We are the first to create it, so not yet complete | |
| except FileExistsError: | |
| # If file exists, another process has completed the database | |
| db_is_complete = True |
| deps, | ||
| flavor, | ||
| cache_root, | ||
| False, | ||
| pickle_tables, | ||
| num_workers, | ||
| verbose, | ||
| ) | ||
| requested_tables = requested_misc_tables + requested_tables | ||
|
|
||
| # filter media | ||
| if media is not None or tables is not None: | ||
| db.pick_files(requested_media) | ||
| # filter tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Potential for incomplete state if .complete file is created before all data is written.
Ensure the .complete file is only created after all data and metadata are fully written and flushed to disk to prevent incorrect assumptions about database completeness.
| # Mock other dependencies | ||
| with mock.patch("audb.core.load.dependencies") as mock_deps: | ||
| # Create a proper mock dependencies object | ||
| mock_dep_instance = mock.Mock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Extract code out into function (extract-method)
| # Mock other dependencies | ||
| with mock.patch("audb.core.load.dependencies") as mock_deps: | ||
| # Create a proper mock dependencies object | ||
| mock_dep_instance = mock.Mock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Extract code out into function (extract-method)
Closes #197
Add a
.completefile to indicate if a database was completely loaded, so we do not need to acquire a lock in those cases.Summary by Sourcery
Add a .complete sentinel file to mark fully loaded databases, use it to bypass locking in load routines, update metadata checks, refactor loading logic for clarity, and add comprehensive tests for .complete behavior
New Features:
Enhancements:
Tests: