-
Notifications
You must be signed in to change notification settings - Fork 2
Be more explicit on creating tmp/cache folders #540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideRefactors cache path helpers so they only compute and return cache and temporary directories, and moves directory creation responsibilities to the higher-level loading and API functions that actually use those paths, plus a small test cleanup for cache removal. Sequence diagram for cache_root and tmp_root creation responsibilitysequenceDiagram
actor User
participant load as load
participant dependencies as dependencies
participant database_cache_root as database_cache_root
participant database_tmp_root as database_tmp_root
participant audeer_mkdir as audeer_mkdir
User->>load: call load(name, version, cache_root, flavor)
load->>database_cache_root: database_cache_root(name, version, cache_root, flavor)
database_cache_root-->>load: db_root
load->>audeer_mkdir: mkdir(db_root)
User->>load: call load_media/load_table/load_attachment/load_header
load->>database_cache_root: database_cache_root(...)
database_cache_root-->>load: db_root
load->>audeer_mkdir: mkdir(db_root)
User->>dependencies: call dependencies(name, version, cache_root)
dependencies->>database_cache_root: database_cache_root(name, version, cache_root, flavor=None)
database_cache_root-->>dependencies: db_root
dependencies->>audeer_mkdir: mkdir(db_root)
note over load,dependencies: database_cache_root only computes and returns path;
note over load,dependencies: callers are responsible for creating directories
Sequence diagram for temporary cache directory usagesequenceDiagram
actor User
participant load as load
participant load_to as load_to
participant backend as backend_interface
participant database_tmp_root as database_tmp_root
participant audeer_mkdir as audeer_mkdir
User->>load: call load(...)
load->>database_tmp_root: database_tmp_root(db_root)
database_tmp_root-->>load: db_root_tmp
load->>audeer_mkdir: mkdir(db_root_tmp)
load->>backend: download tables/attachments to db_root_tmp
User->>load_to: call load_to(root, ...)
load_to->>database_tmp_root: database_tmp_root(db_root)
database_tmp_root-->>load_to: db_root_tmp
load_to->>audeer_mkdir: mkdir(db_root_tmp)
load_to->>backend: get files into db_root_tmp
User->>load: call load_header(..., add_audb_meta=True)
load->>database_tmp_root: database_tmp_root(db_root)
database_tmp_root-->>load: db_root_tmp
load->>audeer_mkdir: mkdir(db_root_tmp)
load->>backend: get_file(remote_header, local_header)
note over load,load_to: database_tmp_root only returns db_root + "~";
note over load,load_to: callers now explicitly create tmp directory
Flow diagram for cache and tmp directory handling after refactorflowchart TD
A[Start load or dependencies call] --> B[Compute db_root via database_cache_root]
B --> C[Create db_root via audeer.mkdir]
C --> D{Need temporary cache?}
D -->|Yes| E[Compute db_root_tmp via database_tmp_root]
E --> F[Create db_root_tmp via audeer.mkdir]
F --> G[Use db_root_tmp for downloads and processing]
D -->|No| H[Use db_root directly]
G --> I[Move or finalize data into db_root]
H --> I[Read/write data in db_root]
I --> J[End]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues, and left some high level feedback:
- Now that
database_cache_root()anddatabase_tmp_root()return paths without creating directories, there is repeatedaudeer.mkdir(...)logic at many call sites; consider centralizing this behavior in a small helper or a dedicated wrapper to avoid future call sites accidentally omitting the directory creation. - The updated docstrings for
database_cache_root()anddatabase_tmp_root()no longer indicate that the returned paths may not exist; consider explicitly stating that these functions only compute the path and that callers are responsible for creating the directory to make the new behavior clearer and harder to misuse.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Now that `database_cache_root()` and `database_tmp_root()` return paths without creating directories, there is repeated `audeer.mkdir(...)` logic at many call sites; consider centralizing this behavior in a small helper or a dedicated wrapper to avoid future call sites accidentally omitting the directory creation.
- The updated docstrings for `database_cache_root()` and `database_tmp_root()` no longer indicate that the returned paths may not exist; consider explicitly stating that these functions only compute the path and that callers are responsible for creating the directory to make the new behavior clearer and harder to misuse.
## Individual Comments
### Comment 1
<location> `audb/core/load.py:277-278` </location>
<code_context>
)
missing_attachments = [deps.archive(path) for path in missing_paths]
db_root_tmp = database_tmp_root(db_root)
+ audeer.mkdir(db_root_tmp)
def job(cache_root: str, file: str):
</code_context>
<issue_to_address>
**suggestion:** The repeated pattern of `database_tmp_root()` + `audeer.mkdir()` could be encapsulated to reduce duplication and future errors.
This pattern (`db_root_tmp = database_tmp_root(db_root)` then `audeer.mkdir(db_root_tmp)`) appears in multiple places (`load.py`, `load_to.py`, `load_header_to()`). To avoid repetition and reduce the chance of callers forgetting `mkdir`, consider a helper like `ensure_database_tmp_root(db_root)` that both computes and creates the temp directory and returns the path.
Suggested implementation:
```python
missing_attachments = [deps.archive(path) for path in missing_paths]
def ensure_database_tmp_root(db_root: str) -> str:
db_root_tmp = database_tmp_root(db_root)
audeer.mkdir(db_root_tmp)
return db_root_tmp
db_root_tmp = ensure_database_tmp_root(db_root)
```
```python
verbose,
)
db_root_tmp = ensure_database_tmp_root(db_root)
```
To fully apply the refactoring across the codebase:
1. Search in `audb/core/load.py`, `audb/core/load_to.py`, and `audb/core/load_header_to.py` for the pattern:
- `db_root_tmp = database_tmp_root(db_root)` followed by `audeer.mkdir(db_root_tmp)`.
2. In `audb/core/load.py`, if you prefer a module-level helper instead of a nested function, move `ensure_database_tmp_root()` to the module level and adjust its indentation, then replace all occurrences of the two-line pattern with `db_root_tmp = ensure_database_tmp_root(db_root)`.
3. In `audb/core/load_to.py` and `audb/core/load_header_to.py`, either:
- Import `ensure_database_tmp_root` from `audb.core.load` (if appropriate for your dependency structure), or
- Introduce an equivalent helper there and replace the repeated pattern with a single call.
</issue_to_address>
### Comment 2
<location> `audb/core/load.py:1142-1143` </location>
<code_context>
sampling_rate=sampling_rate,
)
db_root = database_cache_root(name, version, cache_root, flavor)
+ audeer.mkdir(db_root)
if verbose: # pragma: no cover
</code_context>
<issue_to_address>
**suggestion:** Multiple places now explicitly call `audeer.mkdir(db_root)` after `database_cache_root()`, which could be centralized.
In `load()`, `load_attachment()`, `load_header()`, `load_media()`, `load_table()`, and `dependencies()`, the pattern `db_root = database_cache_root(...); audeer.mkdir(db_root)` is duplicated. Consider introducing a helper like `ensure_database_cache_root(...)` that returns an existing-or-created path to remove boilerplate and avoid future call sites omitting the `mkdir` step.
Suggested implementation:
```python
"""
def ensure_database_cache_root(*args, **kwargs):
db_root = database_cache_root(*args, **kwargs)
audeer.mkdir(db_root)
return db_root
db_root_tmp = database_tmp_root(db_root)
```
```python
csv_file = f"db.{table}.csv"
sampling_rate=sampling_rate,
)
db_root = ensure_database_cache_root(name, version, cache_root, flavor)
if verbose: # pragma: no cover
print(f"Get: {name} v{version}")
version = latest_version(name)
db_root = ensure_database_cache_root(name, version, cache_root)
if verbose: # pragma: no cover
print(f"Get: {name} v{version}")
version = latest_version(name)
```
To fully apply the refactoring across the module, update all other call sites that follow the pattern:
db_root = database_cache_root(...)
audeer.mkdir(db_root)
to instead use:
db_root = ensure_database_cache_root(...)
Specifically, in `load()`, `load_attachment()`, `load_header()`, `load_media()`, `load_table()`, and `dependencies()`, search for `database_cache_root(` followed by `audeer.mkdir(db_root)` and replace those two lines with a single call to `ensure_database_cache_root()` passing through the same arguments used for `database_cache_root()`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| db_root_tmp = database_tmp_root(db_root) | ||
| audeer.mkdir(db_root_tmp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: The repeated pattern of database_tmp_root() + audeer.mkdir() could be encapsulated to reduce duplication and future errors.
This pattern (db_root_tmp = database_tmp_root(db_root) then audeer.mkdir(db_root_tmp)) appears in multiple places (load.py, load_to.py, load_header_to()). To avoid repetition and reduce the chance of callers forgetting mkdir, consider a helper like ensure_database_tmp_root(db_root) that both computes and creates the temp directory and returns the path.
Suggested implementation:
missing_attachments = [deps.archive(path) for path in missing_paths]
def ensure_database_tmp_root(db_root: str) -> str:
db_root_tmp = database_tmp_root(db_root)
audeer.mkdir(db_root_tmp)
return db_root_tmp
db_root_tmp = ensure_database_tmp_root(db_root) verbose,
)
db_root_tmp = ensure_database_tmp_root(db_root)To fully apply the refactoring across the codebase:
- Search in
audb/core/load.py,audb/core/load_to.py, andaudb/core/load_header_to.pyfor the pattern:db_root_tmp = database_tmp_root(db_root)followed byaudeer.mkdir(db_root_tmp).
- In
audb/core/load.py, if you prefer a module-level helper instead of a nested function, moveensure_database_tmp_root()to the module level and adjust its indentation, then replace all occurrences of the two-line pattern withdb_root_tmp = ensure_database_tmp_root(db_root). - In
audb/core/load_to.pyandaudb/core/load_header_to.py, either:- Import
ensure_database_tmp_rootfromaudb.core.load(if appropriate for your dependency structure), or - Introduce an equivalent helper there and replace the repeated pattern with a single call.
- Import
| db_root = database_cache_root(name, version, cache_root, flavor) | ||
| audeer.mkdir(db_root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Multiple places now explicitly call audeer.mkdir(db_root) after database_cache_root(), which could be centralized.
In load(), load_attachment(), load_header(), load_media(), load_table(), and dependencies(), the pattern db_root = database_cache_root(...); audeer.mkdir(db_root) is duplicated. Consider introducing a helper like ensure_database_cache_root(...) that returns an existing-or-created path to remove boilerplate and avoid future call sites omitting the mkdir step.
Suggested implementation:
"""
def ensure_database_cache_root(*args, **kwargs):
db_root = database_cache_root(*args, **kwargs)
audeer.mkdir(db_root)
return db_root
db_root_tmp = database_tmp_root(db_root) csv_file = f"db.{table}.csv"
sampling_rate=sampling_rate,
)
db_root = ensure_database_cache_root(name, version, cache_root, flavor)
if verbose: # pragma: no cover
print(f"Get: {name} v{version}")
version = latest_version(name)
db_root = ensure_database_cache_root(name, version, cache_root)
if verbose: # pragma: no cover
print(f"Get: {name} v{version}")
version = latest_version(name)To fully apply the refactoring across the module, update all other call sites that follow the pattern:
db_root = database_cache_root(...)
audeer.mkdir(db_root)
to instead use:
db_root = ensure_database_cache_root(...)
Specifically, in load(), load_attachment(), load_header(), load_media(), load_table(), and dependencies(), search for database_cache_root( followed by audeer.mkdir(db_root) and replace those two lines with a single call to ensure_database_cache_root() passing through the same arguments used for database_cache_root().
This updates the private functions
audb.core.cache.database_tmp_root()andaudb.core.cache.database_cache_root()to not create the tmp/cache folder, but just return its path.This pull request is only relevant if we decide to use it in #538
Summary by Sourcery
Make database cache and temporary cache directory creation explicit at call sites instead of being handled implicitly by cache helper functions.
Enhancements:
Tests: