Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 5, 2026

This updates the private functions audb.core.cache.database_tmp_root() and audb.core.cache.database_cache_root() to not create the tmp/cache folder, but just return its path.

This pull request is only relevant if we decide to use it in #538

Summary by Sourcery

Make database cache and temporary cache directory creation explicit at call sites instead of being handled implicitly by cache helper functions.

Enhancements:

  • Change database cache helper functions to only return cache and temporary cache paths without creating directories.
  • Ensure all loading, streaming, and dependency-resolution paths explicitly create required cache and temporary directories before use.

Tests:

  • Adjust media loading test teardown to safely remove cache directories only if they exist.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 5, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Refactors cache path helpers so they only compute and return cache and temporary directories, and moves directory creation responsibilities to the higher-level loading and API functions that actually use those paths, plus a small test cleanup for cache removal.

Sequence diagram for cache_root and tmp_root creation responsibility

sequenceDiagram
    actor User
    participant load as load
    participant dependencies as dependencies
    participant database_cache_root as database_cache_root
    participant database_tmp_root as database_tmp_root
    participant audeer_mkdir as audeer_mkdir

    User->>load: call load(name, version, cache_root, flavor)
    load->>database_cache_root: database_cache_root(name, version, cache_root, flavor)
    database_cache_root-->>load: db_root
    load->>audeer_mkdir: mkdir(db_root)

    User->>load: call load_media/load_table/load_attachment/load_header
    load->>database_cache_root: database_cache_root(...)
    database_cache_root-->>load: db_root
    load->>audeer_mkdir: mkdir(db_root)

    User->>dependencies: call dependencies(name, version, cache_root)
    dependencies->>database_cache_root: database_cache_root(name, version, cache_root, flavor=None)
    database_cache_root-->>dependencies: db_root
    dependencies->>audeer_mkdir: mkdir(db_root)

    note over load,dependencies: database_cache_root only computes and returns path;
    note over load,dependencies: callers are responsible for creating directories
Loading

Sequence diagram for temporary cache directory usage

sequenceDiagram
    actor User
    participant load as load
    participant load_to as load_to
    participant backend as backend_interface
    participant database_tmp_root as database_tmp_root
    participant audeer_mkdir as audeer_mkdir

    User->>load: call load(...)
    load->>database_tmp_root: database_tmp_root(db_root)
    database_tmp_root-->>load: db_root_tmp
    load->>audeer_mkdir: mkdir(db_root_tmp)
    load->>backend: download tables/attachments to db_root_tmp

    User->>load_to: call load_to(root, ...)
    load_to->>database_tmp_root: database_tmp_root(db_root)
    database_tmp_root-->>load_to: db_root_tmp
    load_to->>audeer_mkdir: mkdir(db_root_tmp)
    load_to->>backend: get files into db_root_tmp

    User->>load: call load_header(..., add_audb_meta=True)
    load->>database_tmp_root: database_tmp_root(db_root)
    database_tmp_root-->>load: db_root_tmp
    load->>audeer_mkdir: mkdir(db_root_tmp)
    load->>backend: get_file(remote_header, local_header)

    note over load,load_to: database_tmp_root only returns db_root + "~";
    note over load,load_to: callers now explicitly create tmp directory
Loading

Flow diagram for cache and tmp directory handling after refactor

flowchart TD
    A[Start load or dependencies call] --> B[Compute db_root via database_cache_root]
    B --> C[Create db_root via audeer.mkdir]
    C --> D{Need temporary cache?}
    D -->|Yes| E[Compute db_root_tmp via database_tmp_root]
    E --> F[Create db_root_tmp via audeer.mkdir]
    F --> G[Use db_root_tmp for downloads and processing]
    D -->|No| H[Use db_root directly]
    G --> I[Move or finalize data into db_root]
    H --> I[Read/write data in db_root]
    I --> J[End]
Loading

File-Level Changes

Change Details Files
Make database cache/tmp root helpers return paths without creating directories.
  • Updated database_cache_root() docstring to no longer claim it creates the folder.
  • Removed implicit directory creation in database_cache_root() and database_tmp_root().
  • Simplified database_tmp_root() to return the tmp path as db_root + '~' without side effects.
audb/core/cache.py
Ensure callers explicitly create cache and temporary directories before using them.
  • Added audeer.mkdir(db_root) after computing database_cache_root() in load(), load_attachment(), load_header(), load_media(), load_table(), dependencies(), and stream().
  • Added audeer.mkdir(db_root_tmp) after computing database_tmp_root() in helper functions that use temporary cache directories, including load-related and load_to() helpers.
  • Adjusted load_header_to() so temporary header location is created explicitly when used.
audb/core/load.py
audb/core/api.py
audb/core/load_to.py
audb/core/stream.py
Harden tests against missing cache directory during cleanup.
  • Guarded shutil.rmtree(cache_root) with an os.path.exists(cache_root) check in test_load_media to avoid errors when cache_root was never created.
tests/test_load.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • Now that database_cache_root() and database_tmp_root() return paths without creating directories, there is repeated audeer.mkdir(...) logic at many call sites; consider centralizing this behavior in a small helper or a dedicated wrapper to avoid future call sites accidentally omitting the directory creation.
  • The updated docstrings for database_cache_root() and database_tmp_root() no longer indicate that the returned paths may not exist; consider explicitly stating that these functions only compute the path and that callers are responsible for creating the directory to make the new behavior clearer and harder to misuse.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Now that `database_cache_root()` and `database_tmp_root()` return paths without creating directories, there is repeated `audeer.mkdir(...)` logic at many call sites; consider centralizing this behavior in a small helper or a dedicated wrapper to avoid future call sites accidentally omitting the directory creation.
- The updated docstrings for `database_cache_root()` and `database_tmp_root()` no longer indicate that the returned paths may not exist; consider explicitly stating that these functions only compute the path and that callers are responsible for creating the directory to make the new behavior clearer and harder to misuse.

## Individual Comments

### Comment 1
<location> `audb/core/load.py:277-278` </location>
<code_context>
         )
         missing_attachments = [deps.archive(path) for path in missing_paths]
         db_root_tmp = database_tmp_root(db_root)
+        audeer.mkdir(db_root_tmp)

         def job(cache_root: str, file: str):
</code_context>

<issue_to_address>
**suggestion:** The repeated pattern of `database_tmp_root()` + `audeer.mkdir()` could be encapsulated to reduce duplication and future errors.

This pattern (`db_root_tmp = database_tmp_root(db_root)` then `audeer.mkdir(db_root_tmp)`) appears in multiple places (`load.py`, `load_to.py`, `load_header_to()`). To avoid repetition and reduce the chance of callers forgetting `mkdir`, consider a helper like `ensure_database_tmp_root(db_root)` that both computes and creates the temp directory and returns the path.

Suggested implementation:

```python
        missing_attachments = [deps.archive(path) for path in missing_paths]

        def ensure_database_tmp_root(db_root: str) -> str:
            db_root_tmp = database_tmp_root(db_root)
            audeer.mkdir(db_root_tmp)
            return db_root_tmp

        db_root_tmp = ensure_database_tmp_root(db_root)

```

```python
                verbose,
            )
            db_root_tmp = ensure_database_tmp_root(db_root)

```

To fully apply the refactoring across the codebase:
1. Search in `audb/core/load.py`, `audb/core/load_to.py`, and `audb/core/load_header_to.py` for the pattern:
   - `db_root_tmp = database_tmp_root(db_root)` followed by `audeer.mkdir(db_root_tmp)`.
2. In `audb/core/load.py`, if you prefer a module-level helper instead of a nested function, move `ensure_database_tmp_root()` to the module level and adjust its indentation, then replace all occurrences of the two-line pattern with `db_root_tmp = ensure_database_tmp_root(db_root)`.
3. In `audb/core/load_to.py` and `audb/core/load_header_to.py`, either:
   - Import `ensure_database_tmp_root` from `audb.core.load` (if appropriate for your dependency structure), or
   - Introduce an equivalent helper there and replace the repeated pattern with a single call.
</issue_to_address>

### Comment 2
<location> `audb/core/load.py:1142-1143` </location>
<code_context>
         sampling_rate=sampling_rate,
     )
     db_root = database_cache_root(name, version, cache_root, flavor)
+    audeer.mkdir(db_root)

     if verbose:  # pragma: no cover
</code_context>

<issue_to_address>
**suggestion:** Multiple places now explicitly call `audeer.mkdir(db_root)` after `database_cache_root()`, which could be centralized.

In `load()`, `load_attachment()`, `load_header()`, `load_media()`, `load_table()`, and `dependencies()`, the pattern `db_root = database_cache_root(...); audeer.mkdir(db_root)` is duplicated. Consider introducing a helper like `ensure_database_cache_root(...)` that returns an existing-or-created path to remove boilerplate and avoid future call sites omitting the `mkdir` step.

Suggested implementation:

```python
    """
    def ensure_database_cache_root(*args, **kwargs):
        db_root = database_cache_root(*args, **kwargs)
        audeer.mkdir(db_root)
        return db_root

    db_root_tmp = database_tmp_root(db_root)

```

```python
        csv_file = f"db.{table}.csv"
        sampling_rate=sampling_rate,
    )
    db_root = ensure_database_cache_root(name, version, cache_root, flavor)

    if verbose:  # pragma: no cover
        print(f"Get:   {name} v{version}")
        version = latest_version(name)

    db_root = ensure_database_cache_root(name, version, cache_root)

    if verbose:  # pragma: no cover
        print(f"Get:   {name} v{version}")
        version = latest_version(name)

```

To fully apply the refactoring across the module, update all other call sites that follow the pattern:

    db_root = database_cache_root(...)
    audeer.mkdir(db_root)

to instead use:

    db_root = ensure_database_cache_root(...)

Specifically, in `load()`, `load_attachment()`, `load_header()`, `load_media()`, `load_table()`, and `dependencies()`, search for `database_cache_root(` followed by `audeer.mkdir(db_root)` and replace those two lines with a single call to `ensure_database_cache_root()` passing through the same arguments used for `database_cache_root()`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 277 to +278
db_root_tmp = database_tmp_root(db_root)
audeer.mkdir(db_root_tmp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The repeated pattern of database_tmp_root() + audeer.mkdir() could be encapsulated to reduce duplication and future errors.

This pattern (db_root_tmp = database_tmp_root(db_root) then audeer.mkdir(db_root_tmp)) appears in multiple places (load.py, load_to.py, load_header_to()). To avoid repetition and reduce the chance of callers forgetting mkdir, consider a helper like ensure_database_tmp_root(db_root) that both computes and creates the temp directory and returns the path.

Suggested implementation:

        missing_attachments = [deps.archive(path) for path in missing_paths]

        def ensure_database_tmp_root(db_root: str) -> str:
            db_root_tmp = database_tmp_root(db_root)
            audeer.mkdir(db_root_tmp)
            return db_root_tmp

        db_root_tmp = ensure_database_tmp_root(db_root)
                verbose,
            )
            db_root_tmp = ensure_database_tmp_root(db_root)

To fully apply the refactoring across the codebase:

  1. Search in audb/core/load.py, audb/core/load_to.py, and audb/core/load_header_to.py for the pattern:
    • db_root_tmp = database_tmp_root(db_root) followed by audeer.mkdir(db_root_tmp).
  2. In audb/core/load.py, if you prefer a module-level helper instead of a nested function, move ensure_database_tmp_root() to the module level and adjust its indentation, then replace all occurrences of the two-line pattern with db_root_tmp = ensure_database_tmp_root(db_root).
  3. In audb/core/load_to.py and audb/core/load_header_to.py, either:
    • Import ensure_database_tmp_root from audb.core.load (if appropriate for your dependency structure), or
    • Introduce an equivalent helper there and replace the repeated pattern with a single call.

Comment on lines 1142 to +1143
db_root = database_cache_root(name, version, cache_root, flavor)
audeer.mkdir(db_root)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Multiple places now explicitly call audeer.mkdir(db_root) after database_cache_root(), which could be centralized.

In load(), load_attachment(), load_header(), load_media(), load_table(), and dependencies(), the pattern db_root = database_cache_root(...); audeer.mkdir(db_root) is duplicated. Consider introducing a helper like ensure_database_cache_root(...) that returns an existing-or-created path to remove boilerplate and avoid future call sites omitting the mkdir step.

Suggested implementation:

    """
    def ensure_database_cache_root(*args, **kwargs):
        db_root = database_cache_root(*args, **kwargs)
        audeer.mkdir(db_root)
        return db_root

    db_root_tmp = database_tmp_root(db_root)
        csv_file = f"db.{table}.csv"
        sampling_rate=sampling_rate,
    )
    db_root = ensure_database_cache_root(name, version, cache_root, flavor)

    if verbose:  # pragma: no cover
        print(f"Get:   {name} v{version}")
        version = latest_version(name)

    db_root = ensure_database_cache_root(name, version, cache_root)

    if verbose:  # pragma: no cover
        print(f"Get:   {name} v{version}")
        version = latest_version(name)

To fully apply the refactoring across the module, update all other call sites that follow the pattern:

db_root = database_cache_root(...)
audeer.mkdir(db_root)

to instead use:

db_root = ensure_database_cache_root(...)

Specifically, in load(), load_attachment(), load_header(), load_media(), load_table(), and dependencies(), search for database_cache_root( followed by audeer.mkdir(db_root) and replace those two lines with a single call to ensure_database_cache_root() passing through the same arguments used for database_cache_root().

@hagenw hagenw self-assigned this Jan 5, 2026
@hagenw hagenw marked this pull request as draft January 5, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants