Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 5, 2026

Closes #526

This checks if we have any files in the cache folder and only then run the scan for missing files.

Instead of checking for empty folder with is_empty() as introduced in this pull request we could also change how audb.core.cache.database_cache_root() works by letting it not create the folder. Then we could simply check if the db_root exists here instead of using is_empty(db_root). If you think this is the better approach, we should first merge #540, and otherwise close #540.

Summary by Sourcery

Tests:

  • Add regression test covering missing file detection optimization across various directory layouts.

Summary by Sourcery

Optimize detection of missing media files on first database load by skipping per-file existence checks when the cache directory is still empty and wiring this optimization through all relevant load and streaming paths.

New Features:

  • Introduce an is_empty() utility to detect empty cache directories and use it to control whether missing-file scans are performed.

Enhancements:

  • Propagate a scan_for_missing_files flag through internal load helpers so missing media detection can be toggled based on cache state.

Tests:

  • Add regression test covering missing file detection optimization for different media file layouts and directory structures.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 5, 2026

Reviewer's Guide

Introduce an optimization to skip per-file missing-media checks on first load by detecting an empty cache directory, thread this behavior through load/stream paths via a new scan_for_missing_files flag, and add a targeted regression test plus a utility helper to detect empty directories.

Sequence diagram for skipping missing-file scans on first load

sequenceDiagram
    actor Client
    participant Load as load
    participant CacheRoot as database_cache_root
    participant Utils as is_empty
    participant Loader as _load_files
    participant Missing as _missing_files

    Client->>Load: load(name, version, cache_root, flavor, verbose)
    Load->>CacheRoot: database_cache_root(name, version, cache_root, flavor)
    CacheRoot-->>Load: db_root
    Load->>Utils: is_empty(db_root)
    Utils-->>Load: is_empty
    Load->>Load: scan_for_missing_files = not is_empty

    alt cache_not_empty
        Load->>Loader: _load_files(files, files_type, db_root, cached_versions, flavor, cache_root, pickle_tables, scan_for_missing_files=true, num_workers, verbose)
        Loader->>Missing: _missing_files(files, files_type, db_root, flavor, verbose)
        Missing-->>Loader: missing_files
    else cache_empty
        Load->>Loader: _load_files(files, files_type, db_root, cached_versions, flavor, cache_root, pickle_tables, scan_for_missing_files=false, num_workers, verbose)
        Loader->>Loader: missing_files = list(files)
    end

    Loader-->>Load: cached_versions
    Load-->>Client: database
Loading

Updated class diagram for load and utility functions with scan_for_missing_files

classDiagram
    class LoadModule {
        +load(name, version, cache_root, flavor, verbose, ...) database
        +load_media(name, version, cache_root, flavor, verbose, ...) media
        +load_table(name, version, cache_root, verbose, ...) table
        +_load_files(files, files_type, db_root, cached_versions, flavor, cache_root, pickle_tables, scan_for_missing_files, num_workers, verbose) CachedVersions
    }

    class StreamModule {
        +stream(name, version, cache_root, flavor, verbose, ...) DatabaseIterator
    }

    class UtilsModule {
        +is_empty(path) bool
    }

    class CacheModule {
        +database_cache_root(name, version, cache_root, flavor) str
    }

    class MissingFilesModule {
        +_missing_files(files, files_type, db_root, flavor, verbose) list
        +_cached_versions(name, version, db_root, cache_root, flavor, verbose) CachedVersions
    }

    LoadModule --> UtilsModule : uses is_empty
    StreamModule --> UtilsModule : uses is_empty
    LoadModule --> CacheModule : uses database_cache_root
    StreamModule --> CacheModule : uses database_cache_root
    LoadModule --> MissingFilesModule : uses _missing_files
    LoadModule --> MissingFilesModule : uses _cached_versions
    StreamModule --> LoadModule : uses _load_files
Loading

File-Level Changes

Change Details Files
Add a flag to _load_files to optionally skip scanning the filesystem for missing media files and treat all requested files as missing.
  • Extend _load_files signature with a scan_for_missing_files boolean parameter and update its docstring.
  • Guard the _missing_files() call with scan_for_missing_files and otherwise set missing_files to all requested files.
  • Update all internal callers of _load_files (load, load_media, load_table, stream) to pass the new scan_for_missing_files argument.
audb/core/load.py
audb/core/stream.py
Determine whether to scan for missing files based on whether the database cache directory is empty, and propagate this decision through load and streaming APIs.
  • After computing db_root via database_cache_root() in load(), load_media(), load_table(), and stream(), compute scan_for_missing_files = not is_empty(db_root).
  • Pass scan_for_missing_files into subsequent _load_files invocations so that the first load against an empty cache skips per-file existence checks.
audb/core/load.py
audb/core/stream.py
Introduce a filesystem utility helper to check whether a directory is empty.
  • Add an is_empty(path: str) -> bool helper that uses os.scandir() to test if a directory has any entries.
  • Expose is_empty from audb.core.utils and import it where needed for load/stream logic.
audb/core/utils.py
audb/core/stream.py
audb/core/load.py
Add a regression test that validates the missing-files optimization across different directory layouts.
  • Add test_missing_files_optimization() to cover cases with non-existent media parent directories, existing parent directories, root-level files, and mixed layouts.
  • Use audb.core.load._missing_files() directly to assert behavior of the optimization in each scenario.
tests/test_load.py

Assessment against linked issues

Issue Objective Addressed Explanation
#526 Modify database loading so that when downloading a database for the first time (empty cache_root), audb does not scan for missing media files.
#526 Add tests that cover and prevent regression of the optimization that skips unnecessary missing-media checks on first load.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hagenw hagenw marked this pull request as ready for review January 5, 2026 13:40
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The new is_empty() helper assumes the directory exists and is a directory; consider handling non-existent paths (and possibly non-directory paths) by returning True instead of raising to make it safer for reuse outside the current call sites.
  • You now compute scan_for_missing_files in several places (load, load_media, load_table, stream) with identical logic; consider centralizing this decision in a small helper or inside _load_files to avoid repetition and keep behavior consistent if it ever needs to change.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `is_empty()` helper assumes the directory exists and is a directory; consider handling non-existent paths (and possibly non-directory paths) by returning `True` instead of raising to make it safer for reuse outside the current call sites.
- You now compute `scan_for_missing_files` in several places (`load`, `load_media`, `load_table`, `stream`) with identical logic; consider centralizing this decision in a small helper or inside `_load_files` to avoid repetition and keep behavior consistent if it ever needs to change.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@hagenw hagenw self-assigned this Jan 5, 2026
@hagenw hagenw requested a review from frankenjoe January 5, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

audb searches for missing media when downloading a database without audio

2 participants