Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 5, 2026

Speedup parts of audb.Dependencies by using a dictionary for file lookup (audb.Dependencies._file_index), caching of properties (audb.Dependencies._cache) and using numpy for creating masks. In the benchmark results the caching is only visible in Dependencies.tables as this was called already by Dependencies.table_ids.

method Before After
Dependencies.save() 0.291 0.289
Dependencies.load() 0.189 0.184
Dependencies.__call__() 0.000 0.000
Dependencies.__contains__(10000 files) 0.005 0.002
Dependencies.__get_item__(10000 files) 0.610 0.588
Dependencies.__len__() 0.000 0.000
Dependencies.__str__() 0.022 0.021
Dependencies.archives 0.145 0.137
Dependencies.attachments 0.026 0.005
Dependencies.attachment_ids 0.028 0.012
Dependencies.files 0.011 0.006
Dependencies.media 0.065 0.017
Dependencies.removed_media 0.057 0.018
Dependencies.table_ids 0.058 0.039
Dependencies.tables 0.025 0.000
Dependencies.archive(10000 files) 0.046 0.045
Dependencies.bit_depth(10000 files) 0.032 0.033
Dependencies.channels(10000 files) 0.032 0.033
Dependencies.checksum(10000 files) 0.033 0.034
Dependencies.duration(10000 files) 0.032 0.032
Dependencies.format(10000 files) 0.033 0.034
Dependencies.removed(10000 files) 0.032 0.033
Dependencies.sampling_rate(10000 files) 0.032 0.034
Dependencies.type(10000 files) 0.032 0.033
Dependencies.version(10000 files) 0.034 0.034
Dependencies.add_attachment() 0.178 0.221
Dependencies._add_media(10000 files) 0.072 0.066
Dependencies._add_meta() 0.122 0.120
Dependencies._drop() 0.091 0.091
Dependencies._remove() 0.071 0.071
Dependencies._update_media() 0.106 0.107
Dependencies._update_media_version(10000 files) 0.017 0.018

Summary by Sourcery

Optimize audb.Dependencies for faster lookups and property access by introducing a lazy file index and cached derived properties that are invalidated on mutation.

Enhancements:

  • Add a lazy dictionary-based file index to speed up file membership checks and row retrievals.
  • Cache frequently requested properties such as archives, attachments, media, removed media, files, tables, and their IDs for reuse across calls.
  • Ensure caches and the file index are consistently invalidated whenever the underlying dependencies dataframe is modified, including load, add, update, and remove operations.

Summary by Sourcery

Optimize audb.Dependencies lookups and property access using a lazy file index and cached derived properties that are invalidated on data changes.

Enhancements:

  • Introduce a lazily built dictionary index over dependency files to speed up membership checks and row retrieval.
  • Add a generic cache for frequently requested dependency properties such as archives, attachments, media, tables, and related ID/file lists, with automatic invalidation after mutations.

Tests:

  • Adjust the dependencies benchmark pre-warm step to better reflect steady-state performance of cached lookups.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 5, 2026

Reviewer's Guide

Optimizes audb.Dependencies by introducing a lazy dictionary-based file index and a centralized property cache that are invalidated on all dataframe mutations, and by using NumPy masks instead of pandas boolean indexing for several filtered views, plus a small adjustment to the benchmark pre-warm step.

Sequence diagram for Dependencies caching, index building, and invalidation

sequenceDiagram
    participant Client
    participant Dependencies

    %% File membership check using lazy index
    Client->>Dependencies: __contains__(file)
    alt _file_index is None
        Dependencies->>Dependencies: _get_file_index()
        Dependencies->>Dependencies: build dict from _df.index
    end
    Dependencies-->>Client: bool

    %% Cached property access (example: media)
    Client->>Dependencies: media()
    alt cache hit
        Dependencies->>Dependencies: read _cache[media]
        Dependencies-->>Client: media list
    else cache miss
        Dependencies->>Dependencies: compute NumPy mask from _df[type]
        Dependencies->>Dependencies: build media list from _df.index[mask]
        Dependencies->>Dependencies: store in _cache[media]
        Dependencies-->>Client: media list
    end

    %% Dataframe mutation and cache invalidation (example: _add_media)
    Client->>Dependencies: _add_media(files, archive, format, checksum, bit_depth, channels, sampling_rate, type, version, removed)
    Dependencies->>Dependencies: append to _df
    Dependencies->>Dependencies: _invalidate_cache()
    Dependencies->>Dependencies: clear _cache and reset _file_index to None
Loading

Class diagram for updated audb.Dependencies caching and indexing

classDiagram
    class Dependencies {
        - pd.DataFrame _df
        - dict[str,int] _file_index
        - dict[str,object] _cache
        - pa.Schema _schema
        + __call__() pd.DataFrame
        + __contains__(file: str) bool
        + __eq__(other: Dependencies) bool
        + __getitem__(file: str) list
        + __len__() int
        + __str__() str
        + archives() list[str]
        + attachments() list[str]
        + attachment_ids() list[str]
        + files() list[str]
        + media() list[str]
        + removed_media() list[str]
        + table_ids() list[str]
        + tables() list[str]
        + archive(file: str) str
        + removed(file: str) bool
        + load(path: str) void
        - _get_file_index() dict[str,int]
        - _invalidate_cache() void
        - _add_attachment(file: str, archive: str, format: str, checksum: str, version: str) void
        - _add_media(files: Sequence[str], archive: str, format: str, checksum: str, bit_depth: int, channels: int, sampling_rate: int, type: str, version: str, removed: int) void
        - _add_meta(file: str, archive: str, checksum: str, version: str) void
        - _drop(files: Sequence[str]) void
        - _remove(file: str) void
        - _update_media(files: Sequence[str], archive: str, format: str, checksum: str, bit_depth: int, channels: int, sampling_rate: int, type: str, version: str, removed: int) void
        - _update_media_version(files: Sequence[str], version: str) void
    }
Loading

File-Level Changes

Change Details Files
Add lazy dictionary index for O(1) file lookups and use it in membership and item access.
  • Initialize self._file_index in init as a lazily-populated dict mapping file name to row position
  • Implement _get_file_index() to build the index from the dataframe index on first use
  • Change contains to check membership in the file index instead of the dataframe index
  • Change getitem to resolve the row position via the file index and index with iloc
audb/core/dependencies.py
Introduce centralized cache for derived properties and ensure it is invalidated on all mutations of the dependencies dataframe.
  • Initialize self._cache in init for generic property caching
  • Add _invalidate_cache() helper that clears the cache and resets the file index
  • Wrap archives, attachments, attachment_ids, files, media, removed_media, tables, and table_ids properties with cache lookups to avoid recomputation
  • Call _invalidate_cache() after load(), _add_attachment(), _add_media(), _add_meta(), _drop(), _remove(), _update_media(), and _update_media_version() to keep cache and index in sync
audb/core/dependencies.py
Use NumPy-backed boolean masks instead of pandas boolean indexing for filtered lists to improve performance.
  • For attachments, attachment_ids, media, removed_media, and tables, convert the type/removed columns to NumPy arrays and compute boolean masks
  • Apply masks to the dataframe index or archive column to build the corresponding lists before caching them
audb/core/dependencies.py
Adjust benchmark setup to avoid pre-warming cached properties that would hide the performance gain.
  • Replace the pre-warm call deps.archives with deps._df in the benchmark script so that property caches remain cold for measurement
benchmarks/benchmark-dependencies-methods.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hagenw hagenw force-pushed the speedup-deps-with-index-dict-and-cache branch from 555858d to f36e28e Compare January 5, 2026 10:34
@hagenw hagenw marked this pull request as ready for review January 5, 2026 15:33
@hagenw hagenw self-assigned this Jan 5, 2026
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The new dictionary-based file index assumes unique index labels for files; if non-unique indexes are possible, consider enforcing or asserting uniqueness (or falling back to the original .loc behavior) to avoid subtle behavior changes in getitem compared to pandas.
  • In the benchmark, replacing the warm-up call from deps.archives to deps._df is slightly misleading and relies on a private attribute; consider either explicitly warming the relevant cached properties or updating the comment to reflect that you’re intentionally avoiding cache warm-up.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new dictionary-based file index assumes unique index labels for files; if non-unique indexes are possible, consider enforcing or asserting uniqueness (or falling back to the original .loc behavior) to avoid subtle behavior changes in __getitem__ compared to pandas.
- In the benchmark, replacing the warm-up call from deps.archives to deps._df is slightly misleading and relies on a private attribute; consider either explicitly warming the relevant cached properties or updating the comment to reflect that you’re intentionally avoiding cache warm-up.

## Individual Comments

### Comment 1
<location> `benchmarks/benchmark-dependencies-methods.py:97-101` </location>
<code_context>
 # Further calls will be faster
 "file-10.wav" in deps
-deps.archives
+deps._df
 _ = deps.archive(_files[0])

</code_context>

<issue_to_address>
**suggestion:** The benchmark pre-warm step now bypasses the property logic and no longer exercises the cache.

Accessing `deps._df` here only touches the underlying DataFrame and skips the new caching/indexing logic. If the intent is to benchmark steady-state performance of the public API with caches warmed, consider invoking a real property like `deps.archives` (or a dedicated warm-up helper) instead of the private `_df` attribute.

```suggestion
# Pre-warm accesses
# Further calls will be faster
"file-10.wav" in deps
_ = deps.archives
_ = deps.archive(_files[0])
```
</issue_to_address>

### Comment 2
<location> `audb/core/dependencies.py:67` </location>
<code_context>
         self._df = self._set_dtypes(self._df)
+        # Lazy dictionary index for O(1) file lookups
+        self._file_index: dict[str, int] | None = None
+        # Property cache (invalidated on modifications)
+        self._cache: dict[str, object] = {}
         # pyarrow schema
</code_context>

<issue_to_address>
**issue (complexity):** Consider extracting shared caching and DataFrame-assignment logic into small helper methods to keep the new performance optimizations while reducing repetition and cache-management complexity.

You can keep the performance improvements while reducing the new complexity by:

### 1. Deduplicate the cache pattern

All properties now repeat the same “check key, compute, store, return” logic. A small helper removes boilerplate and makes the contract explicit:

```python
def _get_cached(self, key: str, build: Callable[[], object]) -> object:
    try:
        return self._cache[key]
    except KeyError:
        value = build()
        self._cache[key] = value
        return value
```

Then properties become much simpler and harder to get wrong:

```python
@property
def archives(self) -> list[str]:
    return self._get_cached(
        "archives",
        lambda: sorted(self._df["archive"].unique().tolist()),
    )

@property
def attachments(self) -> list[str]:
    return self._get_cached(
        "attachments",
        lambda: self._df.index[
            self._df["type"].to_numpy() == define.DEPENDENCY_TYPE["attachment"]
        ].tolist(),
    )

@property
def media(self) -> list[str]:
    return self._get_cached(
        "media",
        lambda: self._df.index[
            self._df["type"].to_numpy() == define.DEPENDENCY_TYPE["media"]
        ].tolist(),
    )
```

This keeps your caching behavior but avoids repeated manual dict access and string-key bugs.

### 2. Centralize cache invalidation on `_df` mutation

Right now every mutator manually calls `_invalidate_cache()`. You can centralize this by routing `_df` assignments through a helper, reducing the chance of forgetting invalidation in future changes:

```python
def _set_df(self, df: pd.DataFrame) -> None:
    self._df = df
    self._invalidate_cache()
```

Then update mutation sites to use `_set_df` instead of bare `_df` assignment:

```python
def _add_media(self, values: Sequence[tuple[...]]):
    df = pd.DataFrame.from_records(
        values,
        columns=["file"] + list(define.DEPENDENCY_TABLE.keys()),
    ).set_index("file")
    df = self._set_dtypes(df)
    self._set_df(pd.concat([self._df, df]))

def _drop(self, files: Sequence[str]):
    self._set_df(self._df[~self._df.index.isin(files)])

def load(...):
    ...
    if extension == "csv":
        ...
        self._set_df(self._table_to_dataframe(table))
    elif extension == "parquet":
        ...
        self._set_df(self._table_to_dataframe(table))
```

For in-place updates that don’t replace `_df` (like `_remove`, `_update_media`, `_update_media_version`), you can keep the explicit `_invalidate_cache()` calls, but at least all “replace `_df` wholesale” paths are now guaranteed to invalidate caches.

This keeps your optimization (lazy dict index + property caching) but makes the abstraction more uniform and reduces the mental overhead around cache management.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants