-
Notifications
You must be signed in to change notification settings - Fork 2
Add file lookup and cache to Dependencies #534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideOptimizes audb.Dependencies by introducing a lazy dictionary-based file index and a centralized property cache that are invalidated on all dataframe mutations, and by using NumPy masks instead of pandas boolean indexing for several filtered views, plus a small adjustment to the benchmark pre-warm step. Sequence diagram for Dependencies caching, index building, and invalidationsequenceDiagram
participant Client
participant Dependencies
%% File membership check using lazy index
Client->>Dependencies: __contains__(file)
alt _file_index is None
Dependencies->>Dependencies: _get_file_index()
Dependencies->>Dependencies: build dict from _df.index
end
Dependencies-->>Client: bool
%% Cached property access (example: media)
Client->>Dependencies: media()
alt cache hit
Dependencies->>Dependencies: read _cache[media]
Dependencies-->>Client: media list
else cache miss
Dependencies->>Dependencies: compute NumPy mask from _df[type]
Dependencies->>Dependencies: build media list from _df.index[mask]
Dependencies->>Dependencies: store in _cache[media]
Dependencies-->>Client: media list
end
%% Dataframe mutation and cache invalidation (example: _add_media)
Client->>Dependencies: _add_media(files, archive, format, checksum, bit_depth, channels, sampling_rate, type, version, removed)
Dependencies->>Dependencies: append to _df
Dependencies->>Dependencies: _invalidate_cache()
Dependencies->>Dependencies: clear _cache and reset _file_index to None
Class diagram for updated audb.Dependencies caching and indexingclassDiagram
class Dependencies {
- pd.DataFrame _df
- dict[str,int] _file_index
- dict[str,object] _cache
- pa.Schema _schema
+ __call__() pd.DataFrame
+ __contains__(file: str) bool
+ __eq__(other: Dependencies) bool
+ __getitem__(file: str) list
+ __len__() int
+ __str__() str
+ archives() list[str]
+ attachments() list[str]
+ attachment_ids() list[str]
+ files() list[str]
+ media() list[str]
+ removed_media() list[str]
+ table_ids() list[str]
+ tables() list[str]
+ archive(file: str) str
+ removed(file: str) bool
+ load(path: str) void
- _get_file_index() dict[str,int]
- _invalidate_cache() void
- _add_attachment(file: str, archive: str, format: str, checksum: str, version: str) void
- _add_media(files: Sequence[str], archive: str, format: str, checksum: str, bit_depth: int, channels: int, sampling_rate: int, type: str, version: str, removed: int) void
- _add_meta(file: str, archive: str, checksum: str, version: str) void
- _drop(files: Sequence[str]) void
- _remove(file: str) void
- _update_media(files: Sequence[str], archive: str, format: str, checksum: str, bit_depth: int, channels: int, sampling_rate: int, type: str, version: str, removed: int) void
- _update_media_version(files: Sequence[str], version: str) void
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
555858d to
f36e28e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues, and left some high level feedback:
- The new dictionary-based file index assumes unique index labels for files; if non-unique indexes are possible, consider enforcing or asserting uniqueness (or falling back to the original .loc behavior) to avoid subtle behavior changes in getitem compared to pandas.
- In the benchmark, replacing the warm-up call from deps.archives to deps._df is slightly misleading and relies on a private attribute; consider either explicitly warming the relevant cached properties or updating the comment to reflect that you’re intentionally avoiding cache warm-up.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new dictionary-based file index assumes unique index labels for files; if non-unique indexes are possible, consider enforcing or asserting uniqueness (or falling back to the original .loc behavior) to avoid subtle behavior changes in __getitem__ compared to pandas.
- In the benchmark, replacing the warm-up call from deps.archives to deps._df is slightly misleading and relies on a private attribute; consider either explicitly warming the relevant cached properties or updating the comment to reflect that you’re intentionally avoiding cache warm-up.
## Individual Comments
### Comment 1
<location> `benchmarks/benchmark-dependencies-methods.py:97-101` </location>
<code_context>
# Further calls will be faster
"file-10.wav" in deps
-deps.archives
+deps._df
_ = deps.archive(_files[0])
</code_context>
<issue_to_address>
**suggestion:** The benchmark pre-warm step now bypasses the property logic and no longer exercises the cache.
Accessing `deps._df` here only touches the underlying DataFrame and skips the new caching/indexing logic. If the intent is to benchmark steady-state performance of the public API with caches warmed, consider invoking a real property like `deps.archives` (or a dedicated warm-up helper) instead of the private `_df` attribute.
```suggestion
# Pre-warm accesses
# Further calls will be faster
"file-10.wav" in deps
_ = deps.archives
_ = deps.archive(_files[0])
```
</issue_to_address>
### Comment 2
<location> `audb/core/dependencies.py:67` </location>
<code_context>
self._df = self._set_dtypes(self._df)
+ # Lazy dictionary index for O(1) file lookups
+ self._file_index: dict[str, int] | None = None
+ # Property cache (invalidated on modifications)
+ self._cache: dict[str, object] = {}
# pyarrow schema
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared caching and DataFrame-assignment logic into small helper methods to keep the new performance optimizations while reducing repetition and cache-management complexity.
You can keep the performance improvements while reducing the new complexity by:
### 1. Deduplicate the cache pattern
All properties now repeat the same “check key, compute, store, return” logic. A small helper removes boilerplate and makes the contract explicit:
```python
def _get_cached(self, key: str, build: Callable[[], object]) -> object:
try:
return self._cache[key]
except KeyError:
value = build()
self._cache[key] = value
return value
```
Then properties become much simpler and harder to get wrong:
```python
@property
def archives(self) -> list[str]:
return self._get_cached(
"archives",
lambda: sorted(self._df["archive"].unique().tolist()),
)
@property
def attachments(self) -> list[str]:
return self._get_cached(
"attachments",
lambda: self._df.index[
self._df["type"].to_numpy() == define.DEPENDENCY_TYPE["attachment"]
].tolist(),
)
@property
def media(self) -> list[str]:
return self._get_cached(
"media",
lambda: self._df.index[
self._df["type"].to_numpy() == define.DEPENDENCY_TYPE["media"]
].tolist(),
)
```
This keeps your caching behavior but avoids repeated manual dict access and string-key bugs.
### 2. Centralize cache invalidation on `_df` mutation
Right now every mutator manually calls `_invalidate_cache()`. You can centralize this by routing `_df` assignments through a helper, reducing the chance of forgetting invalidation in future changes:
```python
def _set_df(self, df: pd.DataFrame) -> None:
self._df = df
self._invalidate_cache()
```
Then update mutation sites to use `_set_df` instead of bare `_df` assignment:
```python
def _add_media(self, values: Sequence[tuple[...]]):
df = pd.DataFrame.from_records(
values,
columns=["file"] + list(define.DEPENDENCY_TABLE.keys()),
).set_index("file")
df = self._set_dtypes(df)
self._set_df(pd.concat([self._df, df]))
def _drop(self, files: Sequence[str]):
self._set_df(self._df[~self._df.index.isin(files)])
def load(...):
...
if extension == "csv":
...
self._set_df(self._table_to_dataframe(table))
elif extension == "parquet":
...
self._set_df(self._table_to_dataframe(table))
```
For in-place updates that don’t replace `_df` (like `_remove`, `_update_media`, `_update_media_version`), you can keep the explicit `_invalidate_cache()` calls, but at least all “replace `_df` wholesale” paths are now guaranteed to invalidate caches.
This keeps your optimization (lazy dict index + property caching) but makes the abstraction more uniform and reduces the mental overhead around cache management.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Speedup parts of
audb.Dependenciesby using a dictionary for file lookup (audb.Dependencies._file_index), caching of properties (audb.Dependencies._cache) and usingnumpyfor creating masks. In the benchmark results the caching is only visible inDependencies.tablesas this was called already byDependencies.table_ids.Summary by Sourcery
Optimize audb.Dependencies for faster lookups and property access by introducing a lazy file index and cached derived properties that are invalidated on mutation.
Enhancements:
Summary by Sourcery
Optimize audb.Dependencies lookups and property access using a lazy file index and cached derived properties that are invalidated on data changes.
Enhancements:
Tests: