Increase speed of managing dependencies

The dependency tables are the most critical part of `audb` as those are needed to download anything from a database.

Currently, we are using `parquet` and a `pyarrow` backed `pandas.Dataframe` approach. Selecting the right file format is not trivial as we need both fast column access and fast random row access.

Here are some suggestions by Claude how we could improve the current implementation:

 Current State Analysis

  Your current implementation uses:
  - Parquet with PyArrow backend stored as db.parquet
  - PyArrow-backed pandas DataFrames for 2-50x performance improvement
  - 20MB vs 102MB CSV compression for 1M files
  - File-based row lookups (deps.archive("wav/03a01Fa.wav")) via pandas DataFrame index

  Key Performance Bottlenecks

  From audb/core/dependencies.py:541, your primary access pattern is:
```python
def _column_loc(self, column: str, file: str, dtype: Callable = None):
    value = self._df.at[file, column]  # Single row lookup by filename
```
  This suggests random access by filename is critical, which parquet isn't optimized for.

  Top Recommendations

  1. Lance Columnar Format (Best Overall)

  - 2000x faster random access than parquet
  - Native indexing perfect for filename-based lookups
  - Comparable file size to parquet
  - Built for ML workloads like yours
  - Python: pip install lancedb

  2. SQLite with Proper Indexing (Pragmatic Choice)

  CREATE INDEX idx_filename ON dependencies(file);
  - O(log n) lookups vs O(n) DataFrame scans
  - Proven scalability to millions of rows
  - 20-30% smaller than parquet with compression
  - Zero learning curve - standard SQL

  3. DuckDB Query Layer (Immediate Improvement)

  - 3-10x faster than current parquet approach
  - Can use existing parquet files
  - Minimal code changes required
  - Bloom filters for fast filename lookups
  4. Apache Arrow IPC/Feather v2 (Memory-Focused)

  - Zero-copy operations for fast access
  - Memory mapping eliminates deserialization
  - 100x+ faster random access than parquet
  - Trade-off: 2-3x larger files

  Implementation Strategy

  Phase 1 (Immediate): Add DuckDB query layer over existing parquet

```python
import duckdb
# 3-10x performance boost with existing files
conn = duckdb.connect()
result = conn.execute("SELECT * FROM 'db.parquet' WHERE file = ?", [filename])
```
  Phase 2 (Evaluation): Test Lance format with subset of data
  Phase 3 (Migration): Switch to best-performing format based on real benchmarks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase speed of managing dependencies #517

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Increase speed of managing dependencies #517

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions