-
Notifications
You must be signed in to change notification settings - Fork 2
Description
The dependency tables are the most critical part of audb as those are needed to download anything from a database.
Currently, we are using parquet and a pyarrow backed pandas.Dataframe approach. Selecting the right file format is not trivial as we need both fast column access and fast random row access.
Here are some suggestions by Claude how we could improve the current implementation:
Current State Analysis
Your current implementation uses:
- Parquet with PyArrow backend stored as db.parquet
- PyArrow-backed pandas DataFrames for 2-50x performance improvement
- 20MB vs 102MB CSV compression for 1M files
- File-based row lookups (deps.archive("wav/03a01Fa.wav")) via pandas DataFrame index
Key Performance Bottlenecks
From audb/core/dependencies.py:541, your primary access pattern is:
def _column_loc(self, column: str, file: str, dtype: Callable = None):
value = self._df.at[file, column] # Single row lookup by filenameThis suggests random access by filename is critical, which parquet isn't optimized for.
Top Recommendations
- Lance Columnar Format (Best Overall)
- 2000x faster random access than parquet
- Native indexing perfect for filename-based lookups
- Comparable file size to parquet
- Built for ML workloads like yours
- Python: pip install lancedb
- SQLite with Proper Indexing (Pragmatic Choice)
CREATE INDEX idx_filename ON dependencies(file);
- O(log n) lookups vs O(n) DataFrame scans
- Proven scalability to millions of rows
- 20-30% smaller than parquet with compression
- Zero learning curve - standard SQL
- DuckDB Query Layer (Immediate Improvement)
- 3-10x faster than current parquet approach
- Can use existing parquet files
- Minimal code changes required
- Bloom filters for fast filename lookups
- Apache Arrow IPC/Feather v2 (Memory-Focused)
- Zero-copy operations for fast access
- Memory mapping eliminates deserialization
- 100x+ faster random access than parquet
- Trade-off: 2-3x larger files
Implementation Strategy
Phase 1 (Immediate): Add DuckDB query layer over existing parquet
import duckdb
# 3-10x performance boost with existing files
conn = duckdb.connect()
result = conn.execute("SELECT * FROM 'db.parquet' WHERE file = ?", [filename])Phase 2 (Evaluation): Test Lance format with subset of data
Phase 3 (Migration): Switch to best-performing format based on real benchmarks