Skip to content

Increase speed of managing dependencies #517

@hagenw

Description

@hagenw

The dependency tables are the most critical part of audb as those are needed to download anything from a database.

Currently, we are using parquet and a pyarrow backed pandas.Dataframe approach. Selecting the right file format is not trivial as we need both fast column access and fast random row access.

Here are some suggestions by Claude how we could improve the current implementation:

Current State Analysis

Your current implementation uses:

  • Parquet with PyArrow backend stored as db.parquet
  • PyArrow-backed pandas DataFrames for 2-50x performance improvement
  • 20MB vs 102MB CSV compression for 1M files
  • File-based row lookups (deps.archive("wav/03a01Fa.wav")) via pandas DataFrame index

Key Performance Bottlenecks

From audb/core/dependencies.py:541, your primary access pattern is:

def _column_loc(self, column: str, file: str, dtype: Callable = None):
    value = self._df.at[file, column]  # Single row lookup by filename

This suggests random access by filename is critical, which parquet isn't optimized for.

Top Recommendations

  1. Lance Columnar Format (Best Overall)
  • 2000x faster random access than parquet
  • Native indexing perfect for filename-based lookups
  • Comparable file size to parquet
  • Built for ML workloads like yours
  • Python: pip install lancedb
  1. SQLite with Proper Indexing (Pragmatic Choice)

CREATE INDEX idx_filename ON dependencies(file);

  • O(log n) lookups vs O(n) DataFrame scans
  • Proven scalability to millions of rows
  • 20-30% smaller than parquet with compression
  • Zero learning curve - standard SQL
  1. DuckDB Query Layer (Immediate Improvement)
  • 3-10x faster than current parquet approach
  • Can use existing parquet files
  • Minimal code changes required
  • Bloom filters for fast filename lookups
  1. Apache Arrow IPC/Feather v2 (Memory-Focused)
  • Zero-copy operations for fast access
  • Memory mapping eliminates deserialization
  • 100x+ faster random access than parquet
  • Trade-off: 2-3x larger files

Implementation Strategy

Phase 1 (Immediate): Add DuckDB query layer over existing parquet

import duckdb
# 3-10x performance boost with existing files
conn = duckdb.connect()
result = conn.execute("SELECT * FROM 'db.parquet' WHERE file = ?", [filename])

Phase 2 (Evaluation): Test Lance format with subset of data
Phase 3 (Migration): Switch to best-performing format based on real benchmarks

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions