Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 2, 2026

Try the lance approach proposed in #517

Benchmark results

method result
Dependencies.save() 0.785
Dependencies.load() 0.666
Dependencies.__call__() 0.132
Dependencies.__contains__(10000 files) 0.002
Dependencies.__get_item__(10000 files) 0.162
Dependencies.__len__() 0.000
Dependencies.__str__() 0.157
Dependencies.archives 0.519
Dependencies.attachments 0.078
Dependencies.attachment_ids 0.073
Dependencies.files 0.443
Dependencies.media 0.417
Dependencies.removed_media 0.369
Dependencies.table_ids 0.104
Dependencies.tables 0.072
Dependencies.archive(10000 files) 0.049
Dependencies.bit_depth(10000 files) 0.054
Dependencies.channels(10000 files) 0.048
Dependencies.checksum(10000 files) 0.048
Dependencies.duration(10000 files) 0.039
Dependencies.format(10000 files) 0.038
Dependencies.removed(10000 files) 0.038
Dependencies.sampling_rate(10000 files) 0.038
Dependencies.type(10000 files) 0.038
Dependencies.version(10000 files) 0.038
Dependencies._add_attachment() 0.001
Dependencies._add_media(10000 files) 0.008
Dependencies._add_meta() 0.001
Dependencies._drop() 0.677
Dependencies._remove() 0.258
Dependencies._update_media() 3.953
Dependencies._update_media_version(10000 files) 0.458

Summary by Sourcery

Store and manage database dependencies with an in-memory PyArrow table persisted as Lance files, deprecating the old pickle/parquet-centric cache and updating loading, publishing, and tests accordingly.

New Features:

  • Support Lance as the primary on-disk format for dependency tables, with CSV and Parquet maintained as fallback formats.
  • Add serialization logic for Dependencies to work with the new Arrow/Lance-based representation and download workflow.

Enhancements:

  • Reimplement the Dependencies container around a PyArrow table with explicit indexing for faster lookups and updates.
  • Update dependency loading logic to prefer Lance, then Parquet, then legacy ZIP/CSV when downloading from backends.
  • Simplify cache handling by removing the separate pickled dependency cache file and using the main dependency file instead.

Build:

  • Add the pylance package as an installation dependency to support Lance-based storage.

Tests:

  • Refactor dependency tests to construct data via the new Arrow-backed APIs and cover Lance-based load/save behavior.
  • Remove tests related to pickle-based backward compatibility now that pickle caching is no longer used.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 2, 2026

Reviewer's Guide

Refactors the Dependencies backend from a pandas DataFrame to an in‑memory PyArrow table with indexed lookups and adds Lance as the primary on-disk format (with CSV and Parquet compatibility), updating loading/saving, cache semantics, and tests accordingly.

Sequence diagram for resolving dependencies with cache and backend

sequenceDiagram
    actor Client
    participant API as api.dependencies
    participant Deps as Dependencies
    participant Backend as BackendInterface

    Client->>API: dependencies(name, version, cache_root)
    API->>API: resolve db_root
    API->>Deps: deps = Dependencies()
    API->>Deps: deps.load(db_root/DB.lance)
    Deps-->>API: raise or return
    alt Cached load succeeds
        API-->>Client: return deps
    else Cached load fails
        API->>Backend: backend_interface = lookup_backend(name, version)
        API->>Backend: exists(/name/DB.lance, version)
        alt Lance exists
            API->>Backend: get_file(/name/DB.lance, tmp_root/DB.lance, version)
            API->>Deps: deps.load(tmp_root/DB.lance)
        else Lance missing
            API->>Backend: exists(/name/DB.parquet, version)
            alt Parquet exists
                API->>Backend: get_file(/name/DB.parquet, tmp_root/DB.parquet, version)
                API->>Deps: deps.load(tmp_root/DB.parquet)
            else Parquet missing
                API->>Backend: get_archive(/name/DB.zip, tmp_root, version)
                API->>Deps: deps.load(tmp_root/DB.csv)
            end
        end
        API->>Deps: deps.save(db_root/DB.lance)
        API-->>Client: return deps
    end
Loading

Class diagram for updated Dependencies storage and operations

classDiagram
    class Dependencies {
        - pa.Schema _schema
        - pa.Table _table
        - dict~str,int~ _file_index
        + Dependencies()
        + __call__() pd.DataFrame
        + __contains__(file str) bool
        + __eq__(other Dependencies) bool
        + __getitem__(file str) list
        + __len__() int
        + __str__() str
        + __getstate__() dict
        + __setstate__(state dict) None
        + archives() list~str~
        + attachments() list~str~
        + attachment_ids() list~str~
        + files() list~str~
        + media() list~str~
        + removed_media() list~str~
        + table_ids() list~str~
        + tables() list~str~
        + archive(file str) str
        + bit_depth(file str) int
        + channels(file str) int
        + checksum(file str) str
        + duration(file str) float
        + format(file str) str
        + removed(file str) bool
        + sampling_rate(file str) int
        + type(file str) int
        + version(file str) str
        + load(path str) None
        + save(path str) None
        - _add_attachment(file str, archive str, checksum str, version str) None
        - _add_media(values list~tuple~) None
        - _add_meta(file str, checksum str, version str) None
        - _column_loc(column str, file str, dtype type) any
        - _rebuild_index() None
        - _dataframe_to_table(df pd.DataFrame, file_column bool) pa.Table
        - _table_to_dataframe(table pa.Table) pd.DataFrame
        - _drop(files Sequence~str~) None
        - _remove(file str) None
        - _set_dtypes(df pd.DataFrame) pd.DataFrame
        - _update_media(values list~tuple~) None
        - _update_media_version(files list~str~, version str) None
    }

    class LanceFileReader {
        + LanceFileReader(path str)
        + read_all() LanceReaderResult
    }

    class LanceFileWriter {
        + LanceFileWriter(path str, schema pa.Schema)
        + write_batch(table pa.Table) None
        + __enter__() LanceFileWriter
        + __exit__(exc_type type, exc_val BaseException, exc_tb object) None
    }

    class BackendInterface {
        + join(root str, name str, file str) str
        + exists(path str, version str) bool
        + get_file(remote str, local str, version str, verbose bool) None
        + get_archive(remote str, local_root str, version str, verbose bool) None
    }

    Dependencies ..> pa.Schema
    Dependencies ..> pa.Table
    Dependencies ..> LanceFileReader
    Dependencies ..> LanceFileWriter
    Dependencies ..> BackendInterface
Loading

Flow diagram for dependency file format resolution and loading

flowchart TD
    A_start["api.dependencies(name, version)"] --> B_has_cached
    B_has_cached["Cached file path = db_root/DB.lance"] --> C_try_load_cached
    C_try_load_cached["Dependencies.load(DB.lance)"] --> D_cached_ok{Loaded successfully?}
    D_cached_ok -- Yes --> Z_return_cached["return deps"]
    D_cached_ok -- No --> E_lookup_backend

    E_lookup_backend["backend_interface = lookup_backend(name, version)"] --> F_download_deps

    subgraph DownloadDependencies
        F_download_deps["download_dependencies(backend_interface, name, version)"] --> G_try_lance
        G_try_lance["remote = /name/DB.lance\nbackend_interface.exists(remote, version)"] --> H_lance_exists{Exists?}
        H_lance_exists -- Yes --> I_get_lance["get_file(remote, local DB.lance)"] --> J_local_path_lance["local_deps_file = DB.lance"]
        H_lance_exists -- No --> K_try_parquet

        K_try_parquet["remote = /name/DB.parquet\nbackend_interface.exists(remote, version)"] --> L_parquet_exists{Exists?}
        L_parquet_exists -- Yes --> M_get_parquet["get_file(remote, local DB.parquet)"] --> N_local_path_parquet["local_deps_file = DB.parquet"]
        L_parquet_exists -- No --> O_fallback_legacy

        O_fallback_legacy["remote = /name/DB.zip"] --> P_get_legacy["get_archive(remote, tmp_root)"] --> Q_local_path_legacy["local_deps_file = DB.csv (legacy)"]
    end

    J_local_path_lance --> R_load_downloaded
    N_local_path_parquet --> R_load_downloaded
    Q_local_path_legacy --> R_load_downloaded

    R_load_downloaded["deps = Dependencies(); deps.load(local_deps_file)"] --> S_save_cache
    S_save_cache["deps.save(db_root/DB.lance)"] --> T_return_downloaded["return deps"]

    subgraph Dependencies.load
        U_start_load["load(path)"] --> V_ext
        V_ext["extension = file_extension(path)"] --> W_check_ext
        W_check_ext{"ext in [csv, parquet, lance]?"} -- No --> X_error["raise ValueError"]
        W_check_ext -- Yes --> Y_branch
        Y_branch{Extension} -->|lance| Y1_lance["reader = LanceFileReader(path)\nresults = reader.read_all()\ntable = results.to_table()"]
        Y_branch -->|csv| Y2_csv["table = csv.read_csv(path, schema=_schema)"]
        Y_branch -->|parquet| Y3_parquet["table = parquet.read_table(path)"]
        Y1_lance --> Z_set_table
        Y2_csv --> Z_set_table
        Y3_parquet --> Z_set_table
        Z_set_table["self._table = table\nself._rebuild_index()"] --> AA_end_load["return None"]
    end

    subgraph Dependencies.save
        AB_start_save["save(path)"] --> AC_choose_ext
        AC_choose_ext{path suffix} -->|.csv| AD_save_csv
        AC_choose_ext -->|.parquet| AE_save_parquet
        AC_choose_ext -->|.lance| AF_save_lance

        AD_save_csv["df = self()\ntable = _dataframe_to_table(df)\ncsv.write_csv(table, path)"] --> AG_end_save["return None"]
        AE_save_parquet["df = self()\ntable = _dataframe_to_table(df, file_column=True)\nparquet.write_table(table, path)"] --> AG_end_save
        AF_save_lance["if exists(path): os.remove(path)\nwith LanceFileWriter(path, schema=_schema) as writer:\n    writer.write_batch(self._table)"] --> AG_end_save
    end
Loading

File-Level Changes

Change Details Files
Replace Dependencies internal storage from pandas DataFrame to PyArrow table with explicit schema and index for faster operations.
  • Initialize Dependencies with a fixed PyArrow schema and empty table instead of an empty pandas DataFrame
  • Add an internal file-to-row index dictionary and helper to rebuild it after mutations
  • Reimplement call, contains, getitem, len, str, and equality to operate on the PyArrow table and convert to DataFrame on demand
  • Add getstate and setstate to make Dependencies picklable by serializing via DataFrame records and rebuilding the table and index
audb/core/dependencies.py
tests/test_dependencies.py
Implement PyArrow-based property accessors and column-specific helpers for dependencies instead of DataFrame queries.
  • Rewrite archives, attachments, attachment_ids, files, media, removed_media, table_ids, and tables properties using pyarrow.compute filters and unique operations
  • Implement _column_loc to read a single column value for a given file using the table and index with optional type casting
audb/core/dependencies.py
Add Lance support as the primary persistence format for dependencies and adjust load/save paths and backend download logic.
  • Update Dependencies.load to accept csv, parquet, and lance, reading lance via LanceFileReader and others via PyArrow CSV/Parquet readers, then setting the internal table and rebuilding the index
  • Update Dependencies.save to write csv and parquet via DataFrame→PyArrow conversion, and lance via LanceFileWriter using the in-memory table and schema
  • Change DEPENDENCY_FILE to db.lance, introduce PARQUET_DEPENDENCY_FILE for backward compatibility, and update download_dependencies to prefer Lance, then Parquet, then legacy zip/CSV
  • Adjust API.cached/dependencies cache handling to read/write only the Lance dependency file instead of a separate pickle cache file
audb/core/dependencies.py
audb/core/define.py
audb/core/api.py
audb/core/dependencies.py
Reimplement mutating helpers (_add_attachment, _add_media, _add_meta, _drop, _remove, _update_media, _update_media_version) to operate on the PyArrow table and maintain the index.
  • Construct new rows/batches as small PyArrow tables/arrays and append via pa.concat_tables for add operations, dropping any existing entries for the same file first where needed
  • Implement _drop via a computed keep-mask over the file column, followed by index rebuild
  • Implement _remove and _update_media_version via per-column list updates and set_column replacement
  • Implement _update_media by materializing columns to Python lists, updating in-place using the file index, and reconstructing a new table from updated lists
audb/core/dependencies.py
Adapt higher-level logic and tests to the new Dependencies representation and Lance behavior.
  • Update tests to construct Dependencies using _add_media instead of directly setting _df and to use the public call interface when asserting dtypes and string representations
  • Change load/save tests to parametrize over csv, parquet, and lance only, and drop backward-compatibility tests for pickle-based cache
  • Update load.job to derive flavor_files from deps() DataFrame instead of deps._df
  • Adjust _find_attachments in publish to compute removed attachments via attachments/attachment_ids properties instead of accessing the internal DataFrame index
  • Update publish tests and docs to expect Lance (or other updated) dependency filenames where necessary
  • Add pylance to project dependencies
tests/test_dependencies.py
audb/core/load.py
audb/core/publish.py
tests/test_publish.py
tests/test_publish_table.py
pyproject.toml
docs/publish.rst

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hagenw
Copy link
Member Author

hagenw commented Jan 2, 2026

The current implementation does only use pylance, not lancedb. It is slower than main, so we should try to use lancedb as well. It will require a few more changes as it does not store the dependency table in a single file, but in a folder (which we can zip for upload).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants