Use lance for dependency management #533

hagenw · 2026-01-02T17:13:23Z

Try the lance approach proposed in #517

Benchmark results

method	result
Dependencies.save()	0.785
Dependencies.load()	0.666
Dependencies.__call__()	0.132
Dependencies.__contains__(10000 files)	0.002
Dependencies.__get_item__(10000 files)	0.162
Dependencies.__len__()	0.000
Dependencies.__str__()	0.157
Dependencies.archives	0.519
Dependencies.attachments	0.078
Dependencies.attachment_ids	0.073
Dependencies.files	0.443
Dependencies.media	0.417
Dependencies.removed_media	0.369
Dependencies.table_ids	0.104
Dependencies.tables	0.072
Dependencies.archive(10000 files)	0.049
Dependencies.bit_depth(10000 files)	0.054
Dependencies.channels(10000 files)	0.048
Dependencies.checksum(10000 files)	0.048
Dependencies.duration(10000 files)	0.039
Dependencies.format(10000 files)	0.038
Dependencies.removed(10000 files)	0.038
Dependencies.sampling_rate(10000 files)	0.038
Dependencies.type(10000 files)	0.038
Dependencies.version(10000 files)	0.038
Dependencies._add_attachment()	0.001
Dependencies._add_media(10000 files)	0.008
Dependencies._add_meta()	0.001
Dependencies._drop()	0.677
Dependencies._remove()	0.258
Dependencies._update_media()	3.953
Dependencies._update_media_version(10000 files)	0.458

Summary by Sourcery

Store and manage database dependencies with an in-memory PyArrow table persisted as Lance files, deprecating the old pickle/parquet-centric cache and updating loading, publishing, and tests accordingly.

New Features:

Support Lance as the primary on-disk format for dependency tables, with CSV and Parquet maintained as fallback formats.
Add serialization logic for Dependencies to work with the new Arrow/Lance-based representation and download workflow.

Enhancements:

Reimplement the Dependencies container around a PyArrow table with explicit indexing for faster lookups and updates.
Update dependency loading logic to prefer Lance, then Parquet, then legacy ZIP/CSV when downloading from backends.
Simplify cache handling by removing the separate pickled dependency cache file and using the main dependency file instead.

Build:

Add the pylance package as an installation dependency to support Lance-based storage.

Tests:

Refactor dependency tests to construct data via the new Arrow-backed APIs and cover Lance-based load/save behavior.
Remove tests related to pickle-based backward compatibility now that pickle caching is no longer used.

sourcery-ai · 2026-01-02T17:13:29Z

Reviewer's Guide

Refactors the Dependencies backend from a pandas DataFrame to an in‑memory PyArrow table with indexed lookups and adds Lance as the primary on-disk format (with CSV and Parquet compatibility), updating loading/saving, cache semantics, and tests accordingly.

Sequence diagram for resolving dependencies with cache and backend

sequenceDiagram
    actor Client
    participant API as api.dependencies
    participant Deps as Dependencies
    participant Backend as BackendInterface

    Client->>API: dependencies(name, version, cache_root)
    API->>API: resolve db_root
    API->>Deps: deps = Dependencies()
    API->>Deps: deps.load(db_root/DB.lance)
    Deps-->>API: raise or return
    alt Cached load succeeds
        API-->>Client: return deps
    else Cached load fails
        API->>Backend: backend_interface = lookup_backend(name, version)
        API->>Backend: exists(/name/DB.lance, version)
        alt Lance exists
            API->>Backend: get_file(/name/DB.lance, tmp_root/DB.lance, version)
            API->>Deps: deps.load(tmp_root/DB.lance)
        else Lance missing
            API->>Backend: exists(/name/DB.parquet, version)
            alt Parquet exists
                API->>Backend: get_file(/name/DB.parquet, tmp_root/DB.parquet, version)
                API->>Deps: deps.load(tmp_root/DB.parquet)
            else Parquet missing
                API->>Backend: get_archive(/name/DB.zip, tmp_root, version)
                API->>Deps: deps.load(tmp_root/DB.csv)
            end
        end
        API->>Deps: deps.save(db_root/DB.lance)
        API-->>Client: return deps
    end

Class diagram for updated Dependencies storage and operations

classDiagram
    class Dependencies {
        - pa.Schema _schema
        - pa.Table _table
        - dict~str,int~ _file_index
        + Dependencies()
        + __call__() pd.DataFrame
        + __contains__(file str) bool
        + __eq__(other Dependencies) bool
        + __getitem__(file str) list
        + __len__() int
        + __str__() str
        + __getstate__() dict
        + __setstate__(state dict) None
        + archives() list~str~
        + attachments() list~str~
        + attachment_ids() list~str~
        + files() list~str~
        + media() list~str~
        + removed_media() list~str~
        + table_ids() list~str~
        + tables() list~str~
        + archive(file str) str
        + bit_depth(file str) int
        + channels(file str) int
        + checksum(file str) str
        + duration(file str) float
        + format(file str) str
        + removed(file str) bool
        + sampling_rate(file str) int
        + type(file str) int
        + version(file str) str
        + load(path str) None
        + save(path str) None
        - _add_attachment(file str, archive str, checksum str, version str) None
        - _add_media(values list~tuple~) None
        - _add_meta(file str, checksum str, version str) None
        - _column_loc(column str, file str, dtype type) any
        - _rebuild_index() None
        - _dataframe_to_table(df pd.DataFrame, file_column bool) pa.Table
        - _table_to_dataframe(table pa.Table) pd.DataFrame
        - _drop(files Sequence~str~) None
        - _remove(file str) None
        - _set_dtypes(df pd.DataFrame) pd.DataFrame
        - _update_media(values list~tuple~) None
        - _update_media_version(files list~str~, version str) None
    }

    class LanceFileReader {
        + LanceFileReader(path str)
        + read_all() LanceReaderResult
    }

    class LanceFileWriter {
        + LanceFileWriter(path str, schema pa.Schema)
        + write_batch(table pa.Table) None
        + __enter__() LanceFileWriter
        + __exit__(exc_type type, exc_val BaseException, exc_tb object) None
    }

    class BackendInterface {
        + join(root str, name str, file str) str
        + exists(path str, version str) bool
        + get_file(remote str, local str, version str, verbose bool) None
        + get_archive(remote str, local_root str, version str, verbose bool) None
    }

    Dependencies ..> pa.Schema
    Dependencies ..> pa.Table
    Dependencies ..> LanceFileReader
    Dependencies ..> LanceFileWriter
    Dependencies ..> BackendInterface

Flow diagram for dependency file format resolution and loading

flowchart TD
    A_start["api.dependencies(name, version)"] --> B_has_cached
    B_has_cached["Cached file path = db_root/DB.lance"] --> C_try_load_cached
    C_try_load_cached["Dependencies.load(DB.lance)"] --> D_cached_ok{Loaded successfully?}
    D_cached_ok -- Yes --> Z_return_cached["return deps"]
    D_cached_ok -- No --> E_lookup_backend

    E_lookup_backend["backend_interface = lookup_backend(name, version)"] --> F_download_deps

    subgraph DownloadDependencies
        F_download_deps["download_dependencies(backend_interface, name, version)"] --> G_try_lance
        G_try_lance["remote = /name/DB.lance\nbackend_interface.exists(remote, version)"] --> H_lance_exists{Exists?}
        H_lance_exists -- Yes --> I_get_lance["get_file(remote, local DB.lance)"] --> J_local_path_lance["local_deps_file = DB.lance"]
        H_lance_exists -- No --> K_try_parquet

        K_try_parquet["remote = /name/DB.parquet\nbackend_interface.exists(remote, version)"] --> L_parquet_exists{Exists?}
        L_parquet_exists -- Yes --> M_get_parquet["get_file(remote, local DB.parquet)"] --> N_local_path_parquet["local_deps_file = DB.parquet"]
        L_parquet_exists -- No --> O_fallback_legacy

        O_fallback_legacy["remote = /name/DB.zip"] --> P_get_legacy["get_archive(remote, tmp_root)"] --> Q_local_path_legacy["local_deps_file = DB.csv (legacy)"]
    end

    J_local_path_lance --> R_load_downloaded
    N_local_path_parquet --> R_load_downloaded
    Q_local_path_legacy --> R_load_downloaded

    R_load_downloaded["deps = Dependencies(); deps.load(local_deps_file)"] --> S_save_cache
    S_save_cache["deps.save(db_root/DB.lance)"] --> T_return_downloaded["return deps"]

    subgraph Dependencies.load
        U_start_load["load(path)"] --> V_ext
        V_ext["extension = file_extension(path)"] --> W_check_ext
        W_check_ext{"ext in [csv, parquet, lance]?"} -- No --> X_error["raise ValueError"]
        W_check_ext -- Yes --> Y_branch
        Y_branch{Extension} -->|lance| Y1_lance["reader = LanceFileReader(path)\nresults = reader.read_all()\ntable = results.to_table()"]
        Y_branch -->|csv| Y2_csv["table = csv.read_csv(path, schema=_schema)"]
        Y_branch -->|parquet| Y3_parquet["table = parquet.read_table(path)"]
        Y1_lance --> Z_set_table
        Y2_csv --> Z_set_table
        Y3_parquet --> Z_set_table
        Z_set_table["self._table = table\nself._rebuild_index()"] --> AA_end_load["return None"]
    end

    subgraph Dependencies.save
        AB_start_save["save(path)"] --> AC_choose_ext
        AC_choose_ext{path suffix} -->|.csv| AD_save_csv
        AC_choose_ext -->|.parquet| AE_save_parquet
        AC_choose_ext -->|.lance| AF_save_lance

        AD_save_csv["df = self()\ntable = _dataframe_to_table(df)\ncsv.write_csv(table, path)"] --> AG_end_save["return None"]
        AE_save_parquet["df = self()\ntable = _dataframe_to_table(df, file_column=True)\nparquet.write_table(table, path)"] --> AG_end_save
        AF_save_lance["if exists(path): os.remove(path)\nwith LanceFileWriter(path, schema=_schema) as writer:\n    writer.write_batch(self._table)"] --> AG_end_save
    end

File-Level Changes

Change	Details	Files
Replace Dependencies internal storage from pandas DataFrame to PyArrow table with explicit schema and index for faster operations.	Initialize Dependencies with a fixed PyArrow schema and empty table instead of an empty pandas DataFrame Add an internal file-to-row index dictionary and helper to rebuild it after mutations Reimplement call, contains, getitem, len, str, and equality to operate on the PyArrow table and convert to DataFrame on demand Add getstate and setstate to make Dependencies picklable by serializing via DataFrame records and rebuilding the table and index	`audb/core/dependencies.py` `tests/test_dependencies.py`
Implement PyArrow-based property accessors and column-specific helpers for dependencies instead of DataFrame queries.	Rewrite archives, attachments, attachment_ids, files, media, removed_media, table_ids, and tables properties using pyarrow.compute filters and unique operations Implement _column_loc to read a single column value for a given file using the table and index with optional type casting	`audb/core/dependencies.py`
Add Lance support as the primary persistence format for dependencies and adjust load/save paths and backend download logic.	Update Dependencies.load to accept csv, parquet, and lance, reading lance via LanceFileReader and others via PyArrow CSV/Parquet readers, then setting the internal table and rebuilding the index Update Dependencies.save to write csv and parquet via DataFrame→PyArrow conversion, and lance via LanceFileWriter using the in-memory table and schema Change DEPENDENCY_FILE to db.lance, introduce PARQUET_DEPENDENCY_FILE for backward compatibility, and update download_dependencies to prefer Lance, then Parquet, then legacy zip/CSV Adjust API.cached/dependencies cache handling to read/write only the Lance dependency file instead of a separate pickle cache file	`audb/core/dependencies.py` `audb/core/define.py` `audb/core/api.py` `audb/core/dependencies.py`
Reimplement mutating helpers (_add_attachment, _add_media, _add_meta, _drop, _remove, _update_media, _update_media_version) to operate on the PyArrow table and maintain the index.	Construct new rows/batches as small PyArrow tables/arrays and append via pa.concat_tables for add operations, dropping any existing entries for the same file first where needed Implement _drop via a computed keep-mask over the file column, followed by index rebuild Implement _remove and _update_media_version via per-column list updates and set_column replacement Implement _update_media by materializing columns to Python lists, updating in-place using the file index, and reconstructing a new table from updated lists	`audb/core/dependencies.py`
Adapt higher-level logic and tests to the new Dependencies representation and Lance behavior.	Update tests to construct Dependencies using _add_media instead of directly setting _df and to use the public call interface when asserting dtypes and string representations Change load/save tests to parametrize over csv, parquet, and lance only, and drop backward-compatibility tests for pickle-based cache Update load.job to derive flavor_files from deps() DataFrame instead of deps._df Adjust _find_attachments in publish to compute removed attachments via attachments/attachment_ids properties instead of accessing the internal DataFrame index Update publish tests and docs to expect Lance (or other updated) dependency filenames where necessary Add pylance to project dependencies	`tests/test_dependencies.py` `audb/core/load.py` `audb/core/publish.py` `tests/test_publish.py` `tests/test_publish_table.py` `pyproject.toml` `docs/publish.rst`

Possibly linked issues

Increase speed of managing dependencies #517: The PR adopts Lance for dependencies, matching the issue’s proposal to speed up dependency access.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

hagenw · 2026-01-02T17:15:08Z

The current implementation does only use pylance, not lancedb. It is slower than main, so we should try to use lancedb as well. It will require a few more changes as it does not store the dependency table in a single file, but in a folder (which we can zip for upload).

hagenw added 8 commits January 2, 2026 12:37

Avoid using _df in publish

981e0e2

Remove pkl cache for deps file

9094f78

Store dependencies as sqlite

ab75770

Use sqlite for deps

b3a33d6

Fix ruff errors

0e957db

Fix doctest

2218914

Use pylance for dependency tables

02cb716

Use lance

744a029

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use lance for dependency management #533

Use lance for dependency management #533

Uh oh!

hagenw commented Jan 2, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jan 2, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

hagenw commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use lance for dependency management #533

Are you sure you want to change the base?

Use lance for dependency management #533

Uh oh!

Conversation

hagenw commented Jan 2, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for resolving dependencies with cache and backend

Class diagram for updated Dependencies storage and operations

Flow diagram for dependency file format resolution and loading

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

hagenw commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Jan 2, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jan 2, 2026 •

edited

Loading