Store dependency table as arrow #532

hagenw · 2026-01-02T13:35:40Z

Implements the 4th option discussed in #517

Benchmark results when only switching the storage format from parquet to arrow (saving and loading is faster, everything else stays the same):

method	result
Dependencies.save()	0.176
Dependencies.load()	0.127
Dependencies.__call__()	0.000
Dependencies.__contains__(10000 files)	0.005
Dependencies.__get_item__(10000 files)	0.836
Dependencies.__len__()	0.000
Dependencies.__str__()	0.025
Dependencies.archives	0.137
Dependencies.attachments	0.036
Dependencies.attachment_ids	0.030
Dependencies.files	0.011
Dependencies.media	0.079
Dependencies.removed_media	0.062
Dependencies.table_ids	0.082
Dependencies.tables	0.027
Dependencies.archive(10000 files)	0.063
Dependencies.bit_depth(10000 files)	0.046
Dependencies.channels(10000 files)	0.046
Dependencies.checksum(10000 files)	0.046
Dependencies.duration(10000 files)	0.044
Dependencies.format(10000 files)	0.046
Dependencies.removed(10000 files)	0.044
Dependencies.sampling_rate(10000 files)	0.051
Dependencies.type(10000 files)	0.045
Dependencies.version(10000 files)	0.046
Dependencies._add_attachment()	0.191
Dependencies._add_media(10000 files)	0.078
Dependencies._add_meta()	0.127
Dependencies._drop()	0.100
Dependencies._remove()	0.070
Dependencies._update_media()	0.119
Dependencies._update_media_version(10000 files)	0.021

Summary by Sourcery

Switch dependency table storage to Arrow IPC format with automatic format detection and maintain backward compatibility with Parquet and legacy CSV-based dependencies.

New Features:

Support loading and saving dependency tables in Apache Arrow IPC format with LZ4 compression.
Add automatic dependency file format detection and precedence when loading dependencies without an extension.

Bug Fixes:

Fix attachment removal in publishing by correctly mapping attachment names to attachment IDs.

Enhancements:

Update dependency download, caching, and publish logic to prefer Arrow, then Parquet, then legacy CSV formats for compatibility across audb versions.
Simplify cache handling by removing pickle-based cached dependency files and unifying on Arrow/Parquet/CSV paths.

Documentation:

Document the new Arrow-based dependency storage format and clarify historical Parquet and CSV usage in the code documentation.

Tests:

Extend and adjust tests to cover Arrow IPC save/load, auto-detection and format precedence, error conditions, and publishing behavior with Arrow-based dependency files.

sourcery-ai · 2026-01-02T13:35:46Z

Reviewer's Guide

Switches the Dependencies storage format to Apache Arrow IPC by default, adds auto-detection and precedence logic across Arrow/Parquet/CSV, updates cache and download/publish flows accordingly, and refreshes tests/documentation to match while preserving backward compatibility with legacy formats.

Sequence diagram for dependency file download and format fallback

sequenceDiagram
    participant Caller
    participant BackendInterface
    participant TempDir
    participant Dependencies

    Caller->>TempDir: create TemporaryDirectory
    Caller->>BackendInterface: join(/, name, DEPENDENCY_FILE) (db.arrow)
    Caller->>BackendInterface: exists(remote_deps_file, version)
    alt Arrow_file_exists
        Caller->>BackendInterface: get_file(remote_deps_file, local_deps_file, version, verbose)
    else Arrow_missing
        Caller->>BackendInterface: join(/, name, PARQUET_DEPENDENCY_FILE) (db.parquet)
        Caller->>BackendInterface: exists(remote_deps_file, version)
        alt Parquet_file_exists
            Caller->>BackendInterface: get_file(remote_deps_file, local_deps_file, version, verbose)
        else Parquet_missing
            Caller->>BackendInterface: join(/, name, DB + .zip)
            Caller->>BackendInterface: get_archive(remote_deps_file, tmp_root, version, verbose)
            note over Caller,BackendInterface: Legacy CSV in ZIP
        end
    end

    Caller->>Dependencies: __init__()
    Caller->>Dependencies: load(local_deps_file)
    Dependencies-->>Caller: deps instance

Flow diagram for Dependencies.load format auto-detection

flowchart TD
    A_start["Start Dependencies.load(path)"] --> B_init_df["Init empty DataFrame with DEPENDENCY_TABLE columns"]
    B_init_df --> C_norm_path["Normalize path with audeer.path"]
    C_norm_path --> D_get_ext["Get extension = audeer.file_extension(path)"]
    D_get_ext --> E_check_invalid_ext{"extension provided AND not in {arrow, parquet, csv}"}
    E_check_invalid_ext -- Yes --> F_raise_value_error["Raise ValueError: unsupported extension"]
    E_check_invalid_ext -- No --> G_exists_or_ext_empty{"file does not exist OR extension empty"}

    G_exists_or_ext_empty -- Yes --> H_base_path["base_path = os.path.splitext(path)[0]"]
    H_base_path --> I_try_arrow["Try base_path + .arrow"]
    I_try_arrow --> J_arrow_exists{".arrow exists?"}
    J_arrow_exists -- Yes --> K_set_arrow["Set path, extension = arrow"]
    J_arrow_exists -- No --> L_try_parquet["Try base_path + .parquet"]
    L_try_parquet --> M_parquet_exists{".parquet exists?"}
    M_parquet_exists -- Yes --> N_set_parquet["Set path, extension = parquet"]
    M_parquet_exists -- No --> O_try_csv["Try base_path + .csv"]
    O_try_csv --> P_csv_exists{".csv exists?"}
    P_csv_exists -- Yes --> Q_set_csv["Set path, extension = csv"]
    P_csv_exists -- No --> R_raise_not_found["Raise FileNotFoundError"]

    G_exists_or_ext_empty -- No --> S_keep_path["Keep original path and extension"]

    K_set_arrow --> T_select_loader
    N_set_parquet --> T_select_loader
    Q_set_csv --> T_select_loader
    S_keep_path --> T_select_loader["Select loader based on extension"]

    T_select_loader --> U_ext_arrow{"extension == arrow"}
    U_ext_arrow -- Yes --> V_load_arrow["Open IPC file, read_all to table, _table_to_dataframe"]
    U_ext_arrow -- No --> W_ext_parquet{"extension == parquet"}
    W_ext_parquet -- Yes --> X_load_parquet["parquet.read_table, _table_to_dataframe"]
    W_ext_parquet -- No --> Y_ext_csv{"extension == csv"}
    Y_ext_csv -- Yes --> Z_load_csv["csv.read_csv with options, _table_to_dataframe"]
    Y_ext_csv -- No --> AA_unreachable["Unreachable: extension validated earlier"]

    V_load_arrow --> AB_end["Return with populated _df"]
    X_load_parquet --> AB_end
    Z_load_csv --> AB_end

File-Level Changes

Change	Details	Files
Change default dependency table format to Arrow IPC with LZ4 compression and introduce unified load/save logic with format auto-detection.	Extend Dependencies.load() to support Arrow IPC, keep Parquet/CSV via a shared _table_to_dataframe path, and auto-detect file format when the extension is missing or the file path does not exist. Validate extensions strictly to only allow 'arrow', 'parquet', or 'csv', raising ValueError otherwise and FileNotFoundError when auto-detection fails. Update Dependencies.save() to write Arrow IPC with LZ4 compression by default and to Parquet/CSV as requested, removing pickle support and enforcing the same extension validation.	`audb/core/dependencies.py` `tests/test_dependencies.py`
Adjust dependency file naming, caching, and download precedence to align with Arrow as the primary format while remaining backward compatible with Parquet and legacy CSV-in-ZIP.	Redefine DEPENDENCY_FILE as db.arrow and introduce PARQUET_DEPENDENCY_FILE, documenting historical format usage and keeping LEGACY_DEPENDENCY_FILE for pre-1.7 CSV. Update cached() and dependencies() to look for Arrow and Parquet (plus legacy CSV) in cache and to store/load only Arrow instead of pickle cache files. Enhance download_dependencies() to try db.arrow, then db.parquet, then db.zip (containing CSV), and pass the chosen local file to Dependencies.load() relying on its auto-detection. Update publish() to load dependencies from Arrow, then Parquet, then legacy CSV depending on which exists in the database root.	`audb/core/define.py` `audb/core/api.py` `audb/core/dependencies.py` `audb/core/publish.py`
Fix attachment cleanup logic during publish to use attachment names rather than relying on DataFrame index access by archive.	Change computation of removed_attachments to iterate over deps.attachments and deps.attachment_ids together and select attachments whose IDs are no longer in db.attachments, instead of indexing _df by archive column.	`audb/core/publish.py`
Update tests and publishing expectations to reflect the new Arrow-based dependency format and new loading behavior.	Switch load/save parametrized tests from CSV/PKL/Parquet to CSV/Parquet/Arrow and remove the pickle backward-compatibility dtype test. Add tests for error handling on invalid extension/missing file, Arrow compression behavior, auto-detection/precedence of Arrow vs Parquet vs CSV, and format precedence when multiple files exist. Adjust publish-related tests to expect db.arrow instead of db.parquet as the dependency file in repositories.	`tests/test_dependencies.py` `tests/test_publish.py` `tests/test_publish_table.py`
Refresh documentation to describe the Arrow-based dependency storage format and updated legacy behavior.	Align docs text to describe CSV as "CSV" (capitalized) and to mention the Arrow IPC default and historical Parquet/CSV usage where relevant.	`docs/publish.rst` `audb/core/define.py`

Possibly linked issues

Increase speed of managing dependencies #517: The PR implements the issue’s proposed Arrow IPC format to speed up dependency loading while keeping backward compatibility.
#: PR implements Arrow-based storage and loading strategy fulfilling the issue’s request for a better dependency table format.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

hagenw · 2026-01-02T15:59:04Z

When switching the internal representation from pandas dataframe to pyarrow table, we are getting

method	pr	main
Dependencies.save()	0.275	0.333
Dependencies.load()	0.568	0.231
Dependencies.__call__()	0.099	0.000
Dependencies.__contains__(10000 files)	0.002	0.005
Dependencies.__get_item__(10000 files)	0.121	0.825
Dependencies.__len__()	0.000	0.000
Dependencies.__str__()	0.022	0.022
Dependencies.archives	0.495	0.154
Dependencies.attachments	0.052	0.030
Dependencies.attachment_ids	0.052	0.029
Dependencies.files	0.414	0.011
Dependencies.media	0.357	0.083
Dependencies.removed_media	0.319	0.063
Dependencies.table_ids	0.103	0.080
Dependencies.tables	0.051	0.026
Dependencies.archive(10000 files)	0.016	0.063
Dependencies.bit_depth(10000 files)	0.016	0.046
Dependencies.channels(10000 files)	0.015	0.046
Dependencies.checksum(10000 files)	0.015	0.046
Dependencies.duration(10000 files)	0.014	0.046
Dependencies.format(10000 files)	0.015	0.046
Dependencies.removed(10000 files)	0.015	0.045
Dependencies.sampling_rate(10000 files)	0.015	0.045
Dependencies.type(10000 files)	0.015	0.045
Dependencies.version(10000 files)	0.015	0.047
Dependencies._add_attachment()	0.559	0.215
Dependencies._add_media(10000 files)	0.567	0.075
Dependencies._add_meta()	0.556	0.129
Dependencies._drop()	0.594	0.118
Dependencies._remove()	0.228	0.067
Dependencies._update_media()	0.608	0.130
Dependencies._update_media_version(10000 files)	0.452	0.021

hagenw · 2026-01-02T16:02:13Z

This approach is faster for certain methods like bit_depth(), but most of the other stay identical or they are considerable slower. Another disadvantage is that we would switch from a well supported format (parquet) to arrow, which might introduce more often breaking changes.

hagenw added 3 commits January 2, 2026 12:37

Avoid using _df in publish

981e0e2

Remove pkl cache for deps file

9094f78

Store dependency table as arrow

ddc18ea

hagenw added 2 commits January 2, 2026 16:40

Use arrow table for dependency management

e3f595b

Fix single row access

51e005b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store dependency table as arrow #532

Store dependency table as arrow #532

Uh oh!

hagenw commented Jan 2, 2026 •

edited

Loading

Uh oh!

sourcery-ai bot commented Jan 2, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

hagenw commented Jan 2, 2026 •

edited

Loading

Uh oh!

hagenw commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Store dependency table as arrow #532

Are you sure you want to change the base?

Store dependency table as arrow #532

Uh oh!

Conversation

hagenw commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for dependency file download and format fallback

Flow diagram for Dependencies.load format auto-detection

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

hagenw commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hagenw commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Jan 2, 2026 •

edited

Loading

sourcery-ai bot commented Jan 2, 2026 •

edited

Loading

hagenw commented Jan 2, 2026 •

edited

Loading