Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 2, 2026

Implements the 4th option discussed in #517

Benchmark results when only switching the storage format from parquet to arrow (saving and loading is faster, everything else stays the same):

method result
Dependencies.save() 0.176
Dependencies.load() 0.127
Dependencies.__call__() 0.000
Dependencies.__contains__(10000 files) 0.005
Dependencies.__get_item__(10000 files) 0.836
Dependencies.__len__() 0.000
Dependencies.__str__() 0.025
Dependencies.archives 0.137
Dependencies.attachments 0.036
Dependencies.attachment_ids 0.030
Dependencies.files 0.011
Dependencies.media 0.079
Dependencies.removed_media 0.062
Dependencies.table_ids 0.082
Dependencies.tables 0.027
Dependencies.archive(10000 files) 0.063
Dependencies.bit_depth(10000 files) 0.046
Dependencies.channels(10000 files) 0.046
Dependencies.checksum(10000 files) 0.046
Dependencies.duration(10000 files) 0.044
Dependencies.format(10000 files) 0.046
Dependencies.removed(10000 files) 0.044
Dependencies.sampling_rate(10000 files) 0.051
Dependencies.type(10000 files) 0.045
Dependencies.version(10000 files) 0.046
Dependencies._add_attachment() 0.191
Dependencies._add_media(10000 files) 0.078
Dependencies._add_meta() 0.127
Dependencies._drop() 0.100
Dependencies._remove() 0.070
Dependencies._update_media() 0.119
Dependencies._update_media_version(10000 files) 0.021

Summary by Sourcery

Switch dependency table storage to Arrow IPC format with automatic format detection and maintain backward compatibility with Parquet and legacy CSV-based dependencies.

New Features:

  • Support loading and saving dependency tables in Apache Arrow IPC format with LZ4 compression.
  • Add automatic dependency file format detection and precedence when loading dependencies without an extension.

Bug Fixes:

  • Fix attachment removal in publishing by correctly mapping attachment names to attachment IDs.

Enhancements:

  • Update dependency download, caching, and publish logic to prefer Arrow, then Parquet, then legacy CSV formats for compatibility across audb versions.
  • Simplify cache handling by removing pickle-based cached dependency files and unifying on Arrow/Parquet/CSV paths.

Documentation:

  • Document the new Arrow-based dependency storage format and clarify historical Parquet and CSV usage in the code documentation.

Tests:

  • Extend and adjust tests to cover Arrow IPC save/load, auto-detection and format precedence, error conditions, and publishing behavior with Arrow-based dependency files.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 2, 2026

Reviewer's Guide

Switches the Dependencies storage format to Apache Arrow IPC by default, adds auto-detection and precedence logic across Arrow/Parquet/CSV, updates cache and download/publish flows accordingly, and refreshes tests/documentation to match while preserving backward compatibility with legacy formats.

Sequence diagram for dependency file download and format fallback

sequenceDiagram
    participant Caller
    participant BackendInterface
    participant TempDir
    participant Dependencies

    Caller->>TempDir: create TemporaryDirectory
    Caller->>BackendInterface: join(/, name, DEPENDENCY_FILE) (db.arrow)
    Caller->>BackendInterface: exists(remote_deps_file, version)
    alt Arrow_file_exists
        Caller->>BackendInterface: get_file(remote_deps_file, local_deps_file, version, verbose)
    else Arrow_missing
        Caller->>BackendInterface: join(/, name, PARQUET_DEPENDENCY_FILE) (db.parquet)
        Caller->>BackendInterface: exists(remote_deps_file, version)
        alt Parquet_file_exists
            Caller->>BackendInterface: get_file(remote_deps_file, local_deps_file, version, verbose)
        else Parquet_missing
            Caller->>BackendInterface: join(/, name, DB + .zip)
            Caller->>BackendInterface: get_archive(remote_deps_file, tmp_root, version, verbose)
            note over Caller,BackendInterface: Legacy CSV in ZIP
        end
    end

    Caller->>Dependencies: __init__()
    Caller->>Dependencies: load(local_deps_file)
    Dependencies-->>Caller: deps instance
Loading

Flow diagram for Dependencies.load format auto-detection

flowchart TD
    A_start["Start Dependencies.load(path)"] --> B_init_df["Init empty DataFrame with DEPENDENCY_TABLE columns"]
    B_init_df --> C_norm_path["Normalize path with audeer.path"]
    C_norm_path --> D_get_ext["Get extension = audeer.file_extension(path)"]
    D_get_ext --> E_check_invalid_ext{"extension provided AND not in {arrow, parquet, csv}"}
    E_check_invalid_ext -- Yes --> F_raise_value_error["Raise ValueError: unsupported extension"]
    E_check_invalid_ext -- No --> G_exists_or_ext_empty{"file does not exist OR extension empty"}

    G_exists_or_ext_empty -- Yes --> H_base_path["base_path = os.path.splitext(path)[0]"]
    H_base_path --> I_try_arrow["Try base_path + .arrow"]
    I_try_arrow --> J_arrow_exists{".arrow exists?"}
    J_arrow_exists -- Yes --> K_set_arrow["Set path, extension = arrow"]
    J_arrow_exists -- No --> L_try_parquet["Try base_path + .parquet"]
    L_try_parquet --> M_parquet_exists{".parquet exists?"}
    M_parquet_exists -- Yes --> N_set_parquet["Set path, extension = parquet"]
    M_parquet_exists -- No --> O_try_csv["Try base_path + .csv"]
    O_try_csv --> P_csv_exists{".csv exists?"}
    P_csv_exists -- Yes --> Q_set_csv["Set path, extension = csv"]
    P_csv_exists -- No --> R_raise_not_found["Raise FileNotFoundError"]

    G_exists_or_ext_empty -- No --> S_keep_path["Keep original path and extension"]

    K_set_arrow --> T_select_loader
    N_set_parquet --> T_select_loader
    Q_set_csv --> T_select_loader
    S_keep_path --> T_select_loader["Select loader based on extension"]

    T_select_loader --> U_ext_arrow{"extension == arrow"}
    U_ext_arrow -- Yes --> V_load_arrow["Open IPC file, read_all to table, _table_to_dataframe"]
    U_ext_arrow -- No --> W_ext_parquet{"extension == parquet"}
    W_ext_parquet -- Yes --> X_load_parquet["parquet.read_table, _table_to_dataframe"]
    W_ext_parquet -- No --> Y_ext_csv{"extension == csv"}
    Y_ext_csv -- Yes --> Z_load_csv["csv.read_csv with options, _table_to_dataframe"]
    Y_ext_csv -- No --> AA_unreachable["Unreachable: extension validated earlier"]

    V_load_arrow --> AB_end["Return with populated _df"]
    X_load_parquet --> AB_end
    Z_load_csv --> AB_end
Loading

File-Level Changes

Change Details Files
Change default dependency table format to Arrow IPC with LZ4 compression and introduce unified load/save logic with format auto-detection.
  • Extend Dependencies.load() to support Arrow IPC, keep Parquet/CSV via a shared _table_to_dataframe path, and auto-detect file format when the extension is missing or the file path does not exist.
  • Validate extensions strictly to only allow 'arrow', 'parquet', or 'csv', raising ValueError otherwise and FileNotFoundError when auto-detection fails.
  • Update Dependencies.save() to write Arrow IPC with LZ4 compression by default and to Parquet/CSV as requested, removing pickle support and enforcing the same extension validation.
audb/core/dependencies.py
tests/test_dependencies.py
Adjust dependency file naming, caching, and download precedence to align with Arrow as the primary format while remaining backward compatible with Parquet and legacy CSV-in-ZIP.
  • Redefine DEPENDENCY_FILE as db.arrow and introduce PARQUET_DEPENDENCY_FILE, documenting historical format usage and keeping LEGACY_DEPENDENCY_FILE for pre-1.7 CSV.
  • Update cached() and dependencies() to look for Arrow and Parquet (plus legacy CSV) in cache and to store/load only Arrow instead of pickle cache files.
  • Enhance download_dependencies() to try db.arrow, then db.parquet, then db.zip (containing CSV), and pass the chosen local file to Dependencies.load() relying on its auto-detection.
  • Update publish() to load dependencies from Arrow, then Parquet, then legacy CSV depending on which exists in the database root.
audb/core/define.py
audb/core/api.py
audb/core/dependencies.py
audb/core/publish.py
Fix attachment cleanup logic during publish to use attachment names rather than relying on DataFrame index access by archive.
  • Change computation of removed_attachments to iterate over deps.attachments and deps.attachment_ids together and select attachments whose IDs are no longer in db.attachments, instead of indexing _df by archive column.
audb/core/publish.py
Update tests and publishing expectations to reflect the new Arrow-based dependency format and new loading behavior.
  • Switch load/save parametrized tests from CSV/PKL/Parquet to CSV/Parquet/Arrow and remove the pickle backward-compatibility dtype test.
  • Add tests for error handling on invalid extension/missing file, Arrow compression behavior, auto-detection/precedence of Arrow vs Parquet vs CSV, and format precedence when multiple files exist.
  • Adjust publish-related tests to expect db.arrow instead of db.parquet as the dependency file in repositories.
tests/test_dependencies.py
tests/test_publish.py
tests/test_publish_table.py
Refresh documentation to describe the Arrow-based dependency storage format and updated legacy behavior.
  • Align docs text to describe CSV as "CSV" (capitalized) and to mention the Arrow IPC default and historical Parquet/CSV usage where relevant.
docs/publish.rst
audb/core/define.py

Possibly linked issues

  • Increase speed of managing dependencies #517: The PR implements the issue’s proposed Arrow IPC format to speed up dependency loading while keeping backward compatibility.
  • #: PR implements Arrow-based storage and loading strategy fulfilling the issue’s request for a better dependency table format.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hagenw
Copy link
Member Author

hagenw commented Jan 2, 2026

When switching the internal representation from pandas dataframe to pyarrow table, we are getting

method pr main
Dependencies.save() 0.275 0.333
Dependencies.load() 0.568 0.231
Dependencies.__call__() 0.099 0.000
Dependencies.__contains__(10000 files) 0.002 0.005
Dependencies.__get_item__(10000 files) 0.121 0.825
Dependencies.__len__() 0.000 0.000
Dependencies.__str__() 0.022 0.022
Dependencies.archives 0.495 0.154
Dependencies.attachments 0.052 0.030
Dependencies.attachment_ids 0.052 0.029
Dependencies.files 0.414 0.011
Dependencies.media 0.357 0.083
Dependencies.removed_media 0.319 0.063
Dependencies.table_ids 0.103 0.080
Dependencies.tables 0.051 0.026
Dependencies.archive(10000 files) 0.016 0.063
Dependencies.bit_depth(10000 files) 0.016 0.046
Dependencies.channels(10000 files) 0.015 0.046
Dependencies.checksum(10000 files) 0.015 0.046
Dependencies.duration(10000 files) 0.014 0.046
Dependencies.format(10000 files) 0.015 0.046
Dependencies.removed(10000 files) 0.015 0.045
Dependencies.sampling_rate(10000 files) 0.015 0.045
Dependencies.type(10000 files) 0.015 0.045
Dependencies.version(10000 files) 0.015 0.047
Dependencies._add_attachment() 0.559 0.215
Dependencies._add_media(10000 files) 0.567 0.075
Dependencies._add_meta() 0.556 0.129
Dependencies._drop() 0.594 0.118
Dependencies._remove() 0.228 0.067
Dependencies._update_media() 0.608 0.130
Dependencies._update_media_version(10000 files) 0.452 0.021

@hagenw
Copy link
Member Author

hagenw commented Jan 2, 2026

This approach is faster for certain methods like bit_depth(), but most of the other stay identical or they are considerable slower. Another disadvantage is that we would switch from a well supported format (parquet) to arrow, which might introduce more often breaking changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants