-
Notifications
You must be signed in to change notification settings - Fork 2
Store dependency table as arrow #532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideSwitches the Dependencies storage format to Apache Arrow IPC by default, adds auto-detection and precedence logic across Arrow/Parquet/CSV, updates cache and download/publish flows accordingly, and refreshes tests/documentation to match while preserving backward compatibility with legacy formats. Sequence diagram for dependency file download and format fallbacksequenceDiagram
participant Caller
participant BackendInterface
participant TempDir
participant Dependencies
Caller->>TempDir: create TemporaryDirectory
Caller->>BackendInterface: join(/, name, DEPENDENCY_FILE) (db.arrow)
Caller->>BackendInterface: exists(remote_deps_file, version)
alt Arrow_file_exists
Caller->>BackendInterface: get_file(remote_deps_file, local_deps_file, version, verbose)
else Arrow_missing
Caller->>BackendInterface: join(/, name, PARQUET_DEPENDENCY_FILE) (db.parquet)
Caller->>BackendInterface: exists(remote_deps_file, version)
alt Parquet_file_exists
Caller->>BackendInterface: get_file(remote_deps_file, local_deps_file, version, verbose)
else Parquet_missing
Caller->>BackendInterface: join(/, name, DB + .zip)
Caller->>BackendInterface: get_archive(remote_deps_file, tmp_root, version, verbose)
note over Caller,BackendInterface: Legacy CSV in ZIP
end
end
Caller->>Dependencies: __init__()
Caller->>Dependencies: load(local_deps_file)
Dependencies-->>Caller: deps instance
Flow diagram for Dependencies.load format auto-detectionflowchart TD
A_start["Start Dependencies.load(path)"] --> B_init_df["Init empty DataFrame with DEPENDENCY_TABLE columns"]
B_init_df --> C_norm_path["Normalize path with audeer.path"]
C_norm_path --> D_get_ext["Get extension = audeer.file_extension(path)"]
D_get_ext --> E_check_invalid_ext{"extension provided AND not in {arrow, parquet, csv}"}
E_check_invalid_ext -- Yes --> F_raise_value_error["Raise ValueError: unsupported extension"]
E_check_invalid_ext -- No --> G_exists_or_ext_empty{"file does not exist OR extension empty"}
G_exists_or_ext_empty -- Yes --> H_base_path["base_path = os.path.splitext(path)[0]"]
H_base_path --> I_try_arrow["Try base_path + .arrow"]
I_try_arrow --> J_arrow_exists{".arrow exists?"}
J_arrow_exists -- Yes --> K_set_arrow["Set path, extension = arrow"]
J_arrow_exists -- No --> L_try_parquet["Try base_path + .parquet"]
L_try_parquet --> M_parquet_exists{".parquet exists?"}
M_parquet_exists -- Yes --> N_set_parquet["Set path, extension = parquet"]
M_parquet_exists -- No --> O_try_csv["Try base_path + .csv"]
O_try_csv --> P_csv_exists{".csv exists?"}
P_csv_exists -- Yes --> Q_set_csv["Set path, extension = csv"]
P_csv_exists -- No --> R_raise_not_found["Raise FileNotFoundError"]
G_exists_or_ext_empty -- No --> S_keep_path["Keep original path and extension"]
K_set_arrow --> T_select_loader
N_set_parquet --> T_select_loader
Q_set_csv --> T_select_loader
S_keep_path --> T_select_loader["Select loader based on extension"]
T_select_loader --> U_ext_arrow{"extension == arrow"}
U_ext_arrow -- Yes --> V_load_arrow["Open IPC file, read_all to table, _table_to_dataframe"]
U_ext_arrow -- No --> W_ext_parquet{"extension == parquet"}
W_ext_parquet -- Yes --> X_load_parquet["parquet.read_table, _table_to_dataframe"]
W_ext_parquet -- No --> Y_ext_csv{"extension == csv"}
Y_ext_csv -- Yes --> Z_load_csv["csv.read_csv with options, _table_to_dataframe"]
Y_ext_csv -- No --> AA_unreachable["Unreachable: extension validated earlier"]
V_load_arrow --> AB_end["Return with populated _df"]
X_load_parquet --> AB_end
Z_load_csv --> AB_end
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
When switching the internal representation from pandas dataframe to pyarrow table, we are getting
|
|
This approach is faster for certain methods like |
Implements the 4th option discussed in #517
Benchmark results when only switching the storage format from parquet to arrow (saving and loading is faster, everything else stays the same):
Summary by Sourcery
Switch dependency table storage to Arrow IPC format with automatic format detection and maintain backward compatibility with Parquet and legacy CSV-based dependencies.
New Features:
Bug Fixes:
Enhancements:
Documentation:
Tests: