Use duckdb to speed up dependency management #518

hagenw · 2025-08-04T14:42:22Z

Experiments with #517

This adds duckdb to see how much faster it is.

Summary by Sourcery

Integrate DuckDB into the Dependencies class to accelerate queries on Parquet files, while preserving pandas-based fallbacks and ensuring connection setup and teardown around data loads, saves, and modifications.

Enhancements:

Integrate DuckDB for fast querying of dependency data stored in Parquet with fallback to pandas
Extend dependency access methods to use DuckDB where available and invalidate the cache on data changes
Manage DuckDB connection lifecycle by setting up on load/save, closing on modifications and destruction

Build:

Add duckdb>=1.3.2 to project dependencies

sourcery-ai · 2025-08-04T14:42:28Z

Reviewer's Guide

This PR accelerates dependency operations by integrating DuckDB for querying parquet-backed data. It adds connection management utilities, hooks up DuckDB setup in load/save paths, wraps existing accessors to prefer DuckDB queries with pandas fallback, invalidates the cache on modifications, and updates project dependencies.

Sequence diagram for querying dependencies with DuckDB and pandas fallback

sequenceDiagram
    participant D as Dependencies
    participant DuckDB
    participant Pandas
    D->>DuckDB: Query for data (e.g., files, archives)
    alt DuckDB query succeeds
        DuckDB-->>D: Return result
    else DuckDB query fails or not available
        D->>Pandas: Query for data
        Pandas-->>D: Return result
    end

Class diagram for updated Dependencies class with DuckDB integration

classDiagram
    class Dependencies {
        - _df: pd.DataFrame
        - _duckdb_conn
        - _parquet_file
        + __call__() pd.DataFrame
        + __contains__(file: str) bool
        + __eq__(other: Dependencies) bool
        + archives() list[str]
        + attachments() list[str]
        + attachment_ids() list[str]
        + files() list[str]
        + media() list[str]
        + removed_media() list[str]
        + tables() list[str]
        + archive(file: str) str
        + bit_depth(file: str) int
        + load(path: str)
        + removed(file: str) bool
        + save(path: str)
        + type(file: str) int
        + _add_attachment(...)
        + _add_media(...)
        + _add_meta(...)
        + _column_loc(file: str, column: str, dtype=None)
        + _drop(files: Sequence[str])
        + _remove(file: str)
        + _update_media(...)
        + _update_media_version(...)
        + _setup_duckdb_connection(parquet_path: str)
        + _duckdb_query_files(condition: str=None) list[str]
        + _close_duckdb_connection()
        + __del__()
    }

File-Level Changes

Change	Details	Files
Introduce DuckDB connection management	Added _duckdb_conn and _parquet_file fields in constructor Implemented _setup_duckdb_connection, _close_duckdb_connection, _duckdb_query_files, and del methods Hooked setup in load and save to initialize DuckDB on parquet files	`audb/core/dependencies.py`
Prefer DuckDB for dependency accessors with pandas fallback	Wrapped contains and archive/file/media methods to execute SQL on DuckDB when available Extended _column_loc to attempt DuckDB lookup before pandas Ensured fallback to pandas on any DuckDB error	`audb/core/dependencies.py`
Invalidate DuckDB cache on data modifications	Closed DuckDB connection at start of _add_attachment, _add_media, _add_meta Added cache invalidation in _drop, _remove, _update_media, and _update_media_version	`audb/core/dependencies.py`
Add duckdb to project dependencies	Inserted "duckdb>=1.3.2" under dependencies in pyproject.toml	`pyproject.toml`

Possibly linked issues

Increase speed of managing dependencies #517: The PR implements using DuckDB to improve data access and dependency management speed, a direct solution proposed in the issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

hagenw · 2025-08-04T14:43:06Z

For me locally the tests get stuck at test_convert.py

sourcery-ai

New security issues found

sourcery-ai · 2025-12-30T08:34:51Z

audb/core/dependencies.py

+                result = self._duckdb_conn.execute(
+                    f"SELECT COUNT(*) FROM '{self._parquet_file}' WHERE file = ?",
+                    [file],
+                ).fetchone()


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T08:34:51Z

audb/core/dependencies.py

+                result = self._duckdb_conn.execute(
+                    f"SELECT file FROM '{self._parquet_file}'"
+                ).fetchall()


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T08:34:51Z

audb/core/dependencies.py

+                result = self._duckdb_conn.execute(
+                    f"SELECT archive FROM '{self._parquet_file}' WHERE file = ?", [file]
+                ).fetchone()


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T08:34:51Z

audb/core/dependencies.py

+                result = self._duckdb_conn.execute(
+                    f"SELECT {column} FROM '{self._parquet_file}' WHERE file = ?",
+                    [file],
+                ).fetchone()


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T08:34:51Z

audb/core/dependencies.py

+            self._duckdb_conn.execute(
+                f"SELECT COUNT(*) FROM '{parquet_path}'"
+            ).fetchone()


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T08:34:51Z

audb/core/dependencies.py

+        query = f"SELECT file FROM '{self._parquet_file}'"
+        if condition:
+            query += f" WHERE {condition}"
+        result = self._duckdb_conn.execute(query).fetchall()


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai

New security issues found

audb/core/dependencies.py

* Update dependency benchmark script * Better formatting of __ * Don't create lists

hagenw added 3 commits December 30, 2025 09:34

Use duckdb to speed up dependency management

7c25782

Try to fix potential errors

a495329

Simplify code

4cd0093

hagenw force-pushed the duckdb branch from 5dc1995 to 4cd0093 Compare December 30, 2025 08:34

sourcery-ai bot requested changes Dec 30, 2025

View reviewed changes

hagenw added 2 commits December 30, 2025 10:25

Don't catch errors

e0a0801

Store deps file in db_root

c039124

sourcery-ai bot requested changes Dec 30, 2025

View reviewed changes

audb/core/dependencies.py Outdated Show resolved Hide resolved

hagenw added 4 commits December 30, 2025 11:46

Update api.py

5719cd7

Update dependencies.py

5be8bca

Update load_to.py

acf3f70

Update dependency benchmark script (#530)

6f4e9b7

* Update dependency benchmark script * Better formatting of __ * Don't create lists

hagenw force-pushed the duckdb branch from a09ac35 to 6f4e9b7 Compare December 30, 2025 14:55

Use duckdb to speed up dependency management #518

Are you sure you want to change the base?

Use duckdb to speed up dependency management #518

Uh oh!

Conversation

hagenw commented Aug 4, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for querying dependencies with DuckDB and pandas fallback

Class diagram for updated Dependencies class with DuckDB integration

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

hagenw commented Aug 4, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Aug 4, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 4, 2025 •

edited

Loading