Use sqlite to manage dependency table #531

hagenw · 2025-12-30T14:50:16Z

Implements the sqlite suggestion from #517

It stores the dependency table in a sqlite file/database.

It shows some improvements for certain look ups, but overall performance is not better:

method	result
Dependencies.save()	11.124
Dependencies.load()	1.450
Dependencies.__call__()	2.774
Dependencies.__contains__(10000 files)	0.014
Dependencies.__get_item__(10000 files)	0.036
Dependencies.__len__()	0.000
Dependencies.__str__()	2.767
Dependencies.archives	0.442
Dependencies.attachments	0.052
Dependencies.attachment_ids	0.051
Dependencies.files	0.301
Dependencies.media	0.315
Dependencies.removed_media	0.274
Dependencies.table_ids	0.101
Dependencies.tables	0.051
Dependencies.archive(10000 files)	0.019
Dependencies.bit_depth(10000 files)	0.019
Dependencies.channels(10000 files)	0.019
Dependencies.checksum(10000 files)	0.019
Dependencies.duration(10000 files)	0.019
Dependencies.format(10000 files)	0.020
Dependencies.removed(10000 files)	0.019
Dependencies.sampling_rate(10000 files)	0.019
Dependencies.type(10000 files)	0.019
Dependencies.version(10000 files)	0.019
Dependencies._add_attachment()	0.000
Dependencies._add_media(10000 files)	0.039
Dependencies._add_meta()	0.000
Dependencies._drop()	0.000
Dependencies._remove()	0.000
Dependencies._update_media()	0.073
Dependencies._update_media_version(10000 files)	0.016

Summary by Sourcery

Store and manage dependency tables using a SQLite-backed Dependencies implementation and update loading, caching, and publishing paths to use a .sqlite dependency file as the primary format.

New Features:

Add an in-memory SQLite database as the backing store for Dependencies, including schema, indexes, and serialization support.
Support reading and writing dependency tables in .sqlite format alongside existing CSV and Parquet files, and prefer SQLite when downloading from backends and using the cache.

Enhancements:

Refactor Dependencies operations (lookup, update, deletion, and metadata accessors) to use SQL queries instead of direct pandas DataFrame manipulation, while preserving the public API.
Adjust publish and load logic to avoid direct access to internal DataFrame state and rely on the Dependencies interface or full DataFrame view instead.
Remove pickle-based dependency caching and legacy backward-compatibility code tied to pickled dependency tables.

Documentation:

Update publishing-related documentation and tests to expect db.sqlite as the dependency file format instead of db.parquet where applicable.

Tests:

Adapt dependency and publish tests to populate and validate data through the SQLite-backed Dependencies implementation and cover the new .sqlite load/save behavior.

sourcery-ai · 2025-12-30T14:50:23Z

Reviewer's Guide

Replaces the in-memory pandas DataFrame implementation of Dependencies with an in-memory SQLite database and changes persistence, caching, and backend formats to use a db.sqlite file while keeping a DataFrame interface for callers and updating tests and publish/load logic accordingly.

Sequence diagram for dependencies cache loading with SQLite

sequenceDiagram
    participant Client
    participant API as api_dependencies
    participant FS as CacheFileSystem
    participant Deps as Dependencies
    participant Backend as BackendInterface

    Client->>API: dependencies(name, version, cache_root)
    API->>FS: resolve db_root
    API->>FS: compose deps_file = DB.sqlite
    API->>Deps: new Dependencies()
    API->>Deps: load(deps_file)
    alt load succeeds
        Deps-->>API: deps
        API-->>Client: deps
    else load raises Exception
        API->>Backend: lookup_backend(name, version)
        API->>Backend: download_dependencies(backend_interface, name, version)
        Backend-->>API: deps
        API->>Deps: deps.save(deps_file)
        API-->>Client: deps
    end

Sequence diagram for download_dependencies format fallback

sequenceDiagram
    participant API as api_download_dependencies
    participant Backend as BackendInterface
    participant FS as TempFileSystem
    participant Deps as Dependencies

    API->>Backend: join(/, name, DB.sqlite)
    API->>Backend: exists(DB.sqlite, version)
    alt DB.sqlite exists
        API->>Backend: get_file(DB.sqlite, local_db_sqlite, version)
        Backend-->>API: db.sqlite
        API->>Deps: new Dependencies()
        API->>Deps: load(local_db_sqlite)
    else DB.sqlite missing
        API->>Backend: join(/, name, DB.parquet)
        API->>Backend: exists(DB.parquet, version)
        alt DB.parquet exists
            API->>Backend: get_file(DB.parquet, local_db_parquet, version)
            Backend-->>API: db.parquet
            API->>Deps: new Dependencies()
            API->>Deps: load(local_db_parquet)
        else DB.parquet missing
            API->>Backend: join(/, name, DB.zip)
            API->>Backend: get_archive(DB.zip, tmp_root, version)
            Backend-->>API: db.csv in zip
            API->>Deps: new Dependencies()
            API->>Deps: load(legacy_csv_file)
        end
    end
    Deps-->>API: deps
    API-->>Client: deps

ER diagram for dependencies SQLite table

erDiagram
    DEPENDENCIES {
        string file PK
        string archive
        int bit_depth
        int channels
        string checksum
        float duration
        string format
        int removed
        int sampling_rate
        int type
        string version
    }

Class diagram for SQLite-backed Dependencies implementation

classDiagram
    class Dependencies {
        - sqlite3_Connection _conn
        - str _db_path
        - pa_Schema _schema
        + Dependencies()
        + DataFrame __call__()
        + bool __contains__(file)
        + bool __eq__(other)
        + list __getitem__(file)
        + int __len__()
        + str __str__()
        + __del__()
        + dict __getstate__()
        + __setstate__(state)
        + list~str~ archives
        + list~str~ attachments
        + list~str~ attachment_ids
        + list~str~ files
        + list~str~ media
        + list~str~ removed_media
        + list~str~ table_ids
        + list~str~ tables
        + str archive(file)
        + int bit_depth(file)
        + int channels(file)
        + str checksum(file)
        + float duration(file)
        + str format(file)
        + None load(path)
        + bool removed(file)
        + None save(path)
        + int sampling_rate(file)
        + int type(file)
        + str version(file)
        + None _add_attachment(file, archive, checksum, version)
        + None _add_media(values)
        + None _add_meta(file, checksum, version)
        + scalar _column_loc(column, file, dtype)
        + None _drop(files)
        + None _remove(file)
        + DataFrame _set_dtypes(df)
        + None _update_media(values)
        + None _update_media_version(files, version)
    }

    class sqlite3_Connection {
    }

    class pa_Schema {
    }

    class DataFrame {
    }

    Dependencies --> sqlite3_Connection : uses
    Dependencies --> pa_Schema : uses
    Dependencies --> DataFrame : returns

File-Level Changes

Change	Details	Files
Replace Dependencies internal storage from pandas DataFrame to an in-memory SQLite database, while keeping a DataFrame-based public interface.	Initialize an in-memory SQLite connection and create the dependencies table plus indexes in Dependencies.init instead of a pandas DataFrame. Implement call to materialize the SQLite table into a pandas DataFrame with proper dtypes and index configuration. Update magic methods (contains, getitem, len, str, eq, del, getstate, setstate) to operate on SQLite and reconstruct state via SQL inserts. Translate property accessors (archives, attachments, attachment_ids, files, media, removed_media, table_ids, tables) into SQL SELECT queries. Refactor helper methods (_column_loc, _drop, _remove, _update_media, _update_media_version, _add_attachment, _add_media, _add_meta) to use SQL INSERT/UPDATE/DELETE and executemany with appropriate commits.	`audb/core/dependencies.py`
Change persistence format of dependencies from parquet/pickle to SQLite, with backwards compatibility for CSV/parquet and new download/cache logic.	Extend Dependencies.load() to support sqlite files, loading via ATTACH/COPY for sqlite and via pandas for csv/parquet then inserting into SQLite. Extend Dependencies.save() to write csv/parquet via DataFrame materialization and add support for writing a standalone db.sqlite (schema + indexes + data copy via iterdump). Update download_dependencies() to prefer db.sqlite from backend, then fall back to db.parquet, then legacy db.zip CSV. Change define.DEPENDENCY_FILE to db.sqlite and introduce PARQUET_DEPENDENCY_FILE for older parquet-based dependencies. Simplify cached() and dependencies() cache handling to use the main dependency file (db.sqlite) instead of a separate pickle cache file, and adjust error-path regeneration/caching accordingly.	`audb/core/dependencies.py` `audb/core/api.py` `audb/core/define.py`
Adapt higher-level logic and tests to the new SQLite-backed Dependencies implementation and file naming.	Adjust load job in audb.core.load to obtain flavor_files from deps() DataFrame instead of direct _df access consistent with SQLite backend. Update publish logic for attachments to avoid using internal _df and instead leverage attachments/attachment_ids plus _drop. Change tests to construct Dependencies by inserting ROWS into the SQLite connection, to assert on deps() rather than _df, to drop pickle backward-compatibility tests, and to parameterize load/save over csv, parquet, and sqlite. Update tests and docs referring to db.parquet or cached pickle to refer to db.sqlite where appropriate (e.g., publish and publish_table tests).	`audb/core/load.py` `audb/core/publish.py` `tests/test_dependencies.py` `tests/test_publish.py` `tests/test_publish_table.py` `docs/publish.rst`

Possibly linked issues

Increase speed of managing dependencies #517: The PR implements the issue’s SQLite-based dependency storage to improve lookup performance and change db.parquet to db.sqlite.
#0: The PR replaces db.csv/parquet with db.sqlite for dependencies, effectively changing the dependency filename and hiding its internals.
Increase speed of managing dependencies #517: PR implements SQLite-backed dependency table storage, directly addressing the issue’s request for a better storage format.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

New security issues found

sourcery-ai · 2025-12-30T14:56:06Z

audb/core/dependencies.py

+                self._conn.execute(
+                    f"INSERT INTO dependencies {DEPENDENCIES} VALUES {VALUES}",
+                    (
+                        file,
+                        row["archive"],
+                        row["bit_depth"],
+                        row["channels"],
+                        row["checksum"],
+                        row["duration"],
+                        row["format"],
+                        row["removed"],
+                        row["sampling_rate"],
+                        row["type"],
+                        row["version"],
+                    ),
+                )


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T14:56:06Z

audb/core/dependencies.py

+
+        if extension == "sqlite":
+            # For SQLite files, we can attach and copy the data
+            self._conn.execute(f"ATTACH DATABASE '{path}' AS source_db")


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T14:56:06Z

audb/core/dependencies.py

+        self._conn.execute(
+            f"INSERT OR REPLACE INTO dependencies {DEPENDENCIES} VALUES {VALUES}",
+            (
+                file,
+                archive,
+                0,
+                0,
+                checksum,
+                0.0,
+                format,
+                0,
+                0,
+                define.DEPENDENCY_TYPE["attachment"],
+                version,
+            ),
+        )


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T14:56:06Z

audb/core/dependencies.py

+        self._conn.execute(
+            f"INSERT OR REPLACE INTO dependencies {DEPENDENCIES} VALUES {VALUES}",
+            (
+                file,
+                archive,
+                0,
+                0,
+                checksum,
+                0.0,
+                format,
+                0,
+                0,
+                define.DEPENDENCY_TYPE["meta"],
+                version,
+            ),
+        )


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T14:56:06Z

audb/core/dependencies.py

+        cursor = self._conn.execute(
+            f"SELECT {column} FROM dependencies WHERE file = ?", (file,)
+        )


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T14:56:06Z

audb/core/dependencies.py

+        self._conn.execute(
+            f"DELETE FROM dependencies WHERE file IN ({placeholders})", files
+        )


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

sourcery-ai · 2025-12-30T14:56:06Z

audb/core/dependencies.py

+        self._conn.execute(
+            f"UPDATE dependencies SET version = ? WHERE file IN ({placeholders})",
+            [version] + list(files),
+        )


security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

hagenw changed the title ~~Sqlite~~ Use sqlite to manage dependency table Dec 30, 2025

sourcery-ai bot requested changes Dec 30, 2025

View reviewed changes

hagenw added 6 commits January 2, 2026 12:37

Avoid using _df in publish

981e0e2

Remove pkl cache for deps file

9094f78

Store dependencies as sqlite

ab75770

Use sqlite for deps

b3a33d6

Fix ruff errors

0e957db

Fix doctest

2218914

hagenw force-pushed the sqlite branch from 86c0199 to 2218914 Compare January 2, 2026 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use sqlite to manage dependency table #531

Use sqlite to manage dependency table #531

Uh oh!

hagenw commented Dec 30, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Dec 30, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Uh oh!

sourcery-ai bot Dec 30, 2025

Uh oh!

sourcery-ai bot Dec 30, 2025

Uh oh!

sourcery-ai bot Dec 30, 2025

Uh oh!

sourcery-ai bot Dec 30, 2025

Uh oh!

sourcery-ai bot Dec 30, 2025

Uh oh!

sourcery-ai bot Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use sqlite to manage dependency table #531

Are you sure you want to change the base?

Use sqlite to manage dependency table #531

Uh oh!

Conversation

hagenw commented Dec 30, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for dependencies cache loading with SQLite

Sequence diagram for download_dependencies format fallback

ER diagram for dependencies SQLite table

Class diagram for SQLite-backed Dependencies implementation

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Dec 30, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 30, 2025 •

edited

Loading