Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Dec 30, 2025

Implements the sqlite suggestion from #517

It stores the dependency table in a sqlite file/database.

It shows some improvements for certain look ups, but overall performance is not better:

method result
Dependencies.save() 11.124
Dependencies.load() 1.450
Dependencies.__call__() 2.774
Dependencies.__contains__(10000 files) 0.014
Dependencies.__get_item__(10000 files) 0.036
Dependencies.__len__() 0.000
Dependencies.__str__() 2.767
Dependencies.archives 0.442
Dependencies.attachments 0.052
Dependencies.attachment_ids 0.051
Dependencies.files 0.301
Dependencies.media 0.315
Dependencies.removed_media 0.274
Dependencies.table_ids 0.101
Dependencies.tables 0.051
Dependencies.archive(10000 files) 0.019
Dependencies.bit_depth(10000 files) 0.019
Dependencies.channels(10000 files) 0.019
Dependencies.checksum(10000 files) 0.019
Dependencies.duration(10000 files) 0.019
Dependencies.format(10000 files) 0.020
Dependencies.removed(10000 files) 0.019
Dependencies.sampling_rate(10000 files) 0.019
Dependencies.type(10000 files) 0.019
Dependencies.version(10000 files) 0.019
Dependencies._add_attachment() 0.000
Dependencies._add_media(10000 files) 0.039
Dependencies._add_meta() 0.000
Dependencies._drop() 0.000
Dependencies._remove() 0.000
Dependencies._update_media() 0.073
Dependencies._update_media_version(10000 files) 0.016

Summary by Sourcery

Store and manage dependency tables using a SQLite-backed Dependencies implementation and update loading, caching, and publishing paths to use a .sqlite dependency file as the primary format.

New Features:

  • Add an in-memory SQLite database as the backing store for Dependencies, including schema, indexes, and serialization support.
  • Support reading and writing dependency tables in .sqlite format alongside existing CSV and Parquet files, and prefer SQLite when downloading from backends and using the cache.

Enhancements:

  • Refactor Dependencies operations (lookup, update, deletion, and metadata accessors) to use SQL queries instead of direct pandas DataFrame manipulation, while preserving the public API.
  • Adjust publish and load logic to avoid direct access to internal DataFrame state and rely on the Dependencies interface or full DataFrame view instead.
  • Remove pickle-based dependency caching and legacy backward-compatibility code tied to pickled dependency tables.

Documentation:

  • Update publishing-related documentation and tests to expect db.sqlite as the dependency file format instead of db.parquet where applicable.

Tests:

  • Adapt dependency and publish tests to populate and validate data through the SQLite-backed Dependencies implementation and cover the new .sqlite load/save behavior.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Dec 30, 2025

Reviewer's Guide

Replaces the in-memory pandas DataFrame implementation of Dependencies with an in-memory SQLite database and changes persistence, caching, and backend formats to use a db.sqlite file while keeping a DataFrame interface for callers and updating tests and publish/load logic accordingly.

Sequence diagram for dependencies cache loading with SQLite

sequenceDiagram
    participant Client
    participant API as api_dependencies
    participant FS as CacheFileSystem
    participant Deps as Dependencies
    participant Backend as BackendInterface

    Client->>API: dependencies(name, version, cache_root)
    API->>FS: resolve db_root
    API->>FS: compose deps_file = DB.sqlite
    API->>Deps: new Dependencies()
    API->>Deps: load(deps_file)
    alt load succeeds
        Deps-->>API: deps
        API-->>Client: deps
    else load raises Exception
        API->>Backend: lookup_backend(name, version)
        API->>Backend: download_dependencies(backend_interface, name, version)
        Backend-->>API: deps
        API->>Deps: deps.save(deps_file)
        API-->>Client: deps
    end
Loading

Sequence diagram for download_dependencies format fallback

sequenceDiagram
    participant API as api_download_dependencies
    participant Backend as BackendInterface
    participant FS as TempFileSystem
    participant Deps as Dependencies

    API->>Backend: join(/, name, DB.sqlite)
    API->>Backend: exists(DB.sqlite, version)
    alt DB.sqlite exists
        API->>Backend: get_file(DB.sqlite, local_db_sqlite, version)
        Backend-->>API: db.sqlite
        API->>Deps: new Dependencies()
        API->>Deps: load(local_db_sqlite)
    else DB.sqlite missing
        API->>Backend: join(/, name, DB.parquet)
        API->>Backend: exists(DB.parquet, version)
        alt DB.parquet exists
            API->>Backend: get_file(DB.parquet, local_db_parquet, version)
            Backend-->>API: db.parquet
            API->>Deps: new Dependencies()
            API->>Deps: load(local_db_parquet)
        else DB.parquet missing
            API->>Backend: join(/, name, DB.zip)
            API->>Backend: get_archive(DB.zip, tmp_root, version)
            Backend-->>API: db.csv in zip
            API->>Deps: new Dependencies()
            API->>Deps: load(legacy_csv_file)
        end
    end
    Deps-->>API: deps
    API-->>Client: deps
Loading

ER diagram for dependencies SQLite table

erDiagram
    DEPENDENCIES {
        string file PK
        string archive
        int bit_depth
        int channels
        string checksum
        float duration
        string format
        int removed
        int sampling_rate
        int type
        string version
    }
Loading

Class diagram for SQLite-backed Dependencies implementation

classDiagram
    class Dependencies {
        - sqlite3_Connection _conn
        - str _db_path
        - pa_Schema _schema
        + Dependencies()
        + DataFrame __call__()
        + bool __contains__(file)
        + bool __eq__(other)
        + list __getitem__(file)
        + int __len__()
        + str __str__()
        + __del__()
        + dict __getstate__()
        + __setstate__(state)
        + list~str~ archives
        + list~str~ attachments
        + list~str~ attachment_ids
        + list~str~ files
        + list~str~ media
        + list~str~ removed_media
        + list~str~ table_ids
        + list~str~ tables
        + str archive(file)
        + int bit_depth(file)
        + int channels(file)
        + str checksum(file)
        + float duration(file)
        + str format(file)
        + None load(path)
        + bool removed(file)
        + None save(path)
        + int sampling_rate(file)
        + int type(file)
        + str version(file)
        + None _add_attachment(file, archive, checksum, version)
        + None _add_media(values)
        + None _add_meta(file, checksum, version)
        + scalar _column_loc(column, file, dtype)
        + None _drop(files)
        + None _remove(file)
        + DataFrame _set_dtypes(df)
        + None _update_media(values)
        + None _update_media_version(files, version)
    }

    class sqlite3_Connection {
    }

    class pa_Schema {
    }

    class DataFrame {
    }

    Dependencies --> sqlite3_Connection : uses
    Dependencies --> pa_Schema : uses
    Dependencies --> DataFrame : returns
Loading

File-Level Changes

Change Details Files
Replace Dependencies internal storage from pandas DataFrame to an in-memory SQLite database, while keeping a DataFrame-based public interface.
  • Initialize an in-memory SQLite connection and create the dependencies table plus indexes in Dependencies.init instead of a pandas DataFrame.
  • Implement call to materialize the SQLite table into a pandas DataFrame with proper dtypes and index configuration.
  • Update magic methods (contains, getitem, len, str, eq, del, getstate, setstate) to operate on SQLite and reconstruct state via SQL inserts.
  • Translate property accessors (archives, attachments, attachment_ids, files, media, removed_media, table_ids, tables) into SQL SELECT queries.
  • Refactor helper methods (_column_loc, _drop, _remove, _update_media, _update_media_version, _add_attachment, _add_media, _add_meta) to use SQL INSERT/UPDATE/DELETE and executemany with appropriate commits.
audb/core/dependencies.py
Change persistence format of dependencies from parquet/pickle to SQLite, with backwards compatibility for CSV/parquet and new download/cache logic.
  • Extend Dependencies.load() to support sqlite files, loading via ATTACH/COPY for sqlite and via pandas for csv/parquet then inserting into SQLite.
  • Extend Dependencies.save() to write csv/parquet via DataFrame materialization and add support for writing a standalone db.sqlite (schema + indexes + data copy via iterdump).
  • Update download_dependencies() to prefer db.sqlite from backend, then fall back to db.parquet, then legacy db.zip CSV.
  • Change define.DEPENDENCY_FILE to db.sqlite and introduce PARQUET_DEPENDENCY_FILE for older parquet-based dependencies.
  • Simplify cached() and dependencies() cache handling to use the main dependency file (db.sqlite) instead of a separate pickle cache file, and adjust error-path regeneration/caching accordingly.
audb/core/dependencies.py
audb/core/api.py
audb/core/define.py
Adapt higher-level logic and tests to the new SQLite-backed Dependencies implementation and file naming.
  • Adjust load job in audb.core.load to obtain flavor_files from deps() DataFrame instead of direct _df access consistent with SQLite backend.
  • Update publish logic for attachments to avoid using internal _df and instead leverage attachments/attachment_ids plus _drop.
  • Change tests to construct Dependencies by inserting ROWS into the SQLite connection, to assert on deps() rather than _df, to drop pickle backward-compatibility tests, and to parameterize load/save over csv, parquet, and sqlite.
  • Update tests and docs referring to db.parquet or cached pickle to refer to db.sqlite where appropriate (e.g., publish and publish_table tests).
audb/core/load.py
audb/core/publish.py
tests/test_dependencies.py
tests/test_publish.py
tests/test_publish_table.py
docs/publish.rst

Possibly linked issues

  • Increase speed of managing dependencies #517: The PR implements the issue’s SQLite-based dependency storage to improve lookup performance and change db.parquet to db.sqlite.
  • #0: The PR replaces db.csv/parquet with db.sqlite for dependencies, effectively changing the dependency filename and hiding its internals.
  • Increase speed of managing dependencies #517: PR implements SQLite-backed dependency table storage, directly addressing the issue’s request for a better storage format.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hagenw hagenw changed the title Sqlite Use sqlite to manage dependency table Dec 30, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New security issues found

Comment on lines +268 to +283
self._conn.execute(
f"INSERT INTO dependencies {DEPENDENCIES} VALUES {VALUES}",
(
file,
row["archive"],
row["bit_depth"],
row["channels"],
row["checksum"],
row["duration"],
row["format"],
row["removed"],
row["sampling_rate"],
row["type"],
row["version"],
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep


if extension == "sqlite":
# For SQLite files, we can attach and copy the data
self._conn.execute(f"ATTACH DATABASE '{path}' AS source_db")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

Comment on lines +668 to +683
self._conn.execute(
f"INSERT OR REPLACE INTO dependencies {DEPENDENCIES} VALUES {VALUES}",
(
file,
archive,
0,
0,
checksum,
0.0,
format,
0,
0,
define.DEPENDENCY_TYPE["attachment"],
version,
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

Comment on lines +737 to +752
self._conn.execute(
f"INSERT OR REPLACE INTO dependencies {DEPENDENCIES} VALUES {VALUES}",
(
file,
archive,
0,
0,
checksum,
0.0,
format,
0,
0,
define.DEPENDENCY_TYPE["meta"],
version,
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

Comment on lines +772 to +774
cursor = self._conn.execute(
f"SELECT {column} FROM dependencies WHERE file = ?", (file,)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

Comment on lines +824 to +826
self._conn.execute(
f"DELETE FROM dependencies WHERE file IN ({placeholders})", files
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

Comment on lines +954 to +957
self._conn.execute(
f"UPDATE dependencies SET version = ? WHERE file IN ({placeholders})",
[version] + list(files),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.

Source: opengrep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants