-
Notifications
You must be signed in to change notification settings - Fork 2
Use duckdb to speed up dependency management #518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR accelerates dependency operations by integrating DuckDB for querying parquet-backed data. It adds connection management utilities, hooks up DuckDB setup in load/save paths, wraps existing accessors to prefer DuckDB queries with pandas fallback, invalidates the cache on modifications, and updates project dependencies. Sequence diagram for querying dependencies with DuckDB and pandas fallbacksequenceDiagram
participant D as Dependencies
participant DuckDB
participant Pandas
D->>DuckDB: Query for data (e.g., files, archives)
alt DuckDB query succeeds
DuckDB-->>D: Return result
else DuckDB query fails or not available
D->>Pandas: Query for data
Pandas-->>D: Return result
end
Class diagram for updated Dependencies class with DuckDB integrationclassDiagram
class Dependencies {
- _df: pd.DataFrame
- _duckdb_conn
- _parquet_file
+ __call__() pd.DataFrame
+ __contains__(file: str) bool
+ __eq__(other: Dependencies) bool
+ archives() list[str]
+ attachments() list[str]
+ attachment_ids() list[str]
+ files() list[str]
+ media() list[str]
+ removed_media() list[str]
+ tables() list[str]
+ archive(file: str) str
+ bit_depth(file: str) int
+ load(path: str)
+ removed(file: str) bool
+ save(path: str)
+ type(file: str) int
+ _add_attachment(...)
+ _add_media(...)
+ _add_meta(...)
+ _column_loc(file: str, column: str, dtype=None)
+ _drop(files: Sequence[str])
+ _remove(file: str)
+ _update_media(...)
+ _update_media_version(...)
+ _setup_duckdb_connection(parquet_path: str)
+ _duckdb_query_files(condition: str=None) list[str]
+ _close_duckdb_connection()
+ __del__()
}
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
For me locally the tests get stuck at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
audb/core/dependencies.py
Outdated
| result = self._duckdb_conn.execute( | ||
| f"SELECT COUNT(*) FROM '{self._parquet_file}' WHERE file = ?", | ||
| [file], | ||
| ).fetchone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
audb/core/dependencies.py
Outdated
| result = self._duckdb_conn.execute( | ||
| f"SELECT file FROM '{self._parquet_file}'" | ||
| ).fetchall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
| result = self._duckdb_conn.execute( | ||
| f"SELECT archive FROM '{self._parquet_file}' WHERE file = ?", [file] | ||
| ).fetchone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
| result = self._duckdb_conn.execute( | ||
| f"SELECT {column} FROM '{self._parquet_file}' WHERE file = ?", | ||
| [file], | ||
| ).fetchone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
audb/core/dependencies.py
Outdated
| self._duckdb_conn.execute( | ||
| f"SELECT COUNT(*) FROM '{parquet_path}'" | ||
| ).fetchone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
| query = f"SELECT file FROM '{self._parquet_file}'" | ||
| if condition: | ||
| query += f" WHERE {condition}" | ||
| result = self._duckdb_conn.execute(query).fetchall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
* Update dependency benchmark script * Better formatting of __ * Don't create lists
Experiments with #517
This adds
duckdbto see how much faster it is.Summary by Sourcery
Integrate DuckDB into the Dependencies class to accelerate queries on Parquet files, while preserving pandas-based fallbacks and ensuring connection setup and teardown around data loads, saves, and modifications.
Enhancements:
Build: