-
Notifications
You must be signed in to change notification settings - Fork 2
Use sqlite to manage dependency table #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideReplaces the in-memory pandas DataFrame implementation of Dependencies with an in-memory SQLite database and changes persistence, caching, and backend formats to use a db.sqlite file while keeping a DataFrame interface for callers and updating tests and publish/load logic accordingly. Sequence diagram for dependencies cache loading with SQLitesequenceDiagram
participant Client
participant API as api_dependencies
participant FS as CacheFileSystem
participant Deps as Dependencies
participant Backend as BackendInterface
Client->>API: dependencies(name, version, cache_root)
API->>FS: resolve db_root
API->>FS: compose deps_file = DB.sqlite
API->>Deps: new Dependencies()
API->>Deps: load(deps_file)
alt load succeeds
Deps-->>API: deps
API-->>Client: deps
else load raises Exception
API->>Backend: lookup_backend(name, version)
API->>Backend: download_dependencies(backend_interface, name, version)
Backend-->>API: deps
API->>Deps: deps.save(deps_file)
API-->>Client: deps
end
Sequence diagram for download_dependencies format fallbacksequenceDiagram
participant API as api_download_dependencies
participant Backend as BackendInterface
participant FS as TempFileSystem
participant Deps as Dependencies
API->>Backend: join(/, name, DB.sqlite)
API->>Backend: exists(DB.sqlite, version)
alt DB.sqlite exists
API->>Backend: get_file(DB.sqlite, local_db_sqlite, version)
Backend-->>API: db.sqlite
API->>Deps: new Dependencies()
API->>Deps: load(local_db_sqlite)
else DB.sqlite missing
API->>Backend: join(/, name, DB.parquet)
API->>Backend: exists(DB.parquet, version)
alt DB.parquet exists
API->>Backend: get_file(DB.parquet, local_db_parquet, version)
Backend-->>API: db.parquet
API->>Deps: new Dependencies()
API->>Deps: load(local_db_parquet)
else DB.parquet missing
API->>Backend: join(/, name, DB.zip)
API->>Backend: get_archive(DB.zip, tmp_root, version)
Backend-->>API: db.csv in zip
API->>Deps: new Dependencies()
API->>Deps: load(legacy_csv_file)
end
end
Deps-->>API: deps
API-->>Client: deps
ER diagram for dependencies SQLite tableerDiagram
DEPENDENCIES {
string file PK
string archive
int bit_depth
int channels
string checksum
float duration
string format
int removed
int sampling_rate
int type
string version
}
Class diagram for SQLite-backed Dependencies implementationclassDiagram
class Dependencies {
- sqlite3_Connection _conn
- str _db_path
- pa_Schema _schema
+ Dependencies()
+ DataFrame __call__()
+ bool __contains__(file)
+ bool __eq__(other)
+ list __getitem__(file)
+ int __len__()
+ str __str__()
+ __del__()
+ dict __getstate__()
+ __setstate__(state)
+ list~str~ archives
+ list~str~ attachments
+ list~str~ attachment_ids
+ list~str~ files
+ list~str~ media
+ list~str~ removed_media
+ list~str~ table_ids
+ list~str~ tables
+ str archive(file)
+ int bit_depth(file)
+ int channels(file)
+ str checksum(file)
+ float duration(file)
+ str format(file)
+ None load(path)
+ bool removed(file)
+ None save(path)
+ int sampling_rate(file)
+ int type(file)
+ str version(file)
+ None _add_attachment(file, archive, checksum, version)
+ None _add_media(values)
+ None _add_meta(file, checksum, version)
+ scalar _column_loc(column, file, dtype)
+ None _drop(files)
+ None _remove(file)
+ DataFrame _set_dtypes(df)
+ None _update_media(values)
+ None _update_media_version(files, version)
}
class sqlite3_Connection {
}
class pa_Schema {
}
class DataFrame {
}
Dependencies --> sqlite3_Connection : uses
Dependencies --> pa_Schema : uses
Dependencies --> DataFrame : returns
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
| self._conn.execute( | ||
| f"INSERT INTO dependencies {DEPENDENCIES} VALUES {VALUES}", | ||
| ( | ||
| file, | ||
| row["archive"], | ||
| row["bit_depth"], | ||
| row["channels"], | ||
| row["checksum"], | ||
| row["duration"], | ||
| row["format"], | ||
| row["removed"], | ||
| row["sampling_rate"], | ||
| row["type"], | ||
| row["version"], | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
|
|
||
| if extension == "sqlite": | ||
| # For SQLite files, we can attach and copy the data | ||
| self._conn.execute(f"ATTACH DATABASE '{path}' AS source_db") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
| self._conn.execute( | ||
| f"INSERT OR REPLACE INTO dependencies {DEPENDENCIES} VALUES {VALUES}", | ||
| ( | ||
| file, | ||
| archive, | ||
| 0, | ||
| 0, | ||
| checksum, | ||
| 0.0, | ||
| format, | ||
| 0, | ||
| 0, | ||
| define.DEPENDENCY_TYPE["attachment"], | ||
| version, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
| self._conn.execute( | ||
| f"INSERT OR REPLACE INTO dependencies {DEPENDENCIES} VALUES {VALUES}", | ||
| ( | ||
| file, | ||
| archive, | ||
| 0, | ||
| 0, | ||
| checksum, | ||
| 0.0, | ||
| format, | ||
| 0, | ||
| 0, | ||
| define.DEPENDENCY_TYPE["meta"], | ||
| version, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
| cursor = self._conn.execute( | ||
| f"SELECT {column} FROM dependencies WHERE file = ?", (file,) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
| self._conn.execute( | ||
| f"DELETE FROM dependencies WHERE file IN ({placeholders})", files | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
| self._conn.execute( | ||
| f"UPDATE dependencies SET version = ? WHERE file IN ({placeholders})", | ||
| [version] + list(files), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.sqlalchemy.security.sqlalchemy-execute-raw-query): Avoiding SQL string concatenation: untrusted input concatenated with raw SQL query can result in SQL Injection. In order to execute raw query safely, prepared statement should be used. SQLAlchemy provides TextualSQL to easily used prepared statement with named parameters. For complex SQL composition, use SQL Expression Language or Schema Definition Language. In most cases, SQLAlchemy ORM will be a better option.
Source: opengrep
Implements the sqlite suggestion from #517
It stores the dependency table in a sqlite file/database.
It shows some improvements for certain look ups, but overall performance is not better:
Summary by Sourcery
Store and manage dependency tables using a SQLite-backed
Dependenciesimplementation and update loading, caching, and publishing paths to use a.sqlitedependency file as the primary format.New Features:
Dependencies, including schema, indexes, and serialization support..sqliteformat alongside existing CSV and Parquet files, and prefer SQLite when downloading from backends and using the cache.Enhancements:
Dependenciesoperations (lookup, update, deletion, and metadata accessors) to use SQL queries instead of direct pandas DataFrame manipulation, while preserving the public API.Dependenciesinterface or full DataFrame view instead.Documentation:
db.sqliteas the dependency file format instead ofdb.parquetwhere applicable.Tests:
Dependenciesimplementation and cover the new.sqliteload/save behavior.