Cloud storage support (Azure, S3, GCS)#1087
Open
SamirMoustafa wants to merge 14 commits intoscverse:mainfrom
Open
Cloud storage support (Azure, S3, GCS)#1087SamirMoustafa wants to merge 14 commits intoscverse:mainfrom
SamirMoustafa wants to merge 14 commits intoscverse:mainfrom
Conversation
Patch da.to_zarr so ome_zarr's **kwargs are forwarded as zarr_array_kwargs, avoiding FutureWarning and keeping behavior correct.
- _FsspecStoreRoot, _get_store_root for path-like store roots (local + fsspec) - _storage_options_from_fs for parquet writes to Azure/S3/GCS - _remote_zarr_store_exists, _ensure_async_fs for UPath/FsspecStore - Extend _resolve_zarr_store for UPath and _FsspecStoreRoot with async fs - _backed_elements_contained_in_path, _is_element_self_contained accept UPath
- path and _path accept Path | UPath; setter allows UPath - write() accepts file_path: str | Path | UPath | None (None uses path) - _validate_can_safely_write_to_path handles UPath and remote store existence - _write_element accepts Path | UPath; skip local subfolder checks for UPath - __repr__ and _get_groups_for_element use path without forcing Path()
…table, zarr - Resolve store via _resolve_zarr_store in read paths (points, shapes, raster, table) - Use _get_store_root for parquet paths; read/write parquet with storage_options for fsspec - io_shapes: upload parquet to Azure/S3/GCS via temp file when path is _FsspecStoreRoot - io_zarr: _get_store_root, UPath in _get_groups_for_element and _write_consolidated_metadata; set sdata.path to UPath when store is remote
- pyproject.toml: adlfs, gcsfs, moto[server], pytest-timeout in test extras - Dockerfile.emulators: moto, Azurite, fake-gcs-server for tests/io/remote_storage/
… emulator config - full_sdata fixture: two regions for table categorical (avoids 404 on remote read) - tests/io/remote_storage/conftest.py: bucket/container creation, resilient async shutdown - tests/io/remote_storage/test_remote_storage.py: parametrized Azure/S3/GCS roundtrip and write tests
- Added "dimension_separator" to the frozenset of internal keys that should not be passed to zarr.Group.create_array(), ensuring compatibility with various zarr versions. - Updated test to set region labels for full_sdata table, allowing the test_set_table_annotates_spatialelement to succeed without errors.
- Updated the `test_subset` function to exclude labels and poly from the default table, ensuring accurate subset validation. - Enhanced `test_validate_table_in_spatialdata` to assert that both regions (labels2d and poly) are correctly annotated in the table. - Adjusted `test_labels_table_joins` to restrict the table to labels2d, ensuring the join returns the expected results.
…inux - Added steps to build and run storage emulators (S3, Azure, GCS) using Docker, specifically for the Ubuntu environment. - Implemented a wait mechanism to ensure emulators are ready before running tests. - Adjusted test execution to skip remote storage tests on non-Linux platforms.
- Wrapped the fsspec async sync function to prevent RuntimeError "Loop is not running" during process exit when using remote storage (Azure, S3, GCS). - Ensured compatibility with async session management in the _utils module.
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1087 +/- ##
==========================================
- Coverage 91.96% 91.85% -0.12%
==========================================
Files 51 52 +1
Lines 7729 7907 +178
==========================================
+ Hits 7108 7263 +155
- Misses 621 644 +23
🚀 New features to boost your workflow:
|
…nd io_points modules
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cloud storage support (Azure, S3, GCS)
Summary
Add read/write support for SpatialData on remote object storage via
UPath, fixing the issue reported in #999 wheresd.read_zarr(UPath("s3://..."))failed becauseSpatialData.pathdid not acceptUPath. PR #971 ("add remote support") pursued the same goal but remains a draft, blocked on zarr v3/ome-zarr and async fsspec after dask unpinning. This PR delivers working remote support by fixing the path setter, wrapping fsspec in an async filesystem where required for current zarr, and testing Azure, S3, and GCS via Docker emulators.Supported features
SpatialData.pathacceptsNone,str,Path, orUPath(enables remote-backed objects).SpatialData.read(upath)andread_zarr(upath)for Azure Blob (az://), S3 (s3://), and GCS (gs://) using universal-pathlib (UPath).sdata.write(upath)and element-level writes to the same backends; parquet (points/shapes) and zarr (raster/tables) written via fsspec with async filesystem support where required.zmetadata) supported.Testing
Remote storage is tested with Docker-based emulators (Azurite for Azure, moto for S3, fake-gcs-server for GCS). In CI we build
tests/io/remote_storage/Dockerfile.emulators, start the emulators on Ubuntu, then run the full test suite includingtests/io/remote_storage/. These remote-storage tests run only on Ubuntu (Linux), because they depend on Docker; on Windows and macOS we skiptests/io/remote_storage/and run the rest of the suite. To run the remote tests locally you need Docker and can start the emulators with the same image and ports (5000, 10000, 4443) as in the workflow.Example (three providers)
Credentials and options are passed through
UPath(e.g.connection_string,endpoint_url,anon,token,project) as supported by the underlying fsspec backend.Release notes
UPath.SpatialData.pathnow acceptsUPathin addition tostrandPath. Fixes initialization from remote stores (e.g. S3) as in #999.