feat: add remote file support to ND2File via fsspec#290
feat: add remote file support to ND2File via fsspec#290derekthirstrup wants to merge 20 commits intotlambert03:mainfrom
Conversation
Add ND2FsspecReader for streaming access to ND2 files via fsspec: Features: - Streaming access: Only downloads frames you request - Parallel I/O: 2x+ speedup with ThreadPoolExecutor - Cloud support: S3, GCS, Azure, HTTP via fsspec - 3D bounding box crop: Read specific regions efficiently - File list optimization: Pre-compute chunk offsets for repeated reads - Full metadata extraction: Voxel sizes, time intervals, scene positions - Dask integration: Lazy loading for huge files New exports: - ND2FsspecReader: Main reader class - ND2FileList: Pre-computed chunk offsets for optimized reading - ImageMetadata: Full metadata dataclass - read_fsspec: Convenience function Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added comprehensive documentation for the ND2FsspecReader functionality: - Streaming access for remote files (HTTP, S3, GCS, Azure) - Parallel I/O with ThreadPoolExecutor - 3D bounding box crop support - File list optimization for repeated reads - Full metadata extraction (voxel sizes, time intervals, scenes) - Dask integration - API reference and benchmark results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parse experiment loops (ImageMetadataLV!) FIRST as authoritative source - Use position arrays as fallback only when loops don't provide values - Fix scene count: now correctly reports 4 scenes instead of 5 - Fix Z step: calculate from dZHigh/dZLow when dZStep is 0 This fixes metadata reporting where: - n_scenes was incorrectly 5 (should be 4) - voxel_size_z was 0.1625 (should be 1.0) - T (timepoints) was 1 (should be 120) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add _parse_metadata_via_nd2file() that uses ND2File for local files - ND2File properly parses experiment loops including pItemValid filtering - Correctly reports: T=120, S=4, Z=153, voxel_sizes=(1.0, 0.1625, 0.1625) - Remote files still use manual parsing with recursive loop extraction - Fixed XYPosLoop to filter by pItemValid for correct scene count Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change default max_workers from 8 to 64 for maximum throughput - Add Performance Optimizations section to README: - Metadata caching (chunk map read once, not per Z-slice) - Parallel I/O with 64 workers and HTTP connection pooling - Efficient Z cropping (only fetches slices in range) - Document XY cropping limitation (full frame + memory crop due to compression) - Update benchmark table with 64-worker results - Update API reference with new defaults Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
thanks @derekthirstrup. I am definitely open to such a feature, but it's going to take me a while to dig through an LLM-generated 2000 line PR. I would want to make sure it's not re-inventing wheels and not using established patterns in this repo correctly. I think it would be better if we didn't need an entirely new class for this case: it would be cleaner if we could re-use existing patterns and just have ND2File be smarter about local vs remote resources, and put this logic at the level of file-chunk reading (rather than a completely new class with a new API). so, I'm afraid this will have to sit for a while until I have time to look through it carefully. If you want to speed things up, then experiment with:
|
Clarify that pip install installs the default nd2 reader from PyPI, and provide uv pip install command for this fork with fsspec support. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix E501 line too long errors by shortening log messages - Fix B904 errors by using 'raise from err' pattern - Fix SIM105 by using contextlib.suppress for zlib error - Fix mypy type errors: - Add proper type casts for BinaryIO compatibility - Fix chunk_offsets dict key type in ND2FileList.from_dict - Add type annotations for crop parameters - Use getattr pattern for loop.parameters access - Add type: ignore for requests import Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add native support for remote URLs (http://, s3://, gs://, etc.) directly
in ND2File, eliminating the need for a separate ND2FsspecReader class.
Changes:
- protocol.py: Add _is_remote_path() and _open_file() helpers, make mmap
conditional (only for local files), use fsspec for remote URLs
- modern_reader.py: Add read_frame fallback via seek/read when mmap
unavailable, add read_frames_parallel() for high-throughput parallel
reads that can saturate 10Gbit+ connections
- _nd2file.py: Add read_frames() convenience method and is_remote property
Usage:
# Now works seamlessly with remote URLs
with nd2.ND2File("https://example.com/file.nd2") as f:
data = f.asarray()
# For maximum throughput on fast connections:
frames = f.read_frames(max_workers=64)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add contextlib import and use contextlib.suppress for mmap errors - Fix line length issues by splitting long lines - Fix type annotations to accept PathLike[Any] for broader compatibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tests for _is_remote_path() URL detection - Add tests for _open_file() with mocked fsspec - Add tests for read_frames_parallel() method - Add integration tests for ND2File with remote URLs - Fix merge conflicts in protocol.py - Fix __repr__ to handle both Path and str paths Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CodSpeed Performance ReportMerging this PR will not alter performanceComparing Summary
|
…oach Remove the large _fsspec.py module (~1700 lines) and use the minimal integration directly in ND2File instead. Remote file support is now handled by: - protocol.py: URL detection and fsspec file opening - modern_reader.py: parallel reading for remote files - _nd2file.py: read_frames() method and is_remote property This simplifies the codebase while maintaining full functionality. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove invasive test that manipulated internal reader state, causing unclosed file handle warnings. Replace with simpler test that just verifies the method exists. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace extensive ND2FsspecReader documentation with concise documentation for the integrated remote file support in ND2File. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@tlambert03 Thanks for the feedback! I've completely refactored the PR based on your suggestions: Changes Made
MotivationWe're building ML pipelines to train models on microscopy data stored in S3. Our workflow:
The parallel I/O (64 workers default) is designed to saturate 10Gbit+ network connections that are common on cloud GPU instances. Implementation ApproachRather than reinventing the wheel, the changes hook into existing patterns:
Local files continue to use mmap as before - the parallel path only activates when mmap isn't available (remote files). Happy to make any further adjustments! |
|
thanks. this is much easier to review now. I'll try to have a close look soon. Do you, by chance, happen to have an nd2 file on a public s3 or http path that I could test this with? |
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (52.38%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #290 +/- ##
==========================================
- Coverage 93.65% 92.14% -1.51%
==========================================
Files 22 22
Lines 2630 2723 +93
==========================================
+ Hits 2463 2509 +46
- Misses 167 214 +47 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
we are setting up a nd2 file on a public s3 bucket that you can use for testing. Can I send you the path via email? |
|
Note: 3D Bounding Box Cropping Feature During the refactor to integrate fsspec support directly into ND2File, the 3D bounding box cropping functionality was not carried over. This feature could be added back as an enhancement if desired. How it worked: The cropping feature allowed specifying a 6-tuple bounding box (z_start, z_end, y_start, y_end, x_start, x_end) to read only a subset of the volume. For remote files, this was particularly
Example use case: When extracting cell regions from a large multi-position timelapse, you could specify a tight bounding box around each cell or colony of interest and only transfer the relevant data - useful for workflows where you have pre-computed cell coordinates and need to extract thousands of small ROIs from terabyte-scale datasets. Let me know if this feature would be valuable to add back - it would involve calculating byte offsets for specific Z-planes and using HTTP range requests for efficient partial reads. |
Thread-local file handles created in _read_frames_parallel_remote() were never being closed, causing ResourceWarning in tests. Track all opened handles and close them in a finally block after execution. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
thanks @derekthirstrup I used the url you sent me: import nd2
f = nd2.ND2File('https://s3.us-west-2.amazonaws.com/.../filename.nd2')but got an error as soon as I checked the metadata: f.metadata---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 f.metadata
File ~/.local/share/uv/python/cpython-3.13.9-macos-aarch64-none/lib/python3.13/functools.py:1026, in cached_property.__get__(self, instance, owner)
1024 val = cache.get(self.attrname, _NOT_FOUND)
1025 if val is _NOT_FOUND:
-> 1026 val = self.func(instance)
1027 try:
1028 cache[self.attrname] = val
File ~/dev/self/nd2/src/nd2/_nd2file.py:553, in ND2File.metadata(self)
452 @cached_property
453 def metadata(self) -> Metadata:
454 """Various metadata (will be `dict` only if legacy format).
455
456 ??? example "Example output"
(...) 551 dict if legacy format, else `Metadata`
552 """
--> 553 return self._rdr.metadata()
File ~/dev/self/nd2/src/nd2/_readers/_modern/modern_reader.py:194, in ModernReader.metadata(self)
191 def metadata(self) -> structures.Metadata:
192 if not self._metadata:
193 self._metadata = load_metadata(
--> 194 raw_meta=self._cached_raw_metadata(),
195 global_meta=self._cached_global_metadata(),
196 )
197 return self._metadata
File ~/dev/self/nd2/src/nd2/_readers/_modern/modern_reader.py:169, in ModernReader._cached_raw_metadata(self)
165 def _cached_raw_metadata(self) -> RawMetaDict:
166 if self._raw_image_metadata is None:
167 k = (
168 b"ImageMetadataSeqLV|0!"
--> 169 if self.version() >= (3, 0)
170 else b"ImageMetadataSeq|0!"
171 )
172 meta = self._decode_chunk(k, strip_prefix=False)
173 meta = meta.get("SLxPictureMetadata", meta) # for v3 only
File ~/dev/self/nd2/src/nd2/_readers/protocol.py:196, in ND2Reader.version(self)
194 def version(self) -> tuple[int, int]:
195 """Return the file format version as a tuple of ints."""
--> 196 return get_version(self._fh or self._path)
File ~/dev/self/nd2/src/nd2/_parse/_chunk_decode.py:96, in get_version(fh)
94 with ctx as fh:
95 fh.seek(0)
---> 96 fname = str(fh.name)
97 chunk = START_FILE_CHUNK.unpack(fh.read(START_FILE_CHUNK.size))
99 magic, name_length, data_length, name, data = cast("StartFileChunk", chunk)
AttributeError: 'HTTPFile' object has no attribute 'name'so I don't think this is being rigorously tested enough yet. The tests are all very unit-testy and don't really exercise this path much. We either need to:
the best way to test this would be to get the string name of an s3 resource into the conftest.NEW list... so that it is tested against the full battery of tests that ask for " |
that is very much like #242 ... which I believe is the proper way to support that feature. I don't want to add a new public api to express some cropped read: that API is already very nicely met by putting dask in the loop, where the user can then use standard numpy indexing to request some sub-slice, and the lazy reading/loading is entirely handled behind the scenes. it already works just fine for everything but in-plane cropping, and that PR is the one that would add that for in-plane cropping. (it's just about doing the bookkeeping to update the striding when we read the bytes from disk... not really hard to do, but testing it is finicky and it needs someone who cares about it to get it over the finish line) |
Summary
Adds remote file support (HTTP, S3, GCS, Azure) to
ND2Filevia fsspec integration.This is a minimal refactor that integrates directly into the existing
ND2Fileclass rather than creating a separate reader, following the patterns already established in the codebase.Changes (~250 lines across 3 source files)
src/nd2/_readers/protocol.py_is_remote_path()to detect remote URLs (http, https, s3, gs, az, abfs, smb)_open_file()that uses fsspec for remote URLs, regularopen()for localfileno())_is_remoteflag on readersrc/nd2/_readers/_modern/modern_reader.pyread_frames_parallel()method usingThreadPoolExecutorsrc/nd2/_nd2file.pyread_frames()method for batch frame reading with parallel I/Ois_remoteproperty__repr__to handle both Path and str pathsUsage
Design Decisions
Test Coverage
read_frames()andis_remote