Skip to content

feat: add remote file support to ND2File via fsspec#290

Open
derekthirstrup wants to merge 20 commits intotlambert03:mainfrom
derekthirstrup:main
Open

feat: add remote file support to ND2File via fsspec#290
derekthirstrup wants to merge 20 commits intotlambert03:mainfrom
derekthirstrup:main

Conversation

@derekthirstrup
Copy link

@derekthirstrup derekthirstrup commented Jan 21, 2026

Summary

Adds remote file support (HTTP, S3, GCS, Azure) to ND2File via fsspec integration.

This is a minimal refactor that integrates directly into the existing ND2File class rather than creating a separate reader, following the patterns already established in the codebase.

Changes (~250 lines across 3 source files)

src/nd2/_readers/protocol.py

  • Add _is_remote_path() to detect remote URLs (http, https, s3, gs, az, abfs, smb)
  • Add _open_file() that uses fsspec for remote URLs, regular open() for local
  • Make mmap conditional (only created for local files with fileno())
  • Track _is_remote flag on reader

src/nd2/_readers/_modern/modern_reader.py

  • Add read_frames_parallel() method using ThreadPoolExecutor
  • Uses thread-local file handles for connection reuse
  • Falls back to sequential mmap reads for local files

src/nd2/_nd2file.py

  • Add read_frames() method for batch frame reading with parallel I/O
  • Add is_remote property
  • Fix __repr__ to handle both Path and str paths

Usage

import nd2

# Works with any fsspec-supported URL
with nd2.ND2File("https://example.com/file.nd2") as f:
    print(f.shape, f.dtype)
    data = f.asarray()

# Parallel I/O for high-bandwidth networks
with nd2.ND2File("s3://bucket/file.nd2") as f:
    frames = f.read_frames([0, 1, 2], max_workers=64)

Design Decisions

  1. Integrated into ND2File - No new classes, uses existing patterns
  2. Conditional mmap - Only for local files; remote uses seek/read
  3. Parallel I/O - 64 workers default for remote to saturate 10Gbit+ networks
  4. Lazy fsspec import - Only imported when needed for remote URLs

Test Coverage

  • URL detection tests for all supported protocols
  • File opening tests with mocked fsspec
  • Integration tests for read_frames() and is_remote

Derek Thirstrup and others added 8 commits January 20, 2026 14:04
Add ND2FsspecReader for streaming access to ND2 files via fsspec:

Features:
- Streaming access: Only downloads frames you request
- Parallel I/O: 2x+ speedup with ThreadPoolExecutor
- Cloud support: S3, GCS, Azure, HTTP via fsspec
- 3D bounding box crop: Read specific regions efficiently
- File list optimization: Pre-compute chunk offsets for repeated reads
- Full metadata extraction: Voxel sizes, time intervals, scene positions
- Dask integration: Lazy loading for huge files

New exports:
- ND2FsspecReader: Main reader class
- ND2FileList: Pre-computed chunk offsets for optimized reading
- ImageMetadata: Full metadata dataclass
- read_fsspec: Convenience function

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added comprehensive documentation for the ND2FsspecReader functionality:
- Streaming access for remote files (HTTP, S3, GCS, Azure)
- Parallel I/O with ThreadPoolExecutor
- 3D bounding box crop support
- File list optimization for repeated reads
- Full metadata extraction (voxel sizes, time intervals, scenes)
- Dask integration
- API reference and benchmark results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Parse experiment loops (ImageMetadataLV!) FIRST as authoritative source
- Use position arrays as fallback only when loops don't provide values
- Fix scene count: now correctly reports 4 scenes instead of 5
- Fix Z step: calculate from dZHigh/dZLow when dZStep is 0

This fixes metadata reporting where:
- n_scenes was incorrectly 5 (should be 4)
- voxel_size_z was 0.1625 (should be 1.0)
- T (timepoints) was 1 (should be 120)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add _parse_metadata_via_nd2file() that uses ND2File for local files
- ND2File properly parses experiment loops including pItemValid filtering
- Correctly reports: T=120, S=4, Z=153, voxel_sizes=(1.0, 0.1625, 0.1625)
- Remote files still use manual parsing with recursive loop extraction
- Fixed XYPosLoop to filter by pItemValid for correct scene count

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change default max_workers from 8 to 64 for maximum throughput
- Add Performance Optimizations section to README:
  - Metadata caching (chunk map read once, not per Z-slice)
  - Parallel I/O with 64 workers and HTTP connection pooling
  - Efficient Z cropping (only fetches slices in range)
- Document XY cropping limitation (full frame + memory crop due to compression)
- Update benchmark table with 64-worker results
- Update API reference with new defaults

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@tlambert03
Copy link
Owner

thanks @derekthirstrup.

I am definitely open to such a feature, but it's going to take me a while to dig through an LLM-generated 2000 line PR. I would want to make sure it's not re-inventing wheels and not using established patterns in this repo correctly. I think it would be better if we didn't need an entirely new class for this case: it would be cleaner if we could re-use existing patterns and just have ND2File be smarter about local vs remote resources, and put this logic at the level of file-chunk reading (rather than a completely new class with a new API).

so, I'm afraid this will have to sit for a while until I have time to look through it carefully. If you want to speed things up, then experiment with:

  1. making it possible to do nd2.ND2File('https://...') instead of ND2FsspecReader
  2. drastically reduce the amount of code needed (my gut says this should be possible with much less code)
  3. add tests

Derek Thirstrup and others added 7 commits January 21, 2026 09:01
Clarify that pip install installs the default nd2 reader from PyPI,
and provide uv pip install command for this fork with fsspec support.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix E501 line too long errors by shortening log messages
- Fix B904 errors by using 'raise from err' pattern
- Fix SIM105 by using contextlib.suppress for zlib error
- Fix mypy type errors:
  - Add proper type casts for BinaryIO compatibility
  - Fix chunk_offsets dict key type in ND2FileList.from_dict
  - Add type annotations for crop parameters
  - Use getattr pattern for loop.parameters access
  - Add type: ignore for requests import

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add native support for remote URLs (http://, s3://, gs://, etc.) directly
in ND2File, eliminating the need for a separate ND2FsspecReader class.

Changes:
- protocol.py: Add _is_remote_path() and _open_file() helpers, make mmap
  conditional (only for local files), use fsspec for remote URLs
- modern_reader.py: Add read_frame fallback via seek/read when mmap
  unavailable, add read_frames_parallel() for high-throughput parallel
  reads that can saturate 10Gbit+ connections
- _nd2file.py: Add read_frames() convenience method and is_remote property

Usage:
  # Now works seamlessly with remote URLs
  with nd2.ND2File("https://example.com/file.nd2") as f:
      data = f.asarray()

  # For maximum throughput on fast connections:
  frames = f.read_frames(max_workers=64)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add contextlib import and use contextlib.suppress for mmap errors
- Fix line length issues by splitting long lines
- Fix type annotations to accept PathLike[Any] for broader compatibility

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tests for _is_remote_path() URL detection
- Add tests for _open_file() with mocked fsspec
- Add tests for read_frames_parallel() method
- Add integration tests for ND2File with remote URLs
- Fix merge conflicts in protocol.py
- Fix __repr__ to handle both Path and str paths

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codspeed-hq
Copy link

codspeed-hq bot commented Jan 21, 2026

CodSpeed Performance Report

Merging this PR will not alter performance

Comparing derekthirstrup:main (7c3599b) with main (7ba40db)

Summary

✅ 13 untouched benchmarks

Derek Thirstrup and others added 3 commits January 21, 2026 11:43
…oach

Remove the large _fsspec.py module (~1700 lines) and use the minimal
integration directly in ND2File instead. Remote file support is now
handled by:
- protocol.py: URL detection and fsspec file opening
- modern_reader.py: parallel reading for remote files
- _nd2file.py: read_frames() method and is_remote property

This simplifies the codebase while maintaining full functionality.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove invasive test that manipulated internal reader state,
causing unclosed file handle warnings. Replace with simpler
test that just verifies the method exists.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace extensive ND2FsspecReader documentation with concise
documentation for the integrated remote file support in ND2File.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@derekthirstrup
Copy link
Author

@tlambert03 Thanks for the feedback! I've completely refactored the PR based on your suggestions:

Changes Made

  1. Integrated into ND2File - Remote files now work with nd2.ND2File('https://...') directly, no separate class needed
  2. Drastically reduced code - From ~2000 lines down to ~250 lines across 3 files
  3. Added tests - Unit tests for URL detection, file opening, and read_frames()
  4. Removed _fsspec.py - All that code is gone, using existing patterns instead

Motivation

We're building ML pipelines to train models on microscopy data stored in S3. Our workflow:

  • ND2 files (10-100GB) stored on S3/cloud storage
  • Training on cloud H100 GPU clusters
  • Need to stream data directly without downloading entire files first

The parallel I/O (64 workers default) is designed to saturate 10Gbit+ network connections that are common on cloud GPU instances.

Implementation Approach

Rather than reinventing the wheel, the changes hook into existing patterns:

  • protocol.py: URL detection and fsspec file opening at the reader level
  • modern_reader.py: Parallel reading with ThreadPoolExecutor for remote files
  • _nd2file.py: read_frames() convenience method and is_remote property

Local files continue to use mmap as before - the parallel path only activates when mmap isn't available (remote files).

Happy to make any further adjustments!

@derekthirstrup derekthirstrup changed the title feat: add fsspec-based remote/streaming ND2 reader for cloud storage feat: add remote file support to ND2File via fsspec Jan 21, 2026
@tlambert03
Copy link
Owner

thanks. this is much easier to review now. I'll try to have a close look soon. Do you, by chance, happen to have an nd2 file on a public s3 or http path that I could test this with?

@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 52.38095% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.14%. Comparing base (7ba40db) to head (7c3599b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/nd2/_readers/_modern/modern_reader.py 20.00% 48 Missing ⚠️
src/nd2/_readers/protocol.py 92.59% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (52.38%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #290      +/-   ##
==========================================
- Coverage   93.65%   92.14%   -1.51%     
==========================================
  Files          22       22              
  Lines        2630     2723      +93     
==========================================
+ Hits         2463     2509      +46     
- Misses        167      214      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@derekthirstrup
Copy link
Author

derekthirstrup commented Jan 21, 2026

we are setting up a nd2 file on a public s3 bucket that you can use for testing. Can I send you the path via email?

@tlambert03
Copy link
Owner

talley.lambert@gmail.com

@derekthirstrup
Copy link
Author

Note: 3D Bounding Box Cropping Feature

During the refactor to integrate fsspec support directly into ND2File, the 3D bounding box cropping functionality was not carried over. This feature could be added back as an enhancement if desired.

How it worked:

The cropping feature allowed specifying a 6-tuple bounding box (z_start, z_end, y_start, y_end, x_start, x_end) to read only a subset of the volume. For remote files, this was particularly
efficient because:

  1. Selective plane transfer: Only the Z-planes intersecting the bounding box were fetched from the remote server, rather than downloading the entire Z-stack
  2. Reduced bandwidth: For a typical use case like extracting a single cell from a large volume, this meant transferring ~5-10 planes instead of 150+
  3. In-memory cropping: The XY crop was applied after each plane was read, keeping memory usage minimal

Example use case:

When extracting cell regions from a large multi-position timelapse, you could specify a tight bounding box around each cell or colony of interest and only transfer the relevant data - useful for workflows where you have pre-computed cell coordinates and need to extract thousands of small ROIs from terabyte-scale datasets.

Let me know if this feature would be valuable to add back - it would involve calculating byte offsets for specific Z-planes and using HTTP range requests for efficient partial reads.

Derek Thirstrup and others added 2 commits January 21, 2026 16:23
Thread-local file handles created in _read_frames_parallel_remote()
were never being closed, causing ResourceWarning in tests. Track all
opened handles and close them in a finally block after execution.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@tlambert03
Copy link
Owner

thanks @derekthirstrup I used the url you sent me:

import nd2
f = nd2.ND2File('https://s3.us-west-2.amazonaws.com/.../filename.nd2')

but got an error as soon as I checked the metadata:

f.metadata
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 1
----> 1 f.metadata

File ~/.local/share/uv/python/cpython-3.13.9-macos-aarch64-none/lib/python3.13/functools.py:1026, in cached_property.__get__(self, instance, owner)
   1024 val = cache.get(self.attrname, _NOT_FOUND)
   1025 if val is _NOT_FOUND:
-> 1026     val = self.func(instance)
   1027     try:
   1028         cache[self.attrname] = val

File ~/dev/self/nd2/src/nd2/_nd2file.py:553, in ND2File.metadata(self)
    452 @cached_property
    453 def metadata(self) -> Metadata:
    454     """Various metadata (will be `dict` only if legacy format).
    455
    456     ??? example "Example output"
   (...)    551         dict if legacy format, else `Metadata`
    552     """
--> 553     return self._rdr.metadata()

File ~/dev/self/nd2/src/nd2/_readers/_modern/modern_reader.py:194, in ModernReader.metadata(self)
    191 def metadata(self) -> structures.Metadata:
    192     if not self._metadata:
    193         self._metadata = load_metadata(
--> 194             raw_meta=self._cached_raw_metadata(),
    195             global_meta=self._cached_global_metadata(),
    196         )
    197     return self._metadata

File ~/dev/self/nd2/src/nd2/_readers/_modern/modern_reader.py:169, in ModernReader._cached_raw_metadata(self)
    165 def _cached_raw_metadata(self) -> RawMetaDict:
    166     if self._raw_image_metadata is None:
    167         k = (
    168             b"ImageMetadataSeqLV|0!"
--> 169             if self.version() >= (3, 0)
    170             else b"ImageMetadataSeq|0!"
    171         )
    172         meta = self._decode_chunk(k, strip_prefix=False)
    173         meta = meta.get("SLxPictureMetadata", meta)  # for v3 only

File ~/dev/self/nd2/src/nd2/_readers/protocol.py:196, in ND2Reader.version(self)
    194 def version(self) -> tuple[int, int]:
    195     """Return the file format version as a tuple of ints."""
--> 196     return get_version(self._fh or self._path)

File ~/dev/self/nd2/src/nd2/_parse/_chunk_decode.py:96, in get_version(fh)
     94 with ctx as fh:
     95     fh.seek(0)
---> 96     fname = str(fh.name)
     97     chunk = START_FILE_CHUNK.unpack(fh.read(START_FILE_CHUNK.size))
     99 magic, name_length, data_length, name, data = cast("StartFileChunk", chunk)

AttributeError: 'HTTPFile' object has no attribute 'name'

so I don't think this is being rigorously tested enough yet. The tests are all very unit-testy and don't really exercise this path much. We either need to:

  1. fully set-up a minio container on CI, and "host" an s3 file locally for the tests to hit (big scope creep)
  2. actually put an nd2 file in a publicly accessible s3 bucket that requires no credentials to access, that we can use for testing (that would be up to you guys/AICS to do)

the best way to test this would be to get the string name of an s3 resource into the conftest.NEW list... so that it is tested against the full battery of tests that ask for "any_nd2".

@tlambert03
Copy link
Owner

the 3D bounding box cropping functionality was not carried over. This feature could be added back as an enhancement if desired.

that is very much like #242 ... which I believe is the proper way to support that feature. I don't want to add a new public api to express some cropped read: that API is already very nicely met by putting dask in the loop, where the user can then use standard numpy indexing to request some sub-slice, and the lazy reading/loading is entirely handled behind the scenes. it already works just fine for everything but in-plane cropping, and that PR is the one that would add that for in-plane cropping. (it's just about doing the bookkeeping to update the striding when we read the bytes from disk... not really hard to do, but testing it is finicky and it needs someone who cares about it to get it over the finish line)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants