Skip to content

New chunking approach that never splits encoded chunks#11060

Open
jsignell wants to merge 18 commits intopydata:mainfrom
jsignell:non-splitting-auto
Open

New chunking approach that never splits encoded chunks#11060
jsignell wants to merge 18 commits intopydata:mainfrom
jsignell:non-splitting-auto

Conversation

@jsignell
Copy link
Copy Markdown
Member

@jsignell jsignell commented Dec 30, 2025

Proposal

A new chunks option that is only allowed to use encoded chunks or multiples of them. No chunk splitting allowed.

Demo

Current behavior when chunks="auto"

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks="auto")
image

This PR introduces a new option: chunks="preserve"

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks="preserve")
image

Context

I originally set out to update the auto_chunks function in dask, but it felt like my goals were actually quite different. The goal of the dask auto_chunks function is to guarantee that the chunksize will be under a configurable limit while preserving the aspect ratio of previous_chunks (previous_chunks == encoding). This PR instead guarantees that encoded chunks are never split but it will multiply them by some factor to try to get the chunksize close to a targetsize. It doesn't try to preserve the aspect ratio of the chunks. Instead it goes after the dim where there is the greatest number of chunks and it tries to take those in bigger bites.

Also:

  • I'm not quite sure how the interface should work and I am definitely not attached to the word "preserve".
  • I originally put this in the DaskManager, but moved it to the base ChunkManagerEntrypoint class once I realized there was nothing dasky about it. I'm not sure if there is really supposed to be logic in methods on that class though.

@github-actions github-actions bot added topic-backends io topic-NamedArray Lightweight version of Variable labels Dec 30, 2025
Comment thread xarray/namedarray/daskmanager.py Outdated
@github-actions github-actions bot added topic-testing topic-hypothesis Strategies or tests using the hypothesis library labels Dec 31, 2025
({"x": "preserve", "y": -1}, (160, 500)),
],
)
def test_open_dataset_chunking_zarr_with_preserve(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are kind of slow.

@jsignell
Copy link
Copy Markdown
Member Author

@dcherian is this something you would be able to review? I'd love to get more people trying it.

@github-actions github-actions bot added the topic-zarr Related to zarr storage library label Mar 11, 2026
Comment thread xarray/namedarray/parallelcompat.py Outdated
Comment thread doc/whats-new.rst Outdated
Comment thread xarray/namedarray/parallelcompat.py Outdated
Comment thread xarray/namedarray/parallelcompat.py Outdated
@jsignell jsignell requested a review from dcherian March 24, 2026 17:04
@github-actions github-actions bot added topic-indexing topic-groupby topic-plotting topic-performance topic-cftime CI Continuous Integration tools Automation Github bots, testing workflows, release automation topic-DataTree Related to the implementation of a DataTree class labels Apr 15, 2026
jsignell and others added 12 commits April 15, 2026 16:05
* Move ``preserve_chunks`` to base ChunkManager class
* Get target size from dask config options for DaskManager
* Add test for open_zarr
…11230)

* Add inherit='all' option to DataTree.to_dataset()

* Add whats-new entry for inherit='all' (pydata#11230)

* Fix prune() signature accidentally modified by ruff-format

* Fix mypy errors: remove unused type-ignore, add typing.cast in test

* Rename inherit='all' to inherit='all_coords' per review feedback
@jsignell jsignell force-pushed the non-splitting-auto branch 2 times, most recently from 6512217 to ca22bd6 Compare April 15, 2026 20:09
@jsignell jsignell force-pushed the non-splitting-auto branch from ca22bd6 to 4bea2b3 Compare April 15, 2026 20:12
@jsignell
Copy link
Copy Markdown
Member Author

jsignell commented Apr 15, 2026

Ok @dcherian this is now a PR that changes the behavior of chunks="auto". The changes to tests were minimal with this change, but if we think it's important to give people a way to get the prior behavior we can add an option so that people can set xr.set_options(use_dask_auto=True)

This would be what the commit looks like to add the option jsignell@84cca20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Automation Github bots, testing workflows, release automation CI Continuous Integration tools io topic-backends topic-cftime topic-DataTree Related to the implementation of a DataTree class topic-groupby topic-hypothesis Strategies or tests using the hypothesis library topic-indexing topic-NamedArray Lightweight version of Variable topic-performance topic-plotting topic-testing topic-zarr Related to zarr storage library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opening a zarr dataset taking so much time with dask

3 participants