Skip to content

Fix multi-coordinate indexes dropped in _replace_maybe_drop_dims #11286

Merged
benbovy merged 5 commits intopydata:mainfrom
OpenScienceComputing:fix/multi-coord-index-drop
Apr 13, 2026
Merged

Fix multi-coordinate indexes dropped in _replace_maybe_drop_dims #11286
benbovy merged 5 commits intopydata:mainfrom
OpenScienceComputing:fix/multi-coord-index-drop

Conversation

@rsignell
Copy link
Copy Markdown
Contributor

@rsignell rsignell commented Apr 8, 2026

@benbovy, after some discussion with @Huite, he thought Claude Code might be capable of fixing this issue, so I gave it a shot. I'm only an appreciative user of the xarray codebase, so I don't know really how to test/evaluate this beyond the passing tests that Claude itself came up with. So I hope this isn't a waste of your time. With that said, here's what Claude came up with:

When a custom Index spans multiple coordinates across different dimensions (e.g. a UGRID topology index covering both nodes and faces), it was incorrectly dropped during:

  • DataArray dimension reduction (e.g. .mean("extra_dim")) via _replace_maybe_drop_dims, which filtered coords by strict dim subset without consulting Index.should_add_coord_to_array()
  • Dataset subsetting by variable names (e.g. ds[["node_data"]]) via _copy_listed, with the same issue

Both paths now use should_add_coord_to_array() for index-backed coordinates, consistent with how _construct_dataarray already works.

Description

Checklist

  • Closes 11215
  • Tests added

AI Disclosure

  • This PR contains AI-generated content.
  • I have tested the AI-generated content in my PR (but with Claude-generated tests)
  • I take responsibility for any AI-generated content in my PR.

Tools: Claude Code

…_copy_listed

When a custom Index spans multiple coordinates across different dimensions
(e.g. a UGRID topology index covering both `nodes` and `faces`), it was
incorrectly dropped during:

- DataArray dimension reduction (e.g. `.mean("extra_dim")`) via
  `_replace_maybe_drop_dims`, which filtered coords by strict dim subset
  without consulting `Index.should_add_coord_to_array()`
- Dataset subsetting by variable names (e.g. `ds[["node_data"]]`) via
  `_copy_listed`, with the same issue

Both paths now use `should_add_coord_to_array()` for index-backed
coordinates, consistent with how `_construct_dataarray` already works.

Fixes pydata#11215

Co-authored-by: Claude <noreply@anthropic.com>
@rsignell
Copy link
Copy Markdown
Contributor Author

rsignell commented Apr 8, 2026

@benboy, one more thing. I did ask Claude how if it used your suggestion in #11215 (comment) and it said:

Screenshot 2026-04-08 083049

- Cast OrderedSet to set when calling should_add_coord_to_array to satisfy
  the expected set[Hashable] type annotation
- Replace dict comprehensions with dict.fromkeys (ruff C416)
- Reformat assert messages to ruff-preferred style

Co-authored-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rsignell! That looks good to me.

For the case of Dataset.to_dataarray(), I think it should work similarly than in DataArray._replace_maybe_drop_dims():

coords = {}
for k, v in self.coords.items():
    if k in self._indexes:
        if self._indexes[k].should_add_coord_to_array(k, v, dims):
            coords[k] = v
    elif set(v.dims) <= dims:
        coords[k] = v

Comment thread xarray/tests/test_dataarray.py Outdated
Comment on lines +586 to +591
assert isinstance(next(iter(reduced_node.xindexes.values())), MultiDimIndex), (
"Index dropped from node DataArray after reduction"
)
assert isinstance(next(iter(reduced_face.xindexes.values())), MultiDimIndex), (
"Index dropped from face DataArray after reduction"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert isinstance(next(iter(reduced_node.xindexes.values())), MultiDimIndex), (
"Index dropped from node DataArray after reduction"
)
assert isinstance(next(iter(reduced_face.xindexes.values())), MultiDimIndex), (
"Index dropped from face DataArray after reduction"
)
for da in [reduced_node, reduced_face]:
for name in ["node_x", "node_y", "face_x", "face_y"]:
assert name in da.coords
assert isinstance(ds.xindexes[name], MultiDimIndex)

The instance(next(iter(...), ...) checks should be enough but this suggestion is to make the expected behavior clearer to the reader (or LLM).

Comment thread xarray/tests/test_dataset.py Outdated
Comment on lines +4585 to +4590
assert isinstance(next(iter(node_subset.xindexes.values())), MultiDimIndex), (
"Index dropped from Dataset when subsetting to node variable"
)
assert isinstance(next(iter(face_subset.xindexes.values())), MultiDimIndex), (
"Index dropped from Dataset when subsetting to face variable"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rsignell
Copy link
Copy Markdown
Contributor Author

rsignell commented Apr 10, 2026

Thanks for the review @benbovy!

Here is the Claude Code plan to address your comments -- let me know if you would like changes!

Planned changes

1. Fix Dataset.to_dataarray() (xarray/core/dataset.py ~line 7163)

The current implementation includes all coords unconditionally without calling should_add_coord_to_array, so multi-coordinate indexes can be dropped via filter_indexes_from_coords. Will fix to mirror the pattern in _copy_listed / _construct_dataarray:

needed_dims = set(broadcast_vars[0].dims)
coords: dict[Hashable, Variable] = {}
for k, v in self.coords.items():
    if k in self._indexes:
        if self._indexes[k].should_add_coord_to_array(k, v.variable, needed_dims):
            coords[k] = v.variable
    elif set(v.dims) <= needed_dims:
        coords[k] = v.variable
indexes = filter_indexes_from_coords(self._indexes, set(coords))

2. Improve test assertions in test_dataarray.py (lines 586–591)

Replace next(iter(...)) with the explicit loop you suggested (using da.xindexes[name]):

for da in [reduced_node, reduced_face]:
    for name in ["node_x", "node_y", "face_x", "face_y"]:
        assert name in da.coords
        assert isinstance(da.xindexes[name], MultiDimIndex)

3. Improve test assertions in test_dataset.py (lines 4585–4590)

Same pattern applied to the dataset test:

for subset in [node_subset, face_subset]:
    for name in ["node_x", "node_y", "face_x", "face_y"]:
        assert name in subset.coords
        assert isinstance(subset.xindexes[name], MultiDimIndex)

4. Add a new test for to_dataarray() with multi-coordinate indexes

Add test_to_dataarray_preserves_multi_coord_index in test_dataset.py using the same MultiDimIndex fixture pattern.


[This is Claude Code on behalf of Rich Signell]

@benbovy
Copy link
Copy Markdown
Member

benbovy commented Apr 10, 2026

At a first glance the the proposed changes look good!

- Apply should_add_coord_to_array() in Dataset.to_dataarray() for
  consistent handling of multi-coordinate indexes (per benbovy review)
- Add test_to_dataarray_preserves_multi_coord_index to cover this path
- Simplify isinstance(next(iter(...))) checks to cleaner for-loops
  in both test_dataarray.py and test_dataset.py (per benbovy suggestion)

Co-authored-by: Claude <noreply@anthropic.com>
@rsignell
Copy link
Copy Markdown
Contributor Author

@benbovy, okay, Claude thinks it addressed the comments. Fingers crossed! 😄

Comment thread xarray/core/dataset.py Outdated
variable = Variable(dims, data, self.attrs, fastpath=True)

coords = {k: v.variable for k, v in self.coords.items()}
broadcast_dims = set(broadcast_vars[0].dims)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a fix is not even necessary here since Dataset variables are broadcasted against each other and thus all coordinates are kept in the returned DataArray anyway, like mentioned in the docstrings. I added to_dataarray() to the list as I naively looked at all places where filter_indexes_from_coords is called, sorry.

(filter_indexes_from_coords is not even needed, copying self._indexes should be enough).

rsignell and others added 2 commits April 13, 2026 12:36
Per reviewer feedback, no fix was needed in to_dataarray() since Dataset
variables are broadcasted and all coordinates are kept anyway. Replace the
should_add_coord_to_array loop and filter_indexes_from_coords call with
a simple dict copy of all coords and indexes.

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
@rsignell
Copy link
Copy Markdown
Contributor Author

@benbovy, I asked Claude to double-check if the issue was addressed:
image

@benbovy
Copy link
Copy Markdown
Member

benbovy commented Apr 13, 2026

LGTM too, thanks!

@benbovy benbovy merged commit a635d86 into pydata:main Apr 13, 2026
43 checks passed
@welcome
Copy link
Copy Markdown

welcome bot commented Apr 13, 2026

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants