-
Notifications
You must be signed in to change notification settings - Fork 83
Description
Hi,
I'm encountering an issue when calling repr(sdata) which fails during the self-contained check for points elements backed by a Parquet file with the error:
AttributeError: 'list' object has no attribute 'values'
The failure originates in
_search_for_backing_files_recursively()
where the code assumes that each parquet-read task in the Dask graph has a dict in task.args[0] and therefore calls:
v.args[0].values()
However, when dask.dataframe.read_parquet() performs automatic partition aggregation, controlled by split_row_groups='infer' (default), task.args[0] can become a list of row-group dicts instead of a single dict.
Reproduce
To reproduce the error, first trigger Dask’s autopartitioning by forcing a DataFrame into a single partition, writing it to a Parquet file and reading it back with default read_parquet settings. Inspecting the graph reveals the list-of-dicts structure that breaks SpatialData. To trigger the original error, parse the DataFrame with PointsModel, build a SpatialData object, write it to a Zarr store (the error does not occur if unbacked), and finally call repr(sdata). The pseudocode is the following:
import dask.dataframe as dd
from spatialdata._core.points import PointsModel
from spatialdata._core.spatialdata import SpatialData
# 1. Create or load a Dask DataFrame
df = dd.from_pandas(some_pandas_df, npartitions=4)
# 2. Force a single partition
df_one_part = df.repartition(npartitions=1)
# 3. Write single-partition Parquet
df_one_part.to_parquet("example_points.parquet")
# 4. Read Parquet back
df_read = dd.read_parquet("example_points.parquet")
# 5. Inspect graph (optional)
print(df_read.dask) # shows list-of-dicts if autopartitioning changed structure
# 6. Parse with PointsModel
points = PointsModel.parse(df_read)
# 7. Build SpatialData object
sdata = SpatialData(points=points)
# 8. Write to Zarr store (error triggers only when sdata.is_backed() == True)
sdata.write("example.zarr")
# 9. Trigger error with repr
print(repr(sdata))