Skip to content

implement coordinate hash#651

Open
d-chambers wants to merge 2 commits intodevfrom
coord_hash
Open

implement coordinate hash#651
d-chambers wants to merge 2 commits intodevfrom
coord_hash

Conversation

@d-chambers
Copy link
Copy Markdown
Contributor

@d-chambers d-chambers commented Apr 4, 2026

Summary

This PR does two things:does two things:

  1. Implements hash for coords, normalizing unit-bearing values to base SI units so equivalent coords hash the same.
  2. Fixes an intermittent hang in remote HTTP HDF5 reads caused by the no-range fallback path.

Coord hashing

  • Implement BaseCoord.hash
  • Normalize unit-bearing coords to base SI before hashing
  • Use range metadata directly for CoordRange
  • Hash full arrays for non-range coords
  • Rename hash_numpy_array to canonical hash_array
  • Keep a compatibility alias for hash_numpy_array

Remote HTTP HDF5 fix
Root cause was in the remote HDF5 fallback path:

  • when an HTTP server did not support range requests, fallback could re-enter fsspec HTTP while already inside an active fsspec HTTP read
  • that could block in the async/sync bridge and hang the test run
  • after metadata fallback had already materialized a local cached file, later HDF5 reads still reopened the remote HTTP path instead of reusing the cached artifact

Fixes:

  • HTTP cache downloads now use direct blocking urlopen(...) instead of resource.open(...) in the fallback materialization path
  • H5Reader now prefers an already-cached local file for remote HDF5 resources before attempting the remote-first fallback path
  • H5Reader also explicitly owns and closes wrapped file objects

Summary by CodeRabbit

  • New Features

    • Coordinates gain stable, deterministic hashing for use as dict/set keys.
    • Array hashing improved (including datetime handling) with a legacy alias preserved.
    • HTTP(S) downloads now honor custom headers and a configurable timeout; remote caching helper added.
  • Bug Fixes

    • Improved HDF5 reader handle ownership to ensure proper resource cleanup.
  • Tests

    • Added extensive tests for coordinate hashing, array hashing, HTTP downloads, HDF5 handles, and remote caching.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

Adds deterministic hashing to coordinate types via BaseCoord.__hash__() with unit- and dtype-aware hashing; renames hash_numpy_arrayhash_array (keeps alias) and adjusts fast-path for datelike dtypes; introduces managed HDF5 handle proxy and HTTP streaming download with cached-local-file resolution; adds tests covering these behaviors.

Changes

Cohort / File(s) Summary
Coordinate hashing & tests
dascore/core/coords.py, tests/test_core/test_coords.py
Implements BaseCoord._get_hashable_coord, _hash_scalar, _hash_array_token, and overrides __hash__() to produce deterministic hashes for CoordRange, CoordPartial, CoordArray/others with unit normalization and dtype-aware rules. Changes CoordPartial.convert_units from alias-to-update to an explicit method. Adds tests validating hash equality/inequality, unit-normalization, null-scalar handling, and round-trip/hash contract.
Array hashing util & tests
dascore/utils/array.py, tests/test_utils/test_array_utils.py
Renames public function to hash_array(arr) and retains hash_numpy_array as a compatibility alias. Updates fast-path logic to avoid zero-copy handling for datetime/timedelta dtypes and adds tests covering datetime64[ns] behavior and alias parity.
HDF5 handle lifecycle & remote I/O
dascore/utils/hdf5.py, dascore/utils/remote_io.py, dascore/config.py, tests/test_utils/test_io_utils.py, tests/test_io/test_remote_http.py
Adds _ManagedH5pyFile proxy to ensure paired closure of h5py.File and underlying file-like objects; centralizes opening h5py with the fileobj driver. Refactors _download_remote_file to use urllib.request/urlopen with headers and timeout from config; adds get_cached_local_file() helper. Adds remote_download_timeout config field and tests for handle lifecycle, caching behavior, and HTTP download semantics.

Possibly related PRs

  • add array hash function #629: Introduced the original hash_numpy_array(...) utility; directly related because this change renames it to hash_array(...) and uses it in coordinate hashing.

Suggested labels

proc, IO, patch

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'implement coordinate hash' is concise and clearly summarizes the primary feature being added to the codebase.
Description check ✅ Passed The PR description is well-structured and comprehensive, covering both major features (coordinate hashing and remote HTTP HDF5 fixes) with clear explanations of the problems, solutions, and implementation details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch coord_hash

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai bot added bug Something isn't working ready_for_review PR is ready for review labels Apr 4, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1cfaa34167

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +494 to +495
payload = (coord.__class__, unit_str, hash_array(coord.values))
return hash(payload)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Align CoordArray hashing with tolerant equality

BaseCoord.__hash__ hashes raw coord.values bytes, but BaseCoord.__eq__ (via sensible_model_equals/all_close) treats float arrays as equal within np.isclose tolerances. Two CoordArray instances that compare equal (e.g., tiny floating-point differences) can therefore produce different hashes, which breaks Python's hash contract and causes incorrect behavior in sets/dicts keyed by coordinates.

Useful? React with 👍 / 👎.

Comment on lines +139 to +140
headers = dict(getattr(resource, "storage_options", {}) or {})
request = Request(str(resource), headers=headers)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve HTTP options instead of mapping all to headers

The HTTP download path now converts resource.storage_options directly into request headers, which misinterprets common UPath/fsspec options (for example headers={...} becomes a literal headers: header, and non-header options are also injected). In authenticated or customized HTTP setups this drops the intended auth/config and can make fallback materialization fail (e.g., 401/403) when ensure_local_file is triggered.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (e656799) to head (df55ff8).
⚠️ Report is 2 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##               dev      #651    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files          134       137     +3     
  Lines        12042     12736   +694     
==========================================
+ Hits         12042     12736   +694     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dascore/core/coords.py`:
- Around line 457-461: The hashing path currently leaves scalar magnitudes in
their original units because CoordPartial.convert_units just updates unit
metadata; modify the normalization so start/stop/step magnitudes are converted
to the target units before hashing. Concretely, update
CoordPartial.convert_units (or inside _get_hashable_coord) to actually convert
numeric magnitudes (start, stop, step) using the units conversion (e.g., via the
units library's to/convert method) rather than only updating the units
attribute, or ensure _get_hashable_coord calls simplify_units on a copy whose
start/stop/step have been converted to the simplified units; reference the
methods _get_hashable_coord, simplify_units, and CoordPartial.convert_units and
convert those scalar fields to the target units so equivalent values in cm vs m
hash identically.
- Around line 470-495: The __hash__ implementation currently returns Python's
built-in hash(payload) which is non-deterministic across interpreter runs;
change __hash__ to produce a deterministic digest (e.g. SHA-256) by serializing
the same payload components used now (coord.__class__ identity string, unit_str,
coord.shape, dtype string for CoordPartial, and the outputs of _hash_scalar and
hash_array) into a stable byte sequence and returning a stable integer derived
from that digest; update the branches for CoordRange, CoordPartial, and the
default (array values) in the __hash__ method to build the same payload items
but pass them through a deterministic hasher (hashlib) rather than hash(),
keeping references to _get_hashable_coord, _hash_scalar, CoordRange,
CoordPartial, and hash_array so the payload composition remains unchanged.

In `@dascore/utils/remote_io.py`:
- Around line 138-147: The urlopen call in the HTTP branch uses no timeout and
can hang; update the logic that handles HTTP downloads (check _HTTP_PROTOCOLS,
Request and urlopen usage in this function) to pass a timeout value (either a
new DascoreConfig field like remote_download_timeout on get_config() or a
sensible default such as 60 seconds) when calling urlopen(Request(...),
timeout=...). Keep the existing chunked read using
get_config().remote_download_block_size and ensure the new config field is read
from get_config() so tests can override it.

In `@tests/test_core/test_coords.py`:
- Around line 535-539: The test test_equal_coords_share_hash currently passes
the same instance because get_coord(data=coord) returns coord unchanged; change
it to construct a distinct but equal instance (e.g., call
get_coord(data=coord.model_dump()) or build from copied values/range metadata)
so other != coord (identity) but other == coord (equality) and then assert
hash(other) == hash(coord); ensure you reference the get_coord call and
coord.model_dump() (or an explicit deepcopy of coord's fields) to rebuild the
second instance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bcf61b71-3ead-4137-b7b1-e1cec3b3594f

📥 Commits

Reviewing files that changed from the base of the PR and between 35d6b82 and 1cfaa34.

📒 Files selected for processing (8)
  • dascore/core/coords.py
  • dascore/utils/array.py
  • dascore/utils/hdf5.py
  • dascore/utils/remote_io.py
  • tests/test_core/test_coords.py
  • tests/test_io/test_remote_http.py
  • tests/test_utils/test_array_utils.py
  • tests/test_utils/test_io_utils.py

Comment on lines +457 to +461
def _get_hashable_coord(self) -> Self:
"""Return a coordinate normalized for stable hashing."""
if self.units is None or dtype_time_like(self.dtype):
return self
return self.simplify_units()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

CoordPartial unit normalization is a no-op for the hashed scalars.

This path assumes simplify_units() also normalizes start/stop/step, but CoordPartial.convert_units is just update, so those magnitudes stay in their original units. A partial coord in centimeters and the equivalent one in meters will still hash differently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dascore/core/coords.py` around lines 457 - 461, The hashing path currently
leaves scalar magnitudes in their original units because
CoordPartial.convert_units just updates unit metadata; modify the normalization
so start/stop/step magnitudes are converted to the target units before hashing.
Concretely, update CoordPartial.convert_units (or inside _get_hashable_coord) to
actually convert numeric magnitudes (start, stop, step) using the units
conversion (e.g., via the units library's to/convert method) rather than only
updating the units attribute, or ensure _get_hashable_coord calls simplify_units
on a copy whose start/stop/step have been converted to the simplified units;
reference the methods _get_hashable_coord, simplify_units, and
CoordPartial.convert_units and convert those scalar fields to the target units
so equivalent values in cm vs m hash identically.

@coderabbitai coderabbitai bot added IO Work for reading/writing different formats proc Related to processing module patch related to Patch class labels Apr 4, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
dascore/core/coords.py (1)

493-506: ⚠️ Potential issue | 🟠 Major

hash(payload) still makes the result process-specific.

The helpers build stable tokens, but the final hash(payload) reintroduces PYTHONHASHSEED and object-identity variability, so the new coordinate hash still changes across interpreter runs. If this is intended for stable cache keys or cross-process comparisons, finish with a deterministic digest instead of Python’s built-in hash().

#!/bin/bash
set -euo pipefail

for seed in 1 2; do
  printf 'PYTHONHASHSEED=%s -> ' "$seed"
  PYTHONHASHSEED=$seed python - <<'PY'
payload = ("CoordRange", "m", (10,), ("scalar", "0123456789abcdef"))
print(hash(payload))
PY
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dascore/core/coords.py` around lines 493 - 506, The code currently returns
Python's built-in hash(payload) (in the CoordPartial branch and the fallback),
which is process-dependent; instead compute and return a deterministic digest
(e.g., SHA-256 hex) over the stable tokens you already build (class name,
unit_str, coord.shape, str(np.dtype(...)), and the outputs of
_hash_scalar/_hash_array_token). Replace the final hash(payload) calls with a
deterministic digest computed via hashlib (serialize the payload parts in a
stable order/encoding—e.g., join or JSON with sort_keys disabled since you
control ordering—and encode to bytes before hashing) and return that fixed hex
string (or a stable int derived from the hex) so both branches (CoordPartial and
non-partial) produce cross-process stable keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dascore/utils/remote_io.py`:
- Around line 138-143: The code treats storage_options as flat HTTP headers and
trusts the resource protocol; update the HTTP branch handling around
_HTTP_PROTOCOLS, Request, and urlopen to (1) extract headers compatibly from
getattr(resource, "storage_options", {}) by checking for a nested "headers" key
first and falling back to the top-level mapping (ensuring a dict of header
strings), and (2) explicitly parse and validate the URL scheme of str(resource)
(e.g., via urllib.parse.urlparse) to allow only "http" or "https" before calling
urlopen (use get_config().remote_download_timeout as the timeout). Ensure
Request receives the validated URL and the constructed headers.

---

Duplicate comments:
In `@dascore/core/coords.py`:
- Around line 493-506: The code currently returns Python's built-in
hash(payload) (in the CoordPartial branch and the fallback), which is
process-dependent; instead compute and return a deterministic digest (e.g.,
SHA-256 hex) over the stable tokens you already build (class name, unit_str,
coord.shape, str(np.dtype(...)), and the outputs of
_hash_scalar/_hash_array_token). Replace the final hash(payload) calls with a
deterministic digest computed via hashlib (serialize the payload parts in a
stable order/encoding—e.g., join or JSON with sort_keys disabled since you
control ordering—and encode to bytes before hashing) and return that fixed hex
string (or a stable int derived from the hex) so both branches (CoordPartial and
non-partial) produce cross-process stable keys.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 262318e8-9658-434d-b8cd-e486140ca018

📥 Commits

Reviewing files that changed from the base of the PR and between 1cfaa34 and df55ff8.

📒 Files selected for processing (5)
  • dascore/config.py
  • dascore/core/coords.py
  • dascore/utils/remote_io.py
  • tests/test_core/test_coords.py
  • tests/test_utils/test_io_utils.py
✅ Files skipped from review due to trivial changes (1)
  • tests/test_core/test_coords.py

Comment on lines +481 to +506
def __hash__(self):
coord = self._get_hashable_coord()
unit_str = coord.unit_str
if isinstance(coord, CoordRange):
payload = (
coord.__class__,
unit_str,
coord.shape,
self._hash_scalar(coord.start),
self._hash_scalar(coord.stop),
self._hash_scalar(coord.step),
)
return hash(payload)
if isinstance(coord, CoordPartial):
payload = (
coord.__class__,
unit_str,
coord.shape,
str(np.dtype(coord.dtype)),
self._hash_scalar(coord.start),
self._hash_scalar(coord.stop),
self._hash_scalar(coord.step),
)
return hash(payload)
payload = (coord.__class__, unit_str, self._hash_array_token(coord.values))
return hash(payload)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Documented cross-family equality example:"
sed -n '1888,1895p' dascore/core/coords.py

echo
echo "Current __hash__ implementation:"
sed -n '481,506p' dascore/core/coords.py

echo
echo "Shared model equality implementation:"
fd 'models.py' dascore | xargs -r rg -n -C3 'class DascoreBaseModel|def __eq__'

Repository: DASDAE/dascore

Length of output: 1624


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Finding _get_hashable_coord implementation ==="
rg -n -A10 '_get_hashable_coord' dascore/core/coords.py | head -30

echo
echo "=== Finding __eq__ implementation in BaseCoord ==="
rg -n -B2 -A15 'def __eq__' dascore/core/coords.py | head -50

echo
echo "=== Checking _hash_scalar and _hash_array_token implementations ==="
rg -n -A8 'def _hash_scalar|def _hash_array_token' dascore/core/coords.py

echo
echo "=== Checking if CoordRange is converted to CoordArray anywhere ==="
rg -n 'to_array|convert.*array' dascore/core/coords.py | head -20

Repository: DASDAE/dascore

Length of output: 1057


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Class definitions and inheritance ==="
rg -n 'class BaseCoord|class CoordRange|class CoordArray' dascore/core/coords.py | head -10

echo
echo "=== Search for __eq__ in entire coords.py file ==="
rg -n '__eq__' dascore/core/coords.py

echo
echo "=== Check if BaseModel or parent provides __eq__ ==="
rg -n 'class DascoreBaseModel' dascore/models.py -A15 | head -30

echo
echo "=== Check simplify_units implementation ==="
rg -n -A20 'def simplify_units' dascore/core/coords.py

echo
echo "=== Look at entire BaseCoord class definition start ==="
sed -n '80,130p' dascore/core/coords.py

Repository: DASDAE/dascore

Length of output: 267


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== DascoreBaseModel full implementation ==="
rg -n -A50 'class DascoreBaseModel' dascore/models.py | head -70

echo
echo "=== Check if Pydantic BaseModel is used (implicit __eq__) ==="
rg -n 'from pydantic import BaseModel' dascore/models.py

echo
echo "=== Check simplify_units to understand what it does ==="
rg -n -A30 'def simplify_units' dascore/core/coords.py | head -50

echo
echo "=== Look for any coord value/data comparison methods ==="
rg -n 'values.*==|data.*==|array.*==' dascore/core/coords.py | head -20

Repository: DASDAE/dascore

Length of output: 177


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find where DascoreBaseModel is defined ==="
rg -n 'class DascoreBaseModel' dascore/

echo
echo "=== Check imports in coords.py ==="
head -50 dascore/core/coords.py | rg -n 'DascoreBaseModel|import'

echo
echo "=== Try to find models.py in different locations ==="
fd -t f 'models.py' dascore/

echo
echo "=== Check entire dascore structure ==="
ls -la dascore/ | head -20

Repository: DASDAE/dascore

Length of output: 2247


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full DascoreBaseModel implementation ==="
rg -n -A60 'class DascoreBaseModel' dascore/utils/models.py

echo
echo "=== Check if __eq__ is explicitly defined in DascoreBaseModel ==="
rg -n '__eq__' dascore/utils/models.py

echo
echo "=== Check full import list in coords.py ==="
head -60 dascore/core/coords.py | grep -n 'from\|import'

Repository: DASDAE/dascore

Length of output: 2255


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find sensible_model_equals implementation ==="
rg -n -B2 -A20 'def sensible_model_equals' dascore/utils/models.py

echo
echo "=== Check what is imported from dascore.utils.models in coords.py ==="
rg -n -A10 'from dascore.utils.models import' dascore/core/coords.py

echo
echo "=== Look at CoordRange and CoordArray definitions to understand their structure ==="
rg -n -A5 'class CoordRange|class CoordArray' dascore/core/coords.py | head -30

Repository: DASDAE/dascore

Length of output: 1782


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full CoordRange class definition ==="
sed -n '1164,1250p' dascore/core/coords.py

echo
echo "=== Full CoordArray class definition ==="
sed -n '1422,1500p' dascore/core/coords.py

echo
echo "=== Check model_fields for CoordRange ==="
rg -n 'model_fields|__init__' dascore/core/coords.py | grep -A5 -B5 'CoordRange'

Repository: DASDAE/dascore

Length of output: 6601


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Search for any override of __eq__ in CoordRange or CoordArray ==="
rg -n 'class CoordRange|class CoordArray' dascore/core/coords.py -A100 | grep -E 'def __eq__|class CoordRange|class CoordArray'

echo
echo "=== Check if there's any conversion from CoordRange to CoordArray before comparison ==="
rg -n 'to_array|values' dascore/core/coords.py | grep -i 'coordrange\|coordarray' | head -20

echo
echo "=== Look at the exact test case from the docstring to understand the claim ==="
sed -n '1888,1905p' dascore/core/coords.py

echo
echo "=== Check if there's a conversion method on CoordRange ==="
rg -n 'def values|def to_array' dascore/core/coords.py | grep -B5 -A5 'CoordRange\|1[2-3][0-9][0-9]'

Repository: DASDAE/dascore

Length of output: 1392


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check BaseCoord.values property (line 1097) ==="
sed -n '1097,1110p' dascore/core/coords.py

echo
echo "=== Check CoordArray.values property (line 1370) ==="
sed -n '1370,1380p' dascore/core/coords.py

echo
echo "=== Check all fields defined in BaseCoord and subclasses ==="
sed -n '211,280p' dascore/core/coords.py

echo
echo "=== Now verify the actual test - run a minimal Python test to see if they compare equal ==="
python3 << 'EOF'
import sys
sys.path.insert(0, '/root/code')

import numpy as np
from dascore.core.coords import get_coord

# Create test coordinates
range_coord = get_coord(start=1, stop=12, step=1)
array_coord = get_coord(data=np.arange(1, 12, 1))

print(f"Range coord type: {type(range_coord)}")
print(f"Array coord type: {type(array_coord)}")
print(f"Range coord fields: {range_coord.model_dump().keys()}")
print(f"Array coord fields: {array_coord.model_dump().keys()}")
print(f"Are they equal? {range_coord == array_coord}")
print(f"Range coord dump: {range_coord.model_dump()}")
print(f"Array coord dump: {array_coord.model_dump()}")
EOF

Repository: DASDAE/dascore

Length of output: 5129


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find get_coord function to see if it normalizes types ==="
rg -n -A50 'def get_coord' dascore/core/coords.py | head -80

echo
echo "=== Check what line 1370 actually is - full CoordRange.values property ==="
sed -n '1370,1400p' dascore/core/coords.py

echo
echo "=== Look for any __eq__ override in CoordRange or subclasses ==="
sed -n '1164,1420p' dascore/core/coords.py | grep -n '__eq__\|def values'

Repository: DASDAE/dascore

Length of output: 3370


Fix hash contract violation and document cross-family equality mismatch.

The documentation at line 1894 claims CoordRange and CoordArray instances should compare equal when they represent the same data, but sensible_model_equals will return False because they have different field sets—CoordRange has (start, stop, step) while CoordArray has (values). Additionally, the __hash__ implementation includes coord.__class__ in each payload and uses separate field combinations for CoordRange vs CoordArray, which would violate Python's hash contract if the equality claim were true (equal objects must have equal hash values). Either implement the cross-family equality properly by normalizing coords to a canonical representation before comparison and hashing, or update the documentation to reflect that different coord families are not equal.

Comment on lines +138 to +143
if protocol in _HTTP_PROTOCOLS:
headers = dict(getattr(resource, "storage_options", {}) or {})
request = Request(str(resource), headers=headers)
timeout = get_config().remote_download_timeout
with (
urlopen(request, timeout=timeout) as remote_fi,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify how HTTP headers are represented in UPath usage across this repo.
rg -nP 'UPath\([^)]*headers\s*=' --type=py -C2
rg -nP 'storage_options.*headers|["'\'']headers["'\'']' --type=py -C2
rg -nP '\bRequest\(|\burlopen\(' dascore/utils/remote_io.py -C2

Repository: DASDAE/dascore

Length of output: 1334


🏁 Script executed:

# 1. Examine the test case to understand how storage_options are passed
sed -n '670,700p' tests/test_utils/test_io_utils.py

# 2. Find _HTTP_PROTOCOLS definition
rg -n '_HTTP_PROTOCOLS' dascore/utils/remote_io.py -B2 -A2

# 3. Look at the full context of _download_remote_file function
sed -n '120,160p' dascore/utils/remote_io.py

# 4. Check for any other storage_options usage in HTTP context
rg -n 'storage_options' dascore/utils/remote_io.py -B2 -A2

Repository: DASDAE/dascore

Length of output: 3962


🏁 Script executed:

# Examine the _HTTPResource test class definition
sed -n '660,680p' tests/test_utils/test_io_utils.py

# Check for any other HTTP storage_options examples in tests
rg -n 'storage_options.*=' tests/test_utils/test_io_utils.py -B1 -A1 | grep -A5 -B5 'http'

# Check if there's any fsspec/upath dependency that defines headers structure
rg -n 'fsspec|upath' dascore/utils/remote_io.py

# Look for any imports or references to understand the HTTP backend
head -30 dascore/utils/remote_io.py

Repository: DASDAE/dascore

Length of output: 2178


🏁 Script executed:

# Check test setup for HTTP resource to see the actual storage_options structure
sed -n '645,650p' tests/test_utils/test_io_utils.py

# Look for all storage_options structures in tests
rg -n 'storage_options\s*=' tests/test_utils/test_io_utils.py -A1

# Check if there's any documentation about storage_options format in comments/docstrings
rg -n 'storage_options' dascore/utils/remote_io.py -B3 -A3 | head -40

# Check what UPath (coerce_to_upath) is and if it has documentation about storage_options
rg -n 'class UPath|def coerce_to_upath' dascore/compat.py -A5

Repository: DASDAE/dascore

Length of output: 1266


🏁 Script executed:

# Check fsspec and upath imports to understand storage_options structure
rg -n 'from dascore.compat import' dascore/utils/remote_io.py

# Look at the compat module to see what UPath is
cat dascore/compat.py | head -50

# Search for any storage_options nesting patterns in actual usage (not just tests)
rg -n 'storage_options' dascore/ --type=py -B2 -A2 | grep -i "headers" | head -20

# Check if there are any real-world (non-test) HTTP examples with storage_options
rg -rn '"headers".*storage_options|storage_options.*"headers"' dascore/ --type=py

Repository: DASDAE/dascore

Length of output: 1488


🏁 Script executed:

# Search for any usage of UPath with HTTP in the actual codebase (not just tests)
rg -n 'UPath.*http|http.*UPath' dascore/ --type=py

# Check if there are any real-world examples of storage_options being passed
rg -n '_download_remote_file' dascore/ --type=py -B2 -A2

# Look for any integration tests or examples that use HTTP downloading
find dascore/ -name '*.py' -type f -exec grep -l 'http.*download\|download.*http' {} \;

# Check if upath/fsspec HTTP backend documentation is referenced anywhere
rg -n 'fsspec|fsspec.implementations.http' dascore/ --type=py

Repository: DASDAE/dascore

Length of output: 695


🏁 Script executed:

# Check fsspec documentation or any comments about storage_options structure
web_search query about fsspec HTTP storage_options structure

# Also check if there's any version pinning or requirements file
cat dascore/pyproject.toml 2>/dev/null | grep -i 'upath\|fsspec\|urllib' || cat setup.py 2>/dev/null | grep -i 'upath\|fsspec'

# Check the actual test to see what storage_options are expected to work
sed -n '640,695p' tests/test_utils/test_io_utils.py

Repository: DASDAE/dascore

Length of output: 2115


🌐 Web query:

fsspec HTTP backend storage_options structure headers

💡 Result:

The fsspec HTTP backend (HTTPFileSystem) accepts storage_options as keyword arguments in its constructor, which are stored in self.kwargs and passed to aiohttp session requests (e.g., session.get(url, self.kwargs)). Key storage_options include: - headers: dict of HTTP headers applied to all requests (e.g., {"Authorization": "Bearer token"}). Headers can also be overridden per-operation via kwargs in methods like open. - client_kwargs: dict passed to aiohttp.ClientSession constructor (e.g., for auth: {"auth": aiohttp.BasicAuth("user", "pass")}). - Other common options: block_size (read block size, default 5MB), simple_links (for ls), cache_type, cache_options. Example usage: fs = fsspec.filesystem("http", headers={"User-Agent": "my-app"}, client_kwargs={"timeout": aiohttp.ClientTimeout(total=30)}) with fs.open("https://example.com/file.txt") as f: data = f.read For high-level functions like fsspec.open("https://...", storage_options={"headers": {...}}). Source code confirms storage_options become self.kwargs for requests, and explicit params like client_kwargs are separate. Headers in storage_options go to per-request kwargs.

Citations:


🏁 Script executed:

# Verify if the protocol field from resource always matches the actual URL scheme
rg -n 'protocol.*=' tests/test_utils/test_io_utils.py | grep -i http

# Check if there are any cases where protocol and URL scheme could mismatch
rg -n 'coerce_to_upath' tests/test_utils/test_io_utils.py -B2 -A2 | head -30

# Check if there are any edge cases in URL construction
sed -n '125,150p' dascore/utils/remote_io.py

# Verify RemoteCacheError exists
rg -n 'class RemoteCacheError' dascore/

Repository: DASDAE/dascore

Length of output: 2146


🏁 Script executed:

# Final verification: check if there are any real-world calls to _download_remote_file with nested headers
rg -n '_download_remote_file\|materialize_remote_file' dascore/ --type=py -B3 -A3 | head -50

Repository: DASDAE/dascore

Length of output: 40


Add explicit URL scheme validation and support nested header extraction for fsspec compatibility.

Line 139 treats storage_options directly as HTTP headers. While the current test passes with flat headers (e.g., {"User-Agent": "..."}"), fsspec's HTTP backend also supports headers nested under a "headers" key for consistency with other storage backends. Additionally, line 143 should validate the actual URL scheme before passing to urlopen (S310 defense-in-depth), rather than relying solely on the resource's declared protocol.

Proposed fix
+from urllib.parse import urlsplit
 from urllib.request import Request, urlopen
@@
         if protocol in _HTTP_PROTOCOLS:
-            headers = dict(getattr(resource, "storage_options", {}) or {})
-            request = Request(str(resource), headers=headers)
+            storage_options = dict(getattr(resource, "storage_options", {}) or {})
+            raw_headers = storage_options.get("headers", storage_options)
+            headers = {str(k): str(v) for k, v in (raw_headers or {}).items()}
+            url = str(resource)
+            scheme = urlsplit(url).scheme.lower()
+            if scheme not in _HTTP_PROTOCOLS:
+                msg = f"Refusing non-HTTP URL for remote download: {url!r}"
+                raise RemoteCacheError(msg)
+            request = Request(url, headers=headers)
             timeout = get_config().remote_download_timeout
             with (
-                urlopen(request, timeout=timeout) as remote_fi,
+                urlopen(request, timeout=timeout) as remote_fi,  # noqa: S310
                 tmp_path.open("wb") as local_fi,
             ):
🧰 Tools
🪛 Ruff (0.15.9)

[error] 140-140: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


[error] 143-143: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dascore/utils/remote_io.py` around lines 138 - 143, The code treats
storage_options as flat HTTP headers and trusts the resource protocol; update
the HTTP branch handling around _HTTP_PROTOCOLS, Request, and urlopen to (1)
extract headers compatibly from getattr(resource, "storage_options", {}) by
checking for a nested "headers" key first and falling back to the top-level
mapping (ensuring a dict of header strings), and (2) explicitly parse and
validate the URL scheme of str(resource) (e.g., via urllib.parse.urlparse) to
allow only "http" or "https" before calling urlopen (use
get_config().remote_download_timeout as the timeout). Ensure Request receives
the validated URL and the constructed headers.

@d-chambers d-chambers removed the ready_for_review PR is ready for review label Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working IO Work for reading/writing different formats patch related to Patch class proc Related to processing module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant