Make XML metadata optional, extract from HDF if XML not found#97
Conversation
|
@karthick-rn don't worry about the CI errors yet, this package has bit-rotted pretty hard, I'll clean it up. |
gadomski
left a comment
There was a problem hiding this comment.
Can you add a simple test function that exercises this functionality against https://github.com/stactools-packages/modis/blob/main/tests/data-files/MOD10A2.A2022033.h09v05.061.2022042050729.hdf? You can copy it to a temp directory to ensure there's no alongside XML asset.
Also needs a changelog entry.
@gadomski, Added the test function. Note: The MOD10A2 (snow cover) product doesn't have the QAPERCENTNOTPRODUCEDCLOUD tag in its HDF metadata, so I also made that field optional in from_cog_tags() to handle this case gracefully.
done |
This is a bit of a flag — do we know if there are other tags that will be missing for other products? We should probably test against every product type, even if they aren't automated (e.g. the "external data" tests). |
@gadomski, I manually verified all 53 XML test files (which contain the same metadata as HDF files):
The only missing tag is QAPERCENTNOTPRODUCEDCLOUD, which is already made optional. No other tags are missing. |
gadomski
left a comment
There was a problem hiding this comment.
Getting closer, a couple of small tweaks to the test setup and one question about a possibly-optional attribute.
tests/test_stac.py
Outdated
| item.validate() | ||
|
|
||
|
|
||
| def test_create_item_from_hdf_without_xml() -> None: |
There was a problem hiding this comment.
pytest provides a fixture for a temporary directory, let's use that instead.
| def test_create_item_from_hdf_without_xml() -> None: | |
| def test_create_item_from_hdf_without_xml(tmp_path: Path) -> None: |
tests/test_stac.py
Outdated
| """Test that an item can be created from an HDF file when XML is not available. | ||
|
|
||
| This tests the fallback to extracting metadata directly from the HDF file | ||
| when the accompanying XML metadata file is not present. | ||
| """ |
There was a problem hiding this comment.
I'm generally 👎🏼 on docstrings for tests, as I prefer to let the test function name and content speak for themselves.
| """Test that an item can be created from an HDF file when XML is not available. | |
| This tests the fallback to extracting metadata directly from the HDF file | |
| when the accompanying XML metadata file is not present. | |
| """ |
There was a problem hiding this comment.
Yes, they're auto-generated comments and I have removed them.
tests/test_stac.py
Outdated
| # Copy only the HDF file (not the XML) to ensure XML is not available | ||
| temp_hdf_path = os.path.join(temporary_directory, hdf_file) | ||
| shutil.copyfile(source_hdf_path, temp_hdf_path) | ||
|
|
||
| # Verify XML does not exist in temp directory | ||
| temp_xml_path = f"{temp_hdf_path}.xml" | ||
| assert not os.path.exists(temp_xml_path), "XML file should not exist" | ||
|
|
||
| # Create item from HDF only - should extract metadata from HDF | ||
| item = stactools.modis.stac.create_item(temp_hdf_path) | ||
|
|
||
| # Verify item was created with correct metadata |
There was a problem hiding this comment.
These look like auto-gen comments, again I generally prefer to let the code speak for itself.
| # Copy only the HDF file (not the XML) to ensure XML is not available | |
| temp_hdf_path = os.path.join(temporary_directory, hdf_file) | |
| shutil.copyfile(source_hdf_path, temp_hdf_path) | |
| # Verify XML does not exist in temp directory | |
| temp_xml_path = f"{temp_hdf_path}.xml" | |
| assert not os.path.exists(temp_xml_path), "XML file should not exist" | |
| # Create item from HDF only - should extract metadata from HDF | |
| item = stactools.modis.stac.create_item(temp_hdf_path) | |
| # Verify item was created with correct metadata | |
| temp_hdf_path = os.path.join(temporary_directory, hdf_file) | |
| shutil.copyfile(source_hdf_path, temp_hdf_path) | |
| temp_xml_path = f"{temp_hdf_path}.xml" | |
| assert not os.path.exists(temp_xml_path), "XML file should not exist" | |
| item = stactools.modis.stac.create_item(temp_hdf_path) |
tests/test_stac.py
Outdated
| assert item is not None | ||
| assert item.id.startswith("MOD10A2.A2022033.h09v05") | ||
| assert "hdf" in item.assets | ||
| assert "metadata" not in item.assets # XML asset should not be present |
There was a problem hiding this comment.
| assert "metadata" not in item.assets # XML asset should not be present | |
| assert "metadata" not in item.assets |
src/stactools/modis/metadata.py
Outdated
| with warnings.catch_warnings(): | ||
| warnings.simplefilter("ignore", category=NotGeoreferencedWarning) | ||
| with rasterio.open(read_href) as dataset: | ||
| hdf_tags = dataset.tags() |
There was a problem hiding this comment.
Is there a reason to catch this warning? Feels like we do want to propagate this up to the user.
There was a problem hiding this comment.
Removed the suppression. It is now visible in the output.
- Revert TileID to required (present in all products) - Remove NotGeoreferencedWarning suppression - Use pytest tmp_path fixture in test - Clean up test comments
@gadomski, Thanks for reviewing. I have updated the code based on your comments and also tested it, all looks good. When you get a moment, could you review and approve it? |
Description: There is no xml file along with .hdf file for modis dataset anymore. Updated the code accordingly, also for STAC item creation it expects the metadata which is now extracted from the .hdf file.
These changes have been tested by rebuilding the image with the new commit hash from the fork and it ingested the STAC item successfully. This is the link to the ingested item into Open PC - https://planetarycomputer.microsoft.com/explore?c=-134.4470%2C55.3130&z=4.32&v=2&d=modis-43A4-061&m=cql%3A5f2d6e100cd82009329bc4df3e51de56&r=Natural+color&s=false%3A%3A100%3A%3Atrue&sr=desc&ae=0
PR checklist:
scripts/format).scripts/lint).scripts/test).