This repository contains code to convert netCDF files for the MSGCPP dataset to Zarr format. The processing proceed by the following steps:
- Read the source NetCDF files and create a single-file Zarr JSON descriptor for each file. This uses
kerchunk.hdf.SingleHdf5ToZarr. - Combine the single-file Zarr JSON descriptors into a multi-file Zarr JSON descriptor. This uses
kerchunk.combine.MultiZarrToZarr. - Convert the multi-file Zarr JSON descriptor to a multi-file Zarr dataset and write this to disk. This uses
rechunker.rechunk.
Install dependencies with pdm:
pdm installand run with:
$ pdm run python -m msgcpp_to_zarr.cli --help
usage: cli.py [-h] [--fp_source_data FP_SOURCE_DATA] [--data_product DATA_PRODUCT] [--fp_dest_root FP_DEST_ROOT] [--t_min T_MIN] [--t_max T_MAX] [--skip-rechunking]
options:
-h, --help show this help message and exit
--fp_source_data FP_SOURCE_DATA
Path to source data (default: /dmidata/projects/weather2x/SolarNowcasting/data/MSGCPP)
--data_product DATA_PRODUCT
Data product to convert (default: SIS)
--fp_dest_root FP_DEST_ROOT
Path to root of destination data (default: /dmidata/scratch/10day/maf/msgcpp_zarr)
--t_min T_MIN Minimum time to include (default: 2021-01-01 00:00:00+00:00)
--t_max T_MAX Maximum time to include (default: 2021-01-02 00:00:00+00:00)
--skip-rechunking Whether to skip rechunk and write to zarr (default: False)e.g.
$ pdm run python -m msgcpp_to_zarr.cli
2024-07-22 09:22:24.307 | INFO | sarah3_to_zarr.netcdf_reader:_create_singlefile_zarr_jsons:105 - Creating JSON file for each individual source NetCDF file
[########################################] | 100% Completed | 1.91 ss
2024-07-22 09:22:26.429 | INFO | sarah3_to_zarr.netcdf_reader:_multizarr_to_zarr:147 - Writing single-file json zarr descriptor to `/dmidata/scratch/10day/lcd/sarah3_zarr/SIS/20210101T000000+0000-20210102T000000+0000.json`
2024-07-22 09:22:26.454 | INFO | __main__:<module>:66 - <xarray.Dataset> Size: 358MB
Dimensions: (time: 48, lat: 856, lon: 2171, bnds: 2)
Coordinates:
* lat (lat) float32 3kB 22.23 22.27 22.33 ... 64.88 64.93 64.97
* lon (lon) float32 9kB -44.12 -44.08 -44.03 ... 64.28 64.32 64.38
* time (time) datetime64[ns] 384B 2021-01-01 ... 2021-01-01T23:30:00
Dimensions without coordinates: bnds
Data variables:
SIS (time, lat, lon) float32 357MB dask.array<chunksize=(1, 856, 2171), meta=np.ndarray>
lat_bnds (time, lat, bnds) float32 329kB dask.array<chunksize=(1, 856, 2), meta=np.ndarray>
lon_bnds (time, lon, bnds) float32 834kB dask.array<chunksize=(1, 2171, 2), meta=np.ndarray>
record_status (time) int8 48B dask.array<chunksize=(48,), meta=np.ndarray>
Attributes: (12/41)
CDI: Climate Data Interface version 2.4.0 (https:/...
CDO: Climate Data Operators version 2.4.0 (https:/...
Conventions: CF-1.7,ACDD-1.3
creator_email: contact.cmsaf@dwd.de
creator_name: DE/DWD
creator_type: institution
... ...
time_coverage_duration: P1D
time_coverage_end: 2021-01-02T00:00:00
time_coverage_resolution: PT30M
time_coverage_start: 2021-01-01T00:00:00
title: CM SAF Surface Solar Radiation Climate Data R...
variable_id: SIS
2024-07-22 09:22:26.457 | INFO | __main__:<module>:73 - Rechunking to {'time': 48} and writing to /dmidata/scratch/10day/lcd/sarah3_zarr/SIS_20210101T000000+0000_20210102T000000+0000.zarr
2024-07-22 09:22:26.457 | INFO | sarah3_to_zarr.zarr_store:rechunk_and_write_ds_to_zarr:18 - output dir already exists, removing /dmidata/scratch/10day/lcd/sarah3_zarr/SIS_20210101T000000+0000_20210102T000000+0000.zarr
2024-07-22 09:22:26.642 | INFO | sarah3_to_zarr.zarr_store:rechunk_and_write_ds_to_zarr:52 - writing to /dmidata/scratch/10day/lcd/sarah3_zarr/SIS_20210101T000000+0000_20210102T000000+0000.zarr
[########################################] | 100% Completed | 1.90 sms
2024-07-22 09:22:28.584 | INFO | sarah3_to_zarr.zarr_store:rechunk_and_write_ds_to_zarr:63 - done!