Skip to content

arjj8/mlcast-dataset-msgcpp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

msgcpp-to-zarr

This repository contains code to convert netCDF files for the MSGCPP dataset to Zarr format. The processing proceed by the following steps:

  1. Read the source NetCDF files and create a single-file Zarr JSON descriptor for each file. This uses kerchunk.hdf.SingleHdf5ToZarr.
  2. Combine the single-file Zarr JSON descriptors into a multi-file Zarr JSON descriptor. This uses kerchunk.combine.MultiZarrToZarr.
  3. Convert the multi-file Zarr JSON descriptor to a multi-file Zarr dataset and write this to disk. This uses rechunker.rechunk.

Usage

Install dependencies with pdm:

pdm install

and run with:

$ pdm run python -m msgcpp_to_zarr.cli --help
usage: cli.py [-h] [--fp_source_data FP_SOURCE_DATA] [--data_product DATA_PRODUCT] [--fp_dest_root FP_DEST_ROOT] [--t_min T_MIN] [--t_max T_MAX] [--skip-rechunking]

options:
  -h, --help            show this help message and exit
  --fp_source_data FP_SOURCE_DATA
                        Path to source data (default: /dmidata/projects/weather2x/SolarNowcasting/data/MSGCPP)
  --data_product DATA_PRODUCT
                        Data product to convert (default: SIS)
  --fp_dest_root FP_DEST_ROOT
                        Path to root of destination data (default: /dmidata/scratch/10day/maf/msgcpp_zarr)
  --t_min T_MIN         Minimum time to include (default: 2021-01-01 00:00:00+00:00)
  --t_max T_MAX         Maximum time to include (default: 2021-01-02 00:00:00+00:00)
  --skip-rechunking     Whether to skip rechunk and write to zarr (default: False)

e.g.

$ pdm run python -m msgcpp_to_zarr.cli
2024-07-22 09:22:24.307 | INFO     | sarah3_to_zarr.netcdf_reader:_create_singlefile_zarr_jsons:105 - Creating JSON file for each individual source NetCDF file
[########################################] | 100% Completed | 1.91 ss
2024-07-22 09:22:26.429 | INFO     | sarah3_to_zarr.netcdf_reader:_multizarr_to_zarr:147 - Writing single-file json zarr descriptor to `/dmidata/scratch/10day/lcd/sarah3_zarr/SIS/20210101T000000+0000-20210102T000000+0000.json`
2024-07-22 09:22:26.454 | INFO     | __main__:<module>:66 - <xarray.Dataset> Size: 358MB
Dimensions:        (time: 48, lat: 856, lon: 2171, bnds: 2)
Coordinates:
  * lat            (lat) float32 3kB 22.23 22.27 22.33 ... 64.88 64.93 64.97
  * lon            (lon) float32 9kB -44.12 -44.08 -44.03 ... 64.28 64.32 64.38
  * time           (time) datetime64[ns] 384B 2021-01-01 ... 2021-01-01T23:30:00
Dimensions without coordinates: bnds
Data variables:
    SIS            (time, lat, lon) float32 357MB dask.array<chunksize=(1, 856, 2171), meta=np.ndarray>
    lat_bnds       (time, lat, bnds) float32 329kB dask.array<chunksize=(1, 856, 2), meta=np.ndarray>
    lon_bnds       (time, lon, bnds) float32 834kB dask.array<chunksize=(1, 2171, 2), meta=np.ndarray>
    record_status  (time) int8 48B dask.array<chunksize=(48,), meta=np.ndarray>
Attributes: (12/41)
    CDI:                        Climate Data Interface version 2.4.0 (https:/...
    CDO:                        Climate Data Operators version 2.4.0 (https:/...
    Conventions:                CF-1.7,ACDD-1.3
    creator_email:              contact.cmsaf@dwd.de
    creator_name:               DE/DWD
    creator_type:               institution
    ...                         ...
    time_coverage_duration:     P1D
    time_coverage_end:          2021-01-02T00:00:00
    time_coverage_resolution:   PT30M
    time_coverage_start:        2021-01-01T00:00:00
    title:                      CM SAF Surface Solar Radiation Climate Data R...
    variable_id:                SIS
2024-07-22 09:22:26.457 | INFO     | __main__:<module>:73 - Rechunking to {'time': 48} and writing to /dmidata/scratch/10day/lcd/sarah3_zarr/SIS_20210101T000000+0000_20210102T000000+0000.zarr
2024-07-22 09:22:26.457 | INFO     | sarah3_to_zarr.zarr_store:rechunk_and_write_ds_to_zarr:18 - output dir already exists, removing /dmidata/scratch/10day/lcd/sarah3_zarr/SIS_20210101T000000+0000_20210102T000000+0000.zarr
2024-07-22 09:22:26.642 | INFO     | sarah3_to_zarr.zarr_store:rechunk_and_write_ds_to_zarr:52 - writing to /dmidata/scratch/10day/lcd/sarah3_zarr/SIS_20210101T000000+0000_20210102T000000+0000.zarr
[########################################] | 100% Completed | 1.90 sms
2024-07-22 09:22:28.584 | INFO     | sarah3_to_zarr.zarr_store:rechunk_and_write_ds_to_zarr:63 - done!

About

Code to convert msgcpp netCDF dataset to zarr

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%