-
Notifications
You must be signed in to change notification settings - Fork 0
Analysis stack
Large size : http://dods.ipsl.jussieu.fr/glipsl/SPPP.png
This pipeline is based on SQLite database that allows a parallel, orderly and asynchronous post-processing on CMIP5 files stored in /prodigfs/esg/CMIP5/.. In fact, the workflow is divided into two related pipelines: the variable pipeline and the dataset one. A CMIP5 dataset version groups several variables. A dataset cannot be processed if all its variables do not reach the end of the variable pipeline without errors.
Each entry in the SQLite database describes and follows the post-processing progress of a variable or a dataset using different fields as an ID key, the transition (step) name, its state (S0XXX for variable pipeline and S1XXX for dataset pipeline), the status (waiting, running, error or done), the corresponding pipeline (CMIP5_001 for variable pipeline and CMIP5_002 for dataset pipeline) or its creation date.
Then, a python daemon, called a worker and supported by an API, asks the database for waiting jobs regardless the step name. Each post-processing step is carried by a Python or Shell script. Accordingly to the job, the worker runs the corresponding script. Finally the worker updates the entry status in the database to go to the next transition.
Consequently, a dataset is processed when all the corresponding variables reach the done status. No more jobs appears when all variables and datasets reach the done status.
synda is a python program managing discovery, authentication, certificate and download processes from the ESGF archives (CMIP5 archive) in an easy way. The download of files is achieved through the exploration of the ESGF data repositories governed by the configuration files in which user defines its search criteria. These criteria pertains to metadata attributes used for discovery of climate data. These attributes are defined by Data Reference Syntax (DRS). Thus, a user can enter lists of values for variables, frequencies, experiments and ensemble numbers into a configuration as directed by already provided templates. Using these templates, synda explores the ESGF nodes and download all the corresponding available files. The program may be run regularly to download the possible new files.
This step of the pipeline entirely refers to the download or transfert of CMIP5 files through synda into /prodigfs/esg/CMIP5/output[12]/.. This step is called synda transfert. Each downloaded variable leads to a new event committed in the pipeline database. This event initializes the variable pipeline with a waiting status for the first transition. If a dataset is discovered as a latest version, a dataset entry is create into the database. Consequently, only the latest datasets are processed by the pipeline.
More about how submit a synda template: http://forge.ipsl.jussieu.fr/prodiguer/wiki/docs/synda.
This step only cleans the corresponding directory of the CMIP5 variable on /prodigfs/esg/CMIP5/process/..
The CMIP5 product attribute is divided into two directories: the output1 directory and the output2 one. Partitioning depends on version, period, variable and frequency. For clear experience and facilitate archive management, we decided to merge both outputs using hard links. Because of a significant number of exceptions, we decided to take output1 as reference in the case of outputs with same filenames.
This step highlights chunked files producing overlaps in a time series. The period covered y a CMIP5 variable is spitted in several files depending on the model, the frequency, the period, etc. Sometimes, an overlap occurs leading to deprecated file(s).
We decided to check the period part of the filename to create a graph of nodes and found the shortest path between the start and end dates. If a shortest path is found and overlapping files exist, we remove the corresponding hard link in /prodigfs/esg/CMIP5/process/.. When no shortest path cannot be found (i.e., a gap appears in the time series), nothing happens.
WARNING: we always refer to filenames and assume that they are correct.
Time axis often is mistaken in CMIP5 files and leads to flawed studies or unused data. Consequently, these files cannot be used or, even worse, produced erroneous results, due to problems in the time axis description. We developed a python program to check and rebuild a CMIP5-compliant time axis if necessary.
This step of the pipeline:
- always checks time axis squareness depending on calendar, frequency, realm and time units,
- checks consistency between last theoretical date and end date in filename,
- corrects all mistaken timesteps,
- deletes time boundaries if necessary,
- changes time unites according to CMIP5 requirements if necessary,
- saves diagnostic.
WARNING: this script is based on (i) uncorrupted filename period dates and (ii) properly-defined times units, time calendar and frequency NetCDF attributes.
More details and to use our script as a stand-alone command-line: https://github.com/Prodiguer/cmip5-time-axis
A CMIP5 dataset is a partitioned as a collection of files. Using cdscancommand-line from the cdat python library, the CMIP5 files are fictitiously concatenated along time dimension through XML files. These aggregations avoid to deal with the files splitting depending on model and frequency over a time period. These XML aggregations contains all metadata, together with information describing how the dataset is partitioned into files.
Each variable in the pipeline is scanned by cdscan -x command-line. The produced XML follows the CMIP5 Data Reference Syntax and is stored on:
/prodigfs/esg/xml/CMIP5/<experiment>/<realm>/<frequency>/<variable>/.
as:
cmip5.<model>.<experiment>.<ensemble>.<frequency>.<realm>.<MIP_table>.<variable>.<version>.xml
Consequently, each variable directory groups the XML aggregations of all ensembles of all models.
These aggregation can be used with cdms python-module from cdat using a simple: cdms.open('aggregation.xml').
This step only cleans the corresponding directory of the CMIP5 variable on /prodigfs/esg/CMIP5/merge/..
This step only creates a hard link on /prodigfs/esg/CMIP5/merge/. of the processed variable. The status of the corresponding entry in the pipeline database is then set to done.
Here starts the dataset pipeline, only if CMIP5 variable of the corresponding dataset have been all processed through the variable pipeline (reaching a done status in the database).
This step creates a symlink pointing to the latest version of the dataset. It unlinks the previous latest symlink if exists.
We decided to build the pipeline with a "full-slave" behavior regarding to "synda transfert". It means that whatever the dataset, a latest symlink is created regardless its version. Consequently, only datasets flagged as "latest" are processed by the pipeline following the creation date in the case of version updates.
For a clearer experience, we decided to create latest symlink for cdscan aggregations.
In /prodigfs/esg/xml/CMIP5/<experiment>/<realm>/<frequency>/<variable>/., this step creates cmip5.<model>.<experiment>.<ensemble>.<frequency>.<realm>.<MIP_table>.<variable>.latest.xml pointing to the latest version of the corresponding XML files.
This step is executed on another VM hosting the IPSL private node in order to publish all files from synda requests.
One a dataset is completely processed through the pipeline, we start the publication. The publication process of ESG-F nodes requires mapfiles. Mapfiles are text files where each line describes a file to publish. A line is composed by full file path, file size, last modification time in Unix units, the checksum and checksum type, all pipe-separated. esg_mapfiles.py is a flexible alternative Python command-line tool allowing you to easily generate mapfiles independently from ESG-F.
Using the esg-user login, we put all generated mapfiles in /home/esg-user/mapfiles/pending/., awaiting publication. We generate one mapfile per dataset for clear publication management.
More details and to use our script as a stand-alone command-line: https://github.com/Prodiguer/esgf-mapfiles
This step publishes the datasets from synda on our IPSL private node using previous mapfiles: http://esgf-local.ipsl.fr/esgf-web-fe/.
The publication occurs each day at midnight using the esg-user crontab. The following steps are started in the order:
- Mapfiles comparison between
/home/esg-user/mapfiles/pending/.and/home/esg-user/mapfiles/published/. - If a mapfile does not exist in published path or has different checksum, it is selected for publication. All selected mapfiles are concatenated in the limit of 30,000 files to publish.
- Vocabulary is added to
esg-auto.iniandesgcet_models_table.txtif necessary. - Initialize controlled vocabulary and tables with
esginitialize. - Datanode publication with
esgpublish --thredds --new-version 1 --replace --no-thredds-reinit: the version is always set to 1 replacing the dataset by its latest version (syndaonly process latest version). - Indexnode publication with
esgpublish --thredds-reinit --publish --noscan: if an error occurs at this level, we unpublish the dataset avoiding conflicts. - All selected mapfiles are copied from
/home/esg-user/mapfiles/pending/.to/home/esg-user/mapfiles/published/. - If a mapfile does not exist in pending path, it is selected for unpublication process.
- Unpublication with
esgunpublish --database-delete --no-republish - All selected mapfiles are removed from
/home/esg-user/mapfiles/published/.
