This repository allows building containers containing a file based dataset.
Use makefile image or push targets and provider necessary arguments:
SOURCE_DATA_DIR- path to a directory with a dataset, assumed to be a name of a datasetDATA_ARCHIVE- path to an archived dataset used to calculated md5sumDESCRIPTION- short description of a dataset
Example:
make \
SOURCE_DATA_DIR=clinical-trials-data-800k \
DATA_ARCHIVE=clinical-trials-data-800k.tar.xz \
DESCRIPTION="Clinical trials studies and data objects" \
allThe images are pushed to data-container repository on hub.docker.com.
Each data container has matadatach attached to it in form of images labels and container environment variables. The list of metadata attached:
DATA_DIR- location of a dataset in a containerNUMBER_OF_DIRECTORIES- numer of directories inside ofDATA_DIR(excludingDATA_DIR)NUMBER_OF_FILES- number of files inside ofDATA_DIRDESCRIPTION- short description of a dataset
Each dataset is published with 2 image tags latest and identified with it's md5sum, according to templates:
onedata/data-container:<dataset name>-latestonedata/data-container:<dataset name>-<md5hash of dataset archive>
The command used to prepare archives:
tar -cf - clinical-trials-data-800k | xz -T 9 -9 -c - > clinical-trials-data-800k.tar.xz| Description | Dataset Link | Docker Image |
|---|---|---|
| Full set of hf5 files with telescope metadata | cta-hdf5-data-125k.tar.xz | onedata/data-container:cta-hdf5-data-125k-latest |
| A subset of hf5 files with telescope metadata | cta-hdf5-data-30k.tar.xz | onedata/data-container:cta-hdf5-data-30k-latest |
| Clinical trials studies and data objects | clinical-trials-data-800k.tar.xz | onedata/data-container:clinical-trials-data-800k-latest |
| Covid related studies and data objects | covid-data-10k.tar.xz | onedata/data-container:clinical-trials-data-800k-latest |