Requires an installation of Python 3.9+.
To successfully authenticate with Zenodo, configure the following environment variables in
the set_env.sh file:
export INVENIO_RDM_ACCESS_TOKEN=<access_token>
export INVENIO_RDM_BASE_URL=<base url>Create a Python virtual environment and install the project dependencies:
python3 -m venv prefect-env
source prefect-env/bin/activate
pip install -r requirements.txtSet the environment variables:
source set_env.shTo configure concurrency for the API calls, create a global concurrency limit named rate-limit:invenio-rdm-api:
prefect gcl create rate-limit:invenio-rdm-api --limit 5 --slot-decay-per-second 1.0For debugging, set the Prefect logging level to DEBUG:
prefect config set PREFECT_LOGGING_LEVEL="DEBUG"Start the Prefect server:
prefect server startOpen the Prefect dashboard in your browser at http://localhost:4200.
* Note that once the Prefect server has been started, keep the terminal open while the server is running. When needed, the server can be stopped with Ctrl+C.
The project consists of two main scripts:
uploads.py: Handles the uploading of datasets to Zenodo.records.py: Downloads metadata for already uploaded Zenodo records.
Both scripts are configured using a config.json file.
The project uses a config.json file to control file paths and behavior. Here's an example structure of the configuration file:
{
"uploads": {
"total": {
"dataset_dir": "/home/joel/Desktop/zenodo/test/total",
"collectors_csv": "/home/joel/Desktop/zenodo/2024_total_info_updated.csv"
},
"annular": {
"dataset_dir": "/home/joel/Desktop/zenodo/test/annular",
"collectors_csv": "/home/joel/Desktop/zenodo/2023_annular_info.csv"
},
"successful_results_file": "/home/joel/Desktop/zenodo/results/successul_results.csv",
"failure_results_file": "/home/joel/Desktop/zenodo/results/failed_results.csv",
"delete_failures": false,
"auto_publish": false
},
"downloads": {
"results_dir": "/home/joel/Desktop/zenodo/results/records/"
}
}Controls the upload process handled by uploads.py.
total.dataset_dir: Path to the directory containing the total eclipse dataset.total.collectors_csv: CSV file containing metadata for the total eclipse collectors.annular.dataset_dir: Path to the directory containing the annular eclipse dataset.annular.collectors_csv: CSV file containing metadata for the annular eclipse collectors.successful_results_file: File path where successfully uploaded records will be logged.failure_results_file: File path to log failed uploads.delete_failures: Iftrue, files that failed to upload will be deleted.auto_publish: Iftrue, records will be published automatically after upload.
Controls the download process handled by records.py.
results_dir: Directory where the downloaded metadata records will be saved.
In a separate terminal and with prefect-env activated, create a deployment:
python uploads.pyThis starts a long-running process that monitors for work from the Prefect server.
To run the deployment, navigate to the Prefect dashboard and on the left side panel go to Deployments, select upload-datasets-deployment from the list and then click Run and select Quick run from the dropdown.
Once the run has started, each dataset will be uploaded sequentially and can be tracked in the 'Runs' section on the left side panel.
Note that once a dataset has been uploaded it will be internally tracked so it's skipped in subsequent runs. Each dataset is tracked by it's file path.
Once the records have been published, the results can be retrieved and saved locally as a CSV formatted file which will be named in the following format: records_{timestamp}.
In a separate terminal and with prefect-env activated, create a deployment:
python records.pyThis starts a long-running process that monitors for work from the Prefect server.
To run the deployment, navigate to the Prefect dashboard and on the left side panel go to Deployments, select get-published-records-deployment from the list and then click Run and select Quick run from the dropdown.
A README.md file based on the dataset description can be added to each dataset ZIP file prior to the upload. If the README.md file already exists then it will be replaced, however, given the size of the dataset this can take a while since a new version of the dataset has to be created without the previous README.md file first.
In a separate terminal and with prefect-env activated, create a deployment:
python add_readme.pyThis starts a long-running process that monitors for work from the Prefect server.
To run the deployment, navigate to the Prefect dashboard and on the left side panel go to Deployments, select create-dataset-readme-deployment from the list and then click Run and select Quick run from the dropdown.