Problem
Currently, datasets are downloaded on demand when Field(metadata), FieldTimeSeries(metadata), or DatasetRestoring(metadata) is called and data is not already present. This method is useful for writing scripts that will "work anywhere" on the first call without extra steps (eg downloading necessary data from somewhere, which might requiring diving into documentation...)
However, for expensive applications it can be useful to enforce the opposite; that a script will not run unless the data has already been downloading. Similarly, it'd be convenient to be able to take a script written by someone else, and automatically download all of the data it needs (eg on a login node) before submitting a job to a cluster. In other words we want
-
Pre-downloading for cluster jobs: When running simulations on a cluster, data should be downloaded on a login node before submitting a Slurm job, to avoid wasting compute resources (and because compute nodes may lack internet access).
-
Guarding against accidental downloads: For expensive or production simulations, there should be a way to guarantee that all data is already on disk — and error immediately if it isn't — rather than silently downloading mid-simulation.
Proposed solution
Three layers, from low-level to high-level:
1. NUMERICALEARTH_DATA environment variable
Control download behavior at runtime:
| Value |
Behavior |
unset / "auto" |
Download on demand (current default) |
"existing" |
Error if data isn't already on disk — never download |
Analogous to Julia's --pkgimages=existing. Usage:
NUMERICALEARTH_DATA=existing srun julia --project run_simulation.jl
Implementation: a check at the top of each download_dataset method that verifies files exist when in existing mode, and errors with a clear message listing the missing files.
2. Metadata extraction from high-level constructors
Refactor constructors like ERA5PrescribedAtmosphere so their metadata-creation logic is accessible separately:
# Returns Vector{Metadata} for all 8 ERA5 variables
metadata = ERA5PrescribedAtmosphere_metadata(;
dataset = ERA5Hourly(),
start_date = DateTime(2020, 1, 1),
end_date = DateTime(2020, 12, 31),
region = BoundingBox(longitude=(200, 220), latitude=(35, 55)))
This enables constructing metadata (cheap, no I/O) without building fields.
A convenience function for batch downloading:
download_datasets(metadata) # Vector{Metadata}
download_datasets(metadata1, metadata2...) # varargs
3. DataManifest.toml for declarative data requirements
A TOML manifest declaring all data a simulation needs, living alongside Project.toml:
# DataManifest.toml
[[atmosphere]]
type = "ERA5PrescribedAtmosphere"
dataset = "ERA5Hourly"
start_date = "2020-01-01"
end_date = "2020-12-31"
longitude = [200, 220]
latitude = [35, 55]
[[datasets]]
variable = "temperature"
dataset = "GLORYSDaily"
start_date = "2020-01-01"
end_date = "2020-12-31"
longitude = [200, 220]
latitude = [35, 55]
Two directions:
- Code → Manifest:
NumericalEarth.write_manifest("DataManifest.toml", metadata_collection) — export data requirements from simulation setup code
- Manifest → Downloads:
NumericalEarth.download_datasets("DataManifest.toml") — download without importing simulation code
This is especially useful when simulation setups are implemented in source code (packages) rather than scripts, and for automated systems (CI/CD, deployment tooling) that need to act on data requirements without running Julia simulation code.
Example cluster workflow
# download_data.jl — run on login node
using NumericalEarth
atm = ERA5PrescribedAtmosphere_metadata(;
start_date = DateTime(2020, 1, 1),
end_date = DateTime(2020, 12, 31),
region = BoundingBox(longitude=(200, 220), latitude=(35, 55)))
ocean_T = Metadata(:temperature; dataset=GLORYSDaily(),
start_date = DateTime(2020, 1, 1),
end_date = DateTime(2020, 12, 31))
download_datasets(atm..., ocean_T)
# On login node:
julia --project download_data.jl
# Submit job with offline guard:
NUMERICALEARTH_DATA=existing sbatch run_simulation.sh
Implementation order
NUMERICALEARTH_DATA=existing guard (small, high value)
- Metadata extraction functions +
download_datasets convenience (enables pre-download scripts)
DataManifest.toml read/write (enables automation and infrastructure integration)
Problem
Currently, datasets are downloaded on demand when
Field(metadata),FieldTimeSeries(metadata), orDatasetRestoring(metadata)is called and data is not already present. This method is useful for writing scripts that will "work anywhere" on the first call without extra steps (eg downloading necessary data from somewhere, which might requiring diving into documentation...)However, for expensive applications it can be useful to enforce the opposite; that a script will not run unless the data has already been downloading. Similarly, it'd be convenient to be able to take a script written by someone else, and automatically download all of the data it needs (eg on a login node) before submitting a job to a cluster. In other words we want
Pre-downloading for cluster jobs: When running simulations on a cluster, data should be downloaded on a login node before submitting a Slurm job, to avoid wasting compute resources (and because compute nodes may lack internet access).
Guarding against accidental downloads: For expensive or production simulations, there should be a way to guarantee that all data is already on disk — and error immediately if it isn't — rather than silently downloading mid-simulation.
Proposed solution
Three layers, from low-level to high-level:
1.
NUMERICALEARTH_DATAenvironment variableControl download behavior at runtime:
"auto""existing"Analogous to Julia's
--pkgimages=existing. Usage:Implementation: a check at the top of each
download_datasetmethod that verifies files exist when inexistingmode, and errors with a clear message listing the missing files.2. Metadata extraction from high-level constructors
Refactor constructors like
ERA5PrescribedAtmosphereso their metadata-creation logic is accessible separately:This enables constructing metadata (cheap, no I/O) without building fields.
A convenience function for batch downloading:
3.
DataManifest.tomlfor declarative data requirementsA TOML manifest declaring all data a simulation needs, living alongside
Project.toml:Two directions:
NumericalEarth.write_manifest("DataManifest.toml", metadata_collection)— export data requirements from simulation setup codeNumericalEarth.download_datasets("DataManifest.toml")— download without importing simulation codeThis is especially useful when simulation setups are implemented in source code (packages) rather than scripts, and for automated systems (CI/CD, deployment tooling) that need to act on data requirements without running Julia simulation code.
Example cluster workflow
Implementation order
NUMERICALEARTH_DATA=existingguard (small, high value)download_datasetsconvenience (enables pre-download scripts)DataManifest.tomlread/write (enables automation and infrastructure integration)