Method to enforce "predownloading" of data, when desired

## Problem

Currently, datasets are downloaded on demand when `Field(metadata)`, `FieldTimeSeries(metadata)`, or `DatasetRestoring(metadata)` is called and data is not already present. This method is useful for writing scripts that will "work anywhere" on the first call without extra steps (eg downloading necessary data from somewhere, which might requiring diving into documentation...)

However, for expensive applications it can be useful to enforce the opposite; that a script will not run unless the data has already been downloading. Similarly, it'd be convenient to be able to take a script written by someone else, and automatically download all of the data it needs (eg on a login node) before submitting a job to a cluster. In other words we want

1. **Pre-downloading for cluster jobs**: When running simulations on a cluster, data should be downloaded on a login node *before* submitting a Slurm job, to avoid wasting compute resources (and because compute nodes may lack internet access).

2. **Guarding against accidental downloads**: For expensive or production simulations, there should be a way to guarantee that all data is already on disk — and error immediately if it isn't — rather than silently downloading mid-simulation.

## Proposed solution

Three layers, from low-level to high-level:

### 1. `NUMERICALEARTH_DATA` environment variable

Control download behavior at runtime:

| Value | Behavior |
|-------|----------|
| unset / `"auto"` | Download on demand (current default) |
| `"existing"` | Error if data isn't already on disk — never download |

Analogous to Julia's `--pkgimages=existing`. Usage:

```bash
NUMERICALEARTH_DATA=existing srun julia --project run_simulation.jl
```

Implementation: a check at the top of each `download_dataset` method that verifies files exist when in `existing` mode, and errors with a clear message listing the missing files.

### 2. Metadata extraction from high-level constructors

Refactor constructors like `ERA5PrescribedAtmosphere` so their metadata-creation logic is accessible separately:

```julia
# Returns Vector{Metadata} for all 8 ERA5 variables
metadata = ERA5PrescribedAtmosphere_metadata(;
    dataset = ERA5Hourly(),
    start_date = DateTime(2020, 1, 1),
    end_date = DateTime(2020, 12, 31),
    region = BoundingBox(longitude=(200, 220), latitude=(35, 55)))
```

This enables constructing metadata (cheap, no I/O) without building fields.

A convenience function for batch downloading:

```julia
download_datasets(metadata)               # Vector{Metadata}
download_datasets(metadata1, metadata2...) # varargs
```

### 3. `DataManifest.toml` for declarative data requirements

A TOML manifest declaring all data a simulation needs, living alongside `Project.toml`:

```toml
# DataManifest.toml

[[atmosphere]]
type = "ERA5PrescribedAtmosphere"
dataset = "ERA5Hourly"
start_date = "2020-01-01"
end_date = "2020-12-31"
longitude = [200, 220]
latitude = [35, 55]

[[datasets]]
variable = "temperature"
dataset = "GLORYSDaily"
start_date = "2020-01-01"
end_date = "2020-12-31"
longitude = [200, 220]
latitude = [35, 55]
```

Two directions:
- **Code → Manifest**: `NumericalEarth.write_manifest("DataManifest.toml", metadata_collection)` — export data requirements from simulation setup code
- **Manifest → Downloads**: `NumericalEarth.download_datasets("DataManifest.toml")` — download without importing simulation code

This is especially useful when simulation setups are implemented in source code (packages) rather than scripts, and for automated systems (CI/CD, deployment tooling) that need to act on data requirements without running Julia simulation code.

## Example cluster workflow

```julia
# download_data.jl — run on login node
using NumericalEarth

atm = ERA5PrescribedAtmosphere_metadata(;
    start_date = DateTime(2020, 1, 1),
    end_date = DateTime(2020, 12, 31),
    region = BoundingBox(longitude=(200, 220), latitude=(35, 55)))

ocean_T = Metadata(:temperature; dataset=GLORYSDaily(),
                    start_date = DateTime(2020, 1, 1),
                    end_date = DateTime(2020, 12, 31))

download_datasets(atm..., ocean_T)
```

```bash
# On login node:
julia --project download_data.jl

# Submit job with offline guard:
NUMERICALEARTH_DATA=existing sbatch run_simulation.sh
```

## Implementation order

1. `NUMERICALEARTH_DATA=existing` guard (small, high value)
2. Metadata extraction functions + `download_datasets` convenience (enables pre-download scripts)
3. `DataManifest.toml` read/write (enables automation and infrastructure integration)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method to enforce "predownloading" of data, when desired #143

Problem

Proposed solution

1. `NUMERICALEARTH_DATA` environment variable

2. Metadata extraction from high-level constructors

3. `DataManifest.toml` for declarative data requirements

Example cluster workflow

Implementation order

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Value	Behavior
unset / `"auto"`	Download on demand (current default)
`"existing"`	Error if data isn't already on disk — never download

Method to enforce "predownloading" of data, when desired #143

Description

Problem

Proposed solution

1. NUMERICALEARTH_DATA environment variable

2. Metadata extraction from high-level constructors

3. DataManifest.toml for declarative data requirements

Example cluster workflow

Implementation order

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `NUMERICALEARTH_DATA` environment variable

3. `DataManifest.toml` for declarative data requirements