Add data loading pipeline for training and prediction by aaTman · Pull Request #3 · ai2es/fronts

aaTman · 2026-02-20T05:01:01Z

Summary

Introduces src/fronts/data/config.py with a composable DataConfig system: ERA5PredictorConfig, FrontsDataConfig, DataConfig, PredictConfig, and TimeSelection
ERA5 variables now use a unified levels list (e.g. ["surface", 1000, 950, 900, 850]) — the surface/pressure-level distinction is handled automatically via SURFACE_VARIABLE_MAP and SURFACE_ONLY_VARIABLES constants in code, eliminating separate surface_variables/pressure_variables YAML fields
DataConfig handles train/val/test year splits, normalization, shuffling, and produces tf.data.Dataset objects via xbatcher
PredictConfig supports three time-selection modes: most-recent timestep, explicit timestamps, and inclusive date ranges
Updated configs/1702.yaml with populated data: block; added configs/predict_1702.yaml for inference
Fixed input_sizes/target_sizes type annotation in batch.py (tuple → dict)
Wired DataConfig into TrainConfig.build() in train.py
49 new tests in tests/test_data_config.py; updated mocks in conftest.py and test_config_ingestion.py

Test plan

All 52 tests pass (pytest tests/test_data_config.py tests/test_config_ingestion.py)
Verify ERA5 zarr access against the real ARCO store with actual credentials
Smoke-test DataConfig.build() end-to-end with a small year range
Smoke-test PredictConfig.build() with most_recent: true

🤖 Generated with Claude Code

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

Signed-off-by: andrewjustin <76486321+andrewjustin@users.noreply.github.com>

Add cloud batching tooling + ARCO ERA5

…or metrics and activations

…d" to represent both classes and functions

… model

- Fix inverted require logic in open_config_yaml_as_dataclass - Fix args.train_config → args.train_config_path attribute name - Fix wandb=wandb passing module instead of self.wandb - Fix mutable default list in Trainer.__init__ - Fix build_init → build_init_config method name - Fix ModelConfig.build() parameter names to match Model.__init__ - Fix Model.build() nested inside __init__ → proper method - Fix `if not None` → `if self.x is not None` in config builders - Fix MetricConfig name Literal to match registry keys - Fix callbacks YAML section to match CallbacksConfig fields - Fix locals()[arg] → getattr(self, arg) in UNet classes - Fix `import tf.keras` → `import tensorflow as tf` Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add pixi test environment (no TF/CUDA required, runs on osx-arm64) - Reorganize pixi features: heavy deps in train feature, test env standalone - Add TF/wandb mock conftest for testing without GPU dependencies - 106 tests covering: - open_config_yaml_as_dataclass (loading, require logic, nested parsing) - BaseConfig and all registry subclasses (registries, build dispatch) - Nullable config builders (BiasVectorConfig, KernelMatrixConfig, ConvOutputConfig) - WandBConfig, CallbacksConfig, TrainConfig, Trainer - ModelConfig and Model (param alignment, build method, validation) - Full YAML -> dacite -> TrainConfig -> Trainer end-to-end pipeline - Actual 1702.yaml config file ingestion Additional fixes discovered during testing: - Fix activations.Lisht -> activations.LiSHT (case mismatch) - Fix os.environ["WANDB_KEY"] -> os.environ.get() (crash at import) - Add dacite.Config(cast=[tuple]) for YAML list->tuple conversion - Populate fronts/layers/__init__.py with re-exports from modules.py - Fix duplicate ruff lint key in pyproject.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Introduces a composable DataConfig system that wires ERA5 zarr data, front label netCDF files, and xbatcher into tf.data.Dataset objects for train/val/test splits, plus a PredictConfig for inference-time data loading. Key additions: - src/fronts/data/config.py: ERA5PredictorConfig, FrontsDataConfig, DataConfig, PredictConfig, TimeSelection, and _stack_era5_variables(). ERA5 variables use a unified `levels` list (e.g. ["surface", 1000, 950]) with SURFACE_VARIABLE_MAP and SURFACE_ONLY_VARIABLES constants handling the surface/pressure-level distinction automatically in code. TimeSelection supports most-recent, explicit timestamps, and date ranges. - configs/1702.yaml: populated data: block with unified variables/levels schema. - configs/predict_1702.yaml: new prediction config with all three time-selection modes documented. - src/fronts/data/batch.py: fix input_sizes/target_sizes type to dict[str, int]. - src/fronts/train.py: wire DataConfig into TrainConfig.build(); update dacite config to cast datetime.datetime. - tests/conftest.py: add geospatial, xbatcher, and tf.data mocks. - tests/test_data_config.py: 49 new tests covering all config classes. - tests/test_config_ingestion.py: patch DataConfig.build() to prevent real zarr access during existing TrainConfig round-trip tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tf.keras.Activation and tf.keras.Layer are not top-level attributes in TF 2.x — the correct paths are tf.keras.layers.Activation and tf.keras.layers.Layer. This was a type-annotation-only reference in the ActivationConfig generic base class, so it caused an AttributeError at module import time on the cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

normalize_dataset expects a "pressure_level" dimension and legacy short-form keys (e.g. "T_850", "u_1000"). Our stacked ERA5 dataset uses dimension "level" and full ARCO variable names ("temperature", "u_component_of_wind", etc.), and the normalization constants have no surface-level entries at all (acknowledged by the existing TODO in data_utils.py). Calling it would raise a KeyError or silently produce incorrect values. Disable the normalize_dataset call in DataConfig.build() and PredictConfig.build() with a TODO comment. Normalization constants and the normalize_dataset function need to be updated for the new naming scheme before re-enabling. Update the two PredictConfig tests that asserted normalize_dataset was called, since it no longer is. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The front netCDF files on the cluster are organised into monthly subdirectories (e.g. netcdf/200701/) rather than sitting flat in the root directory. Update the glob pattern to try the subdirectory layout first ({directory}/{year}*/*.nc) and fall back to the flat layout ({directory}/*{year}*.nc) if no files are found that way. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The front netCDF files don't exist separately — labels were baked into pre-built tf.data.Dataset snapshots under raw_front_data/tf_datasets/5class-7lvl-conus/{year}-{month}_tf/ Add TFDatasetConfig which loads these directly via tf.data.Dataset.load(), concatenating all monthly subdirs matching the requested years. DataConfig now supports two mutually exclusive paths: - tf_dataset: key → delegates to TFDatasetConfig (fast, for existing data) - era5 + fronts + batch keys → full ARCO ERA5 pipeline (future use) Also add configs/1702_tf.yaml pointing at the on-cluster TF datasets with the same training/validation year split as the old script. Use this config with the Slurm script to get training running immediately: pixi run python -m fronts.train --train_config_path configs/1702_tf.yaml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andrewjustin and others added 30 commits September 28, 2023 23:25

allow matching of different forecast hours

e804d53

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

remove underscores for model name in output files

3547fd7

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add numerical frontal analysis methods

1b5c8a6

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add ECMWF to file loader

254d4bc

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

begin developing permutation studies code

37638b5

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add new relative humidity function

835b7f6

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add HRRR, NAMNEST conus, NAM 12km grids

f766bb5

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add open contours

f15a728

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

update script version

8e7a862

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

update relative humidity function

5884532

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

update NAM filename format

4fa7770

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

remove repetitive dataset opening

d309095

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add HRRR, NAM 12km, NAMNEST-conus coordinates

f3306fb

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

code touch-ups

d755f80

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

code touch-ups

7ab5135

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add ECMWF to grib-netcdf script

b83cd97

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

documentation update

089858c

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

finish multi-pass permutations, create plot script

73feefb

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add POD to loss/metric scripts

210215f

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add new argument so old front files can be paired

f843176

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

make multi-pass permutations not required

b596f17

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add labels for horizontal bars

a1f8371

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add saliency map generation code

a94d210

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

fix front expansion code

1b28edc

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add saliency map plotting code

cc1378b

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

add ocean domains, update lat/lon slicing method

f70dccb

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

first commit for new README

f326d94

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

continue writing README

9cc5c51

Signed-off-by: Andrew Justin <andrewjustin@ou.edu>

continue README.md development

a260d73

Signed-off-by: andrewjustin <76486321+andrewjustin@users.noreply.github.com>

continue README.md development

70b2ad9

Signed-off-by: andrewjustin <76486321+andrewjustin@users.noreply.github.com>

aaTman and others added 30 commits February 11, 2026 15:30

Merge pull request #9 from aaTman:data/xbatcher

d156250

Add cloud batching tooling + ARCO ERA5

init commit; mid flight update with config dataclasses still needed f…

f246db3

…or metrics and activations

Mid flight update to config dc's

3b2248f

update keras_builders for all model components

31a6a0e

turn registry into property

fe4fbe5

remove field from activation registry

1c55f92

swap activations to new builder

f8dae09

bookkeeping of inputs

5dba217

update regular classes to reflect arg names going into unet kwargs

207efbe

update 1702 config yaml example

8195c7f

move lossesconfig to keras_builders and change build object to "metho…

5978c39

…d" to represent both classes and functions

remove lossconfig

6c47c91

change models dir to layers, move keras_builders to utils

76f7a4f

mid-flight updating model.py; managing interactions between unets and…

8383f52

… model

bookkeeping on legacy train module

a33904b

typing changes and config organization

501128f

set unets as classes (wip)

ab62a26

renamings

cf6e2f0

swap unets to classes

322f835

Model builds via UNetRegistry

4d3ba71

remove unused import

d376f0a

add ds_store to gitignore

e99a881

swap list style

ae7e64c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add data loading pipeline for training and prediction#3

Add data loading pipeline for training and prediction#3
aaTman wants to merge 139 commits intoai2es:masterfrom
aaTman:feat/dataloader-config

aaTman commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

aaTman commented Feb 20, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants