Skip to content

Comments

Add data loading pipeline for training and prediction#3

Open
aaTman wants to merge 139 commits intoai2es:masterfrom
aaTman:feat/dataloader-config
Open

Add data loading pipeline for training and prediction#3
aaTman wants to merge 139 commits intoai2es:masterfrom
aaTman:feat/dataloader-config

Conversation

@aaTman
Copy link

@aaTman aaTman commented Feb 20, 2026

Summary

  • Introduces src/fronts/data/config.py with a composable DataConfig system: ERA5PredictorConfig, FrontsDataConfig, DataConfig, PredictConfig, and TimeSelection
  • ERA5 variables now use a unified levels list (e.g. ["surface", 1000, 950, 900, 850]) — the surface/pressure-level distinction is handled automatically via SURFACE_VARIABLE_MAP and SURFACE_ONLY_VARIABLES constants in code, eliminating separate surface_variables/pressure_variables YAML fields
  • DataConfig handles train/val/test year splits, normalization, shuffling, and produces tf.data.Dataset objects via xbatcher
  • PredictConfig supports three time-selection modes: most-recent timestep, explicit timestamps, and inclusive date ranges
  • Updated configs/1702.yaml with populated data: block; added configs/predict_1702.yaml for inference
  • Fixed input_sizes/target_sizes type annotation in batch.py (tupledict)
  • Wired DataConfig into TrainConfig.build() in train.py
  • 49 new tests in tests/test_data_config.py; updated mocks in conftest.py and test_config_ingestion.py

Test plan

  • All 52 tests pass (pytest tests/test_data_config.py tests/test_config_ingestion.py)
  • Verify ERA5 zarr access against the real ARCO store with actual credentials
  • Smoke-test DataConfig.build() end-to-end with a small year range
  • Smoke-test PredictConfig.build() with most_recent: true

🤖 Generated with Claude Code

andrewjustin and others added 30 commits September 28, 2023 23:25
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: andrewjustin <76486321+andrewjustin@users.noreply.github.com>
Signed-off-by: andrewjustin <76486321+andrewjustin@users.noreply.github.com>
aaTman and others added 30 commits February 11, 2026 15:30
Add cloud batching tooling + ARCO ERA5
- Fix inverted require logic in open_config_yaml_as_dataclass
- Fix args.train_config → args.train_config_path attribute name
- Fix wandb=wandb passing module instead of self.wandb
- Fix mutable default list in Trainer.__init__
- Fix build_init → build_init_config method name
- Fix ModelConfig.build() parameter names to match Model.__init__
- Fix Model.build() nested inside __init__ → proper method
- Fix `if not None` → `if self.x is not None` in config builders
- Fix MetricConfig name Literal to match registry keys
- Fix callbacks YAML section to match CallbacksConfig fields
- Fix locals()[arg] → getattr(self, arg) in UNet classes
- Fix `import tf.keras` → `import tensorflow as tf`

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pixi test environment (no TF/CUDA required, runs on osx-arm64)
- Reorganize pixi features: heavy deps in train feature, test env standalone
- Add TF/wandb mock conftest for testing without GPU dependencies
- 106 tests covering:
  - open_config_yaml_as_dataclass (loading, require logic, nested parsing)
  - BaseConfig and all registry subclasses (registries, build dispatch)
  - Nullable config builders (BiasVectorConfig, KernelMatrixConfig, ConvOutputConfig)
  - WandBConfig, CallbacksConfig, TrainConfig, Trainer
  - ModelConfig and Model (param alignment, build method, validation)
  - Full YAML -> dacite -> TrainConfig -> Trainer end-to-end pipeline
  - Actual 1702.yaml config file ingestion

Additional fixes discovered during testing:
- Fix activations.Lisht -> activations.LiSHT (case mismatch)
- Fix os.environ["WANDB_KEY"] -> os.environ.get() (crash at import)
- Add dacite.Config(cast=[tuple]) for YAML list->tuple conversion
- Populate fronts/layers/__init__.py with re-exports from modules.py
- Fix duplicate ruff lint key in pyproject.toml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces a composable DataConfig system that wires ERA5 zarr data,
front label netCDF files, and xbatcher into tf.data.Dataset objects for
train/val/test splits, plus a PredictConfig for inference-time data loading.

Key additions:
- src/fronts/data/config.py: ERA5PredictorConfig, FrontsDataConfig,
  DataConfig, PredictConfig, TimeSelection, and _stack_era5_variables().
  ERA5 variables use a unified `levels` list (e.g. ["surface", 1000, 950])
  with SURFACE_VARIABLE_MAP and SURFACE_ONLY_VARIABLES constants handling
  the surface/pressure-level distinction automatically in code.
  TimeSelection supports most-recent, explicit timestamps, and date ranges.
- configs/1702.yaml: populated data: block with unified variables/levels schema.
- configs/predict_1702.yaml: new prediction config with all three time-selection
  modes documented.
- src/fronts/data/batch.py: fix input_sizes/target_sizes type to dict[str, int].
- src/fronts/train.py: wire DataConfig into TrainConfig.build(); update dacite
  config to cast datetime.datetime.
- tests/conftest.py: add geospatial, xbatcher, and tf.data mocks.
- tests/test_data_config.py: 49 new tests covering all config classes.
- tests/test_config_ingestion.py: patch DataConfig.build() to prevent real zarr
  access during existing TrainConfig round-trip tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tf.keras.Activation and tf.keras.Layer are not top-level attributes in
TF 2.x — the correct paths are tf.keras.layers.Activation and
tf.keras.layers.Layer. This was a type-annotation-only reference in the
ActivationConfig generic base class, so it caused an AttributeError at
module import time on the cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
normalize_dataset expects a "pressure_level" dimension and legacy short-form
keys (e.g. "T_850", "u_1000"). Our stacked ERA5 dataset uses dimension "level"
and full ARCO variable names ("temperature", "u_component_of_wind", etc.), and
the normalization constants have no surface-level entries at all (acknowledged
by the existing TODO in data_utils.py). Calling it would raise a KeyError or
silently produce incorrect values.

Disable the normalize_dataset call in DataConfig.build() and PredictConfig.build()
with a TODO comment. Normalization constants and the normalize_dataset function
need to be updated for the new naming scheme before re-enabling.

Update the two PredictConfig tests that asserted normalize_dataset was called,
since it no longer is.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The front netCDF files on the cluster are organised into monthly
subdirectories (e.g. netcdf/200701/) rather than sitting flat in the
root directory.  Update the glob pattern to try the subdirectory layout
first ({directory}/{year}*/*.nc) and fall back to the flat layout
({directory}/*{year}*.nc) if no files are found that way.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The front netCDF files don't exist separately — labels were baked into
pre-built tf.data.Dataset snapshots under
  raw_front_data/tf_datasets/5class-7lvl-conus/{year}-{month}_tf/

Add TFDatasetConfig which loads these directly via tf.data.Dataset.load(),
concatenating all monthly subdirs matching the requested years.

DataConfig now supports two mutually exclusive paths:
  - tf_dataset: key → delegates to TFDatasetConfig (fast, for existing data)
  - era5 + fronts + batch keys → full ARCO ERA5 pipeline (future use)

Also add configs/1702_tf.yaml pointing at the on-cluster TF datasets with
the same training/validation year split as the old script. Use this config
with the Slurm script to get training running immediately:

  pixi run python -m fronts.train --train_config_path configs/1702_tf.yaml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants