Add data loading pipeline for training and prediction#3
Open
aaTman wants to merge 139 commits intoai2es:masterfrom
Open
Add data loading pipeline for training and prediction#3aaTman wants to merge 139 commits intoai2es:masterfrom
aaTman wants to merge 139 commits intoai2es:masterfrom
Conversation
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: Andrew Justin <andrewjustin@ou.edu>
Signed-off-by: andrewjustin <76486321+andrewjustin@users.noreply.github.com>
Signed-off-by: andrewjustin <76486321+andrewjustin@users.noreply.github.com>
Add cloud batching tooling + ARCO ERA5
…or metrics and activations
…d" to represent both classes and functions
- Fix inverted require logic in open_config_yaml_as_dataclass - Fix args.train_config → args.train_config_path attribute name - Fix wandb=wandb passing module instead of self.wandb - Fix mutable default list in Trainer.__init__ - Fix build_init → build_init_config method name - Fix ModelConfig.build() parameter names to match Model.__init__ - Fix Model.build() nested inside __init__ → proper method - Fix `if not None` → `if self.x is not None` in config builders - Fix MetricConfig name Literal to match registry keys - Fix callbacks YAML section to match CallbacksConfig fields - Fix locals()[arg] → getattr(self, arg) in UNet classes - Fix `import tf.keras` → `import tensorflow as tf` Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pixi test environment (no TF/CUDA required, runs on osx-arm64) - Reorganize pixi features: heavy deps in train feature, test env standalone - Add TF/wandb mock conftest for testing without GPU dependencies - 106 tests covering: - open_config_yaml_as_dataclass (loading, require logic, nested parsing) - BaseConfig and all registry subclasses (registries, build dispatch) - Nullable config builders (BiasVectorConfig, KernelMatrixConfig, ConvOutputConfig) - WandBConfig, CallbacksConfig, TrainConfig, Trainer - ModelConfig and Model (param alignment, build method, validation) - Full YAML -> dacite -> TrainConfig -> Trainer end-to-end pipeline - Actual 1702.yaml config file ingestion Additional fixes discovered during testing: - Fix activations.Lisht -> activations.LiSHT (case mismatch) - Fix os.environ["WANDB_KEY"] -> os.environ.get() (crash at import) - Add dacite.Config(cast=[tuple]) for YAML list->tuple conversion - Populate fronts/layers/__init__.py with re-exports from modules.py - Fix duplicate ruff lint key in pyproject.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces a composable DataConfig system that wires ERA5 zarr data, front label netCDF files, and xbatcher into tf.data.Dataset objects for train/val/test splits, plus a PredictConfig for inference-time data loading. Key additions: - src/fronts/data/config.py: ERA5PredictorConfig, FrontsDataConfig, DataConfig, PredictConfig, TimeSelection, and _stack_era5_variables(). ERA5 variables use a unified `levels` list (e.g. ["surface", 1000, 950]) with SURFACE_VARIABLE_MAP and SURFACE_ONLY_VARIABLES constants handling the surface/pressure-level distinction automatically in code. TimeSelection supports most-recent, explicit timestamps, and date ranges. - configs/1702.yaml: populated data: block with unified variables/levels schema. - configs/predict_1702.yaml: new prediction config with all three time-selection modes documented. - src/fronts/data/batch.py: fix input_sizes/target_sizes type to dict[str, int]. - src/fronts/train.py: wire DataConfig into TrainConfig.build(); update dacite config to cast datetime.datetime. - tests/conftest.py: add geospatial, xbatcher, and tf.data mocks. - tests/test_data_config.py: 49 new tests covering all config classes. - tests/test_config_ingestion.py: patch DataConfig.build() to prevent real zarr access during existing TrainConfig round-trip tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tf.keras.Activation and tf.keras.Layer are not top-level attributes in TF 2.x — the correct paths are tf.keras.layers.Activation and tf.keras.layers.Layer. This was a type-annotation-only reference in the ActivationConfig generic base class, so it caused an AttributeError at module import time on the cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
normalize_dataset expects a "pressure_level" dimension and legacy short-form
keys (e.g. "T_850", "u_1000"). Our stacked ERA5 dataset uses dimension "level"
and full ARCO variable names ("temperature", "u_component_of_wind", etc.), and
the normalization constants have no surface-level entries at all (acknowledged
by the existing TODO in data_utils.py). Calling it would raise a KeyError or
silently produce incorrect values.
Disable the normalize_dataset call in DataConfig.build() and PredictConfig.build()
with a TODO comment. Normalization constants and the normalize_dataset function
need to be updated for the new naming scheme before re-enabling.
Update the two PredictConfig tests that asserted normalize_dataset was called,
since it no longer is.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The front netCDF files on the cluster are organised into monthly
subdirectories (e.g. netcdf/200701/) rather than sitting flat in the
root directory. Update the glob pattern to try the subdirectory layout
first ({directory}/{year}*/*.nc) and fall back to the flat layout
({directory}/*{year}*.nc) if no files are found that way.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The front netCDF files don't exist separately — labels were baked into
pre-built tf.data.Dataset snapshots under
raw_front_data/tf_datasets/5class-7lvl-conus/{year}-{month}_tf/
Add TFDatasetConfig which loads these directly via tf.data.Dataset.load(),
concatenating all monthly subdirs matching the requested years.
DataConfig now supports two mutually exclusive paths:
- tf_dataset: key → delegates to TFDatasetConfig (fast, for existing data)
- era5 + fronts + batch keys → full ARCO ERA5 pipeline (future use)
Also add configs/1702_tf.yaml pointing at the on-cluster TF datasets with
the same training/validation year split as the old script. Use this config
with the Slurm script to get training running immediately:
pixi run python -m fronts.train --train_config_path configs/1702_tf.yaml
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/fronts/data/config.pywith a composable DataConfig system:ERA5PredictorConfig,FrontsDataConfig,DataConfig,PredictConfig, andTimeSelectionlevelslist (e.g.["surface", 1000, 950, 900, 850]) — the surface/pressure-level distinction is handled automatically viaSURFACE_VARIABLE_MAPandSURFACE_ONLY_VARIABLESconstants in code, eliminating separatesurface_variables/pressure_variablesYAML fieldsDataConfighandles train/val/test year splits, normalization, shuffling, and producestf.data.Datasetobjects via xbatcherPredictConfigsupports three time-selection modes: most-recent timestep, explicit timestamps, and inclusive date rangesconfigs/1702.yamlwith populateddata:block; addedconfigs/predict_1702.yamlfor inferenceinput_sizes/target_sizestype annotation inbatch.py(tuple→dict)DataConfigintoTrainConfig.build()intrain.pytests/test_data_config.py; updated mocks inconftest.pyandtest_config_ingestion.pyTest plan
pytest tests/test_data_config.py tests/test_config_ingestion.py)DataConfig.build()end-to-end with a small year rangePredictConfig.build()withmost_recent: true🤖 Generated with Claude Code