Welcome to my repository on AI-based modeling for Green Stormwater Infrastructure (GSI) performance prediction. This project implements a configurable, model-agnostic experiment pipeline for training, evaluating, and comparing multiple models under one structure.
Green Stormwater Infrastructure (GSI) plays a key role in reducing urban flooding, improving water quality, and supporting urban ecosystems. However, traditional models are computationally expensive and hard to generalize. This project aims to overcome these limitations by providing a shared ML/DL framework with model configurability, reproducibility, and consistent diagnostics.
This repository supports training and testing of the following models:
✅ Random Forest (RF) — Non-parametric ensemble learning. ✅ Data-Driven LSTM (LSTM) — Sequence modeling for time-series data. ✅ Physics-Informed LSTM (PILSTM) — Deep learning with physical constraints.
Model selection is fully controlled via config/config.yaml or CLI arguments. The same pipeline now handles:
- data loading and validation
- storm-based train/test splitting
- feature scaling
- tabular or sequence feature generation
- model training / loading
- metrics, residuals, and bootstrap uncertainty
- artifact saving
├── config/ # Experiment and model configuration
├── data/ # Raw and processed data (DO NOT COMMIT large files)
├── notebooks/ # Exploration notebooks; no longer the source of truth
├── results/
│ ├── experiments/ # Predictions, metrics, plots, and bootstrap outputs by model
│ └── models/ # Serialized trained models
├── src/
│ ├── analysis/ # Metrics and bootstrap uncertainty
│ ├── data/ # Dataset loading and storm-level splitting
│ ├── features/ # Tabular and sequence feature builders
│ ├── models/ # Model adapters and registry
│ ├── pipeline/ # Shared training / evaluation orchestration
│ ├── plots/ # Reusable diagnostics plots
│ ├── train.py # CLI entrypoint for training
│ └── test.py # CLI entrypoint for evaluation
└── README.md
This project should run in a project-local Python 3.10 virtual environment.
The repo currently pins tensorflow==2.14.0, which is not a good match for Python 3.13.
Python 3.11 is also acceptable if it is a stable release build.
bash scripts/setup_env.shThat script will:
- Create
.venvwithpython3.10 - Upgrade
pip,setuptools, andwheel - Install runtime and notebook dependencies
- Register a Jupyter kernel for VS Code and Jupyter
- Run an environment validation check
If your OS Python was installed without the standard venv module, the script falls back to virtualenv automatically.
python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements-dev.txt
python -m ipykernel install --user --name gsi-performance-prediction --display-name "Python (.venv) - GSI Performance Prediction"
python scripts/validate_env.py- Open the repo in WSL
- Select the interpreter at
.venv/bin/python - Open a notebook and select the kernel
Python (.venv) - GSI Performance Prediction
- Do not install project packages into
base, system Python, or a Conda environment. - Always use
python -m pip ...so installs target the active interpreter. - Keep runtime dependencies in
requirements.txtand notebook/dev tooling inrequirements-dev.txt. - Recreate the virtual environment after major dependency changes, especially around TensorFlow, NumPy, scikit-learn, or Python version changes.
- Commit dependency files and setup scripts, but never commit
.venv/. - Retrain and resave serialized models in
results/models/after changing scikit-learn or TensorFlow/Keras versions.
The repo is no longer intended to be run primarily from notebooks. The notebook workflows have been extracted into reusable source modules so models can be trained and evaluated consistently from the CLI.
Sequence models are now built storm-by-storm, so LSTM windows do not cross StormID boundaries.
The notebooks were originally authored in Colab and still contain Google Drive paths such as /content/drive/....
They should now be treated as exploration layers on top of the shared pipeline rather than the place where the core workflow lives.
notebooks/LSTM(uni)_BTI_Storms.ipynbandnotebooks/RF_BTI_Storms.ipynbshould usedata/raw/filtered_storms_df.csvnotebooks/PILSTM_BTI_Storms.ipynbexpectsdata/raw/filtered_df_ET_inf.csv- If the PILSTM dataset is not present, the
pilstmCLI path will fail with a clear missing-file message
source .venv/bin/activate
python -m src.train --model rfsource .venv/bin/activate
python -m src.train --model lstmsource .venv/bin/activate
python -m src.test --model rfsource .venv/bin/activate
python -m src.train --allpython -m src.train --model lstm --set lstm.epochs=5 --set bootstrap.n_samples=200python -m src.train --model rf --skip-bootstrap --skip-plotsEach model writes artifacts to results/experiments/<model_name>/:
predictions.csvmetrics.jsonmetrics.csvplots/bootstrap/resolved_config.yamlpreprocessor.joblib
Serialized models are written to results/models/.