Skip to content

Almehedi06/GSI-Performance-Prediction

Repository files navigation

GSI-Performance-Prediction

Welcome to my repository on AI-based modeling for Green Stormwater Infrastructure (GSI) performance prediction. This project implements a configurable, model-agnostic experiment pipeline for training, evaluating, and comparing multiple models under one structure.

Why This Matters

Green Stormwater Infrastructure (GSI) plays a key role in reducing urban flooding, improving water quality, and supporting urban ecosystems. However, traditional models are computationally expensive and hard to generalize. This project aims to overcome these limitations by providing a shared ML/DL framework with model configurability, reproducibility, and consistent diagnostics.

Supported Models

This repository supports training and testing of the following models:

Random Forest (RF) — Non-parametric ensemble learning. ✅ Data-Driven LSTM (LSTM) — Sequence modeling for time-series data. ✅ Physics-Informed LSTM (PILSTM) — Deep learning with physical constraints.

Model selection is fully controlled via config/config.yaml or CLI arguments. The same pipeline now handles:

  1. data loading and validation
  2. storm-based train/test splitting
  3. feature scaling
  4. tabular or sequence feature generation
  5. model training / loading
  6. metrics, residuals, and bootstrap uncertainty
  7. artifact saving

Project Structure

├── config/                  # Experiment and model configuration
├── data/                    # Raw and processed data (DO NOT COMMIT large files)
├── notebooks/               # Exploration notebooks; no longer the source of truth
├── results/
│   ├── experiments/         # Predictions, metrics, plots, and bootstrap outputs by model
│   └── models/              # Serialized trained models
├── src/
│   ├── analysis/            # Metrics and bootstrap uncertainty
│   ├── data/                # Dataset loading and storm-level splitting
│   ├── features/            # Tabular and sequence feature builders
│   ├── models/              # Model adapters and registry
│   ├── pipeline/            # Shared training / evaluation orchestration
│   ├── plots/               # Reusable diagnostics plots
│   ├── train.py             # CLI entrypoint for training
│   └── test.py              # CLI entrypoint for evaluation
└── README.md

Environment Setup

This project should run in a project-local Python 3.10 virtual environment. The repo currently pins tensorflow==2.14.0, which is not a good match for Python 3.13. Python 3.11 is also acceptable if it is a stable release build.

One-Time Setup

bash scripts/setup_env.sh

That script will:

  1. Create .venv with python3.10
  2. Upgrade pip, setuptools, and wheel
  3. Install runtime and notebook dependencies
  4. Register a Jupyter kernel for VS Code and Jupyter
  5. Run an environment validation check

If your OS Python was installed without the standard venv module, the script falls back to virtualenv automatically.

Manual Setup

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements-dev.txt
python -m ipykernel install --user --name gsi-performance-prediction --display-name "Python (.venv) - GSI Performance Prediction"
python scripts/validate_env.py

VS Code / Notebook Workflow

  1. Open the repo in WSL
  2. Select the interpreter at .venv/bin/python
  3. Open a notebook and select the kernel Python (.venv) - GSI Performance Prediction

Best Practices

  1. Do not install project packages into base, system Python, or a Conda environment.
  2. Always use python -m pip ... so installs target the active interpreter.
  3. Keep runtime dependencies in requirements.txt and notebook/dev tooling in requirements-dev.txt.
  4. Recreate the virtual environment after major dependency changes, especially around TensorFlow, NumPy, scikit-learn, or Python version changes.
  5. Commit dependency files and setup scripts, but never commit .venv/.
  6. Retrain and resave serialized models in results/models/ after changing scikit-learn or TensorFlow/Keras versions.

Pipeline Notes

The repo is no longer intended to be run primarily from notebooks. The notebook workflows have been extracted into reusable source modules so models can be trained and evaluated consistently from the CLI.

Sequence models are now built storm-by-storm, so LSTM windows do not cross StormID boundaries.

Notebook Notes

The notebooks were originally authored in Colab and still contain Google Drive paths such as /content/drive/.... They should now be treated as exploration layers on top of the shared pipeline rather than the place where the core workflow lives.

  1. notebooks/LSTM(uni)_BTI_Storms.ipynb and notebooks/RF_BTI_Storms.ipynb should use data/raw/filtered_storms_df.csv
  2. notebooks/PILSTM_BTI_Storms.ipynb expects data/raw/filtered_df_ET_inf.csv
  3. If the PILSTM dataset is not present, the pilstm CLI path will fail with a clear missing-file message

Usage Example

source .venv/bin/activate
python -m src.train --model rf
source .venv/bin/activate
python -m src.train --model lstm
source .venv/bin/activate
python -m src.test --model rf
source .venv/bin/activate
python -m src.train --all

Useful Overrides

python -m src.train --model lstm --set lstm.epochs=5 --set bootstrap.n_samples=200
python -m src.train --model rf --skip-bootstrap --skip-plots

Saved Artifacts

Each model writes artifacts to results/experiments/<model_name>/:

  1. predictions.csv
  2. metrics.json
  3. metrics.csv
  4. plots/
  5. bootstrap/
  6. resolved_config.yaml
  7. preprocessor.joblib

Serialized models are written to results/models/.

About

GSI performance prediction using ML models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages