Skip to content

adityasurap/PREVAIL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PREVAIL: Predictive Response for Emergency Volume Assessment in Incident Locations

An HDSI Capstone Project

Section B19-1: Aditya Surapaneni, Angela Hu, Subika Haider, Suhani Sharma

PREVAIL is a machine-learning pipeline that predicts power outages and crew dispatch requirements for San Diego Gas & Electric (SDG&E) service territory. It fuses weather station observations with utility outage records, engineers spatial-temporal features using Uber H3 hexagons, and trains XGBoost and ensemble models. The dashboard is live in a separate repository: https://github.com/angela139/prevail-dashboard.

Notes for the TA

  • To get an accurate reflection of each member's contribution history, please take a look at the different branches in the repository. We each worked on various parts of the project in our own branches, and pushed them to the main branch once each part was finalized, which is why all contributions may not be reflected in the main branch.

  • Additionally we would like to note that many of our data sources were given to us directly by our mentors at SDG&E and due to a data privacy agreement between them and UCSD we are not able to publish them directly to this repo.

Thank you!

Project Overview

Electric utilities face significant operational challenges in anticipating and responding to weather-related grid disruptions. Traditional reactive approaches often result in inefficient crew deployment, increased standby costs, and prolonged restoration times during extreme weather events.

To address this operational gap, PREVAIL introduces a framework that forecasts potential grid vulnerabilities over a weekly planning window. Unlike conventional models, this project utilizes a two-stage predictive architecture:

  1. Outage Location Prediction: Identifies geographic areas (hexagonal grid cells) with probable outages due to extreme weather conditions
  2. Crew Size Optimization: Quantifies the precise number of crew members required for restoration in affected areas

Dataset

By engineering a novel spatio-temporal linkage between historical outage logs and crew dispatch records using a ZIP code proxy, we constructed a training dataset of over 1,500 verified adverse weather-related responses. This dataset enables:

  • Proactive crew staging - Position crews before incidents occur
  • Resource optimization - Reduce standby costs while enhancing grid reliability
  • Data-driven decision making - Forecasts for operational planning

The system combines weather data, power outage records, and crew deployment information using:

  • Spatial analysis with H3 hexagonal indexing for geographic granularity
  • Time-series modeling with temporal lag features and rolling aggregations
  • Ensemble machine learning including XGBoost, Random Forest, and Lasso regression
  • Interactive visualization through a geospatial dashboard (see prevail-dashboard)

Project Structure

data/all_weather.parquet
        │
        ├──────────────────────────────────────────────────────────┐
        ▼                                                          ▼
master_dataset_hourly_build.py                         outage_span_weather_10311_clean_v2.parquet
  • station → H3 hex mapping (res=7)                   (outage events + nearest-station weather)
  • hex-hour weather aggregation                                    │
  • extreme weather flags (q95/q05)                                │
  • outage start labels per hex-hour                   ◄───────────┘
  • future outage targets (1/3/6/12/24h)
  • lag & rolling features (1/3/6/12/24h)
        │
        ▼
data/master_dataset_hex_hour_v1.parquet  ←─── primary ML dataset
        │
        ├──► outage_prediction_model.py  →  XGBoost outage classifier
        │
        ├──► outage_sort_merge.py  →  data/merged_outage_sort_data.csv
        │         (ZIP code KDTree match + SORT crew dispatch join)
        │
        ├──► crew_size_prediction_approach_1.py  →  api/models/trained/
        │         LASSO → RandomForest → XGBoost(Poisson) → Stacking
        │
        └──► crew_size_final.py  →  data/predictions_final.csv
                  (generates predictions for the dashboard)

(dashboard → https://github.com/angela139/prevail-dashboard)

Prerequisites

Python Environment

  • Python 3.11+ is required
  • All packages are pinned in requirements.txt
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt

Required Data Files

The following input files must be present before running the pipeline. They are not committed to the repository due to size and data-use agreements.

File Description
data/all_weather.parquet Consolidated weather station observations (temp °F, wind/gust mph, humidity %)
data/outage_span_weather_10311_clean_v2.parquet Outage events joined to nearest weather station observations; includes span/conductor/topology fields
data/SORT/REP_ORD_ORDER.parquet SORT work orders
data/SORT/REP_LAB_BUSINESS.parquet SORT business/crew type lookup
data/SORT/REP_ASN_ASSIGNMENT.parquet SORT crew assignment resource counts
data/2022_Gaz_zcta_national.txt USCB ZCTA ZIP code centroid coordinates (2022 Gazetteer)

Pipeline Execution Order

Run these scripts in order to reproduce all output datasets from scratch.

Step 1 — Build the master dataset

python master_v3_creation.py

Inputs: data/outage_span_weather_10311_clean_v2.parquet, data/all_weather.parquet
Outputs: data/master_v3.csv, data/master_v3_cause_thresholds.csv, data/master_v3_cause_thresholds_with_lags.csv

Step 2 — Merge outages with SORT crew dispatch records

python outage_sort_merge.py

Inputs: data/outage_and_weather_data.parquet, data/2022_Gaz_zcta_national.txt, data/SORT/*.parquet
Outputs: data/merged_outage_sort_data.csv

Step 3 — Train the outage prediction model

python outage_prediction_model.py

Inputs: data/master_dataset_hex_hour_v1.parquet (via outage_prediction_feature_engineering.py)
Outputs: roc_curve_xgboost.png, console classification report

Step 4 — Train the crew size prediction model

python crew_size_prediction_approach_1.py

Inputs: data/merged_outage_sort_data.csv, data/master_dataset_hex_hour_v1.parquet
Outputs: api/models/trained/ — serialized model artifacts

Step 5 — Generate dashboard predictions

python crew_size_final.py

Inputs: data/merged_outage_sort_data.csv, data/master_dataset_hex_hour_v1.parquet
Outputs: data/predictions_final.csv — crew size predictions used by the dashboard


Unit Testing

All unit tests live in the tests/ directory. No data files are required — every test uses small synthetic DataFrames defined in tests/conftest.py so the full suite can be run immediately after cloning, before any data is present.

Running the Tests

# Run every test with verbose per-test output (recommended)
pytest tests/ -v

# Run only one module at a time
pytest tests/test_feature_engineering.py -v
pytest tests/test_spatial_utils.py -v
pytest tests/test_preprocessing.py -v

Expected: 67 tests pass in under 10 seconds.

Test File Reference

File Source module tested What it covers
tests/conftest.py (shared fixtures) Defines all synthetic DataFrames reused across the three test files. Read this first to understand the shape and values of every test input.
tests/test_feature_engineering.py master_dataset_hourly_build.py mode_or_nan (categorical aggregation edge cases); add_extreme_weather_flags_hourly (q95/q05 thresholds, binary output, union logic); add_lag_and_rolling_features_hourly (exact lag values, history-only rolling mean/max, column naming); add_future_outage_targets (forward-looking target correctness, no current-hour leakage)
tests/test_spatial_utils.py utils/spatial_utils.py lat_lon_to_h3 (determinism, resolution separation); add_hex_ids_to_df (column creation, null safety, custom column names); map_outages_to_zip_codes (KDTree nearest-ZIP assignment, up/down lat-lon fallback); clean_zip_codes (ZIP+4 strip, float suffix strip, nan drop, non-5-digit drop)
tests/test_preprocessing.py preprocessing/crew_data_cleaning.py remove_duplicates (count, custom subset, idempotency); infer_missing_durations (90-min span filled, non-zero duration protected, temp column removed); filter_weather_related_outages (flag=True kept, flag=False dropped); convert_numeric_columns (coercion, unparsable → NaN); convert_datetime_columns_full (tz-aware output, NaT on bad input)

How fixtures work (tests/conftest.py)

pytest automatically loads conftest.py before any test file runs. Any function decorated with @pytest.fixture in that file is available as a parameter in any test method — pytest injects the return value automatically. For example, hourly_weather_df is a 12-row DataFrame (2 hexes × 6 hours) where the last row of each hex is deliberately set to extreme weather values so the threshold tests have a predictable ground truth.


Project Structure

PREVAIL/
├── master_v3_creation.py               # Step 1: per-outage summary dataset
├── outage_prediction_data.py           # Weekly/daily weather+outage joins
├── outage_prediction_feature_engineering.py
├── outage_prediction_model.py          # XGBoost outage classifier
├── outage_sort_merge.py                # Outage ↔ SORT crew dispatch merge
├── crew_size_prediction_approach_1.py  # LASSO → RF → XGBoost → Stacking
├── crew_size_prediction_approach_2.py
├── crew_size_final.py                  # Generates predictions for the dashboard
├── crew_feature_engineering.py
├── requirements.txt
├── README.md
│
├── preprocessing/
│   ├── weather_preprocessing.py
│   ├── crew_data_cleaning.py
│   ├── crew_data_loading.py
│   └── sort_data_processing.py
│
├── utils/
│   ├── pipeline_utils.py
│   └── spatial_utils.py
│
├── tests/                              # Unit tests (no data files required)
│   ├── conftest.py
│   ├── test_feature_engineering.py
│   ├── test_spatial_utils.py
│   └── test_preprocessing.py
│
├── data/                               # Input + output datasets (not in git)
└── logs/                               # Pipeline run logs (auto-created)

Future Work

  • Real-Time API Integration: The most immediate enhancement would be integrating a live weather API. By streaming real-time meteorological data directly into the inference pipeline, the model could generate dynamic, on-the-fly workforce forecasts as storms evolve, rather than relying on batch-processed historical telemetry.

  • Logistical Refinement: The deployment pipeline could be further optimized by incorporating real-time traffic and road closure data. Integrating these variables would allow the model to adjust staging recommendations based on the actual travel time required for crews to reach an incident location during adverse conditions.

  • Multimodal Failure Prediction: While the current framework focuses on personnel counts, the underlying Poisson architecture could be expanded to predict specific equipment failures — such as transformer blowouts versus vegetation-related line faults. Mapping specific hardware needs alongside crew sizes would provide a more holistic logistical solution for emergency response.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors