An HDSI Capstone Project
Section B19-1: Aditya Surapaneni, Angela Hu, Subika Haider, Suhani Sharma
PREVAIL is a machine-learning pipeline that predicts power outages and crew dispatch requirements for San Diego Gas & Electric (SDG&E) service territory. It fuses weather station observations with utility outage records, engineers spatial-temporal features using Uber H3 hexagons, and trains XGBoost and ensemble models. The dashboard is live in a separate repository: https://github.com/angela139/prevail-dashboard.
Notes for the TA
-
To get an accurate reflection of each member's contribution history, please take a look at the different branches in the repository. We each worked on various parts of the project in our own branches, and pushed them to the main branch once each part was finalized, which is why all contributions may not be reflected in the main branch.
-
Additionally we would like to note that many of our data sources were given to us directly by our mentors at SDG&E and due to a data privacy agreement between them and UCSD we are not able to publish them directly to this repo.
Thank you!
Electric utilities face significant operational challenges in anticipating and responding to weather-related grid disruptions. Traditional reactive approaches often result in inefficient crew deployment, increased standby costs, and prolonged restoration times during extreme weather events.
To address this operational gap, PREVAIL introduces a framework that forecasts potential grid vulnerabilities over a weekly planning window. Unlike conventional models, this project utilizes a two-stage predictive architecture:
- Outage Location Prediction: Identifies geographic areas (hexagonal grid cells) with probable outages due to extreme weather conditions
- Crew Size Optimization: Quantifies the precise number of crew members required for restoration in affected areas
By engineering a novel spatio-temporal linkage between historical outage logs and crew dispatch records using a ZIP code proxy, we constructed a training dataset of over 1,500 verified adverse weather-related responses. This dataset enables:
- Proactive crew staging - Position crews before incidents occur
- Resource optimization - Reduce standby costs while enhancing grid reliability
- Data-driven decision making - Forecasts for operational planning
The system combines weather data, power outage records, and crew deployment information using:
- Spatial analysis with H3 hexagonal indexing for geographic granularity
- Time-series modeling with temporal lag features and rolling aggregations
- Ensemble machine learning including XGBoost, Random Forest, and Lasso regression
- Interactive visualization through a geospatial dashboard (see prevail-dashboard)
data/all_weather.parquet
│
├──────────────────────────────────────────────────────────┐
▼ ▼
master_dataset_hourly_build.py outage_span_weather_10311_clean_v2.parquet
• station → H3 hex mapping (res=7) (outage events + nearest-station weather)
• hex-hour weather aggregation │
• extreme weather flags (q95/q05) │
• outage start labels per hex-hour ◄───────────┘
• future outage targets (1/3/6/12/24h)
• lag & rolling features (1/3/6/12/24h)
│
▼
data/master_dataset_hex_hour_v1.parquet ←─── primary ML dataset
│
├──► outage_prediction_model.py → XGBoost outage classifier
│
├──► outage_sort_merge.py → data/merged_outage_sort_data.csv
│ (ZIP code KDTree match + SORT crew dispatch join)
│
├──► crew_size_prediction_approach_1.py → api/models/trained/
│ LASSO → RandomForest → XGBoost(Poisson) → Stacking
│
└──► crew_size_final.py → data/predictions_final.csv
(generates predictions for the dashboard)
(dashboard → https://github.com/angela139/prevail-dashboard)
- Python 3.11+ is required
- All packages are pinned in
requirements.txt
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtThe following input files must be present before running the pipeline. They are not committed to the repository due to size and data-use agreements.
| File | Description |
|---|---|
data/all_weather.parquet |
Consolidated weather station observations (temp °F, wind/gust mph, humidity %) |
data/outage_span_weather_10311_clean_v2.parquet |
Outage events joined to nearest weather station observations; includes span/conductor/topology fields |
data/SORT/REP_ORD_ORDER.parquet |
SORT work orders |
data/SORT/REP_LAB_BUSINESS.parquet |
SORT business/crew type lookup |
data/SORT/REP_ASN_ASSIGNMENT.parquet |
SORT crew assignment resource counts |
data/2022_Gaz_zcta_national.txt |
USCB ZCTA ZIP code centroid coordinates (2022 Gazetteer) |
Run these scripts in order to reproduce all output datasets from scratch.
python master_v3_creation.pyInputs: data/outage_span_weather_10311_clean_v2.parquet, data/all_weather.parquet
Outputs: data/master_v3.csv, data/master_v3_cause_thresholds.csv, data/master_v3_cause_thresholds_with_lags.csv
python outage_sort_merge.pyInputs: data/outage_and_weather_data.parquet, data/2022_Gaz_zcta_national.txt, data/SORT/*.parquet
Outputs: data/merged_outage_sort_data.csv
python outage_prediction_model.pyInputs: data/master_dataset_hex_hour_v1.parquet (via outage_prediction_feature_engineering.py)
Outputs: roc_curve_xgboost.png, console classification report
python crew_size_prediction_approach_1.pyInputs: data/merged_outage_sort_data.csv, data/master_dataset_hex_hour_v1.parquet
Outputs: api/models/trained/ — serialized model artifacts
python crew_size_final.pyInputs: data/merged_outage_sort_data.csv, data/master_dataset_hex_hour_v1.parquet
Outputs: data/predictions_final.csv — crew size predictions used by the dashboard
All unit tests live in the tests/ directory. No data files are required — every test uses small synthetic DataFrames defined in tests/conftest.py so the full suite can be run immediately after cloning, before any data is present.
# Run every test with verbose per-test output (recommended)
pytest tests/ -v
# Run only one module at a time
pytest tests/test_feature_engineering.py -v
pytest tests/test_spatial_utils.py -v
pytest tests/test_preprocessing.py -vExpected: 67 tests pass in under 10 seconds.
| File | Source module tested | What it covers |
|---|---|---|
tests/conftest.py |
(shared fixtures) | Defines all synthetic DataFrames reused across the three test files. Read this first to understand the shape and values of every test input. |
tests/test_feature_engineering.py |
master_dataset_hourly_build.py |
mode_or_nan (categorical aggregation edge cases); add_extreme_weather_flags_hourly (q95/q05 thresholds, binary output, union logic); add_lag_and_rolling_features_hourly (exact lag values, history-only rolling mean/max, column naming); add_future_outage_targets (forward-looking target correctness, no current-hour leakage) |
tests/test_spatial_utils.py |
utils/spatial_utils.py |
lat_lon_to_h3 (determinism, resolution separation); add_hex_ids_to_df (column creation, null safety, custom column names); map_outages_to_zip_codes (KDTree nearest-ZIP assignment, up/down lat-lon fallback); clean_zip_codes (ZIP+4 strip, float suffix strip, nan drop, non-5-digit drop) |
tests/test_preprocessing.py |
preprocessing/crew_data_cleaning.py |
remove_duplicates (count, custom subset, idempotency); infer_missing_durations (90-min span filled, non-zero duration protected, temp column removed); filter_weather_related_outages (flag=True kept, flag=False dropped); convert_numeric_columns (coercion, unparsable → NaN); convert_datetime_columns_full (tz-aware output, NaT on bad input) |
pytest automatically loads conftest.py before any test file runs. Any function decorated with @pytest.fixture in that file is available as a parameter in any test method — pytest injects the return value automatically. For example, hourly_weather_df is a 12-row DataFrame (2 hexes × 6 hours) where the last row of each hex is deliberately set to extreme weather values so the threshold tests have a predictable ground truth.
PREVAIL/
├── master_v3_creation.py # Step 1: per-outage summary dataset
├── outage_prediction_data.py # Weekly/daily weather+outage joins
├── outage_prediction_feature_engineering.py
├── outage_prediction_model.py # XGBoost outage classifier
├── outage_sort_merge.py # Outage ↔ SORT crew dispatch merge
├── crew_size_prediction_approach_1.py # LASSO → RF → XGBoost → Stacking
├── crew_size_prediction_approach_2.py
├── crew_size_final.py # Generates predictions for the dashboard
├── crew_feature_engineering.py
├── requirements.txt
├── README.md
│
├── preprocessing/
│ ├── weather_preprocessing.py
│ ├── crew_data_cleaning.py
│ ├── crew_data_loading.py
│ └── sort_data_processing.py
│
├── utils/
│ ├── pipeline_utils.py
│ └── spatial_utils.py
│
├── tests/ # Unit tests (no data files required)
│ ├── conftest.py
│ ├── test_feature_engineering.py
│ ├── test_spatial_utils.py
│ └── test_preprocessing.py
│
├── data/ # Input + output datasets (not in git)
└── logs/ # Pipeline run logs (auto-created)
-
Real-Time API Integration: The most immediate enhancement would be integrating a live weather API. By streaming real-time meteorological data directly into the inference pipeline, the model could generate dynamic, on-the-fly workforce forecasts as storms evolve, rather than relying on batch-processed historical telemetry.
-
Logistical Refinement: The deployment pipeline could be further optimized by incorporating real-time traffic and road closure data. Integrating these variables would allow the model to adjust staging recommendations based on the actual travel time required for crews to reach an incident location during adverse conditions.
-
Multimodal Failure Prediction: While the current framework focuses on personnel counts, the underlying Poisson architecture could be expanded to predict specific equipment failures — such as transformer blowouts versus vegetation-related line faults. Mapping specific hardware needs alongside crew sizes would provide a more holistic logistical solution for emergency response.