Parametric cost estimation for LA Basin natural gas pipeline projects.
ML regression model · Azure Static Web Apps · Python Azure Functions · BLS PPI inflation adjustment · AACE Class 5 compliant output
Most pipeline cost estimates at early project stages are gut-feel spreadsheets. This tool replaces that with a production-deployed ML regression model trained on 418 completed LA Basin pipeline projects, wrapped in a full-stack web application that generates AACE Class 5 compliant estimates in seconds.
It is actively used internally at a major Southern California natural gas utility for capital project budget planning and portfolio-level cost screening.
- Instant parametric estimates — diameter, length, dig count, project type, and city as inputs; project cost + contingency + total budget as output
- Cost escalation engine — compounds costs forward to any construction quarter (2025–2035) using a configurable inflation rate
- AACE Class 5 accuracy range — −50% / +100% bounds rendered on an interactive range track
- Full Basis of Estimate (BOE) report — exportable PDF with TIC cost breakdown, methodology, assumptions, and accuracy statement; AACE RP 34R-05 compliant
- Model trace panel — per-feature scaled values, coefficients, and contributions; full transparency into what is driving each estimate
- Historical dataset explorer — filterable, sortable paginated table of 431 training-set records with city and project type filters
- Excel export — two-sheet workbook (Estimate Summary + Cost Breakdown) with model performance metadata
Browser (Azure Static Web App)
│
├── index.html + css/styles.css
└── js/
├── api.js → POST /api/predict, GET /api/health
├── main.js → input handling, escalation preview
├── ui.js → rendering layer (estimate cards, BOE, trace table)
├── data.js → dataset explorer
└── export.js → Excel (SheetJS) and PDF exports
Azure Functions (Python v2) — /api
├── predict/ → POST /api/predict
│ └── validates inputs → calls run_estimate() → returns JSON
└── health/ → GET /api/health
api/shared/
├── predictor.py → singleton model loader (Azure Blob Storage), run_estimate()
└── m5_transformer.py → sklearn transformer (OHE + feature engineering + scaler)
Azure Blob Storage
└── model-artifacts/
├── pipeline.joblib
└── model_metadata.json
train/
├── train.py → full training pipeline (BLS PPI fetch → Lasso selection → LR → joblib)
└── data/ProjectData_LA.xlsx
| Property | Value |
|---|---|
| Algorithm | Lasso feature selection → Linear Regression on log-transformed cost |
| Training set | 418 LA Basin pipeline projects · 19 cities · 4 districts |
| Test R² (log scale) | 0.888 |
| MAE | $44K (2025 dollars) |
| MAPE | 19.0% |
| CV R² (5-fold) | 0.899 ± 0.015 |
| Features selected | 24 of 37 candidates |
| Inflation adjustment | BLS PPI series WPUIP2311001 (Oil & Gas Field Machinery) — actuals normalised to 2025 dollars before training |
Feature engineering: log(diameter × length), diameter², digs × diameter, log(digs), log(length) — plus one-hot encoded project type, city, and district. Lasso with 5-fold CV selects the final feature set; a plain LinearRegression is then fit on the selected features for full interpretability.
Why log-transform? Pipeline costs span nearly two orders of magnitude. Log-transforming the target produces normally distributed residuals and better-behaved regression coefficients, with exp() used at inference to return dollar values.
| Layer | Technology |
|---|---|
| Frontend hosting | Azure Static Web Apps |
| Backend compute | Azure Functions (Python v2) |
| Model storage | Azure Blob Storage |
| ML framework | scikit-learn 1.5.2 |
| Data | pandas, numpy |
| Model persistence | joblib |
| Inflation data | BLS Public API (no key required) |
| Excel export | SheetJS (browser-side) |
| PDF export | Browser print API |
| CI/CD | GitHub Actions → Azure SWA |
swa-estimator/
├── index.html
├── css/styles.css
├── js/
│ ├── api.js
│ ├── data.js
│ ├── dataset.json
│ ├── export.js
│ ├── main.js
│ └── ui.js
├── staticwebapp.config.json
├── api/
│ ├── host.json
│ ├── requirements.txt
│ ├── predict/__init__.py ← POST /api/predict
│ ├── health/__init__.py ← GET /api/health
│ └── shared/
│ ├── predictor.py
│ ├── m5_transformer.py
│ └── model/ ← populated from Azure Blob at cold-start
│ ├── pipeline.joblib
│ └── model_metadata.json
└── train/
├── train.py
└── data/ProjectData_LA.xlsx
# Install dependencies
npm install -g @azure/static-web-apps-cli
pip install -r api/requirements.txt
# Start local emulator (serves frontend + functions together)
swa start . --api-location api
# → http://localhost:4280Model artifacts are downloaded from Azure Blob Storage at cold-start via AZURE_STORAGE_CONNECTION_STRING. For local development, set this in api/local.settings.json.
cd swa-estimator/
source api/.venv/bin/activate
python -m train.trainThis will:
- Fetch current BLS PPI data (falls back to hardcoded values if offline)
- Normalise all historical actuals to 2025 dollars
- Engineer features and run LassoCV feature selection
- Fit a LinearRegression on the selected features
- Run smoke tests and print a summary
- Save
pipeline.joblibandmodel_metadata.jsontoapi/shared/model/
Then upload the artifacts to Azure Blob Storage for the Functions backend to consume.
-
Create Azure Static Web App (Azure Portal → Static Web Apps → Create)
- App location:
/ - API location:
api - Output location: (leave blank)
- App location:
-
Add GitHub secret:
AZURE_STATIC_WEB_APPS_API_TOKEN -
Set application settings in Azure Portal (Functions → Configuration):
AZURE_STORAGE_CONNECTION_STRING BLOB_CONTAINER_NAME = model-artifacts BLOB_MODEL_NAME = pipeline.joblib BLOB_META_NAME = model_metadata.json -
Push to main — GitHub Actions handles the rest.
Returns model load status and metadata summary.
Request:
{
"pipe_diameter": 8,
"pipe_length": 1356,
"num_digs": 7,
"project_type": "Pipe Replacement",
"city": "Glendale",
"contingency_pct": 10,
"construction_year": 2027,
"construction_quarter": 2
}Response (abbreviated):
{
"costs": {
"base_cost_2025": 485000,
"project_cost": 524600,
"contingency_amt": 52460,
"total_budget": 577060
},
"escalation": {
"years_from_base": 2.25,
"escalation_factor": 1.0824,
"inflation_rate_pct": 4.0
},
"aace_range": { "low": 288530, "estimate": 577060, "high": 1154120 },
"trace": {
"feature_names": ["Digs × Diameter", "..."],
"contributions": [0.0312, "..."]
}
}Singleton model loading — predictor.py uses a module-level _pipeline variable so the model is loaded once at Azure Functions cold-start and reused across all subsequent invocations. Avoids the ~2–3s joblib load penalty on every request.
Blob Storage over bundled artifacts — The trained pipeline.joblib is not committed to the repo. It is uploaded to Azure Blob Storage post-training and pulled down to /tmp/model/ at cold-start. This keeps the scikit-learn version decoupled from the deployment process and makes model updates a blob upload, not a redeploy.
Hardcoded trace coefficients — predictor.py maintains a parallel set of COEFS and FEATURE_LABELS for the explanation trace panel. This avoids introspecting joblib internals for a UI feature and makes the explanation layer robust to future model packaging changes.
Lasso then LinearRegression — LassoCV is used purely for feature selection (zero-out coefficients), not as the final estimator. A plain LinearRegression is then fit on the selected feature subset, giving clean, interpretable coefficients with no regularisation shrinkage bias in the final estimates.
- Re-calibrate model with incoming project actuals as the dataset grows
- Upgrade to AACE Class 3 accuracy as project definition data becomes available
- Add confidence interval bands to the estimate output
- Extend city coverage beyond the current 19 LA Basin cities
- Role-based access control for internal deployment
Built and maintained by Priyank Rao — Data Scientist / ML Engineer
Portfolio · GitHub
This tool is an internal planning instrument. All estimates are AACE Class 5 parametric estimates suitable for screening and feasibility purposes only. Not for project authorisation.