Skip to content

DebugJedi/CostEstimator-Pipeline_Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SWA Estimator — Pipeline Cost Intelligence Platform

Parametric cost estimation for LA Basin natural gas pipeline projects.
ML regression model · Azure Static Web Apps · Python Azure Functions · BLS PPI inflation adjustment · AACE Class 5 compliant output


What This Is

Most pipeline cost estimates at early project stages are gut-feel spreadsheets. This tool replaces that with a production-deployed ML regression model trained on 418 completed LA Basin pipeline projects, wrapped in a full-stack web application that generates AACE Class 5 compliant estimates in seconds.

It is actively used internally at a major Southern California natural gas utility for capital project budget planning and portfolio-level cost screening.


Live Features

  • Instant parametric estimates — diameter, length, dig count, project type, and city as inputs; project cost + contingency + total budget as output
  • Cost escalation engine — compounds costs forward to any construction quarter (2025–2035) using a configurable inflation rate
  • AACE Class 5 accuracy range — −50% / +100% bounds rendered on an interactive range track
  • Full Basis of Estimate (BOE) report — exportable PDF with TIC cost breakdown, methodology, assumptions, and accuracy statement; AACE RP 34R-05 compliant
  • Model trace panel — per-feature scaled values, coefficients, and contributions; full transparency into what is driving each estimate
  • Historical dataset explorer — filterable, sortable paginated table of 431 training-set records with city and project type filters
  • Excel export — two-sheet workbook (Estimate Summary + Cost Breakdown) with model performance metadata

Architecture

Browser (Azure Static Web App)
    │
    ├── index.html + css/styles.css
    └── js/
        ├── api.js       →  POST /api/predict, GET /api/health
        ├── main.js      →  input handling, escalation preview
        ├── ui.js        →  rendering layer (estimate cards, BOE, trace table)
        ├── data.js      →  dataset explorer
        └── export.js    →  Excel (SheetJS) and PDF exports

Azure Functions (Python v2) — /api
    ├── predict/         →  POST /api/predict
    │   └── validates inputs → calls run_estimate() → returns JSON
    └── health/          →  GET /api/health

api/shared/
    ├── predictor.py     →  singleton model loader (Azure Blob Storage), run_estimate()
    └── m5_transformer.py →  sklearn transformer (OHE + feature engineering + scaler)

Azure Blob Storage
    └── model-artifacts/
        ├── pipeline.joblib
        └── model_metadata.json

train/
    ├── train.py         →  full training pipeline (BLS PPI fetch → Lasso selection → LR → joblib)
    └── data/ProjectData_LA.xlsx

The Model (M5)

Property Value
Algorithm Lasso feature selection → Linear Regression on log-transformed cost
Training set 418 LA Basin pipeline projects · 19 cities · 4 districts
Test R² (log scale) 0.888
MAE $44K (2025 dollars)
MAPE 19.0%
CV R² (5-fold) 0.899 ± 0.015
Features selected 24 of 37 candidates
Inflation adjustment BLS PPI series WPUIP2311001 (Oil & Gas Field Machinery) — actuals normalised to 2025 dollars before training

Feature engineering: log(diameter × length), diameter², digs × diameter, log(digs), log(length) — plus one-hot encoded project type, city, and district. Lasso with 5-fold CV selects the final feature set; a plain LinearRegression is then fit on the selected features for full interpretability.

Why log-transform? Pipeline costs span nearly two orders of magnitude. Log-transforming the target produces normally distributed residuals and better-behaved regression coefficients, with exp() used at inference to return dollar values.


Tech Stack

Layer Technology
Frontend hosting Azure Static Web Apps
Backend compute Azure Functions (Python v2)
Model storage Azure Blob Storage
ML framework scikit-learn 1.5.2
Data pandas, numpy
Model persistence joblib
Inflation data BLS Public API (no key required)
Excel export SheetJS (browser-side)
PDF export Browser print API
CI/CD GitHub Actions → Azure SWA

Project Structure

swa-estimator/
├── index.html
├── css/styles.css
├── js/
│   ├── api.js
│   ├── data.js
│   ├── dataset.json         
│   ├── export.js
│   ├── main.js
│   └── ui.js
├── staticwebapp.config.json
├── api/
│   ├── host.json
│   ├── requirements.txt
│   ├── predict/__init__.py   ← POST /api/predict
│   ├── health/__init__.py    ← GET /api/health
│   └── shared/
│       ├── predictor.py
│       ├── m5_transformer.py
│       └── model/            ← populated from Azure Blob at cold-start
│           ├── pipeline.joblib
│           └── model_metadata.json
└── train/
    ├── train.py
    └── data/ProjectData_LA.xlsx

Local Development

# Install dependencies
npm install -g @azure/static-web-apps-cli
pip install -r api/requirements.txt

# Start local emulator (serves frontend + functions together)
swa start . --api-location api
# → http://localhost:4280

Model artifacts are downloaded from Azure Blob Storage at cold-start via AZURE_STORAGE_CONNECTION_STRING. For local development, set this in api/local.settings.json.


Training the Model

cd swa-estimator/
source api/.venv/bin/activate
python -m train.train

This will:

  1. Fetch current BLS PPI data (falls back to hardcoded values if offline)
  2. Normalise all historical actuals to 2025 dollars
  3. Engineer features and run LassoCV feature selection
  4. Fit a LinearRegression on the selected features
  5. Run smoke tests and print a summary
  6. Save pipeline.joblib and model_metadata.json to api/shared/model/

Then upload the artifacts to Azure Blob Storage for the Functions backend to consume.


Deployment

  1. Create Azure Static Web App (Azure Portal → Static Web Apps → Create)

    • App location: /
    • API location: api
    • Output location: (leave blank)
  2. Add GitHub secret: AZURE_STATIC_WEB_APPS_API_TOKEN

  3. Set application settings in Azure Portal (Functions → Configuration):

    AZURE_STORAGE_CONNECTION_STRING
    BLOB_CONTAINER_NAME = model-artifacts
    BLOB_MODEL_NAME     = pipeline.joblib
    BLOB_META_NAME      = model_metadata.json
    
  4. Push to main — GitHub Actions handles the rest.


API Reference

GET /api/health

Returns model load status and metadata summary.

POST /api/predict

Request:

{
  "pipe_diameter":        8,
  "pipe_length":          1356,
  "num_digs":             7,
  "project_type":         "Pipe Replacement",
  "city":                 "Glendale",
  "contingency_pct":      10,
  "construction_year":    2027,
  "construction_quarter": 2
}

Response (abbreviated):

{
  "costs": {
    "base_cost_2025":  485000,
    "project_cost":    524600,
    "contingency_amt":  52460,
    "total_budget":    577060
  },
  "escalation": {
    "years_from_base":   2.25,
    "escalation_factor": 1.0824,
    "inflation_rate_pct": 4.0
  },
  "aace_range": { "low": 288530, "estimate": 577060, "high": 1154120 },
  "trace": {
    "feature_names": ["Digs × Diameter", "..."],
    "contributions":  [0.0312, "..."]
  }
}

Design Decisions Worth Noting

Singleton model loadingpredictor.py uses a module-level _pipeline variable so the model is loaded once at Azure Functions cold-start and reused across all subsequent invocations. Avoids the ~2–3s joblib load penalty on every request.

Blob Storage over bundled artifacts — The trained pipeline.joblib is not committed to the repo. It is uploaded to Azure Blob Storage post-training and pulled down to /tmp/model/ at cold-start. This keeps the scikit-learn version decoupled from the deployment process and makes model updates a blob upload, not a redeploy.

Hardcoded trace coefficientspredictor.py maintains a parallel set of COEFS and FEATURE_LABELS for the explanation trace panel. This avoids introspecting joblib internals for a UI feature and makes the explanation layer robust to future model packaging changes.

Lasso then LinearRegression — LassoCV is used purely for feature selection (zero-out coefficients), not as the final estimator. A plain LinearRegression is then fit on the selected feature subset, giving clean, interpretable coefficients with no regularisation shrinkage bias in the final estimates.


Roadmap

  • Re-calibrate model with incoming project actuals as the dataset grows
  • Upgrade to AACE Class 3 accuracy as project definition data becomes available
  • Add confidence interval bands to the estimate output
  • Extend city coverage beyond the current 19 LA Basin cities
  • Role-based access control for internal deployment

Author

Built and maintained by Priyank Rao — Data Scientist / ML Engineer
Portfolio · GitHub


This tool is an internal planning instrument. All estimates are AACE Class 5 parametric estimates suitable for screening and feasibility purposes only. Not for project authorisation.

About

Parametric ML cost estimator for LA Basin gas pipeline projects · Lasso regression · AACE Class 5 · Azure SWA · Azure Functions · BLS PPI inflation adjustment · live in production

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors