A machine learning system for wildfire risk prediction using ensemble methods that combine XGBoost and PyTorch LSTM models. The system analyzes weather time-series data to forecast fire risk 6 hours in advance.
This system predicts wildfire risk in California using:
- Weather station data from CIMIS (California Irrigation Management Information System)
- Historical fire records from Wikipedia and CAL FIRE
- Ensemble modeling combining XGBoost and attention-based LSTM
- 6-hour prediction horizon using 24-hour weather sequences
- Gradient boosting for feature interactions
- Station-based splitting to prevent data leakage
- Cyclical time encoding for seasonal patterns (sin/cos)
- Lag features (previous day weather conditions)
- Class balancing for rare fire events (~5% positive class)
- Bidirectional LSTM with custom attention mechanism
- 24-hour sequences → 6-hour prediction horizon
- Focal loss for imbalanced data
- Early stopping with validation monitoring
- Weighted Average: Grid search optimization
- Stacked Meta-Learning: Logistic regression meta-classifier
- Calibration analysis and reliability curves
pip install xgboost joblib scikit-learn matplotlib seaborn
pip install torch pandas numpy- Data Processing:
dataset/colab-conditions-df-cleaning.ipynb - XGBoost Training:
fire_risk_xgboost_implementation.ipynb - Ensemble Evaluation:
ensemble_xgboost_lstm.ipynb
fire_risk_prediction/
├── fire_risk_xgboost_implementation.ipynb # XGBoost model
├── ensemble_xgboost_lstm.ipynb # PyTorch LSTM + Ensemble
├── dataset/
│ ├── colab-conditions-df-cleaning.ipynb # Data preprocessing
│ └── conditions_df.csv # Processed weather/fire data
└── README.md
- Cyclical Encoding: Month/day-of-year as sin/cos pairs
- Station Statistics: Historical fire rates per weather station
- Lag Features: Previous 24-hour weather conditions
- Region Encoding: Geographic information via one-hot encoding
- Station-based splitting: 80% stations for training, 20% for testing
- Class balancing:
scale_pos_weightfor XGBoost, focal loss for LSTM - Sequence construction: 24-hour windows for LSTM input
- Missing value imputation: Median imputation for numerical features
| Model | Approach | Key Strength |
|---|---|---|
| XGBoost | Feature-based | Complex feature interactions |
| LSTM | Sequential | Temporal pattern recognition |
| Ensemble | Combined | Best of both approaches |
Note: Run the notebooks to see actual performance metrics
xgb_params = {
"n_estimators": 500,
"max_depth": 5,
"learning_rate": 0.05,
"subsample": 0.8,
"colsample_bytree": 0.8,
"tree_method": "hist",
"scale_pos_weight": "auto" # Handles class imbalance
}- Input: 24-hour weather sequences
- Hidden: 64-unit bidirectional LSTM
- Attention: Custom attention mechanism
- Output: Single probability (fire risk in next 6 hours)
- Loss: Focal loss for imbalanced classification
- Weather Data: CIMIS (California Irrigation Management Information System)
- Fire Records: Wikipedia fire tables + CAL FIRE historical data
- Time Period: 2018-2020
- Geographic Coverage: California weather stations
- Early Warning: 6-hour advance fire risk alerts
- Resource Planning: Firefighting resource deployment
- Research: Weather pattern analysis for fire prediction
The system includes comprehensive evaluation:
- ROC and Precision-Recall curves
- Confusion matrices and classification reports
- Feature importance analysis (XGBoost + SHAP)
- Attention weight visualization (LSTM)
- Model calibration analysis
- Real-time weather data integration
- Extended prediction horizons (12-24 hours)
- Additional weather variables (soil moisture, drought indices)
- Spatial modeling for fire spread prediction