- Ahmed Al-Ansi - @Nashawiyat - H00418777
- Amir Hafiy - @Amirhafiy27 - H00391253
- Chong Shing Boa - @hohoho123 - H00456192
- Wee Zhen Hao - @zhenhao23 - H00410783
- Chia Zheng Rong - @Zen - H00452635
The project is titled WildFive Team Report: Using Dataset to Predict the Possibility of WildFire.
The primary objective is to conduct a comparative analysis of two distinct machine learning methodologies for wildfire prediction: a statistical approach using classical ML algorithms on tabular environmental data (WildfireDB) and a visual approach using Convolutional Neural Networks (CNNs) on satellite imagery (Canadian Wildfire Satellite Images Dataset). The project aims to determine whether environmental metrics or visual spatial data provide a more reliable basis for forecasting wildfire behavior (spread/presence).
- WildfireDB (Tabular Environmental Data)
-
Source: A large-scale tabular dataset created by researchers from the University of California, Vanderbilt University, and Stanford University.
-
Licence: Creative Commons Attribution 4.0 International
-
Example 1: Contains approximately 11.3 million wildfire event records from 2012 to 2015, described by 149 features.
-
Example 2: Features include environmental variables like WSPF_ave (Wind Speed) and PRES_ave (Pressure), alongside the binary target variable fire_spread.
- Canadian Wildfire Satellite Images Dataset (Image Data)
-
Source: Published by Abdelghani Aaba on Kaggle, compiled from open-source Canadian government archives.
-
Link: https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset
-
Licence: Creative Commons Attribution (CC-BY) 4.0
-
Example 1: Contains 42,850 labeled RGB satellite images (originally 350x350 pixels).
-
Example 2: Images are categorized into two classes: 'Wildfire' (53%) and 'No Wildfire' (47%), providing a balanced foundation for deep learning.
-
Week 4 (D1): Project Pitch Completion: Finalize datasets, confirm project objectives, and present the initial proposal.
-
Week 6: Data Preprocessing and EDA Complete: Finalize cleaning pipeline, feature engineering for WildfireDB, and image standardization for the Canadian Wildfire Dataset.
-
Week 8: Baseline Models (R3) Training and Evaluation Complete: Implement, train, and benchmark Logistic Regression, Random Forest, and K-NN on the WildfireDB dataset.
-
Week 10: Neural Network Models (R4) Training and Evaluation Complete: Implement Transfer Learning with VGG16, ResNet-50, and EfficientNet-B3 on the Canadian Wildfire Dataset. Perform Grad-CAM visualization.
-
Week 11 (D2): Final Report and Code Submission: Compile final results, complete the 6-page report, and package the complete, runnable code repository.
-
Week 12 (D3/D4): Project Presentation and Peer Assessment: Deliver the presentation and submit the peer assessment form.
- Python 3.8 or higher
- pip
- Git
- Clone the repository:
git clone https://github.com/F20DL-2025-26/f20dl-cw-ay25-26-wildfive.git
cd f20dl-cw-ay25-26-wildfive- Install required dependencies:
# Core dependencies for tabular data models
pip install pandas numpy scikit-learn imbalanced-learn joblib
# For geospatial data processing (WildfireDB)
pip install shapely
# For image processing and deep learning models
pip install tensorflow keras pillow matplotlib seaborn tqdm
# Alternatively, install all dependencies at once:
pip install pandas numpy scikit-learn imbalanced-learn joblib shapely tensorflow keras pillow matplotlib seaborn tqdm- Download the datasets:
- WildfireDB:
- Original: Download from Zenodo
- Preprocessed: Available at Google Drive
- Canadian Wildfire Satellite Images:
- Original: Download from Kaggle
- Preprocessed: Available at Google Drive
Note: Sample data for testing is provided in data/sample_data/ directory for quick experimentation without downloading the full datasets.
The pipeline consists of two separate flows for the tabular and image datasets.
For Tabular Data (WildfireDB):
# Preprocess the WildfireDB dataset from the raw dataset to the 3 processed dataset (training/validation/testing)
python scripts/Preprocessing-training-prediction-wildfireDB/preprocess.pyFor Image Data (Canadian Wildfire Satellite Images):
# Resize satellite images to 300x300
python scripts/preprocessing/satellite\ images/resize.pyFor Exploratory Data Analysis:
Open and run the preprocessing notebook:
jupyter notebook notebooks/EDA_preprocessing_and_logistic_regression.ipynbExecution: Run python scripts/Preprocessing-training-prediction-wildfireDB/preprocess.py
The preprocessing pipeline applies the following transformations:
-
Remove Useless Column: Remove the 149th unnamed column from the original dataset (17.8 million records).
-
Temporal Filtering: Extract data from 2016-2017 only, filtering out 2012-2015 records to focus on the most recent patterns.
-
Feature Engineering:
- Remove
TEMP_ave(100% missing values) andNeighbour_acq_time(90% missing, redundant) - Create binary target variable
fire_spreadfromNeighbour_frp(1 = spread; 0 = no spread), removingNeighbour_frpin the process
- Remove
-
Remove Incomplete Rows: Drop all rows with any remaining NaN values to ensure data quality.
-
Process Multi-Value and Spatial Columns:
frpandacq_time: Extract maximum and average values from comma-separated entriesShapeandNeighbour_Shape: Convert WKT polygon strings to area values (Shape_area,Neighbour_Shape_area)
-
Date Feature Extraction: Split
acq_dateintoacq_date_year,acq_date_month,acq_date_day, andacq_date_dayofyearfor temporal pattern analysis. -
Balanced Sampling and Dataset Splitting:
- Training set: 80,000 samples (40,000 spread + 40,000 no-spread)
- Validation set: 20,000 samples (10,000 spread + 10,000 no-spread)
- Test set: All remaining 2016-2017 data (naturally imbalanced)
-
Shuffling: Shuffle all three datasets to ensure random distribution.
Note
Model-specific preprocessing (e.g., StandardScaler for KNN, SMOTE for Random Forest) is applied within individual training scripts.
Output: Generates three preprocessed datasets (features_array_2016-2017_training.csv, features_array_2016-2017_validation.csv, features_array_2016-2017_testing.csv).
Execution: Run python scripts/preprocessing/satellite\ images/resize.py
-
Loading and Verification: Load 42,850 labeled RGB satellite images. Corrupted or unreadable images are removed.
-
Resizing and Standardization: All images are resized to 300 × 300 to match CNN model requirements.
-
Dataset Splitting: Split into:
- 70% Training
- 15% Validation
- 15% Test
-
CNN Preprocessing: Images are processed according to the requirements of the Transfer Learning models:
- VGG16
- ResNet-50
- EfficientNet-B3
Output: Generates a resized_dataset/ folder with preprocessed images organized by train/valid/test splits.
f20dl-cw-ay25-26-wildfive/
│
├── README.md # Project documentation
├── .gitignore
├── .editorconfig
│
├── data/ # Data directory
│ ├── datasets.md # Dataset links and information
│ └── sample_data/ # Sample data for testing
│ ├── CNN/ # CNN sample images
│ │ ├── wildfire/ # Sample wildfire images
│ │ └── no_wildfire/ # Sample non-wildfire images
│ ├── KNN/ # KNN sample data
│ │ └── features_array_testing_set_new_sample.csv
│ └── Random Forest + Logistic Regression/
│ └── features_array_2016-2017_testing_sample.csv
│
├── notebooks/ # Jupyter notebooks for analysis
│ ├── Baseline models (wildfireDB)/ # Classical ML models
│ │ ├── EDA_preprocessing_and_logistic_regression.ipynb
│ │ ├── KNN.ipynb
│ │ ├── Random Forest Testing.ipynb
│ │ └── Training RF Model.ipynb
│ └── CNN/ # Deep learning models
│ ├── 01_load_and_display_data.ipynb
│ ├── 02_pretrained_models_resnet50_12k.ipynb
│ ├── 03_pretrained_models_EfficientNetB3_12k.ipynb
│ ├── 04_model_with_cam_resnet50_12k.ipynb
│ ├── 05_pretrained_models_vgg16_12k.ipynb
│ └── 06_cnn_saved_models_inference.ipynb
│
├── scripts/ # Python scripts
│ ├── preprocessing/
│ │ └── satellite images/ # Image preprocessing
│ │ ├── count_validation.py # Validate image counts
│ │ ├── load_dataset_example.py # Dataset loading example
│ │ ├── README.txt # Preprocessing instructions
│ │ └── resize.py # Image resizing script
│ └── Preprocessing-training-prediction-wildfireDB/
│ ├── preprocess.py # Main preprocessing
│ ├── LR-training.py # Logistic Regression training
│ ├── LR-prediction.py # Logistic Regression prediction
│ ├── RF-training.py # Random Forest training
│ ├── RF-prediction.py # Random Forest prediction
│ ├── KNN-training.py # K-NN training
│ └── KNN-prediction.py # K-NN prediction
│
├── trained models/ # Trained model storage
│ └── trained_models.md # Google Drive links to all trained models
│
└── documentation/ # Project documentation and updates
The project focuses on predicting wildfire occurrence by comparing classical machine learning models (Logistic Regression, Random Forest, K-NN) on tabular data (WildfireDB) against deep learning models (VGG16, ResNet-50, EfficientNet-B3) trained on satellite images. The goal is to determine which approach yields superior predictive performance.
Location:
README.md(Initial proposal)documentation/(Weekly updates)
This includes cleaning the WildfireDB dataset, addressing severe imbalance via strategic sampling, and performing EDA to identify issues such as 100% missingness in certain columns.
Location:
notebooks/EDA_preprocessing_and_logistic_regression.ipynb(WildfireDB Exploratory Data Analysis, Preprocessing, Feature Engineering, Problem Discovery, and Logistic Regression baseline notebook)scripts/Preprocessing-training-prediction-wildfireDB/preprocess.py(Tabular data preprocessing)scripts/preprocessing/satellite images/resize.py(Image resizing)
Three classical ML models were applied on the WildfireDB dataset:
- Logistic Regression
- Random Forest
- K-NN
Models were evaluated using ROC-AUC, PR-AUC, Precision, Recall, and F1. Random Forest performed best, achieving an ROC-AUC of 0.899.
Location:
notebooks/Baseline models (wildfireDB)/EDA_preprocessing_and_logistic_regression.ipynb(Notebook also includes Logistic Regression baseline model training and prediction evaluation)scripts/Preprocessing-training-prediction-wildfireDB/LR-training.py(Logistic Regression training script)scripts/Preprocessing-training-prediction-wildfireDB/LR-prediction.py(Logistic Regression prediction script)notebooks/Baseline models (wildfireDB)/Training RF Model.ipynb(Random Forest training notebook)notebooks/Baseline models (wildfireDB)/Random Forest Testing.ipynb(Random Forest prediction evaluation notebook)scripts/Preprocessing-training-prediction-wildfireDB/RF-training.py(Random Forest training script)scripts/Preprocessing-training-prediction-wildfireDB/RF-prediction.py(Random Forest prediction script)notebooks/Baseline models (wildfireDB)/KNN.ipynb(K-NN training and prediction evaluation notebook)scripts/Preprocessing-training-prediction-wildfireDB/KNN-training.py(K-NN training script)scripts/Preprocessing-training-prediction-wildfireDB/KNN-prediction.py(K-NN prediction script)
Trained Models: All trained Baseline and CNN models are available at Google Drive (see trained models/trained_models.md)
Three transfer-learning CNN models were trained:
- VGG16
- ResNet-50
- EfficientNet-B3
A standard classification head was added to each. Model interpretability was ensured using Grad-CAM visualizations to confirm focus on relevant wildfire visual cues.
Location:
notebooks/CNN/01_load_and_display_data.ipynb(Initial data loading and visualization)notebooks/CNN/02_pretrained_models_resnet50_12k.ipynb(ResNet-50 training and evaluation)notebooks/CNN/03_pretrained_models_EfficientNetB3_12k.ipynb(EfficientNet-B3 training and evaluation)notebooks/CNN/05_pretrained_models_vgg16_12k.ipynb(VGG16 training and evaluation)notebooks/CNN/04_model_with_cam_resnet50_12k.ipynb(Grad-CAM visualization for model interpretability)notebooks/CNN/06_cnn_saved_models_inference.ipynb(Inference using trained CNN models)
Trained Models: All trained Baseline and CNN models are available at Google Drive (see trained models/trained_models.md)
Weekly updates are kept in the documentation/ directory.