Skip to content

ZenroChia/Wildfire-Prediction

Repository files navigation

Review Assignment Due Date Open in Visual Studio Code

Data Mining and Machine Learning Group Coursework

Group Members

  1. Ahmed Al-Ansi - @Nashawiyat - H00418777
  2. Amir Hafiy - @Amirhafiy27 - H00391253
  3. Chong Shing Boa - @hohoho123 - H00456192
  4. Wee Zhen Hao - @zhenhao23 - H00410783
  5. Chia Zheng Rong - @Zen - H00452635

Initial Project Proposal

The project is titled WildFive Team Report: Using Dataset to Predict the Possibility of WildFire.

The primary objective is to conduct a comparative analysis of two distinct machine learning methodologies for wildfire prediction: a statistical approach using classical ML algorithms on tabular environmental data (WildfireDB) and a visual approach using Convolutional Neural Networks (CNNs) on satellite imagery (Canadian Wildfire Satellite Images Dataset). The project aims to determine whether environmental metrics or visual spatial data provide a more reliable basis for forecasting wildfire behavior (spread/presence).

Source of Datasets

  1. WildfireDB (Tabular Environmental Data)
  • Source: A large-scale tabular dataset created by researchers from the University of California, Vanderbilt University, and Stanford University.

  • Link: https://zenodo.org/records/5636429

  • Licence: Creative Commons Attribution 4.0 International

  • Example 1: Contains approximately 11.3 million wildfire event records from 2012 to 2015, described by 149 features.

  • Example 2: Features include environmental variables like WSPF_ave (Wind Speed) and PRES_ave (Pressure), alongside the binary target variable fire_spread.

  1. Canadian Wildfire Satellite Images Dataset (Image Data)
  • Source: Published by Abdelghani Aaba on Kaggle, compiled from open-source Canadian government archives.

  • Link: https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset

  • Licence: Creative Commons Attribution (CC-BY) 4.0

  • Example 1: Contains 42,850 labeled RGB satellite images (originally 350x350 pixels).

  • Example 2: Images are categorized into two classes: 'Wildfire' (53%) and 'No Wildfire' (47%), providing a balanced foundation for deep learning.

Milestones

  1. Week 4 (D1): Project Pitch Completion: Finalize datasets, confirm project objectives, and present the initial proposal.

  2. Week 6: Data Preprocessing and EDA Complete: Finalize cleaning pipeline, feature engineering for WildfireDB, and image standardization for the Canadian Wildfire Dataset.

  3. Week 8: Baseline Models (R3) Training and Evaluation Complete: Implement, train, and benchmark Logistic Regression, Random Forest, and K-NN on the WildfireDB dataset.

  4. Week 10: Neural Network Models (R4) Training and Evaluation Complete: Implement Transfer Learning with VGG16, ResNet-50, and EfficientNet-B3 on the Canadian Wildfire Dataset. Perform Grad-CAM visualization.

  5. Week 11 (D2): Final Report and Code Submission: Compile final results, complete the 6-page report, and package the complete, runnable code repository.

  6. Week 12 (D3/D4): Project Presentation and Peer Assessment: Deliver the presentation and submit the peer assessment form.

Installing the project

Prerequisites

  • Python 3.8 or higher
  • pip
  • Git

Installation Steps

  1. Clone the repository:
git clone https://github.com/F20DL-2025-26/f20dl-cw-ay25-26-wildfive.git
cd f20dl-cw-ay25-26-wildfive
  1. Install required dependencies:
# Core dependencies for tabular data models
pip install pandas numpy scikit-learn imbalanced-learn joblib

# For geospatial data processing (WildfireDB)
pip install shapely

# For image processing and deep learning models
pip install tensorflow keras pillow matplotlib seaborn tqdm

# Alternatively, install all dependencies at once:
pip install pandas numpy scikit-learn imbalanced-learn joblib shapely tensorflow keras pillow matplotlib seaborn tqdm
  1. Download the datasets:
  • WildfireDB:
  • Canadian Wildfire Satellite Images:

Note: Sample data for testing is provided in data/sample_data/ directory for quick experimentation without downloading the full datasets.

Data Preparation Pipeline

The pipeline consists of two separate flows for the tabular and image datasets.

Running the Complete Pipeline

For Tabular Data (WildfireDB):

# Preprocess the WildfireDB dataset from the raw dataset to the 3 processed dataset (training/validation/testing)
python scripts/Preprocessing-training-prediction-wildfireDB/preprocess.py

For Image Data (Canadian Wildfire Satellite Images):

# Resize satellite images to 300x300
python scripts/preprocessing/satellite\ images/resize.py

For Exploratory Data Analysis:

Open and run the preprocessing notebook:

jupyter notebook notebooks/EDA_preprocessing_and_logistic_regression.ipynb

Tabular Data Pipeline (WildfireDB)

Execution: Run python scripts/Preprocessing-training-prediction-wildfireDB/preprocess.py

The preprocessing pipeline applies the following transformations:

  • Remove Useless Column: Remove the 149th unnamed column from the original dataset (17.8 million records).

  • Temporal Filtering: Extract data from 2016-2017 only, filtering out 2012-2015 records to focus on the most recent patterns.

  • Feature Engineering:

    • Remove TEMP_ave (100% missing values) and Neighbour_acq_time (90% missing, redundant)
    • Create binary target variable fire_spread from Neighbour_frp (1 = spread; 0 = no spread), removing Neighbour_frp in the process
  • Remove Incomplete Rows: Drop all rows with any remaining NaN values to ensure data quality.

  • Process Multi-Value and Spatial Columns:

    • frp and acq_time: Extract maximum and average values from comma-separated entries
    • Shape and Neighbour_Shape: Convert WKT polygon strings to area values (Shape_area, Neighbour_Shape_area)
  • Date Feature Extraction: Split acq_date into acq_date_year, acq_date_month, acq_date_day, and acq_date_dayofyear for temporal pattern analysis.

  • Balanced Sampling and Dataset Splitting:

    • Training set: 80,000 samples (40,000 spread + 40,000 no-spread)
    • Validation set: 20,000 samples (10,000 spread + 10,000 no-spread)
    • Test set: All remaining 2016-2017 data (naturally imbalanced)
  • Shuffling: Shuffle all three datasets to ensure random distribution.

Note

Model-specific preprocessing (e.g., StandardScaler for KNN, SMOTE for Random Forest) is applied within individual training scripts.

Output: Generates three preprocessed datasets (features_array_2016-2017_training.csv, features_array_2016-2017_validation.csv, features_array_2016-2017_testing.csv).


Image Data Pipeline (Canadian Wildfire Satellite Images)

Execution: Run python scripts/preprocessing/satellite\ images/resize.py

  • Loading and Verification: Load 42,850 labeled RGB satellite images. Corrupted or unreadable images are removed.

  • Resizing and Standardization: All images are resized to 300 × 300 to match CNN model requirements.

  • Dataset Splitting: Split into:

    • 70% Training
    • 15% Validation
    • 15% Test
  • CNN Preprocessing: Images are processed according to the requirements of the Transfer Learning models:

    • VGG16
    • ResNet-50
    • EfficientNet-B3

Output: Generates a resized_dataset/ folder with preprocessed images organized by train/valid/test splits.


Project Structure

f20dl-cw-ay25-26-wildfive/
│
├── README.md                          # Project documentation
├── .gitignore
├── .editorconfig
│
├── data/                              # Data directory
│   ├── datasets.md                    # Dataset links and information
│   └── sample_data/                   # Sample data for testing
│       ├── CNN/                       # CNN sample images
│       │   ├── wildfire/              # Sample wildfire images
│       │   └── no_wildfire/           # Sample non-wildfire images
│       ├── KNN/                       # KNN sample data
│       │   └── features_array_testing_set_new_sample.csv
│       └── Random Forest + Logistic Regression/
│           └── features_array_2016-2017_testing_sample.csv
│
├── notebooks/                         # Jupyter notebooks for analysis
│   ├── Baseline models (wildfireDB)/ # Classical ML models
│   │   ├── EDA_preprocessing_and_logistic_regression.ipynb
│   │   ├── KNN.ipynb
│   │   ├── Random Forest Testing.ipynb
│   │   └── Training RF Model.ipynb
│   └── CNN/                           # Deep learning models
│       ├── 01_load_and_display_data.ipynb
│       ├── 02_pretrained_models_resnet50_12k.ipynb
│       ├── 03_pretrained_models_EfficientNetB3_12k.ipynb
│       ├── 04_model_with_cam_resnet50_12k.ipynb
│       ├── 05_pretrained_models_vgg16_12k.ipynb
│       └── 06_cnn_saved_models_inference.ipynb
│
├── scripts/                           # Python scripts
│   ├── preprocessing/
│   │   └── satellite images/          # Image preprocessing
│   │       ├── count_validation.py    # Validate image counts
│   │       ├── load_dataset_example.py # Dataset loading example
│   │       ├── README.txt             # Preprocessing instructions
│   │       └── resize.py              # Image resizing script
│   └── Preprocessing-training-prediction-wildfireDB/
│       ├── preprocess.py              # Main preprocessing
│       ├── LR-training.py             # Logistic Regression training
│       ├── LR-prediction.py           # Logistic Regression prediction
│       ├── RF-training.py             # Random Forest training
│       ├── RF-prediction.py           # Random Forest prediction
│       ├── KNN-training.py            # K-NN training
│       └── KNN-prediction.py          # K-NN prediction
│
├── trained models/                    # Trained model storage
│   └── trained_models.md              # Google Drive links to all trained models
│
└── documentation/                     # Project documentation and updates

Coursework Requirements

R1. Project Topic, and Direction

The project focuses on predicting wildfire occurrence by comparing classical machine learning models (Logistic Regression, Random Forest, K-NN) on tabular data (WildfireDB) against deep learning models (VGG16, ResNet-50, EfficientNet-B3) trained on satellite images. The goal is to determine which approach yields superior predictive performance.

Location:

  • README.md (Initial proposal)
  • documentation/ (Weekly updates)

R2. Data Analysis and Exploration

This includes cleaning the WildfireDB dataset, addressing severe imbalance via strategic sampling, and performing EDA to identify issues such as 100% missingness in certain columns.

Location:

  • notebooks/EDA_preprocessing_and_logistic_regression.ipynb (WildfireDB Exploratory Data Analysis, Preprocessing, Feature Engineering, Problem Discovery, and Logistic Regression baseline notebook)
  • scripts/Preprocessing-training-prediction-wildfireDB/preprocess.py (Tabular data preprocessing)
  • scripts/preprocessing/satellite images/resize.py (Image resizing)

R3. Baseline Training and Evaluation Experiments

Three classical ML models were applied on the WildfireDB dataset:

  • Logistic Regression
  • Random Forest
  • K-NN

Models were evaluated using ROC-AUC, PR-AUC, Precision, Recall, and F1. Random Forest performed best, achieving an ROC-AUC of 0.899.

Location:

  • notebooks/Baseline models (wildfireDB)/EDA_preprocessing_and_logistic_regression.ipynb (Notebook also includes Logistic Regression baseline model training and prediction evaluation)
  • scripts/Preprocessing-training-prediction-wildfireDB/LR-training.py (Logistic Regression training script)
  • scripts/Preprocessing-training-prediction-wildfireDB/LR-prediction.py (Logistic Regression prediction script)
  • notebooks/Baseline models (wildfireDB)/Training RF Model.ipynb (Random Forest training notebook)
  • notebooks/Baseline models (wildfireDB)/Random Forest Testing.ipynb (Random Forest prediction evaluation notebook)
  • scripts/Preprocessing-training-prediction-wildfireDB/RF-training.py (Random Forest training script)
  • scripts/Preprocessing-training-prediction-wildfireDB/RF-prediction.py (Random Forest prediction script)
  • notebooks/Baseline models (wildfireDB)/KNN.ipynb (K-NN training and prediction evaluation notebook)
  • scripts/Preprocessing-training-prediction-wildfireDB/KNN-training.py (K-NN training script)
  • scripts/Preprocessing-training-prediction-wildfireDB/KNN-prediction.py (K-NN prediction script)

Trained Models: All trained Baseline and CNN models are available at Google Drive (see trained models/trained_models.md)


R4. Neural Networks

Three transfer-learning CNN models were trained:

  • VGG16
  • ResNet-50
  • EfficientNet-B3

A standard classification head was added to each. Model interpretability was ensured using Grad-CAM visualizations to confirm focus on relevant wildfire visual cues.

Location:

  • notebooks/CNN/01_load_and_display_data.ipynb (Initial data loading and visualization)
  • notebooks/CNN/02_pretrained_models_resnet50_12k.ipynb (ResNet-50 training and evaluation)
  • notebooks/CNN/03_pretrained_models_EfficientNetB3_12k.ipynb (EfficientNet-B3 training and evaluation)
  • notebooks/CNN/05_pretrained_models_vgg16_12k.ipynb (VGG16 training and evaluation)
  • notebooks/CNN/04_model_with_cam_resnet50_12k.ipynb (Grad-CAM visualization for model interpretability)
  • notebooks/CNN/06_cnn_saved_models_inference.ipynb (Inference using trained CNN models)

Trained Models: All trained Baseline and CNN models are available at Google Drive (see trained models/trained_models.md)


Documentation

Weekly updates are kept in the documentation/ directory.

About

Wildfire prediction using dual ML approaches: classical models (Logistic Regression, Random Forest, K-NN) on the WildfireDB tabular dataset, and transfer learning CNNs (VGG16, ResNet-50, EfficientNet-B3) on satellite imagery, with EDA, Grad-CAM visualisations, and full data pipelines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors