Data Mining and Machine Learning Group Coursework

Group Members

Ahmed Al-Ansi - @Nashawiyat - H00418777
Amir Hafiy - @Amirhafiy27 - H00391253
Chong Shing Boa - @hohoho123 - H00456192
Wee Zhen Hao - @zhenhao23 - H00410783
Chia Zheng Rong - @Zen - H00452635

Initial Project Proposal

The project is titled WildFive Team Report: Using Dataset to Predict the Possibility of WildFire.

The primary objective is to conduct a comparative analysis of two distinct machine learning methodologies for wildfire prediction: a statistical approach using classical ML algorithms on tabular environmental data (WildfireDB) and a visual approach using Convolutional Neural Networks (CNNs) on satellite imagery (Canadian Wildfire Satellite Images Dataset). The project aims to determine whether environmental metrics or visual spatial data provide a more reliable basis for forecasting wildfire behavior (spread/presence).

Source of Datasets

WildfireDB (Tabular Environmental Data)

Source: A large-scale tabular dataset created by researchers from the University of California, Vanderbilt University, and Stanford University.
Link: https://zenodo.org/records/5636429
Licence: Creative Commons Attribution 4.0 International
Example 1: Contains approximately 11.3 million wildfire event records from 2012 to 2015, described by 149 features.
Example 2: Features include environmental variables like WSPF_ave (Wind Speed) and PRES_ave (Pressure), alongside the binary target variable fire_spread.

Canadian Wildfire Satellite Images Dataset (Image Data)

Source: Published by Abdelghani Aaba on Kaggle, compiled from open-source Canadian government archives.
Link: https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset
Licence: Creative Commons Attribution (CC-BY) 4.0
Example 1: Contains 42,850 labeled RGB satellite images (originally 350x350 pixels).
Example 2: Images are categorized into two classes: 'Wildfire' (53%) and 'No Wildfire' (47%), providing a balanced foundation for deep learning.

Milestones

Week 4 (D1): Project Pitch Completion: Finalize datasets, confirm project objectives, and present the initial proposal.
Week 6: Data Preprocessing and EDA Complete: Finalize cleaning pipeline, feature engineering for WildfireDB, and image standardization for the Canadian Wildfire Dataset.
Week 8: Baseline Models (R3) Training and Evaluation Complete: Implement, train, and benchmark Logistic Regression, Random Forest, and K-NN on the WildfireDB dataset.
Week 10: Neural Network Models (R4) Training and Evaluation Complete: Implement Transfer Learning with VGG16, ResNet-50, and EfficientNet-B3 on the Canadian Wildfire Dataset. Perform Grad-CAM visualization.
Week 11 (D2): Final Report and Code Submission: Compile final results, complete the 6-page report, and package the complete, runnable code repository.
Week 12 (D3/D4): Project Presentation and Peer Assessment: Deliver the presentation and submit the peer assessment form.

Installing the project

Prerequisites

Python 3.8 or higher
pip
Git

Installation Steps

Clone the repository:

git clone https://github.com/F20DL-2025-26/f20dl-cw-ay25-26-wildfive.git
cd f20dl-cw-ay25-26-wildfive

Install required dependencies:

# Core dependencies for tabular data models
pip install pandas numpy scikit-learn imbalanced-learn joblib

# For geospatial data processing (WildfireDB)
pip install shapely

# For image processing and deep learning models
pip install tensorflow keras pillow matplotlib seaborn tqdm

# Alternatively, install all dependencies at once:
pip install pandas numpy scikit-learn imbalanced-learn joblib shapely tensorflow keras pillow matplotlib seaborn tqdm

Download the datasets:

WildfireDB:
- Original: Download from Zenodo
- Preprocessed: Available at Google Drive
Canadian Wildfire Satellite Images:
- Original: Download from Kaggle
- Preprocessed: Available at Google Drive

Note: Sample data for testing is provided in data/sample_data/ directory for quick experimentation without downloading the full datasets.

Data Preparation Pipeline

The pipeline consists of two separate flows for the tabular and image datasets.

Running the Complete Pipeline

For Tabular Data (WildfireDB):

# Preprocess the WildfireDB dataset from the raw dataset to the 3 processed dataset (training/validation/testing)
python scripts/Preprocessing-training-prediction-wildfireDB/preprocess.py

For Image Data (Canadian Wildfire Satellite Images):

# Resize satellite images to 300x300
python scripts/preprocessing/satellite\ images/resize.py

For Exploratory Data Analysis:

Open and run the preprocessing notebook:

jupyter notebook notebooks/EDA_preprocessing_and_logistic_regression.ipynb

Tabular Data Pipeline (WildfireDB)

Execution: Run python scripts/Preprocessing-training-prediction-wildfireDB/preprocess.py

The preprocessing pipeline applies the following transformations:

Remove Useless Column: Remove the 149th unnamed column from the original dataset (17.8 million records).
Temporal Filtering: Extract data from 2016-2017 only, filtering out 2012-2015 records to focus on the most recent patterns.
Feature Engineering:
- Remove TEMP_ave (100% missing values) and Neighbour_acq_time (90% missing, redundant)
- Create binary target variable fire_spread from Neighbour_frp (1 = spread; 0 = no spread), removing Neighbour_frp in the process
Remove Incomplete Rows: Drop all rows with any remaining NaN values to ensure data quality.
Process Multi-Value and Spatial Columns:
- frp and acq_time: Extract maximum and average values from comma-separated entries
- Shape and Neighbour_Shape: Convert WKT polygon strings to area values (Shape_area, Neighbour_Shape_area)
Date Feature Extraction: Split acq_date into acq_date_year, acq_date_month, acq_date_day, and acq_date_dayofyear for temporal pattern analysis.
Balanced Sampling and Dataset Splitting:
- Training set: 80,000 samples (40,000 spread + 40,000 no-spread)
- Validation set: 20,000 samples (10,000 spread + 10,000 no-spread)
- Test set: All remaining 2016-2017 data (naturally imbalanced)
Shuffling: Shuffle all three datasets to ensure random distribution.

Note

Model-specific preprocessing (e.g., StandardScaler for KNN, SMOTE for Random Forest) is applied within individual training scripts.

Output: Generates three preprocessed datasets (features_array_2016-2017_training.csv, features_array_2016-2017_validation.csv, features_array_2016-2017_testing.csv).

Image Data Pipeline (Canadian Wildfire Satellite Images)

Execution: Run python scripts/preprocessing/satellite\ images/resize.py

Loading and Verification: Load 42,850 labeled RGB satellite images. Corrupted or unreadable images are removed.
Resizing and Standardization: All images are resized to 300 × 300 to match CNN model requirements.
Dataset Splitting: Split into:
- 70% Training
- 15% Validation
- 15% Test
CNN Preprocessing: Images are processed according to the requirements of the Transfer Learning models:
- VGG16
- ResNet-50
- EfficientNet-B3

Output: Generates a resized_dataset/ folder with preprocessed images organized by train/valid/test splits.

Project Structure

f20dl-cw-ay25-26-wildfive/
│
├── README.md                          # Project documentation
├── .gitignore
├── .editorconfig
│
├── data/                              # Data directory
│   ├── datasets.md                    # Dataset links and information
│   └── sample_data/                   # Sample data for testing
│       ├── CNN/                       # CNN sample images
│       │   ├── wildfire/              # Sample wildfire images
│       │   └── no_wildfire/           # Sample non-wildfire images
│       ├── KNN/                       # KNN sample data
│       │   └── features_array_testing_set_new_sample.csv
│       └── Random Forest + Logistic Regression/
│           └── features_array_2016-2017_testing_sample.csv
│
├── notebooks/                         # Jupyter notebooks for analysis
│   ├── Baseline models (wildfireDB)/ # Classical ML models
│   │   ├── EDA_preprocessing_and_logistic_regression.ipynb
│   │   ├── KNN.ipynb
│   │   ├── Random Forest Testing.ipynb
│   │   └── Training RF Model.ipynb
│   └── CNN/                           # Deep learning models
│       ├── 01_load_and_display_data.ipynb
│       ├── 02_pretrained_models_resnet50_12k.ipynb
│       ├── 03_pretrained_models_EfficientNetB3_12k.ipynb
│       ├── 04_model_with_cam_resnet50_12k.ipynb
│       ├── 05_pretrained_models_vgg16_12k.ipynb
│       └── 06_cnn_saved_models_inference.ipynb
│
├── scripts/                           # Python scripts
│   ├── preprocessing/
│   │   └── satellite images/          # Image preprocessing
│   │       ├── count_validation.py    # Validate image counts
│   │       ├── load_dataset_example.py # Dataset loading example
│   │       ├── README.txt             # Preprocessing instructions
│   │       └── resize.py              # Image resizing script
│   └── Preprocessing-training-prediction-wildfireDB/
│       ├── preprocess.py              # Main preprocessing
│       ├── LR-training.py             # Logistic Regression training
│       ├── LR-prediction.py           # Logistic Regression prediction
│       ├── RF-training.py             # Random Forest training
│       ├── RF-prediction.py           # Random Forest prediction
│       ├── KNN-training.py            # K-NN training
│       └── KNN-prediction.py          # K-NN prediction
│
├── trained models/                    # Trained model storage
│   └── trained_models.md              # Google Drive links to all trained models
│
└── documentation/                     # Project documentation and updates

Coursework Requirements

R1. Project Topic, and Direction

The project focuses on predicting wildfire occurrence by comparing classical machine learning models (Logistic Regression, Random Forest, K-NN) on tabular data (WildfireDB) against deep learning models (VGG16, ResNet-50, EfficientNet-B3) trained on satellite images. The goal is to determine which approach yields superior predictive performance.

Location:

README.md (Initial proposal)
documentation/ (Weekly updates)

R2. Data Analysis and Exploration

This includes cleaning the WildfireDB dataset, addressing severe imbalance via strategic sampling, and performing EDA to identify issues such as 100% missingness in certain columns.

Location:

notebooks/EDA_preprocessing_and_logistic_regression.ipynb (WildfireDB Exploratory Data Analysis, Preprocessing, Feature Engineering, Problem Discovery, and Logistic Regression baseline notebook)
scripts/Preprocessing-training-prediction-wildfireDB/preprocess.py (Tabular data preprocessing)
scripts/preprocessing/satellite images/resize.py (Image resizing)

R3. Baseline Training and Evaluation Experiments

Three classical ML models were applied on the WildfireDB dataset:

Logistic Regression
Random Forest
K-NN

Models were evaluated using ROC-AUC, PR-AUC, Precision, Recall, and F1. Random Forest performed best, achieving an ROC-AUC of 0.899.

Location:

notebooks/Baseline models (wildfireDB)/EDA_preprocessing_and_logistic_regression.ipynb (Notebook also includes Logistic Regression baseline model training and prediction evaluation)
scripts/Preprocessing-training-prediction-wildfireDB/LR-training.py (Logistic Regression training script)
scripts/Preprocessing-training-prediction-wildfireDB/LR-prediction.py (Logistic Regression prediction script)
notebooks/Baseline models (wildfireDB)/Training RF Model.ipynb (Random Forest training notebook)
notebooks/Baseline models (wildfireDB)/Random Forest Testing.ipynb (Random Forest prediction evaluation notebook)
scripts/Preprocessing-training-prediction-wildfireDB/RF-training.py (Random Forest training script)
scripts/Preprocessing-training-prediction-wildfireDB/RF-prediction.py (Random Forest prediction script)
notebooks/Baseline models (wildfireDB)/KNN.ipynb (K-NN training and prediction evaluation notebook)
scripts/Preprocessing-training-prediction-wildfireDB/KNN-training.py (K-NN training script)
scripts/Preprocessing-training-prediction-wildfireDB/KNN-prediction.py (K-NN prediction script)

Trained Models: All trained Baseline and CNN models are available at Google Drive (see trained models/trained_models.md)

R4. Neural Networks

Three transfer-learning CNN models were trained:

VGG16
ResNet-50
EfficientNet-B3

A standard classification head was added to each. Model interpretability was ensured using Grad-CAM visualizations to confirm focus on relevant wildfire visual cues.

Location:

notebooks/CNN/01_load_and_display_data.ipynb (Initial data loading and visualization)
notebooks/CNN/02_pretrained_models_resnet50_12k.ipynb (ResNet-50 training and evaluation)
notebooks/CNN/03_pretrained_models_EfficientNetB3_12k.ipynb (EfficientNet-B3 training and evaluation)
notebooks/CNN/05_pretrained_models_vgg16_12k.ipynb (VGG16 training and evaluation)
notebooks/CNN/04_model_with_cam_resnet50_12k.ipynb (Grad-CAM visualization for model interpretability)
notebooks/CNN/06_cnn_saved_models_inference.ipynb (Inference using trained CNN models)

Trained Models: All trained Baseline and CNN models are available at Google Drive (see trained models/trained_models.md)

Documentation

Weekly updates are kept in the documentation/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github		.github
.vscode		.vscode
Trained models		Trained models
data		data
documentation		documentation
notebooks		notebooks
scripts		scripts
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
WildFire Project Presentation (WildfiVe).pdf		WildFire Project Presentation (WildfiVe).pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Mining and Machine Learning Group Coursework

Group Members

Initial Project Proposal

Source of Datasets

Milestones

Installing the project

Prerequisites

Installation Steps

Data Preparation Pipeline

Running the Complete Pipeline

Tabular Data Pipeline (WildfireDB)

Image Data Pipeline (Canadian Wildfire Satellite Images)

Project Structure

Coursework Requirements

R1. Project Topic, and Direction

R2. Data Analysis and Exploration

R3. Baseline Training and Evaluation Experiments

R4. Neural Networks

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Mining and Machine Learning Group Coursework

Group Members

Initial Project Proposal

Source of Datasets

Milestones

Installing the project

Prerequisites

Installation Steps

Data Preparation Pipeline

Running the Complete Pipeline

Tabular Data Pipeline (WildfireDB)

Image Data Pipeline (Canadian Wildfire Satellite Images)

Project Structure

Coursework Requirements

R1. Project Topic, and Direction

R2. Data Analysis and Exploration

R3. Baseline Training and Evaluation Experiments

R4. Neural Networks

Documentation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages