Prediction and Diagnosis of Breast Cancer Using Different Machine Learning Classification Techniques

Overview

This repository contains the code and resources for the research paper titled "Prediction and Diagnosis of Breast Cancer Using Different Machine Learning Classification Techniques". The study focuses on applying and comparing various machine learning algorithms to classify breast tumors as benign or malignant using the Breast Cancer Wisconsin (Diagnostic) dataset.

Abstract

Breast cancer is a leading cause of cancer-related deaths globally, making early and accurate diagnosis vital. This study utilizes Machine Learning (ML) as a tool for improved breast cancer classification. Four ML algorithms were implemented:

Logistic Regression (LR)
K-Nearest Neighbors (KNN)
Support Vector Machine (SVM)
Random Forest Classifier (RFC)

These models were trained and evaluated using three feature selection approaches: full features, reduced features, and the top 10 most important features. Hyperparameter tuning was performed using GridSearchCV, and L2 regularization was applied. Performance was assessed using standard metrics (Accuracy, Precision, Recall, F1-score, ROC-AUC). Statistical validation using the Friedman test and cross-validated F1-scores was also conducted. The results indicated that SVM with full features achieved the highest F1-score (0.9912), making it the most effective model in this study.

Dataset

Source: Breast Cancer Wisconsin (Diagnostic) Dataset from the UCI Machine Learning Repository. Originally collected at the University of Wisconsin Hospitals, Madison by Dr. William H. Wolberg, et al..
Characteristics: Contains 569 samples and 30 features derived from digitized images of fine needle aspirates (FNAs) of breast masses. Features describe cell nuclei characteristics.
Target Variable: Diagnosis (Malignant or Benign), encoded as 1 and 0 respectively.
Preprocessing:
- No missing values were found.
- Label Encoding was applied to the target variable.
- Feature scaling (StandardScaler for LR/SVM, MinMaxScaler for KNN) was applied based on model requirements. RFC did not require scaling.
- The dataset was split into 60% training, 20% validation, and 20% testing sets.

Methodology

Data Loading & Preprocessing: The dataset was loaded, checked for missing values, scaled, and split.
Exploratory Data Analysis (EDA): Class distribution, feature histograms, and feature correlations were analyzed.
Model Building: Four classification algorithms (LR, KNN, SVM, RFC) were trained. Training was performed under three feature scenarios:
- Case 1: Full Features
- Case 2: Reduced Features (Feature Selection)
- Case 3: Top 10 Features
Hyperparameter Tuning: GridSearchCV was used for optimization.
Evaluation: Models were evaluated using Accuracy, Precision, Recall, F1-Score, and ROC-AUC metrics. Confusion matrices and ROC curves were generated.
Statistical Validation: Friedman test was applied.

Key Findings

The Support Vector Machine (SVM) model trained with the full feature set achieved the best performance, with an F1-score of 0.9912.
Logistic Regression performed best with the Top 10 Features set.
KNN performed best with the Full Features set.
Feature scaling and selection significantly impacted model performance. StandardScaler worked well for LR and SVM, while MinMaxScaler was optimal for KNN.
All models demonstrated strong classification capabilities, with high AUC values close to 1.

Technologies Used

Language: Python
Environment: Anaconda, Jupyter Notebook
Core Libraries:
- scikit-learn (for ML models, metrics, preprocessing, GridSearchCV)
- pandas (for data manipulation)
- NumPy (for numerical operations)
- Matplotlib & seaborn (for visualization)
- ucimlrepo (to fetch the dataset)
- scipy.stats & scikit-posthocs (for statistical tests)

How to Run (Example Structure)

Clone the repository:

git clone <repository-url>
cd <repository-directory>

Set up the environment: (Recommend using Conda)

# Create and activate conda environment (if environment.yml is provided)
# conda env create -f environment.yml
# conda activate <env_name>

# Or install dependencies manually (if requirements.txt is provided)
# pip install -r requirements.txt

Run the Jupyter Notebook:
```
jupyter notebook "breast_cancer.ipynb"
```
(Note: Adapt the notebook name if different)
Explore the code: The notebook contains the data loading, preprocessing, model training, evaluation, and visualization steps outlined in the paper.

Authors

Georgette Dalen S. Cadiz
Joshua G. Dampil
Wince Larcen M. Rivano
Adriel E. Groyon

College of Computer and Information Science, Mapúa Malayan Colleges Laguna, Philippines

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
Notebook.ipynb		Notebook.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction and Diagnosis of Breast Cancer Using Different Machine Learning Classification Techniques

Overview

Abstract

Dataset

Methodology

Key Findings

Technologies Used

How to Run (Example Structure)

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prediction and Diagnosis of Breast Cancer Using Different Machine Learning Classification Techniques

Overview

Abstract

Dataset

Methodology

Key Findings

Technologies Used

How to Run (Example Structure)

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages