Skip to content

grgttdln/Breast-Cancer-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Prediction and Diagnosis of Breast Cancer Using Different Machine Learning Classification Techniques

Overview

This repository contains the code and resources for the research paper titled "Prediction and Diagnosis of Breast Cancer Using Different Machine Learning Classification Techniques". The study focuses on applying and comparing various machine learning algorithms to classify breast tumors as benign or malignant using the Breast Cancer Wisconsin (Diagnostic) dataset.

Abstract

Breast cancer is a leading cause of cancer-related deaths globally, making early and accurate diagnosis vital. This study utilizes Machine Learning (ML) as a tool for improved breast cancer classification. Four ML algorithms were implemented:

  • Logistic Regression (LR)
  • K-Nearest Neighbors (KNN)
  • Support Vector Machine (SVM)
  • Random Forest Classifier (RFC)

These models were trained and evaluated using three feature selection approaches: full features, reduced features, and the top 10 most important features. Hyperparameter tuning was performed using GridSearchCV, and L2 regularization was applied. Performance was assessed using standard metrics (Accuracy, Precision, Recall, F1-score, ROC-AUC). Statistical validation using the Friedman test and cross-validated F1-scores was also conducted. The results indicated that SVM with full features achieved the highest F1-score (0.9912), making it the most effective model in this study.

Dataset

  • Source: Breast Cancer Wisconsin (Diagnostic) Dataset from the UCI Machine Learning Repository. Originally collected at the University of Wisconsin Hospitals, Madison by Dr. William H. Wolberg, et al..
  • Characteristics: Contains 569 samples and 30 features derived from digitized images of fine needle aspirates (FNAs) of breast masses. Features describe cell nuclei characteristics.
  • Target Variable: Diagnosis (Malignant or Benign), encoded as 1 and 0 respectively.
  • Preprocessing:
    • No missing values were found.
    • Label Encoding was applied to the target variable.
    • Feature scaling (StandardScaler for LR/SVM, MinMaxScaler for KNN) was applied based on model requirements. RFC did not require scaling.
    • The dataset was split into 60% training, 20% validation, and 20% testing sets.

Methodology

  1. Data Loading & Preprocessing: The dataset was loaded, checked for missing values, scaled, and split.
  2. Exploratory Data Analysis (EDA): Class distribution, feature histograms, and feature correlations were analyzed.
  3. Model Building: Four classification algorithms (LR, KNN, SVM, RFC) were trained. Training was performed under three feature scenarios:
    • Case 1: Full Features
    • Case 2: Reduced Features (Feature Selection)
    • Case 3: Top 10 Features
  4. Hyperparameter Tuning: GridSearchCV was used for optimization.
  5. Evaluation: Models were evaluated using Accuracy, Precision, Recall, F1-Score, and ROC-AUC metrics. Confusion matrices and ROC curves were generated.
  6. Statistical Validation: Friedman test was applied.

Key Findings

  • The Support Vector Machine (SVM) model trained with the full feature set achieved the best performance, with an F1-score of 0.9912.
  • Logistic Regression performed best with the Top 10 Features set.
  • KNN performed best with the Full Features set.
  • Feature scaling and selection significantly impacted model performance. StandardScaler worked well for LR and SVM, while MinMaxScaler was optimal for KNN.
  • All models demonstrated strong classification capabilities, with high AUC values close to 1.

Technologies Used

  • Language: Python
  • Environment: Anaconda, Jupyter Notebook
  • Core Libraries:
    • scikit-learn (for ML models, metrics, preprocessing, GridSearchCV)
    • pandas (for data manipulation)
    • NumPy (for numerical operations)
    • Matplotlib & seaborn (for visualization)
    • ucimlrepo (to fetch the dataset)
    • scipy.stats & scikit-posthocs (for statistical tests)

How to Run (Example Structure)

  1. Clone the repository:
    git clone <repository-url>
    cd <repository-directory>
  2. Set up the environment: (Recommend using Conda)
    # Create and activate conda environment (if environment.yml is provided)
    # conda env create -f environment.yml
    # conda activate <env_name>
    
    # Or install dependencies manually (if requirements.txt is provided)
    # pip install -r requirements.txt
  3. Run the Jupyter Notebook:
    jupyter notebook "breast_cancer.ipynb"
    (Note: Adapt the notebook name if different)
  4. Explore the code: The notebook contains the data loading, preprocessing, model training, evaluation, and visualization steps outlined in the paper.

Authors

  • Georgette Dalen S. Cadiz
  • Joshua G. Dampil
  • Wince Larcen M. Rivano
  • Adriel E. Groyon

College of Computer and Information Science, Mapúa Malayan Colleges Laguna, Philippines

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors