Skip to content

naholav/dbscan-vs-kmeans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clustering Algorithms Comparison

This repository provides a framework for clustering analysis using K-Means and DBSCAN algorithms. The project demonstrates the effect of data standardization on clustering performance and includes visualizations for deeper insights into the results.


Project Overview

This project evaluates the performance of clustering algorithms on standardized and non-standardized datasets. Key objectives include:

  1. Clustering Analysis: Compare the performance of K-Means and DBSCAN.
  2. Impact of Standardization: Highlight how standardizing the data affects clustering results.
  3. Visualizations: Generate plots to visualize clusters and performance metrics.

Datasets

1. pricerunner_aggregate 2.csv

  • Contains product and merchant information used for clustering.
  • Key Columns:
    • Product ID
    • Merchant ID

2. Clustered Data Outputs

  • non-standardize_k-means_clustered_data.csv: Results of K-Means clustering without standardization.
  • standardize_k-means_clustered_data.csv: Results of K-Means clustering with standardization.
  • dbscan_clustered_data.csv: Results of DBSCAN clustering.

Python Scripts

1. non-standardize_k-means.py

  • Performs K-Means clustering without standardizing the data.
  • Outputs clustering metrics and visualizations.

2. standardize_k-means.py

  • Performs K-Means clustering on standardized data.
  • Highlights the impact of standardization on clustering.

3. dbscan.py

  • Implements DBSCAN clustering algorithm.
  • Includes clustering metrics and visualizations for non-standardized data.

Performance Metrics

1. Silhouette Score

  • Measures the similarity of points within the same cluster.
  • Range: -1 (poor clustering) to 1 (ideal clustering).

2. Davies-Bouldin Index

  • Evaluates the compactness and separation of clusters.
  • Lower values indicate better clustering performance.

Visualizations

Key Visualizations:

  1. K-Means vs DBSCAN Performance (Non-Standardized): Non-Standardized K-Means vs DBSCAN

  2. Impact of Standardization on K-Means: Standardized vs Non-Standardized K-Means

  3. K-Means vs DBSCAN Performance (Standardized): Standardized K-Means vs DBSCAN


Installation

Step 1: Clone the Repository

git clone https://github.com/<your_username>/<repository_name>.git
cd <repository_name>

Step 2: Install Dependencies

Install the required Python libraries:

pip install -r requirements.txt

How to Run

1. Running Non-Standardized K-Means

  • Execute the script:
    python non-standardize_k-means.py
  • Ensure the input file path and output file path are correctly set within the script:
    file_path = 'path_to_your_file'  # Replace with the path to your input dataset
    output_file_path = 'path_to_output_file'  # Replace with the path where the results will be saved

2. Running Standardized K-Means

  • Execute the script:
    python standardize_k-means.py
  • Ensure the input file path and output file path are correctly set within the script:
    file_path = 'path_to_your_file'  # Replace with the path to your input dataset
    output_file_path = 'path_to_output_file'  # Replace with the path where the results will be saved

3. Running DBSCAN

  • Execute the script:
    python dbscan.py
  • Ensure the input file path and output file path are correctly set within the script:
    file_path = 'path_to_your_file'  # Replace with the path to your input dataset
    output_file_path = 'path_to_output_file'  # Replace with the path where the results will be saved

Project Structure

project_directory/
├── images/
│   ├── non-standardize_k-means_vs_dbscan.png
│   ├── non-standardize_vs_standardize.png
│   ├── standardize_k-means_vs_dbscan.png
│   ├── dbscan.png
│   ├── k-means_non-standardized.png
│   ├── k-means_standardize.png
├── data/
│   ├── pricerunner_aggregate 2.csv
│   ├── non-standardize_k-means_clustered_data.csv
│   ├── standardize_k-means_clustered_data.csv
│   ├── dbscan_clustered_data.csv
├── non-standardize_k-means.py
├── standardize_k-means.py
├── dbscan.py
├── requirements.txt
└── README.md

Key Notes

  1. The repository contains a requirements.txt file listing all necessary Python libraries. Make sure to install the dependencies before running any scripts.
  2. Visualizations generated by the scripts are automatically saved in the images/ folder.
  3. Replace all instances of 'path_to_your_file' and 'path_to_output_file' in the scripts with the correct file paths for your dataset and desired output.

About

Comparison of K-Means and DBSCAN clustering algorithms with analysis of data standardization impact on performance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages