Clustering Algorithms Comparison

This repository provides a framework for clustering analysis using K-Means and DBSCAN algorithms. The project demonstrates the effect of data standardization on clustering performance and includes visualizations for deeper insights into the results.

Project Overview

This project evaluates the performance of clustering algorithms on standardized and non-standardized datasets. Key objectives include:

Clustering Analysis: Compare the performance of K-Means and DBSCAN.
Impact of Standardization: Highlight how standardizing the data affects clustering results.
Visualizations: Generate plots to visualize clusters and performance metrics.

Datasets

1. `pricerunner_aggregate 2.csv`

Contains product and merchant information used for clustering.
Key Columns:
- Product ID
- Merchant ID

2. Clustered Data Outputs

non-standardize_k-means_clustered_data.csv: Results of K-Means clustering without standardization.
standardize_k-means_clustered_data.csv: Results of K-Means clustering with standardization.
dbscan_clustered_data.csv: Results of DBSCAN clustering.

Python Scripts

1. `non-standardize_k-means.py`

Performs K-Means clustering without standardizing the data.
Outputs clustering metrics and visualizations.

2. `standardize_k-means.py`

Performs K-Means clustering on standardized data.
Highlights the impact of standardization on clustering.

3. `dbscan.py`

Implements DBSCAN clustering algorithm.
Includes clustering metrics and visualizations for non-standardized data.

Performance Metrics

1. Silhouette Score

Measures the similarity of points within the same cluster.
Range: -1 (poor clustering) to 1 (ideal clustering).

2. Davies-Bouldin Index

Evaluates the compactness and separation of clusters.
Lower values indicate better clustering performance.

Visualizations

Key Visualizations:

K-Means vs DBSCAN Performance (Non-Standardized):
Impact of Standardization on K-Means:
K-Means vs DBSCAN Performance (Standardized):

Installation

Step 1: Clone the Repository

git clone https://github.com/<your_username>/<repository_name>.git
cd <repository_name>

Step 2: Install Dependencies

Install the required Python libraries:

pip install -r requirements.txt

How to Run

1. Running Non-Standardized K-Means

Execute the script:
```
python non-standardize_k-means.py
```

Ensure the input file path and output file path are correctly set within the script:

file_path = 'path_to_your_file'  # Replace with the path to your input dataset
output_file_path = 'path_to_output_file'  # Replace with the path where the results will be saved

2. Running Standardized K-Means

Execute the script:
```
python standardize_k-means.py
```

Ensure the input file path and output file path are correctly set within the script:

file_path = 'path_to_your_file'  # Replace with the path to your input dataset
output_file_path = 'path_to_output_file'  # Replace with the path where the results will be saved

3. Running DBSCAN

Execute the script:
```
python dbscan.py
```

Ensure the input file path and output file path are correctly set within the script:

file_path = 'path_to_your_file'  # Replace with the path to your input dataset
output_file_path = 'path_to_output_file'  # Replace with the path where the results will be saved

Project Structure

project_directory/
├── images/
│   ├── non-standardize_k-means_vs_dbscan.png
│   ├── non-standardize_vs_standardize.png
│   ├── standardize_k-means_vs_dbscan.png
│   ├── dbscan.png
│   ├── k-means_non-standardized.png
│   ├── k-means_standardize.png
├── data/
│   ├── pricerunner_aggregate 2.csv
│   ├── non-standardize_k-means_clustered_data.csv
│   ├── standardize_k-means_clustered_data.csv
│   ├── dbscan_clustered_data.csv
├── non-standardize_k-means.py
├── standardize_k-means.py
├── dbscan.py
├── requirements.txt
└── README.md

Key Notes

The repository contains a requirements.txt file listing all necessary Python libraries. Make sure to install the dependencies before running any scripts.
Visualizations generated by the scripts are automatically saved in the images/ folder.
Replace all instances of 'path_to_your_file' and 'path_to_output_file' in the scripts with the correct file paths for your dataset and desired output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering Algorithms Comparison

Project Overview

Datasets

1. `pricerunner_aggregate 2.csv`

2. Clustered Data Outputs

Python Scripts

1. `non-standardize_k-means.py`

2. `standardize_k-means.py`

3. `dbscan.py`

Performance Metrics

1. Silhouette Score

2. Davies-Bouldin Index

Visualizations

Key Visualizations:

Installation

Step 1: Clone the Repository

Step 2: Install Dependencies

How to Run

1. Running Non-Standardized K-Means

2. Running Standardized K-Means

3. Running DBSCAN

Project Structure

Key Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
images		images
README.md		README.md
dbscan.py		dbscan.py
non-standardize_k-means.py		non-standardize_k-means.py
requirements.txt		requirements.txt
standardize_k-means.py		standardize_k-means.py

Folders and files

Latest commit

History

Repository files navigation

Clustering Algorithms Comparison

Project Overview

Datasets

1. pricerunner_aggregate 2.csv

2. Clustered Data Outputs

Python Scripts

1. non-standardize_k-means.py

2. standardize_k-means.py

3. dbscan.py

Performance Metrics

1. Silhouette Score

2. Davies-Bouldin Index

Visualizations

Key Visualizations:

Installation

Step 1: Clone the Repository

Step 2: Install Dependencies

How to Run

1. Running Non-Standardized K-Means

2. Running Standardized K-Means

3. Running DBSCAN

Project Structure

Key Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `pricerunner_aggregate 2.csv`

1. `non-standardize_k-means.py`

2. `standardize_k-means.py`

3. `dbscan.py`

Packages