This repository provides a framework for clustering analysis using K-Means and DBSCAN algorithms. The project demonstrates the effect of data standardization on clustering performance and includes visualizations for deeper insights into the results.
This project evaluates the performance of clustering algorithms on standardized and non-standardized datasets. Key objectives include:
- Clustering Analysis: Compare the performance of K-Means and DBSCAN.
- Impact of Standardization: Highlight how standardizing the data affects clustering results.
- Visualizations: Generate plots to visualize clusters and performance metrics.
- Contains product and merchant information used for clustering.
- Key Columns:
Product IDMerchant ID
non-standardize_k-means_clustered_data.csv: Results of K-Means clustering without standardization.standardize_k-means_clustered_data.csv: Results of K-Means clustering with standardization.dbscan_clustered_data.csv: Results of DBSCAN clustering.
- Performs K-Means clustering without standardizing the data.
- Outputs clustering metrics and visualizations.
- Performs K-Means clustering on standardized data.
- Highlights the impact of standardization on clustering.
- Implements DBSCAN clustering algorithm.
- Includes clustering metrics and visualizations for non-standardized data.
- Measures the similarity of points within the same cluster.
- Range:
-1(poor clustering) to1(ideal clustering).
- Evaluates the compactness and separation of clusters.
- Lower values indicate better clustering performance.
git clone https://github.com/<your_username>/<repository_name>.git
cd <repository_name>Install the required Python libraries:
pip install -r requirements.txt- Execute the script:
python non-standardize_k-means.py
- Ensure the input file path and output file path are correctly set within the script:
file_path = 'path_to_your_file' # Replace with the path to your input dataset output_file_path = 'path_to_output_file' # Replace with the path where the results will be saved
- Execute the script:
python standardize_k-means.py
- Ensure the input file path and output file path are correctly set within the script:
file_path = 'path_to_your_file' # Replace with the path to your input dataset output_file_path = 'path_to_output_file' # Replace with the path where the results will be saved
- Execute the script:
python dbscan.py
- Ensure the input file path and output file path are correctly set within the script:
file_path = 'path_to_your_file' # Replace with the path to your input dataset output_file_path = 'path_to_output_file' # Replace with the path where the results will be saved
project_directory/
├── images/
│ ├── non-standardize_k-means_vs_dbscan.png
│ ├── non-standardize_vs_standardize.png
│ ├── standardize_k-means_vs_dbscan.png
│ ├── dbscan.png
│ ├── k-means_non-standardized.png
│ ├── k-means_standardize.png
├── data/
│ ├── pricerunner_aggregate 2.csv
│ ├── non-standardize_k-means_clustered_data.csv
│ ├── standardize_k-means_clustered_data.csv
│ ├── dbscan_clustered_data.csv
├── non-standardize_k-means.py
├── standardize_k-means.py
├── dbscan.py
├── requirements.txt
└── README.md
- The repository contains a
requirements.txtfile listing all necessary Python libraries. Make sure to install the dependencies before running any scripts. - Visualizations generated by the scripts are automatically saved in the
images/folder. - Replace all instances of
'path_to_your_file'and'path_to_output_file'in the scripts with the correct file paths for your dataset and desired output.


