Skip to content

A comprehensive machine learning analysis of Scottish Haggis morphology using clustering (K-Means, DBSCAN) and classification (Decision Trees, Random Forest, Logistic Regression) to inform conservation efforts.

Notifications You must be signed in to change notification settings

Tanush1912/data-analysis

Repository files navigation

Scottish Haggis Data Analysis Project

Overview

Comprehensive machine learning analysis of Scottish Haggis morphological data for a data visualization and analysis assignment. This project combines exploratory data analysis, unsupervised learning (K-Means and DBSCAN clustering), and supervised machine learning (Decision Tree, Random Forest, KNN, Logistic Regression, and Linear Regression) to understand species distributions and predict biological characteristics.

Files

  • scottish_haggis_2025.csv: Raw dataset (344 observations, 3 species, 3 islands, 2023-2025)
  • tanush_analysis.ipynb: Complete Jupyter notebook with all analysis stages

Notebook Structure

The notebook contains 7 main stages:

  1. Management Summary & Introduction: Executive summary, dataset description, research objectives
  2. Data Preparation & Quality Assessment: Missing value analysis, species-specific imputation, outlier detection (1.5×IQR method), feature engineering
  3. Exploratory Data Analysis (EDA): Univariate/bivariate analysis, species-island associations, correlation analysis, scaling/encoding justification
  4. Unsupervised Learning: K-Means Clustering: Elbow and Silhouette methods for k selection, PCA visualization (77.9% variance explained), cluster-species comparison, DBSCAN Comparative Analysis
  5. Supervised Learning I: Decision Tree Classification: Tree visualization, feature importance, confusion matrix, Hyperparameter Tuning (GridSearchCV), Cost-Complexity Pruning (ccp_alpha), Random Forest (Ensemble Method)
  6. Supervised Learning II: KNN & Logistic Regression: Optimal k selection (k=6), comprehensive coefficient interpretation with biological insights, comparative analysis
  7. Supervised Learning III: Linear Regression: Body mass prediction from non-invasive measurements, comprehensive diagnostic plots, formal statistical tests (Shapiro-Wilk, VIF, Breusch-Pagan)
  8. Cross-Stage Analysis & Conclusions: Methodological consolidation (Section 7.0), linking findings across stages (Section 7.1), conservation insights, model selection guidance

Key Features

  • Data Preparation: Species-specific imputation with biological rationale, StandardScaler for distance-based methods, one-hot encoding, explicit outlier analysis methodology
  • Feature Engineering: Three engineered features (tail_to_body_ratio, bmi, head_size_index) with documented impact across all stages
  • Unsupervised Learning: Elbow and Silhouette methods for k selection, PCA visualization (PC1: 57.5%, PC2: 20.4%), DBSCAN comparative analysis
  • Supervised Learning: Decision Tree, Random Forest (ensemble), KNN, Logistic Regression with comprehensive coefficient interpretation; Linear Regression with full diagnostics and assumption validation
  • Analysis: Cross-stage narrative synthesis, biological interpretation, critical justifications, comprehensive model comparison

Setup Instructions

1. Create Virtual Environment

python3 -m venv venv

2. Activate Virtual Environment

On macOS/Linux:

source venv/bin/activate

On Windows:

venv\Scripts\activate

3. Install Dependencies

pip install pandas numpy matplotlib seaborn scikit-learn jupyter scipy statsmodels

4. Launch Jupyter

jupyter notebook tanush_analysis.ipynb

5. Run All Cells

In Jupyter: Kernel → Restart & Run All

6. Deactivate Virtual Environment (when done)

deactivate

Expected Results

Classification Performance

  • Random Forest: 91.3% accuracy
  • KNN (k=6): 91.3% accuracy
  • Tuned Decision Tree: 91.3% accuracy
  • Logistic Regression: 89.9% accuracy
  • Base Decision Tree: 88.4% accuracy

Regression Performance

  • R² Score: 0.769 (76.9% variance explained)
  • MAE: 294g (mean absolute error)
  • RMSE: 359g (root mean squared error)

Clustering Results

  • K-Means (k=3): Silhouette score 0.358, 92.6% WildRambler purity
  • DBSCAN: Silhouette score 0.401 (highest), 2 clusters with 3 noise points
  • PCA: 77.9% variance explained (PC1: 57.5%, PC2: 20.4%)

Feature Importance

  • Tail length and nose length dominate across all models
  • Engineered features contribute 20.2% of total importance in Random Forest
  • Feature engineering improves accuracy by +1.4 percentage points

Key Insights

  1. Species are biologically distinct: Clustering recovers species groups without labels (92.6% purity)
  2. Tail length is the strongest predictor: Consistent across all models (coefficient +0.87 for WildRambler)
  3. Species-island associations: WildRambler (Skye), BogSniffler (Shetland), Macduff (generalist)
  4. Non-invasive monitoring: Body mass predictable from measurements (R² = 0.769)
  5. Linear separability: Logistic Regression achieves 89.9% accuracy, confirming strong morphological separation
  6. Model robustness: Consistent high performance across diverse algorithms validates morphological distinctiveness

Technical Highlights

  • Advanced Techniques: Random Forest ensemble, hyperparameter tuning (GridSearchCV), cost-complexity pruning (ccp_alpha)
  • Comparative Analysis: DBSCAN vs K-Means, comprehensive model comparison with trade-offs
  • Regression Diagnostics: Formal statistical tests (Shapiro-Wilk, VIF, Breusch-Pagan) with critical discussion
  • Cross-Stage Linking: Section 7.1 synthesizes findings across all stages, building compelling narrative
  • Biological Interpretation: Comprehensive coefficient interpretation with high-level conclusions about data predictability

Requirements

  • Python 3.7+
  • pandas >= 1.3.0
  • numpy >= 1.21.0
  • matplotlib >= 3.4.0
  • seaborn >= 0.11.0
  • scikit-learn >= 1.0.0
  • scipy >= 1.7.0
  • statsmodels >= 0.13.0
  • jupyter >= 1.0.0

About

A comprehensive machine learning analysis of Scottish Haggis morphology using clustering (K-Means, DBSCAN) and classification (Decision Trees, Random Forest, Logistic Regression) to inform conservation efforts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published