Scottish Haggis Data Analysis Project

Overview

Comprehensive machine learning analysis of Scottish Haggis morphological data for a data visualization and analysis assignment. This project combines exploratory data analysis, unsupervised learning (K-Means and DBSCAN clustering), and supervised machine learning (Decision Tree, Random Forest, KNN, Logistic Regression, and Linear Regression) to understand species distributions and predict biological characteristics.

Files

scottish_haggis_2025.csv: Raw dataset (344 observations, 3 species, 3 islands, 2023-2025)
tanush_analysis.ipynb: Complete Jupyter notebook with all analysis stages

Notebook Structure

The notebook contains 7 main stages:

Management Summary & Introduction: Executive summary, dataset description, research objectives
Data Preparation & Quality Assessment: Missing value analysis, species-specific imputation, outlier detection (1.5×IQR method), feature engineering
Exploratory Data Analysis (EDA): Univariate/bivariate analysis, species-island associations, correlation analysis, scaling/encoding justification
Unsupervised Learning: K-Means Clustering: Elbow and Silhouette methods for k selection, PCA visualization (77.9% variance explained), cluster-species comparison, DBSCAN Comparative Analysis
Supervised Learning I: Decision Tree Classification: Tree visualization, feature importance, confusion matrix, Hyperparameter Tuning (GridSearchCV), Cost-Complexity Pruning (ccp_alpha), Random Forest (Ensemble Method)
Supervised Learning II: KNN & Logistic Regression: Optimal k selection (k=6), comprehensive coefficient interpretation with biological insights, comparative analysis
Supervised Learning III: Linear Regression: Body mass prediction from non-invasive measurements, comprehensive diagnostic plots, formal statistical tests (Shapiro-Wilk, VIF, Breusch-Pagan)
Cross-Stage Analysis & Conclusions: Methodological consolidation (Section 7.0), linking findings across stages (Section 7.1), conservation insights, model selection guidance

Key Features

Data Preparation: Species-specific imputation with biological rationale, StandardScaler for distance-based methods, one-hot encoding, explicit outlier analysis methodology
Feature Engineering: Three engineered features (tail_to_body_ratio, bmi, head_size_index) with documented impact across all stages
Unsupervised Learning: Elbow and Silhouette methods for k selection, PCA visualization (PC1: 57.5%, PC2: 20.4%), DBSCAN comparative analysis
Supervised Learning: Decision Tree, Random Forest (ensemble), KNN, Logistic Regression with comprehensive coefficient interpretation; Linear Regression with full diagnostics and assumption validation
Analysis: Cross-stage narrative synthesis, biological interpretation, critical justifications, comprehensive model comparison

Setup Instructions

1. Create Virtual Environment

python3 -m venv venv

2. Activate Virtual Environment

On macOS/Linux:

source venv/bin/activate

On Windows:

venv\Scripts\activate

3. Install Dependencies

pip install pandas numpy matplotlib seaborn scikit-learn jupyter scipy statsmodels

4. Launch Jupyter

jupyter notebook tanush_analysis.ipynb

5. Run All Cells

In Jupyter: Kernel → Restart & Run All

6. Deactivate Virtual Environment (when done)

deactivate

Expected Results

Classification Performance

Random Forest: 91.3% accuracy
KNN (k=6): 91.3% accuracy
Tuned Decision Tree: 91.3% accuracy
Logistic Regression: 89.9% accuracy
Base Decision Tree: 88.4% accuracy

Regression Performance

R² Score: 0.769 (76.9% variance explained)
MAE: 294g (mean absolute error)
RMSE: 359g (root mean squared error)

Clustering Results

K-Means (k=3): Silhouette score 0.358, 92.6% WildRambler purity
DBSCAN: Silhouette score 0.401 (highest), 2 clusters with 3 noise points
PCA: 77.9% variance explained (PC1: 57.5%, PC2: 20.4%)

Feature Importance

Tail length and nose length dominate across all models
Engineered features contribute 20.2% of total importance in Random Forest
Feature engineering improves accuracy by +1.4 percentage points

Key Insights

Species are biologically distinct: Clustering recovers species groups without labels (92.6% purity)
Tail length is the strongest predictor: Consistent across all models (coefficient +0.87 for WildRambler)
Species-island associations: WildRambler (Skye), BogSniffler (Shetland), Macduff (generalist)
Non-invasive monitoring: Body mass predictable from measurements (R² = 0.769)
Linear separability: Logistic Regression achieves 89.9% accuracy, confirming strong morphological separation
Model robustness: Consistent high performance across diverse algorithms validates morphological distinctiveness

Technical Highlights

Advanced Techniques: Random Forest ensemble, hyperparameter tuning (GridSearchCV), cost-complexity pruning (ccp_alpha)
Comparative Analysis: DBSCAN vs K-Means, comprehensive model comparison with trade-offs
Regression Diagnostics: Formal statistical tests (Shapiro-Wilk, VIF, Breusch-Pagan) with critical discussion
Cross-Stage Linking: Section 7.1 synthesizes findings across all stages, building compelling narrative
Biological Interpretation: Comprehensive coefficient interpretation with high-level conclusions about data predictability

Requirements

Python 3.7+
pandas >= 1.3.0
numpy >= 1.21.0
matplotlib >= 3.4.0
seaborn >= 0.11.0
scikit-learn >= 1.0.0
scipy >= 1.7.0
statsmodels >= 0.13.0
jupyter >= 1.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.gitkeep		.gitkeep
README.md		README.md
requirements.txt		requirements.txt
scottish_haggis_2025.csv		scottish_haggis_2025.csv
tanush_analysis.ipynb		tanush_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scottish Haggis Data Analysis Project

Overview

Files

Notebook Structure

Key Features

Setup Instructions

1. Create Virtual Environment

2. Activate Virtual Environment

3. Install Dependencies

4. Launch Jupyter

5. Run All Cells

6. Deactivate Virtual Environment (when done)

Expected Results

Classification Performance

Regression Performance

Clustering Results

Feature Importance

Key Insights

Technical Highlights

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Tanush1912/data-analysis

Folders and files

Latest commit

History

Repository files navigation

Scottish Haggis Data Analysis Project

Overview

Files

Notebook Structure

Key Features

Setup Instructions

1. Create Virtual Environment

2. Activate Virtual Environment

3. Install Dependencies

4. Launch Jupyter

5. Run All Cells

6. Deactivate Virtual Environment (when done)

Expected Results

Classification Performance

Regression Performance

Clustering Results

Feature Importance

Key Insights

Technical Highlights

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages