Comprehensive machine learning analysis of Scottish Haggis morphological data for a data visualization and analysis assignment. This project combines exploratory data analysis, unsupervised learning (K-Means and DBSCAN clustering), and supervised machine learning (Decision Tree, Random Forest, KNN, Logistic Regression, and Linear Regression) to understand species distributions and predict biological characteristics.
scottish_haggis_2025.csv: Raw dataset (344 observations, 3 species, 3 islands, 2023-2025)tanush_analysis.ipynb: Complete Jupyter notebook with all analysis stages
The notebook contains 7 main stages:
- Management Summary & Introduction: Executive summary, dataset description, research objectives
- Data Preparation & Quality Assessment: Missing value analysis, species-specific imputation, outlier detection (1.5×IQR method), feature engineering
- Exploratory Data Analysis (EDA): Univariate/bivariate analysis, species-island associations, correlation analysis, scaling/encoding justification
- Unsupervised Learning: K-Means Clustering: Elbow and Silhouette methods for k selection, PCA visualization (77.9% variance explained), cluster-species comparison, DBSCAN Comparative Analysis
- Supervised Learning I: Decision Tree Classification: Tree visualization, feature importance, confusion matrix, Hyperparameter Tuning (GridSearchCV), Cost-Complexity Pruning (ccp_alpha), Random Forest (Ensemble Method)
- Supervised Learning II: KNN & Logistic Regression: Optimal k selection (k=6), comprehensive coefficient interpretation with biological insights, comparative analysis
- Supervised Learning III: Linear Regression: Body mass prediction from non-invasive measurements, comprehensive diagnostic plots, formal statistical tests (Shapiro-Wilk, VIF, Breusch-Pagan)
- Cross-Stage Analysis & Conclusions: Methodological consolidation (Section 7.0), linking findings across stages (Section 7.1), conservation insights, model selection guidance
- Data Preparation: Species-specific imputation with biological rationale, StandardScaler for distance-based methods, one-hot encoding, explicit outlier analysis methodology
- Feature Engineering: Three engineered features (tail_to_body_ratio, bmi, head_size_index) with documented impact across all stages
- Unsupervised Learning: Elbow and Silhouette methods for k selection, PCA visualization (PC1: 57.5%, PC2: 20.4%), DBSCAN comparative analysis
- Supervised Learning: Decision Tree, Random Forest (ensemble), KNN, Logistic Regression with comprehensive coefficient interpretation; Linear Regression with full diagnostics and assumption validation
- Analysis: Cross-stage narrative synthesis, biological interpretation, critical justifications, comprehensive model comparison
python3 -m venv venvOn macOS/Linux:
source venv/bin/activateOn Windows:
venv\Scripts\activatepip install pandas numpy matplotlib seaborn scikit-learn jupyter scipy statsmodelsjupyter notebook tanush_analysis.ipynbIn Jupyter: Kernel → Restart & Run All
deactivate- Random Forest: 91.3% accuracy
- KNN (k=6): 91.3% accuracy
- Tuned Decision Tree: 91.3% accuracy
- Logistic Regression: 89.9% accuracy
- Base Decision Tree: 88.4% accuracy
- R² Score: 0.769 (76.9% variance explained)
- MAE: 294g (mean absolute error)
- RMSE: 359g (root mean squared error)
- K-Means (k=3): Silhouette score 0.358, 92.6% WildRambler purity
- DBSCAN: Silhouette score 0.401 (highest), 2 clusters with 3 noise points
- PCA: 77.9% variance explained (PC1: 57.5%, PC2: 20.4%)
- Tail length and nose length dominate across all models
- Engineered features contribute 20.2% of total importance in Random Forest
- Feature engineering improves accuracy by +1.4 percentage points
- Species are biologically distinct: Clustering recovers species groups without labels (92.6% purity)
- Tail length is the strongest predictor: Consistent across all models (coefficient +0.87 for WildRambler)
- Species-island associations: WildRambler (Skye), BogSniffler (Shetland), Macduff (generalist)
- Non-invasive monitoring: Body mass predictable from measurements (R² = 0.769)
- Linear separability: Logistic Regression achieves 89.9% accuracy, confirming strong morphological separation
- Model robustness: Consistent high performance across diverse algorithms validates morphological distinctiveness
- Advanced Techniques: Random Forest ensemble, hyperparameter tuning (GridSearchCV), cost-complexity pruning (ccp_alpha)
- Comparative Analysis: DBSCAN vs K-Means, comprehensive model comparison with trade-offs
- Regression Diagnostics: Formal statistical tests (Shapiro-Wilk, VIF, Breusch-Pagan) with critical discussion
- Cross-Stage Linking: Section 7.1 synthesizes findings across all stages, building compelling narrative
- Biological Interpretation: Comprehensive coefficient interpretation with high-level conclusions about data predictability
- Python 3.7+
- pandas >= 1.3.0
- numpy >= 1.21.0
- matplotlib >= 3.4.0
- seaborn >= 0.11.0
- scikit-learn >= 1.0.0
- scipy >= 1.7.0
- statsmodels >= 0.13.0
- jupyter >= 1.0.0