This repository applies supervised machine learning techniques to classify individuals into insurance cost tiers (low, medium, high) using the Medical Cost Personal Dataset from Kaggle: View Dataset.
The task is to predict which tier of insurance cost a person will fall into — Low, Medium, or High — based on demographic and lifestyle features. This transforms the regression-based charges prediction into a multiclass classification task for better interpretability and segmentation.
We reframed the dataset using cost tier binning and trained models including Logistic Regression, Random Forest, and K-Nearest Neighbors (KNN). Each model was trained using pipelines with preprocessing, followed by GridSearchCV-based hyperparameter tuning.
Best Model: Tuned Random Forest
- Validation Accuracy: 91.04%
- Macro F1 Score: 0.91
This model outperformed all others across accuracy, precision, and F1 score metrics.
- Type: Tabular CSV (demographic and health data)
- Input: Age, sex, smoker status, region, BMI, children
- Output: Insurance cost tier (
Low,Medium,High) - Size: 1,338 records
- Split: 802 train / 268 validation / 268 test (stratified)
- One-hot encoding of categorical features
- Standard scaling of numerical features
- Quantile binning of
chargesinto three cost tiers - Winsorization to control outliers in BMI
- Histograms: Distribution of age, BMI, and children by cost tier
- Count plots: Smoker, region, and sex distributions by cost tier
visualized all numerical features using grouped histograms overlaid by cost_tier. Each subplot shows how the values of a given feature are distributed across the three insurance cost tiers (Low, Medium, High).
Observations:
bmishows a noticeable rightward shift for the High tier. Individuals in this group tend to have higher BMI values, which may correlate with higher medical risk.agealso trends older in the High tier, suggesting age plays a significant role in determining insurance cost.childrenhas a more balanced distribution and shows a weaker relationship with the target, but the Medium tier has a wider spread.
This helps identify which features may carry more weight in prediction.
visualized the categorical features using count plots, grouped by cost_tier using hue.
Observations:
smokerstands out as the most predictive categorical feature. Nearly all smokers fall into the High cost tier, showing strong influence.sexandregionshow relatively balanced distributions across cost tiers, suggesting they are likely less influential features on their own.- These plots confirm the importance of visual EDA in identifying which features are likely to improve model performance.
- Histogram and bar plots revealed key relationships between features and the target.
- Strong signals:
smoker,bmi, andage. - Weaker signals:
region,children, and possiblysex.
- Input: Demographic and lifestyle features
- Output: Multiclass label (
Low,Medium,High) - Task: Supervised classification
- Logistic Regression
- Random Forest
- K-Nearest Neighbors (KNN)
-
The Logistic Regression model achieved 83.96% validation accuracy and a macro F1 score of 0.83, which is excellent for a 3-class classification task.
-
Low and Medium tiers were predicted very well:
Low: F1 = 0.90Medium: F1 = 0.84, with extremely high recall (0.97)
-
High tier had the highest precision (0.98) but lower recall (0.62), meaning the model is cautious in assigning this label.
-
These results suggest the model effectively captures cost patterns based on features like
smoker,bmi, andage. -
The confusion matrix reveals that most errors happen with High tier, which may require additional feature engineering or model tuning in future iterations.
### Random Forest – Performance Summary
The Random Forest model achieved 90% validation accuracy and a macro F1 score of 0.90, making it the strongest performer among all models tested.
- High Tier: F1 = 0.91, with high precision (0.94) and recall (0.89)
- Low Tier: F1 = 0.90, with strong recall (0.93)
- Medium Tier: F1 = 0.89, with balanced precision and recall
Random Forest performs exceptionally well due to its ensemble learning approach, which enhances stability and generalization across both numerical and categorical features.
The confusion matrix reveals that most misclassifications occur between Medium and High tiers, which is expected due to feature overlap.
This model is a strong candidate for deployment given its robust performance, interpretability, and high accuracy across all classes.

The KNN model achieved 82% validation accuracy and a macro F1 score of 0.82, performing comparably to Logistic Regression.
- High Tier: F1 = 0.80, with moderate recall (0.74), showing it occasionally confuses High with Medium or Low.
- Low Tier: F1 = 0.85, with strong recall (0.88), making it the most confidently predicted class.
- Medium Tier: F1 = 0.80, with decent performance, though some misclassification still occurs.
KNN is easy to implement and interpret but can be sensitive to feature scaling and class overlap.
The confusion matrix suggests Medium and High tiers are frequently confused — likely due to shared characteristics in features like age, BMI, or smoker.
While KNN may not be the top performer, it provides a solid, interpretable benchmark and can still contribute to ensemble modeling or hybrid pipeline
This section compares the performance of three baseline models: Logistic Regression, Random Forest, and K-Nearest Neighbors (KNN) on the task of classifying insurance cost tiers.
| Model | Accuracy | Macro F1 Score |
|---|---|---|
| Logistic Regression | 83.96% | 0.83 |
| Random Forest | 90.00% | 0.90 |
| KNN | 82.00% | 0.82 |
- Random Forest is the best-performing model overall with the highest accuracy and F1 score, showing strong generalization across all tiers.
- Logistic Regression performs well, especially on the Low and Medium tiers, with a very high precision for High tier but slightly lower recall.
- KNN, while easy to implement and interpret, shows slightly lower performance, struggling more with class overlap—especially between Medium and other tiers.
These results provide a strong starting point. Further performance gains may be achieved through hyperparameter tuning, feature engineering, or ensemble methods.
- Logistic Regression:
C,penalty,solver - Random Forest:
n_estimators,max_depth,min_samples_split,max_features - KNN:
n_neighbors,weights,metric
- Software: Python 3, scikit-learn, pandas, matplotlib, seaborn
- Environment: Jupyter Notebook (Anaconda/Kaggle)
- Training Method: 5-fold GridSearchCV for hyperparameter optimization
- Duration: Each model trained in under 10 minutes
| Model | Accuracy | Macro F1 Score |
|---|---|---|
| Logistic Regression | 0.87 | 0.87 |
| Random Forest (Tuned) | 0.91 | 0.91 |
| KNN (Tuned) | 0.85 | 0.85 |
Precision for the High tier was exceptionally high (1.00), but its recall was more modest (0.72), indicating the model is cautious in labeling someone as "High" cost.
This tuned model leveraged L2 regularization and lower regularization strength (C=0.1) to generalize better than the untuned version.
The model was tuned with:
n_estimators = 200max_features = 'sqrt'min_samples_split = 5
It shows excellent class balance and minimal misclassification, especially between overlapping tiers like Medium and High.
The tuned KNN model achieved 84.70% validation accuracy and a macro F1 score of 0.85.
- High Tier: F1 = 0.85
- Low Tier: F1 = 0.87
- Medium Tier: F1 = 0.82
Best Parameters:
n_neighbors = 9weights = 'distance'metric = 'euclidean'
While performance improved over the baseline, the Medium tier still shows overlap. KNN remains useful for fast and interpretable predictions.
- Random Forest showed consistently strong results across all classes.
- The Tuned Random Forest model offers the best balance between performance and robustness.
- Logistic Regression had high precision but slightly lower recall on the "High" tier.
- KNN was interpretable and lightweight, but sensitive to scaling and neighborhood size.
- Add SHAP or LIME for prediction explanations.
- Try gradient boosting algorithms like XGBoost or LightGBM.
- Integrate additional socioeconomic features like income or location.
- Deploy model using Flask or FastAPI.
git clone https://github.com/yourusername/insurance-cost-tier-classification.git
cd insurance-cost-tier-classification
### Step 2 pip install -r requirements.txt
### Step 3: Download Dataset
Download insurance.csv and place it in the project root
Step 4: Run Notebook
Open Cost Tiers.ipynb and run all cells to:
Preprocess data
Train models
Visualize results
View final evaluation
---
.
├── insurance.csv # Raw dataset from Kaggle
├── Cost Tiers.ipynb # Main notebook with full project pipeline
├── README.md # Project documentation
└── requirements.txt # Python dependencies











