Employee Retention Analyzer is a machine-learning (ML) project that uses HR analytics data to predict the likelihood of employee attrition. The model achieves this through a classification task, using the 'left' attribute from the dataset as the target and employee characteristics as features. The project provides HR teams with actionable insights for improving retention strategies. is a machine-learning (ML) project that uses HR analytics data to predict the likelihood of employee attrition. The model achieves this through a classification task, using the 'left' attribute from the dataset as the target and employee characteristics as features. The project provides HR teams with actionable insights for improving retention strategies.
- Dataset Content
- Business Requirements
- Hypothesis
- Mapping Business Requirements to Data Visualisation and ML Tasks
- ML Business Case
- Epics and User Stories
- Dashboard Design
- Technologies Used
- Testing
- Issues
- Unfixed Bugs
- Deployment
- Credits
- Acknowledgements
- The dataset is sourced from Kaggle and has been adjusted accordingly for my project. Each row represents an employee and each column contains employee attributes. The dataset includes information about:
- Employee satisfaction levels
- Performance evaluation scores
- Number of projects
- Average monthly hours
- Time at company
- Work accidents
- Promotions
- Department and salary level
- Whether they left the company
| Attribute | Information | Type/Units |
|---|---|---|
| satisfaction_level | Employee's satisfaction rating | Float (0-1) |
| last_evaluation | Last performance evaluation score | Float (0-1) |
| number_project | Number of projects assigned | Integer |
| average_monthly_hours | Average monthly work hours | Integer |
| time_spend_company | Years at the company | Integer |
| Work_accident | Whether they had a work accident | Binary (0/1) |
| promotion_last_5years | Whether promoted in last 5 years | Binary (0/1) |
| Departments | Department employee works in | Categorical |
| salary | Salary level | Categorical (low/medium/high) |
| left | Whether the employee left | Binary (0/1) |
-
Employee turnover is a significant challenge for organizations, with replacement costs estimated at 1.5-2x the departing employee's salary. Early identification of attrition risks can enable proactive retention strategies.
-
Business Requirement 1 - The client wants to identify which factors contribute most significantly to employee turnover, focusing on key predictors of departure.
-
Business Requirement 2 - The client needs a tool to predict whether current employees are at risk of leaving based on their characteristics and behavior patterns.
-
Business Requirement 3 - The client needs actionable insights and clear intervention triggers for HR to develop targeted retention strategies.
-
Hypothesis 1:
- We suspect that satisfaction level and workload (project count/monthly hours) are the strongest predictors of turnover.
- Validation: A correlation analysis that shows relationship strength between these features and the target 'left'.
-
Hypothesis 2:
- We suspect that employees with high workload and low satisfaction have the highest departure risk.
- Validation: Analysis of feature importance and interaction effects through ML model evaluation.
-
Hypothesis 3:
- We suspect that employees in years 1-3 with low salaries have higher attrition rates.
- Validation: Statistical analysis and visualization of departure rates across tenure and salary bands.
-
Business Requirement 1: Data Visualization and Correlation Study
- We need to perform a correlation study to identify key attrition factors
- Pearson correlation for linear relationships in numerical variables
- Spearman correlation for monotonic relationships
- PPS (Predictive Power Score) analysis for categorical variables
- Feature importance analysis from ML model
- This will be done in the Data Visualization and Preparation Epic
-
Business Requirement 2: Classification Model
- We need to predict binary outcome (stay/leave)
- Build supervised classification model
- Implement ML pipeline with preprocessing and prediction
- Optimize hyperparameters for best performance
- This will be executed in Model Training Epic
-
Business Requirement 3: Actionable Insights
- Identify critical thresholds for key metrics
- Develop clear intervention triggers
- Create visualization dashboard for HR
- This spans both Analysis and Dashboard Development Epics
- We want an ML model to predict whether an employee is likely to leave based on their current attributes and behavior patterns. The target variable 'left' is binary (0: stayed, 1: left).
- We will build a classification model, a supervised model with two-class, single-label output matching the target.
- The model success metrics are:
- At least 95% F1 score on both train and test sets
- High precision to minimize false alarms
- High recall to catch actual flight risks
- The model will be considered a failure if:
- The model fails to achieve 90% F1 score
- False positive rate exceeds 15% (too many false alarms)
- Features aren't interpretable for HR use
- The model output is defined as a flag indicating if an employee is likely to leave and the associated probability.
- The training data contains:
- 4,998 employee records with 9 attributes + target
- Mix of numerical and categorical features
- Data from past employee records including both retained and departed staff
The project was split into 5 Epics based on Data Analysis and ML tasks, with user stories enabling agile methodology.
- User Story - As a data analyst, I can load employee data from local directories to begin analysis.
- User Story - As a data analyst, I can examine data quality to determine necessary preprocessing steps.
- User Story - As a data scientist, I can analyze correlations between features and attrition (Business Requirement 1).
- User Story - As a data analyst, I can handle missing values and outliers to prepare data for modeling.
- User Story - As a data analyst, I can check class balance in the target variable.
- User Story - As a data scientist, I can engineer features to improve model performance.
- User Story - As a data scientist, I can prepare features for ML pipeline implementation.
- User Story - As a data scientist, I can split data appropriately for model training.
- User Story - As a data engineer, I can build ML pipeline with preprocessing steps.
- User Story - As a data engineer, I can select and optimize algorithms for prediction (Business Requirement 2).
- User Story - As a data scientist, I can tune hyperparameters for optimal performance.
- User Story - As a data scientist, I can validate model performance on test data.
- User Story - As a data scientist, I can analyze feature importance for insights (Business Requirement 1).
- User Story - As a user, I can view project overview and business requirements.
- User Story - As a user, I can see hypothesis validation results.
- User Story - As a user, I can input employee data for predictions (Business Requirement 2).
- User Story - As a technical user, I can examine correlation analysis (Business Requirement 1).
- User Story - As a technical user, I can review model performance metrics.
- User Story - As an HR user, I can get clear retention recommendations (Business Requirement 3).
- User Story - As a user, I can access the dashboard through a web interface.
- User Story - As a developer, I can deploy the project following documentation.
- Section 1 - Summary
- Introduction to project goals
- Dataset description and source
- Link to readme
- Section 2 - Business Requirements
- Business context
- Specific requirements
- Expected outcomes
- Present the three project hypotheses
- Show validation results
- Visualize key findings
- Address Business Requirement 1
- Show dataset overview
- Present correlation analysis
- Display PPS heatmap
- Feature distribution analysis
- Key conclusions
- Address Business Requirement 2
- Input widgets for employee data
- Prediction interface
- Risk assessment display
- Performance metrics summary
- Pipeline description
- Feature importance analysis
- Train/test results documentation
- We want an ML model to predict whether an employee is likely to leave based on their current attributes and behavior patterns. The target variable 'left' is binary (0: stayed, 1: left).
- We will build a classification model, a supervised model with two-class, single-label output matching the target.
- The model success metrics are:
- At least 95% F1 score on both train and test sets
- High precision to minimize false alarms
- High recall to catch actual flight risks
- The model will be considered a failure if:
- The model fails to achieve 90% F1 score
- False positive rate exceeds 15% (too many false alarms)
- Features aren't interpretable for HR use
- The model output is defined as a flag indicating if an employee is likely to leave and the associated probability.
- The training data contains:
- 4,998 employee records with 9 attributes + target
- Mix of numerical and categorical features
- Data from past employee records including both retained and departed staff
The project was split into 5 Epics based on Data Analysis and ML tasks, with user stories enabling agile methodology.
- User Story - As a data analyst, I can load employee data from local directories to begin analysis.
- User Story - As a data analyst, I can examine data quality to determine necessary preprocessing steps.
- User Story - As a data scientist, I can analyze correlations between features and attrition (Business Requirement 1).
- User Story - As a data analyst, I can handle missing values and outliers to prepare data for modeling.
- User Story - As a data analyst, I can check class balance in the target variable.
- User Story - As a data scientist, I can engineer features to improve model performance.
- User Story - As a data scientist, I can prepare features for ML pipeline implementation.
- User Story - As a data scientist, I can split data appropriately for model training.
- User Story - As a data engineer, I can build ML pipeline with preprocessing steps.
- User Story - As a data engineer, I can select and optimize algorithms for prediction (Business Requirement 2).
- User Story - As a data scientist, I can tune hyperparameters for optimal performance.
- User Story - As a data scientist, I can validate model performance on test data.
- User Story - As a data scientist, I can analyze feature importance for insights (Business Requirement 1).
- User Story - As a user, I can view project overview and business requirements.
- User Story - As a user, I can see hypothesis validation results.
- User Story - As a user, I can input employee data for predictions (Business Requirement 2).
- User Story - As a technical user, I can examine correlation analysis (Business Requirement 1).
- User Story - As a technical user, I can review model performance metrics.
- User Story - As an HR user, I can get clear retention recommendations (Business Requirement 3).
- User Story - As a user, I can access the dashboard through a web interface.
- User Story - As a developer, I can deploy the project following documentation.
- Section 1 - Summary
- Introduction to project goals
- Dataset description and source
- Link to readme
- Section 2 - Business Requirements
- Business context
- Specific requirements
- Expected outcomes
- Present the three project hypotheses
- Show validation results
- Visualize key findings
- Address Business Requirement 1
- Show dataset overview
- Present correlation analysis
- Display PPS heatmap
- Feature distribution analysis
- Key conclusions
- Address Business Requirement 2
- Input widgets for employee data
- Prediction interface
- Risk assessment display
- Performance metrics summary
- Pipeline description
- Feature importance analysis
- Train/test results documentation
The technologies used throughout the development are listed below:
- Pandas - Data manipulation and analysis
- Numpy - Numerical computing and array operations
- Matplotlib - Data visualization and plotting
- Seaborn - Statistical data visualization
- Scikit-learn - Machine learning algorithms and tools
- Feature-engine - Feature engineering and selection
- XGBoost - Gradient boosting framework
- Streamlit - Web application framework
- Plotly - Interactive visualizations
- ppscore - Predictive power score calculations
- Joblib - Pipeline persistence
- Git - Version control
- GitHub - Code repository and project management
- VS Code - IDE for development
- Heroku - Application deployment platform
- Dashboard was manually tested against user stories
- Each feature verified for functionality and usability
As a non-technical user, I can view a project summary that describes the project, dataset and business requirements.
| Feature | Action | Expected Result | Actual Result |
|---|---|---|---|
| Project summary page | View landing page | Clear project overview displayed | Works as expected |
| Navigation | Click through sections | Smooth section transitions | Works as expected |
| Business requirements | View requirements section | Requirements clearly listed | Works as expected |
As a technical user, I can access the correlation analysis and model performance metrics.
| Feature | Action | Expected Result | Actual Result |
|---|---|---|---|
| Correlation page | Navigate to analysis | Display correlation heatmaps | Works as expected |
| Feature importance | View importance plots | Show feature rankings | Works as expected |
| Performance metrics | Check model metrics | Display accuracy scores | Works as expected |
As an HR user, I can input employee data and receive attrition predictions.
| Feature | Action | Expected Result | Actual Result |
|---|---|---|---|
| Prediction interface | Enter employee data | All inputs accept values | Works as expected |
| Run prediction | Click predict button | Show prediction result | Works as expected |
| Risk assessment | View prediction details | Display risk factors | Works as expected |
- All Python code validated with PEP8
- Frontend validated for responsiveness
- Data pipeline tested for consistency
- Initial deployment of the model failed with error: "'DecisionTreeClassifier' object has no attribute 'monotonic_cst'"
- This occurred due to a version mismatch between local development (scikit-learn 1.5.0) and Heroku's default scikit-learn version
- The solution was to update requirements.txt to specify:
scikit-learn>=1.5.0
- No known bugs at time of deployment
- All identified issues have been resolved
- The App live link is: CVD Predictor
The project was deployed to Heroku using the following steps:
- Within your working directory, ensure there is a setup.sh file containing the following:
mkdir -p ~/.streamlit/
echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml
- Within your working directory, ensure there is a runtime.txt file containing a Heroku-20 stack supported version of Python.
python-3.10.12
- Within your working directory, ensure there is a Procfile file containing the following:
web: sh setup.sh && streamlit run app.py
- Ensure your requirements.txt file contains all the packages necessary to run the streamlit dashboard.
- Update your .gitignore and .slugignore files with any files/directories that you do not want uploading to GitHub or are unnecessary for deployment.
- Log in to Heroku or create an account if you do not already have one.
- Click the New button on the dashboard and from the dropdown menu select "Create new app".
- Enter a suitable app name and select your region, then click the Create app button.
- Once the app has been created, navigate to the Deploy tab.
- At the Deploy tab, in the Deployment method section select GitHub.
- Enter your repository name and click Search. Once it is found, click Connect.
- Navigate to the bottom of the Deploy page to the Manual deploy section and select main from the branch dropdown menu.
- Click the Deploy Branch button to begin deployment.
- The deployment process should happen smoothly if all deployment files are fully functional. Click the button Open App at the top of the page to access your App.
- If the build fails, check the build log carefully to troubleshoot what went wrong.
If you wish to fork or clone this repository, please follow the instructions below:
- In the top right of the main repository page, click the Fork button.
- Under Owner, select the desired owner from the dropdown menu.
- OPTIONAL: Change the default name of the repository in order to distinguish it.
- OPTIONAL: In the Description field, enter a description for the forked repository.
- Ensure the 'Copy the main branch only' checkbox is selected.
- Click the Create fork button.
- On the main repository page, click the Code button.
- Copy the HTTPS URL from the resulting dropdown menu.
- In your IDE terminal, navigate to the directory you want the cloned repository to be created.
- In your IDE terminal, type
git cloneand paste the copied URL. - Hit Enter to create the cloned repository.
WARNING: The packages listed in the requirements.txt file are limited to those necessary for the deployment of the dashboard to Heroku, due to the limit on the slug size.
In order to ensure all the correct dependencies are installed in your local environment, run the following command in the terminal:
pip install -r full-requirements.txt
- The custom function for checking the effect of data cleaning on distribution was partially taken from the Code Institute "Data Analytics Packages - ML: feature-engine" module.
- David Langer Hyperparameter Tuning video was used to help define the hyperparameter values used for optimisation.
- The custom function for carrying out hyperparameter optimisation was taken from the Code Institute "Data Analytics Packages - ML: Scikit-learn" module.
- The custom function for displaying the confusion matrix and analysing model performance was taken from the Code Institute "Data Analytics Packages - ML: Scikit-learn" module.
- The multi-page class was taken from the Code Institute "Data Analysis & Machine Learning Toolkit" streamlit lessons.
- Thanks to my mentor Mo Shami for his invaluable guidance and detailed feedback throughout this project, which greatly contributed to its successful development and implementation.