Author: Vivek Kumar
This document provides a comprehensive overview of the end-to-end machine learning (ML) pipeline developed for the PPG Project.
It details each stage of the process, from data extraction and preprocessing to model training and evaluation, ensuring clarity and reproducibility for future reference.
Here is the quick glance of the ML pipeline:

- Introduction
- PPG Overview
- HRV Analysis
- Data Collection
- Data Extraction
- Data Preprocessing
- Feature Reduction
- Model Evaluation
- Results
- Conclusion
The PPG Project aims to develop a machine learning pipeline for analyzing physiological data to detect anxiety levels.
This documentation outlines the workflow, from raw data collection to model evaluation, providing detailed explanations for each step and the scripts involved.
Photoplethysmography (PPG) is a non-invasive optical method to measure blood volume changes in the skin. It monitors heart rate and blood oxygen (SpO2) levels.
- Light Emission: An LED shines light into the skin.
- Absorption and Reflection: Blood absorbs light, and the rest is reflected back.
- Detection: A sensor measures the reflected light.
- Blood Volume Changes: With each heartbeat, blood volume changes, altering light absorption.
- Signal Generation: The sensor converts light changes into an electrical signal (PPG waveform).
- Data Analysis: The PPG signal is analyzed to determine heart rate and blood oxygen levels (SpO2).
- Heart Rate Monitoring: Detects heartbeats to calculate heart rate.
- SpO2 Monitoring: Uses dual-wavelength light to estimate blood oxygen levels.
- Non-Invasive: Suitable for continuous use in wearables like fitness trackers and smartwatches.
- Heart Rate Monitoring
- SpO2 Monitoring
- HRV Analysis
- Sleep Monitoring
- Fitness Tracking
- Cardiovascular Detection
- Stress Monitoring
- Remote Patient Monitoring
- Blood Pressure Estimation
- Peripheral Circulation Monitoring
- Medical Research
Heart Rate Variability (HRV) of a PPG (Photoplethysmogram) signal is the measurement of the variation in time intervals between successive pulse peaks reflecting the heartbeats. HRV provides insights into autonomic nervous system function and cardiovascular health.
- Cardiovascular Health: Higher HRV indicates better cardiovascular health and resilience.
- Stress and Recovery Monitoring: Tracks stress levels and recovery in everyday life and athletic training.
- Mental Health Monitoring: Lower HRV is associated with anxiety, depression, and other mental health conditions.
- Early Health Warnings: Changes in HRV can provide early warnings of potential health issues.
- Chronic Condition Management: Useful in managing conditions like diabetes, hypertension, and heart disease.
- Research and Clinical Studies: Widely used to study the ANS and its impact on various physiological and pathological conditions.
Here are some examples of HRV metrics which one can get from any body signal.
- MeanNN: The average length of the intervals between normal heartbeats.
- SDNN: Standard deviation of the NN intervals.
- RMSSD: Square root of the mean squared differences between successive NN intervals.
- pNN50: The proportion of successive NN intervals that differ by more than 50 milliseconds.
- LF (Low Frequency): Represents the power in the low-frequency range (0.04–0.15 Hz).
- HF (High Frequency): Represents the power in the high-frequency range (0.15–0.4 Hz).
- LF/HF Ratio: Provides insights into the balance between sympathetic and parasympathetic activities.
We conducted a survey to collect PPG data from 101 participants, capturing their PPG time series data during various activities. This project specifically uses activity1 data from all participants.
- Directory:
Raw- Subdirectory:
GSR_data- Description: Contains directories for each participant (e.g.,
101_GSR,102_GSR), with each directory holding CSV files for different activities. For this project, we are using theactivity.csvfile from each participant's directory.
- Description: Contains directories for each participant (e.g.,
- File:
participant_response.xlsx- Description: Contains self-assessment anxiety test data collected from participants.
- Subdirectory:
- Feature Extraction:
- Script:
FeatureExtraction.py - Description: Extracts relevant features from the raw data.
- Script:
- Label Extraction:
- Script:
LabelExtraction.py - Description: Extracts corresponding labels for the features.
- Script:
- Raw Features and Labels:
- Files:
features.csv,labels.csv - Description: CSV files containing extracted features and labels from the raw data.
- Files:
- Script:
MergingData.py - Output File:
merged.csv - Description: Merges the extracted features and labels into a single dataset for analysis.
- File:
merged.csv - Description: Combined dataset containing both features and labels.
- Script:
FRbyMissing.py - Output File:
reduced_I.csv - Description: Removes features based on missing values.
- Files:
corr_mat_I.xlsxcorr_hmap_I.png
- Script:
FRbyCorr.py - Description: Plots correlation matrix for initial 85 features.
- Script:
FRbyCorr.py - Output File:
reduced_II.csv - Description: Further reduces features based on correlation analysis.
- Files:
corr_mat_II.xlsxcorr_hmap_II.png
- Script:
FRbyCorr.py - Description: Plots correlation matrix for the reduced set of 41 features.
- Scripts:
- Various models with KNN Imputer and Robust Scaler
- Hyperparameter tuning scripts
- Description: Evaluates different models on the preprocessed data with hyperparameter tuning to identify the best-performing model.
- Output:
model_comparison.pdf - Description: PDF file containing the comparison of different models' performance.
- File:
model_comparison.pdf - Description: Document summarizing the performance of different models, including metrics and visualizations.
This documentation provides a detailed walkthrough of the ML pipeline for the PPG Project, from raw data extraction to model evaluation. By following this guide, the entire process can be reproduced and validated, ensuring transparency and reliability in the analysis.