Skip to content

Ansul-S/Pulsar_Prediction_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŒŸ Pulsar Star Classification using Machine Learning

Python Scikit-learn XGBoost License

An end-to-end machine learning pipeline to identify pulsar stars from radio telescope data, achieving 92% recall and minimizing false negatives to maximize astronomical discoveries.


๐Ÿ“‹ Table of Contents


๐Ÿ”ญ Overview

Pulsars are highly magnetized rotating neutron stars that emit beams of electromagnetic radiation. With only ~3,000 pulsars discovered out of an estimated 100,000 in our galaxy, automated detection is crucial for accelerating astronomical research.

This project builds a machine learning classifier to identify pulsar candidates from radio telescope data, reducing manual review workload by 98% while maintaining 92% detection accuracy.

๐ŸŽฏ Key Achievements

  • โœ… 92% Recall - Successfully detects 284 out of 308 pulsars
  • โœ… 24 False Negatives - Lowest miss rate across all tested models
  • โœ… 96.68% Accuracy - Strong overall performance
  • โœ… Handles Severe Class Imbalance - 91:9 negative-to-positive ratio

๐ŸŽฏ Problem Statement

Why This Matters

Pulsars provide insights into:

  • Extreme physics (neutron stars, gravitational waves)
  • Tests of general relativity
  • Deep space navigation systems
  • Understanding stellar evolution

The Challenge

  1. Severe Class Imbalance: Only 9.16% of candidates are actual pulsars
  2. High-Stakes Decisions: Missing a pulsar = lost scientific discovery
  3. Asymmetric Cost: False negatives (missed discoveries) are worse than false positives (false alarms)

Solution Approach

Build a machine learning classifier that prioritizes recall (maximizing pulsar detection) while maintaining acceptable precision (minimizing wasted telescope follow-up time).


๐Ÿ“Š Dataset

Source: HTRU2 Dataset from UCI Machine Learning Repository

Specifications

  • Size: 17,898 observations
  • Features: 8 continuous variables
  • Target: Binary (0 = Not Pulsar, 1 = Pulsar)
  • Class Distribution:
    • Not Pulsar: 16,259 (90.84%)
    • Pulsar: 1,639 (9.16%)
    • Imbalance Ratio: ~10:1

Features

Integrated Profile Statistics:

  • Mean, Standard Deviation, Kurtosis, Skewness of integrated pulse profile

DM-SNR Curve Statistics:

  • Mean, Standard Deviation, Kurtosis, Skewness of DM-SNR curve

๐Ÿ”ฌ Methodology

Phase 1: Exploratory Data Analysis

  • Statistical analysis and distribution checks
  • Correlation analysis and feature relationships
  • Outlier detection and class imbalance assessment
  • Visualization of feature separability

Key Finding: Kurtosis features capture pulse shape characteristics critical for classification.

Phase 2: Data Preprocessing

Outlier Handling

Decision: Retained outliers
Rationale: Pulsars are extreme objects; their "outlier" values represent real astronomical signals, not measurement errors.

Feature Scaling

Method: RobustScaler
Rationale: Uses median and IQR, unaffected by outliers (unlike StandardScaler which uses mean/std).

# Proper train-test split to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit scaler on training data only
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use training parameters

Class Imbalance Handling

Approaches Tested:

  1. SMOTE (Synthetic Minority Over-sampling)

    • Created synthetic minority samples
    • Balanced training set to 50:50 ratio
    • Result: Good baseline performance
  2. Class Weights (Final Choice) โœ…

    • Used XGBoost's scale_pos_weight parameter
    • Trained on original imbalanced data with weighted loss
    • Result: Best performance (24 false negatives)
scale_pos_weight = len(y_train[y_train==0]) / len(y_train[y_train==1])
# โ‰ˆ 9.76 - tells XGBoost to treat minority class as 9.76x more important

Phase 3: Model Development

Systematically evaluated multiple algorithms:

1. Logistic Regression (Baseline)

  • Purpose: Establish minimum performance benchmark
  • Result: 96.76% accuracy, 91% recall, 29 FN
  • Conclusion: Strong baseline, room for improvement

2. Random Forest

  • Purpose: Capture non-linear relationships and feature interactions
  • Result: 97.79% accuracy (highest), 90% recall, 30 FN
  • Notable: Revealed Kurtosis_Integrated as most important feature (33%)

3. XGBoost (Final Model) โญ

  • Purpose: State-of-the-art gradient boosting with class imbalance handling
  • Result: 96.70% accuracy, 92% recall (highest), 24 FN (lowest)
  • Decision: Selected for minimizing false negatives

Hyperparameter Tuning Experiment

Used RandomizedSearchCV to optimize XGBoost:

  • Unexpected Result: Tuning increased false negatives (27โ†’31)
  • Decision: Retained simpler model with better FN performance
  • Learning: Default parameters often well-designed; blind tuning can hurt target metrics

๐Ÿ“ˆ Results

Final Model Performance

Model: XGBoost with Class Weights
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Accuracy:  96.70%
Recall:    92.21% โญ (Highest)
Precision: 74.93%
F1-Score:  82.75%

Confusion Matrix:
                 Predicted
              Not Pulsar  |  Pulsar
Actual โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Not Pulsar    3178  โœ…   |   94  โŒ
Pulsar          24  โŒ   |  284  โœ…

False Negatives: 24 (7.79% of pulsars missed)
False Positives: 95 (acceptable for discovery science)

Model Comparison

Model Accuracy Recall Precision False Negatives
Logistic Regression 96.76% 91% 76% 29
Random Forest 97.79% 90% 85% 30
XGBoost (Final) 96.70% 92% 75% 24 โญ

Business Impact

Without ML:

  • Manual review of 17,898 candidates
  • Time-consuming and error-prone
  • Potential missed pulsars due to fatigue

With ML System:

  • Flags 379 high-probability candidates (98% reduction)
  • 284 true pulsars + 94 false alarms
  • Only 24 pulsars missed (7.79%)
  • Result: 98% workload reduction, 92% detection rate

๐Ÿ’ก Key Learnings

Technical Insights

  1. Class Imbalance: scale_pos_weight outperformed SMOTE for XGBoost
  2. Feature Engineering: Model-based importance differed from visual EDA
  3. Scaling: RobustScaler essential for astronomical data with outliers
  4. Data Leakage: Always split before preprocessing
  5. Evaluation: Accuracy misleading for imbalanced data; recall critical

Methodological Insights

  1. Hyperparameter Tuning: Doesn't always improve target metrics
  2. Multiple Approaches: SMOTE vs class weights - context determines best choice
  3. Domain Knowledge: False negatives worse than false positives in discovery science
  4. Model Selection: Best model โ‰  highest accuracy; depends on business objective

๐Ÿš€ Installation

Prerequisites

Python 3.8+

Setup

  1. Clone the repository
git clone https://github.com/[YOUR_USERNAME]/pulsar-star-classification.git
cd pulsar-star-classification
  1. Install dependencies
pip install -r requirements.txt

Or install manually:

pip install numpy pandas scikit-learn xgboost imbalanced-learn matplotlib seaborn
  1. Download dataset
# Automatically downloads HTRU2 dataset
python download_data.py

๐Ÿ’ป Usage

Training the Model

# Run complete pipeline
python train_model.py

# Or step-by-step in Jupyter
jupyter notebook notebooks/pulsar_classification.ipynb

Making Predictions

import pickle
import numpy as np

# Load model and scaler
with open('models/xgboost_final.pkl', 'rb') as f:
    model = pickle.load(f)
    
with open('models/scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

# Prepare new data
new_data = np.array([[140.5, 55.7, -0.23, -0.70, 3.2, 19.1, 8.0, 74.2]])
scaled_data = scaler.transform(new_data)

# Predict
prediction = model.predict(scaled_data)
probability = model.predict_proba(scaled_data)

print(f"Prediction: {'Pulsar' if prediction[0] == 1 else 'Not Pulsar'}")
print(f"Confidence: {probability[0][prediction[0]]:.2%}")

๐Ÿ“ Project Structure

pulsar-star-classification/
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                    # Original HTRU2 dataset
โ”‚   โ””โ”€โ”€ processed/              # Preprocessed data
โ”‚
โ”œโ”€โ”€ notebooks/
โ”‚   โ”œโ”€โ”€ 01_EDA.ipynb           # Exploratory Data Analysis
โ”‚   โ”œโ”€โ”€ 02_preprocessing.ipynb # Data preprocessing
โ”‚   โ”œโ”€โ”€ 03_modeling.ipynb      # Model training & comparison
โ”‚   โ””โ”€โ”€ 04_evaluation.ipynb    # Final evaluation & analysis
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ data_preprocessing.py  # Preprocessing functions
โ”‚   โ”œโ”€โ”€ model_training.py      # Model training utilities
โ”‚   โ”œโ”€โ”€ evaluation.py          # Evaluation metrics
โ”‚   โ””โ”€โ”€ utils.py               # Helper functions
โ”‚
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ xgboost_final.pkl      # Final trained model
โ”‚   โ””โ”€โ”€ scaler.pkl             # Fitted RobustScaler
โ”‚
โ”œโ”€โ”€ visualizations/
โ”‚   โ”œโ”€โ”€ eda_plots/             # EDA visualizations
โ”‚   โ”œโ”€โ”€ model_comparison/      # Model performance plots
โ”‚   โ””โ”€โ”€ confusion_matrices/    # Confusion matrices
โ”‚
โ”œโ”€โ”€ requirements.txt           # Python dependencies
โ”œโ”€โ”€ train_model.py            # Training script
โ”œโ”€โ”€ download_data.py          # Dataset download script
โ”œโ”€โ”€ README.md                 # This file
โ””โ”€โ”€ LICENSE                   # MIT License

๐Ÿ› ๏ธ Technologies

Core Libraries:

  • NumPy - Numerical computing
  • Pandas - Data manipulation
  • Scikit-learn - ML algorithms, preprocessing, evaluation
  • XGBoost - Gradient boosting implementation
  • imbalanced-learn - SMOTE and imbalance handling

Visualization:

  • Matplotlib - Basic plotting
  • Seaborn - Statistical visualizations

Development:

  • Jupyter - Interactive development
  • Pycharm - Local IDE
  • Git/GitHub - Version control

๐Ÿ”ฎ Future Improvements

Short-term

  • Ensemble voting (combine Logistic Regression + RF + XGBoost)
  • SHAP values for model interpretability
  • Streamlit web app for easy predictions

Long-term

  • Deep learning approach (1D CNN on raw signal data)
  • Anomaly detection framework (Isolation Forest)
  • Real-time prediction API
  • Integration with telescope data pipelines

๐Ÿ“š References


๐Ÿ“ง Contact

[Ansul Suryawanshi]
๐Ÿ“ง Email: [ansul2612@gmail.com]
๐Ÿ’ผ LinkedIn: Ansul Suryawanshi ๐Ÿ™ GitHub: Ansul-S


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐ŸŒŸ Acknowledgments

  • Dataset: HTRU2 from UCI Machine Learning Repository
  • Inspiration: Real-world astronomical discovery challenges
  • Learning: Hands-on project-based learning approach

โญ If you found this project helpful, please consider giving it a star!


Built with โค๏ธ and lots of efforts

About

Pulsar Star Classification ML project focused on detecting rare pulsars from radio telescope data. Includes EDA, SMOTE for class balancing, model comparison (Random Forest & XGBoost), and optimization for high recall in rare-event detection.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors