An end-to-end machine learning pipeline to identify pulsar stars from radio telescope data, achieving 92% recall and minimizing false negatives to maximize astronomical discoveries.
- Overview
- Problem Statement
- Dataset
- Methodology
- Results
- Key Learnings
- Installation
- Usage
- Project Structure
- Technologies
- Future Improvements
- Contact
Pulsars are highly magnetized rotating neutron stars that emit beams of electromagnetic radiation. With only ~3,000 pulsars discovered out of an estimated 100,000 in our galaxy, automated detection is crucial for accelerating astronomical research.
This project builds a machine learning classifier to identify pulsar candidates from radio telescope data, reducing manual review workload by 98% while maintaining 92% detection accuracy.
- โ 92% Recall - Successfully detects 284 out of 308 pulsars
- โ 24 False Negatives - Lowest miss rate across all tested models
- โ 96.68% Accuracy - Strong overall performance
- โ Handles Severe Class Imbalance - 91:9 negative-to-positive ratio
Pulsars provide insights into:
- Extreme physics (neutron stars, gravitational waves)
- Tests of general relativity
- Deep space navigation systems
- Understanding stellar evolution
- Severe Class Imbalance: Only 9.16% of candidates are actual pulsars
- High-Stakes Decisions: Missing a pulsar = lost scientific discovery
- Asymmetric Cost: False negatives (missed discoveries) are worse than false positives (false alarms)
Build a machine learning classifier that prioritizes recall (maximizing pulsar detection) while maintaining acceptable precision (minimizing wasted telescope follow-up time).
Source: HTRU2 Dataset from UCI Machine Learning Repository
- Size: 17,898 observations
- Features: 8 continuous variables
- Target: Binary (0 = Not Pulsar, 1 = Pulsar)
- Class Distribution:
- Not Pulsar: 16,259 (90.84%)
- Pulsar: 1,639 (9.16%)
- Imbalance Ratio: ~10:1
Integrated Profile Statistics:
- Mean, Standard Deviation, Kurtosis, Skewness of integrated pulse profile
DM-SNR Curve Statistics:
- Mean, Standard Deviation, Kurtosis, Skewness of DM-SNR curve
- Statistical analysis and distribution checks
- Correlation analysis and feature relationships
- Outlier detection and class imbalance assessment
- Visualization of feature separability
Key Finding: Kurtosis features capture pulse shape characteristics critical for classification.
Decision: Retained outliers
Rationale: Pulsars are extreme objects; their "outlier" values represent real astronomical signals, not measurement errors.
Method: RobustScaler
Rationale: Uses median and IQR, unaffected by outliers (unlike StandardScaler which uses mean/std).
# Proper train-test split to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit scaler on training data only
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use training parametersApproaches Tested:
-
SMOTE (Synthetic Minority Over-sampling)
- Created synthetic minority samples
- Balanced training set to 50:50 ratio
- Result: Good baseline performance
-
Class Weights (Final Choice) โ
- Used XGBoost's
scale_pos_weightparameter - Trained on original imbalanced data with weighted loss
- Result: Best performance (24 false negatives)
- Used XGBoost's
scale_pos_weight = len(y_train[y_train==0]) / len(y_train[y_train==1])
# โ 9.76 - tells XGBoost to treat minority class as 9.76x more importantSystematically evaluated multiple algorithms:
- Purpose: Establish minimum performance benchmark
- Result: 96.76% accuracy, 91% recall, 29 FN
- Conclusion: Strong baseline, room for improvement
- Purpose: Capture non-linear relationships and feature interactions
- Result: 97.79% accuracy (highest), 90% recall, 30 FN
- Notable: Revealed Kurtosis_Integrated as most important feature (33%)
- Purpose: State-of-the-art gradient boosting with class imbalance handling
- Result: 96.70% accuracy, 92% recall (highest), 24 FN (lowest)
- Decision: Selected for minimizing false negatives
Used RandomizedSearchCV to optimize XGBoost:
- Unexpected Result: Tuning increased false negatives (27โ31)
- Decision: Retained simpler model with better FN performance
- Learning: Default parameters often well-designed; blind tuning can hurt target metrics
Model: XGBoost with Class Weights
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Accuracy: 96.70%
Recall: 92.21% โญ (Highest)
Precision: 74.93%
F1-Score: 82.75%
Confusion Matrix:
Predicted
Not Pulsar | Pulsar
Actual โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโ
Not Pulsar 3178 โ
| 94 โ
Pulsar 24 โ | 284 โ
False Negatives: 24 (7.79% of pulsars missed)
False Positives: 95 (acceptable for discovery science)
| Model | Accuracy | Recall | Precision | False Negatives |
|---|---|---|---|---|
| Logistic Regression | 96.76% | 91% | 76% | 29 |
| Random Forest | 97.79% | 90% | 85% | 30 |
| XGBoost (Final) | 96.70% | 92% | 75% | 24 โญ |
Without ML:
- Manual review of 17,898 candidates
- Time-consuming and error-prone
- Potential missed pulsars due to fatigue
With ML System:
- Flags 379 high-probability candidates (98% reduction)
- 284 true pulsars + 94 false alarms
- Only 24 pulsars missed (7.79%)
- Result: 98% workload reduction, 92% detection rate
- Class Imbalance:
scale_pos_weightoutperformed SMOTE for XGBoost - Feature Engineering: Model-based importance differed from visual EDA
- Scaling: RobustScaler essential for astronomical data with outliers
- Data Leakage: Always split before preprocessing
- Evaluation: Accuracy misleading for imbalanced data; recall critical
- Hyperparameter Tuning: Doesn't always improve target metrics
- Multiple Approaches: SMOTE vs class weights - context determines best choice
- Domain Knowledge: False negatives worse than false positives in discovery science
- Model Selection: Best model โ highest accuracy; depends on business objective
Python 3.8+- Clone the repository
git clone https://github.com/[YOUR_USERNAME]/pulsar-star-classification.git
cd pulsar-star-classification- Install dependencies
pip install -r requirements.txtOr install manually:
pip install numpy pandas scikit-learn xgboost imbalanced-learn matplotlib seaborn- Download dataset
# Automatically downloads HTRU2 dataset
python download_data.py# Run complete pipeline
python train_model.py
# Or step-by-step in Jupyter
jupyter notebook notebooks/pulsar_classification.ipynbimport pickle
import numpy as np
# Load model and scaler
with open('models/xgboost_final.pkl', 'rb') as f:
model = pickle.load(f)
with open('models/scaler.pkl', 'rb') as f:
scaler = pickle.load(f)
# Prepare new data
new_data = np.array([[140.5, 55.7, -0.23, -0.70, 3.2, 19.1, 8.0, 74.2]])
scaled_data = scaler.transform(new_data)
# Predict
prediction = model.predict(scaled_data)
probability = model.predict_proba(scaled_data)
print(f"Prediction: {'Pulsar' if prediction[0] == 1 else 'Not Pulsar'}")
print(f"Confidence: {probability[0][prediction[0]]:.2%}")pulsar-star-classification/
โ
โโโ data/
โ โโโ raw/ # Original HTRU2 dataset
โ โโโ processed/ # Preprocessed data
โ
โโโ notebooks/
โ โโโ 01_EDA.ipynb # Exploratory Data Analysis
โ โโโ 02_preprocessing.ipynb # Data preprocessing
โ โโโ 03_modeling.ipynb # Model training & comparison
โ โโโ 04_evaluation.ipynb # Final evaluation & analysis
โ
โโโ src/
โ โโโ data_preprocessing.py # Preprocessing functions
โ โโโ model_training.py # Model training utilities
โ โโโ evaluation.py # Evaluation metrics
โ โโโ utils.py # Helper functions
โ
โโโ models/
โ โโโ xgboost_final.pkl # Final trained model
โ โโโ scaler.pkl # Fitted RobustScaler
โ
โโโ visualizations/
โ โโโ eda_plots/ # EDA visualizations
โ โโโ model_comparison/ # Model performance plots
โ โโโ confusion_matrices/ # Confusion matrices
โ
โโโ requirements.txt # Python dependencies
โโโ train_model.py # Training script
โโโ download_data.py # Dataset download script
โโโ README.md # This file
โโโ LICENSE # MIT License
Core Libraries:
- NumPy - Numerical computing
- Pandas - Data manipulation
- Scikit-learn - ML algorithms, preprocessing, evaluation
- XGBoost - Gradient boosting implementation
- imbalanced-learn - SMOTE and imbalance handling
Visualization:
- Matplotlib - Basic plotting
- Seaborn - Statistical visualizations
Development:
- Jupyter - Interactive development
- Pycharm - Local IDE
- Git/GitHub - Version control
- Ensemble voting (combine Logistic Regression + RF + XGBoost)
- SHAP values for model interpretability
- Streamlit web app for easy predictions
- Deep learning approach (1D CNN on raw signal data)
- Anomaly detection framework (Isolation Forest)
- Real-time prediction API
- Integration with telescope data pipelines
- R. J. Lyon et al. (2016). "Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach." Monthly Notices of the Royal Astronomical Society.
- Dataset: UCI Machine Learning Repository - HTRU2
- XGBoost Documentation: https://xgboost.readthedocs.io/
[Ansul Suryawanshi]
๐ง Email: [ansul2612@gmail.com]
๐ผ LinkedIn: Ansul Suryawanshi
๐ GitHub: Ansul-S
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset: HTRU2 from UCI Machine Learning Repository
- Inspiration: Real-world astronomical discovery challenges
- Learning: Hands-on project-based learning approach
โญ If you found this project helpful, please consider giving it a star!
Built with โค๏ธ and lots of efforts