A comprehensive data science project for pharmaceutical sales forecasting using advanced machine learning techniques, featuring interactive dashboards and automated reporting.
- Project Overview
- Interactive Dashboard
- Business Problem
- Key Features
- Dataset Description
- Methodology
- Results & Insights
- Visualizations
- Project Structure
- Installation & Usage
- Technologies Used
- Future Improvements
This project addresses a critical challenge in the pharmaceutical industry: accurate sales forecasting across multiple drug categories. By leveraging time series analysis and machine learning, I developed predictive models to support New Product Development (NPD) and strategic decision-making.
โ
Forecast monthly pharmaceutical sales across 8 ATC drug categories
โ
Compare baseline (Prophet) vs. advanced (XGBoost) models
โ
Generate actionable insights for NPD strategy
โ
Create interactive dashboards for stakeholder presentations
- ๐ Toggle between 8 drug categories (M01AB, M01AE, N02BA, N02BE, N05B, N05C, R03, R06)
- ๐ Compare Actual vs Prophet vs XGBoost predictions
- ๐ฏ Interactive hover tooltips showing exact values
- ๐ Zoom and pan capabilities for detailed analysis
- ๐ฑ Responsive design works on mobile and desktop
Pharmaceutical companies need accurate sales forecasts to:
- Optimize inventory management
- Plan production schedules
- Identify high-growth product categories
- Support New Product Development decisions
- Allocate marketing resources effectively
Developed a dual-model forecasting system combining:
- Prophet (Facebook's time series tool) - Baseline model
- XGBoost with feature engineering - Advanced model
- Comprehensive statistical analysis of 2,106+ daily sales records
- Temporal pattern identification (trends, seasonality)
- Missing value analysis and data quality checks
- Interactive visualizations with Plotly
- Lag Features: Historical sales values (1-month lag)
- Rolling Averages: 3-month moving averages
- Temporal Features: Year, Month extraction
- Category-Specific Features: Per-drug engineered variables
- Baseline Model: Prophet with yearly seasonality
- Advanced Model: XGBoost with Optuna hyperparameter tuning
- Cross-Validation: Time-based train-test split (80/20)
- Evaluation Metrics: MAE, RMSE for model comparison
Pharmaceutical Sales Data (Kaggle)
- Period: January 2014 - October 2019
- Granularity: Daily โ Aggregated to Monthly
- Records: 2,106 daily entries โ 70 monthly aggregates
| Category | Description | Avg Monthly Sales |
|---|---|---|
| M01AB | Anti-inflammatory (non-steroids) | 5.03 |
| M01AE | Propionic acid derivatives | 3.90 |
| N02BA | Salicylic acid derivatives | 3.88 |
| N02BE | Anilides (Pain relievers) | 29.92 โญ |
| N05B | Anxiolytics | 8.85 |
| N05C | Hypnotics and sedatives | 0.59 |
| R03 | Anti-asthmatics | 5.51 |
| R06 | Antihistamines | 2.90 |
Note: N02BE (Pain relievers) shows highest sales volume - prime candidate for NPD focus.
# Date conversion & resampling to monthly frequency
df['datum'] = pd.to_datetime(df['datum'])
df_monthly = df.resample('ME').sum()
# Feature engineering
df['lag1'] = df['sales'].shift(1)
df['ma3'] = df['sales'].rolling(3).mean()- Seasonality: Yearly patterns enabled
- Training: 54 months of historical data
- Forecasting: 14-month test period
- Best For: Capturing trend + seasonality
- Hyperparameter Tuning: Optuna (20 trials)
- Features: Lags + Moving Averages
- Optimization Target: Minimize MAE
- Best For: Capturing complex non-linear patterns
MAE = mean_absolute_error(actual, predicted)
RMSE = sqrt(mean_squared_error(actual, predicted))| Model | MAE | RMSE | Performance |
|---|---|---|---|
| Prophet | 313.74 | 434.05 | Baseline |
| XGBoost | 163.80 โญ | 208.06 โญ | 47.8% better |
Key Finding: XGBoost significantly outperforms Prophet due to feature engineering capturing recent sales patterns.
Top Predictors (across all categories):
- Lag Features (previous month sales) - 45% importance
- 3-Month Moving Average - 30% importance
- Seasonal Components - 25% importance
-
๐ Priority Categories:
- N02BE (Pain Relievers): Highest volume + stable demand
- N05B (Anxiolytics): Growing trend detected
- R03 (Anti-asthmatics): Seasonal peaks exploitable
-
๐ Inventory Optimization:
- Use XGBoost forecasts for procurement planning
- Stock 10-15% buffer for high-variance categories (N05C)
-
๐ฏ Marketing Strategy:
- Align campaigns with predicted demand peaks
- Focus resources on N02BE category expansion
-
๐ฎ Future Model Enhancements:
- Integrate external data (holidays, promotions)
- Implement ensemble methods (Prophet + XGBoost)
- Add competitor pricing features
- โ XGBoost captures short-term fluctuations better than Prophet
- โ Both models identify seasonal patterns successfully
- โ Prophet tends to over-predict during low-demand periods
- โ XGBoost maintains accuracy across volatile market conditions
pharma-sales-forecasting/
โ
โโโ pharmasales.ipynb # Main analysis notebook
โโโ sales_dashboard.html # Interactive dashboard
โโโ Sales_N02BE.png # Results visualization
โโโ XGboot_feature_N02BE.PNG # Feature importance chart
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
โโโ LICENSE # Project license
โโโ .gitignore # Git exclusions
- Python 3.11+
- pip package manager
- Jupyter Notebook
git clone https://github.com/Omneya21/Pharma-Sales-Forecasting.git
cd Pharma-Sales-Forecastingpip install -r requirements.txt# Open Jupyter Notebook
jupyter notebook pharmasales.ipynbOption 1: View Online (Recommended)
https://htmlpreview.github.io/?https://github.com/Omneya21/Pharma-Sales-Forecasting/blob/main/sales_dashboard.html
Option 2: View Locally
# Simply double-click sales_dashboard.html
# Or open in browser:
firefox sales_dashboard.html # Linux
open sales_dashboard.html # Mac
start sales_dashboard.html # Windows- pandas (2.0+) - Data manipulation
- numpy (1.24+) - Numerical computing
- matplotlib / seaborn - Static visualizations
- plotly - Interactive dashboards
- Prophet (1.1+) - Time series baseline
- XGBoost (2.0+) - Gradient boosting
- scikit-learn (1.3+) - Model evaluation
- Optuna (3.0+) - Hyperparameter optimization
- Jupyter Notebook - Interactive analysis
- Git - Version control
- Python - Primary language
- Implement ensemble methods (stacking Prophet + XGBoost)
- Add LSTM/GRU for deep learning comparison
- Integrate external features (holidays, weather, economic indicators)
- Develop multi-step ahead forecasts (3-6 months)
- Add competitor pricing data
- Include promotional campaign indicators
- Incorporate seasonality index per category
- Create product lifecycle stage features
- Deploy model as REST API (FastAPI)
- Automate retraining pipeline (MLflow)
- Build real-time dashboard (Streamlit)
- Implement CI/CD with GitHub Actions
- Add demand forecasting for new product launches
- Create what-if scenario analysis tool
- Develop ROI calculator for NPD decisions