Predict Every Batch. Prevent Every Failure.
Team Knights — Aditya Rana · Aryan Pratap Singh · Sandeep Kumar
"One system. Four modules. Seven predictions. Zero guesswork."
- Overview
- Key Features
- Project Structure
- Tech Stack
- Dataset
- Modules
- Quick Start
- Running the Pipeline
- API Reference
- Dashboard
- Results
- Documentation
Modern manufacturing plants generate massive amounts of process data every batch — yet most of it goes unanalyzed. Energy spikes, quality drops, and equipment failures continue to surprise operators, costing time, money, and carbon.
PRISM is an AI-driven manufacturing intelligence system built for pharmaceutical tablet manufacturing that predicts batch quality, yield, and energy consumption before production completes — while continuously monitoring equipment health through power and vibration patterns to catch failures before they happen.
Given process parameters (compression force, machine speed, drying temperature, etc.), PRISM:
- Predicts all quality, yield, and performance targets before the batch completes
- Monitors energy consumption patterns phase-by-phase to detect anomalies during the batch
- Explains every prediction using SHAP values so operators know why, not just what
- Tracks carbon footprint with adaptive targets aligned to regulatory requirements
Domain: Pharmaceutical tablet manufacturing
Dataset: 60 production batches (T001–T060) + 1 minute-by-minute sensor log
Target Accuracy: R² ≥ 0.90 across all primary quality targets (achieved: ≥ 0.95 on XGBoost) Hackathon: National AI/ML Hackathon by AVEVA — Team Knights
| Feature | Description |
|---|---|
| 🎯 Multi-Target Prediction | Simultaneously predicts Hardness, Friability, Dissolution Rate, Content Uniformity, Disintegration Time, Tablet Weight, and Energy (kWh) |
| ⚡ Energy Pattern Analysis | Phase-wise power + vibration monitoring with Isolation Forest & LSTM Autoencoder anomaly detection |
| 🔍 SHAP Explainability | Per-prediction and global feature importance — operators understand why, not just what |
| 🌍 Carbon Footprint Tracker | CO₂e per batch using India CEA grid factor (0.716 kg/kWh) with adaptive target setting |
| 🎛️ What-If Optimizer | Real-time slider-based parameter explorer — predictions update in < 100ms |
| 📊 Batch Fingerprinting | Radar chart comparison of any two batches against the all-time best |
| 📉 CUSUM Drift Detection | Detects gradual quality/energy degradation across batches over time |
| 🧪 Composite Quality Score | Single 0–100 score blending all quality targets into one actionable metric |
| 📈 Benchmark Dashboard | Full model performance report (R², MAE, RMSE, MAPE) with anomaly detector metrics |
manufacturing-intelligence/
│
├── 📄 README.md
├── 📄 SETUP.md ← Step-by-step environment setup guide
├── 📄 PIPELINE.md ← Plain-English pipeline explanation
├── 📄 BENCHMARK.md ← Model benchmark report with metrics
│
├── 📂 data/
│ ├── raw/
│ │ ├── _h_batch_process_data.xlsx ← Sensor log (T001, 211 min × 11 cols)
│ │ └── _h_batch_production_data.xlsx ← Batch records (60 batches × 15 cols)
│ ├── processed/
│ │ ├── merged_dataset.csv ← Final ML input (60 × ~22 features)
│ │ ├── phase_features.csv ← Phase-aggregated sensor features
│ │ ├── batch_outcomes.csv ← Cleaned outcomes + derived targets
│ │ └── carbon_history.csv ← Per-batch CO₂e with adaptive targets
│ └── simulated/
│ └── simulated_sensors.csv ← Physics-based sensor data T001–T060
│
├── 📂 notebooks/ ← Core analysis (run in order)
│ ├── 01_EDA.ipynb ← Exploratory data analysis
│ ├── 02_feature_engineering.ipynb ← Simulation + feature extraction
│ ├── 03_multitarget_models.ipynb ← Model training + evaluation
│ ├── 04_anomaly_detection.ipynb ← Isolation Forest + LSTM Autoencoder
│ └── 05_explainability.ipynb ← SHAP analysis + plots
│
├── 📂 analysis/ ← Deep-dive analysis notebooks
│ ├── 01_data_profiling.ipynb ← Full stats, missing values, outliers
│ ├── 02_correlation_deep_dive.ipynb ← Pearson/Spearman/VIF analysis
│ ├── 03_phase_energy_analysis.ipynb ← Phase energy breakdown, CUSUM drift
│ ├── 04_model_comparison.ipynb ← CV scores, residuals, timing benchmarks
│ └── 05_business_impact.ipynb ← ROI, carbon savings, grid scenarios
│
├── 📂 src/
│ ├── config.py ← Constants, paths, thresholds
│ ├── preprocessing.py ← Load, validate, normalize
│ ├── simulate_sensors.py ← Physics-based T002–T060 simulation
│ ├── feature_engineering.py ← Phase aggregation, FFT, derived features
│ ├── multi_target_model.py ← XGBoost + RF + MLP + stacking ensemble
│ ├── anomaly_detector.py ← Isolation Forest + LSTM Autoencoder
│ ├── shap_explainer.py ← SHAP value computation + plots
│ ├── carbon_calculator.py ← CO₂e calculation + adaptive targets
│ ├── run_pipeline.py ← Master training script
│ └── utils.py ← Shared helpers
│
├── 📂 models/ ← Serialized trained models (after pipeline run)
│ ├── xgb_multitarget.pkl
│ ├── rf_multitarget.pkl
│ ├── mlp_model.keras
│ ├── stacking_meta.pkl
│ ├── isolation_forest.pkl
│ ├── lstm_autoencoder.keras
│ ├── scaler.pkl
│ ├── shap_values.pkl
│ ├── lstm_threshold.json
│ ├── lstm_norm_params.json
│ ├── evaluation_results.json ← Per-target R², MAE, RMSE, MAPE
│ └── pipeline_summary.json ← Full run summary
│
├── 📂 api/
│ ├── main.py ← FastAPI app + all route handlers
│ └── schemas.py ← Pydantic request/response models
│
├── 📂 dashboard/ ← Next.js web dashboard (React + TypeScript)
│ ├── package.json
│ ├── next.config.ts
│ ├── tsconfig.json
│ └── src/
│ ├── app/
│ │ ├── layout.tsx ← Root layout + navigation
│ │ ├── page.tsx ← Home redirect
│ │ ├── ClientLayout.tsx ← Tab-based navigation shell
│ │ └── globals.css
│ ├── components/
│ │ ├── MetricCard.tsx ← Reusable metric display card
│ │ ├── Slider.tsx ← Parameter input slider
│ │ └── tabs/
│ │ ├── PredictionsTab.tsx ← Tab 1: Quality & energy predictions
│ │ ├── EnergyTab.tsx ← Tab 2: Phase charts + anomaly alerts
│ │ ├── ComparisonTab.tsx ← Tab 3: Radar chart batch comparison
│ │ ├── CarbonTab.tsx ← Tab 4: CO₂e trends + targets
│ │ ├── WhatIfTab.tsx ← Tab 5: Real-time parameter explorer
│ │ └── BenchmarkTab.tsx ← Tab 6: Full model benchmark report
│ └── lib/
│
├── 📂 tests/
│ ├── test_preprocessing.py
│ ├── test_models.py
│ └── test_api.py
│
├── 📂 docs/
│ ├── ARCHITECTURE.md ← System design + diagrams
│ ├── IMPLEMENTATION_PLAN.md ← Build guide + code skeletons + timeline
│ └── PROJECT_DOCUMENTATION.md ← Strategy, research rationale, business impact
│
└── 📄 requirements.txt
| Layer | Technology | Purpose |
|---|---|---|
| Data Processing | pandas, numpy, openpyxl |
Tabular data manipulation, Excel reading |
| ML — Gradient Boosting | xgboost 2.0 |
Primary multi-output prediction model |
| ML — Ensemble | scikit-learn |
Random Forest, Ridge meta-learner, Isolation Forest, scalers |
| Deep Learning | tensorflow 2.15 / keras |
LSTM Autoencoder for sequential anomaly detection, MLP regression |
| Hyperparameter Tuning | optuna |
Bayesian search over XGBoost hyperparameters (50 trials) |
| Explainability | shap |
TreeExplainer for XGBoost; beeswarm, waterfall, bar plots |
| API Backend | fastapi, uvicorn, pydantic |
REST endpoints; auto Swagger docs; < 100ms inference |
| Web Dashboard | Next.js 16, React 19, TypeScript |
6-tab interactive dashboard consuming the FastAPI backend |
| Charts | recharts, plotly |
Time-series, radar, and bar charts |
| Serialization | joblib |
Model persistence across sessions |
| Signal Processing | scipy |
FFT analysis of vibration signals for motor health |
- 211 rows × 11 columns, 1 batch (T001), no missing values
- Captures: Temperature, Pressure, Humidity, Motor Speed, Compression Force, Flow Rate, Power Consumption (kW), Vibration (mm/s)
- Covers 8 sequential manufacturing phases over 211 minutes
| Phase | Energy Used | Key Signal |
|---|---|---|
| Compression | 38.69 kWh (50.4%) 🔴 | Highest energy — #1 optimization target |
| Milling | 9.00 kWh (11.7%) | Highest vibration (9.79 mm/s) |
| Drying | 10.09 kWh (13.1%) | Temperature + time sensitive |
| Others | 18.96 kWh (24.7%) | Lower priority |
- 60 rows × 15 columns (T001–T060), no missing values
- 8 input features: Granulation Time, Binder Amount, Drying Temp, Drying Time, Compression Force, Machine Speed, Lubricant Concentration, Moisture Content
- 6 output targets: Hardness, Friability, Dissolution Rate, Content Uniformity, Disintegration Time, Tablet Weight
Note: Feature-target correlations of 0.96–0.99 across most pairs — this dataset is highly structured and models reliably exceed R² = 0.93 on primary targets.
Ensemble of XGBoost + Random Forest → Ridge stacking meta-learner. Predicts 7 targets simultaneously from 8 process parameters + phase features. Uses 5-fold cross-validation and Optuna hyperparameter search. MLP is trained but excluded from the stacking ensemble due to overfitting on the 60-sample dataset.
Two-layer anomaly detection:
- Isolation Forest — fast batch-level screening (~5ms)
- LSTM Autoencoder — deep sequential pattern analysis (~50ms)
Trained on physics-simulated sensor data for all 60 batches. Root cause attribution via domain-knowledge rule engine (bearing wear, motor overload, process drift).
shap.TreeExplainer on XGBoost models provides exact Shapley values per feature per prediction. Beeswarm plots for global insights; waterfall plots for per-batch explanations.
Converts predicted Energy_kWh to Carbon_kgCO2e using India CEA grid factor (0.716 kg/kWh). Adaptive target setting: dynamically adjusts goals based on best 10th-percentile operational performance vs regulatory floor. Supports India / EU / US / Renewable grid scenarios.
Full guide: See
SETUP.mdfor step-by-step instructions.
git clone https://github.com/your-team/manufacturing-intelligence.git
cd manufacturing-intelligence
python -m venv venv
venv\Scripts\Activate.ps1 # Windows PowerShell
# source venv/bin/activate # macOS / Linux
pip install -r requirements.txtdata/raw/
├── _h_batch_process_data.xlsx
└── _h_batch_production_data.xlsx
# Quick run (~2 min, no tuning):
python src/run_pipeline.py
# With Optuna XGBoost tuning (~7 min):
python src/run_pipeline.py --tuneuvicorn api.main:app --host 0.0.0.0 --port 8000 --reloadAPI docs available at: http://localhost:8000/docs
cd dashboard
npm install # first time only
npm run devDashboard available at: http://localhost:3000
pytest tests/ -vpython src/run_pipeline.py# Step 1: Load & validate raw data
python -c "from src.preprocessing import load_data, validate_data; load_data()"
# Step 2: Sensor simulation for T002–T060
python src/simulate_sensors.py
# Step 3: Feature engineering
python src/feature_engineering.py
# Step 4: Train multi-target prediction models
python src/multi_target_model.py
# Step 5: Train anomaly detection models
python src/anomaly_detector.py
# Step 6: Compute SHAP values + generate plots
python src/shap_explainer.py
# Step 7: Build carbon footprint history
python src/carbon_calculator.py| File | Description |
|---|---|
models/xgb_multitarget.pkl |
XGBoost multi-output model |
models/rf_multitarget.pkl |
Random Forest model |
models/mlp_model.keras |
MLP neural network (Keras format) |
models/stacking_meta.pkl |
Stacking ensemble bundle |
models/isolation_forest.pkl |
Isolation Forest anomaly detector |
models/lstm_autoencoder.keras |
LSTM Autoencoder (Keras format) |
models/scaler.pkl |
StandardScaler for feature normalization |
models/shap_values.pkl |
Pre-computed SHAP values |
models/lstm_threshold.json |
LSTM anomaly reconstruction threshold |
models/evaluation_results.json |
Per-target R², MAE, RMSE, MAPE |
models/pipeline_summary.json |
Full run summary |
Base URL: http://localhost:8000
Interactive docs: http://localhost:8000/docs
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/health |
Model load status |
POST |
/api/predict |
Predict all quality targets + energy |
POST |
/api/anomaly |
Detect energy anomalies for a batch |
GET |
/api/explain/{batch_id} |
SHAP feature contributions |
GET |
/api/carbon/{batch_id} |
CO₂e + adaptive target for a batch |
GET |
/api/batches |
List all available batch IDs |
GET |
/api/carbon_history |
Full carbon history (all batches) |
GET |
/api/model_metrics |
Full benchmark: R², MAE, anomaly metrics |
curl -X POST "http://localhost:8000/api/predict" \
-H "Content-Type: application/json" \
-d '{
"granulation_time": 16,
"binder_amount": 9.0,
"drying_temp": 60,
"drying_time": 29,
"compression_force": 12.0,
"machine_speed": 170,
"lubricant_conc": 1.2,
"moisture_content": 2.0
}'Response:
{
"hardness": 89.4,
"friability": 0.81,
"dissolution_rate": 90.7,
"content_uniformity": 98.2,
"disintegration_time": 8.3,
"tablet_weight": 202.1,
"energy_kwh": 72.4,
"carbon_kg_co2e": 51.8,
"composite_quality_score": 82.3
}curl -X POST "http://localhost:8000/api/anomaly" \
-H "Content-Type: application/json" \
-d '{ "batch_id": "T045" }'curl "http://localhost:8000/api/explain/T023?target=Dissolution_Rate"curl "http://localhost:8000/api/carbon/T045?grid=India"curl "http://localhost:8000/api/model_metrics"Returns full benchmark data (regression R²/MAE/RMSE/MAPE per model & target, anomaly metrics).
The Next.js web dashboard (http://localhost:3000) has 6 tabs, all consuming the FastAPI backend:
| Tab | Description |
|---|---|
| 🔮 Predictions | Enter 8 process parameters → get all quality predictions + Composite Quality Score |
| ⚡ Energy Monitor | Select any batch → view phase-wise power/vibration chart + anomaly score + root cause alerts |
| 📊 Batch Comparison | Compare any two batches side-by-side via normalized radar charts + delta table |
| 🌍 Carbon Footprint | Trend chart of CO₂e across all batches + adaptive target line + grid selector (India/EU/US/Renewable) |
| 🎛️ What-If Optimizer | Move sliders for any parameter → predictions update live in < 100ms |
| 📈 Benchmark | Full model performance report — R², MAE, RMSE, MAPE per model & target; anomaly detector metrics |
cd dashboard
npm install # first time only
npm run dev # http://localhost:3000| Target | XGBoost R² | RF R² | Stacking R² |
|---|---|---|---|
| Hardness | 0.9895 | 0.9826 | 0.9896 |
| Friability | 0.9810 | 0.9530 | 0.9722 |
| Dissolution Rate | 0.9902 | 0.9727 | 0.9832 |
| Content Uniformity | 0.9926 | 0.9765 | 0.9919 |
| Disintegration Time | 0.9869 | 0.9733 | 0.9876 |
| Tablet Weight | 0.9327 | 0.9000 | 0.9571 |
| Energy kWh | 0.8094 | 0.8479 | 0.7799 |
| Overall Mean R² | 0.9546 | 0.9437 | 0.9516 |
Production model: Stacking Ensemble (XGBoost + RandomForest + per-target Ridge meta-learner, 5-fold OOF). MLP excluded — severely overfits on n=60 dataset (overall R² = –10.15).
| Model | Precision | Recall | F1 | AUC-ROC |
|---|---|---|---|---|
| Isolation Forest | 16.67% | 16.67% | 0.167 | 0.682 |
| LSTM Autoencoder | 10.00% | 100% | 0.182 | 0.324 |
Low precision is expected — severe class imbalance (6/60 anomalous). LSTM recall of 100% means zero missed anomalies. See
BENCHMARK.mdfor full analysis.
| Metric | Saving |
|---|---|
| Energy reduction (8–10% per batch) | ~4,350 kWh/year |
| Carbon reduction | ~3,100 kg CO₂e/year |
| Batch rejection prevention (est. 30 fewer/year) | ~₹15 lakh/year |
| Early anomaly detection | Prevents catastrophic equipment failure |
| File | Contents |
|---|---|
SETUP.md |
Environment setup, data placement, running all services |
PIPELINE.md |
Plain-English explanation of every pipeline step |
BENCHMARK.md |
Full model performance benchmark with metric definitions |
docs/ARCHITECTURE.md |
System architecture diagrams, layer breakdowns, API schemas |
docs/IMPLEMENTATION_PLAN.md |
Dataset analysis, code skeletons, training strategy, timeline |
docs/PROJECT_DOCUMENTATION.md |
Problem statement, tech stack rationale, business impact, references |
- Federated Learning — train across multiple factories without sharing raw batch data
- Digital Twin — couple predictive models with physics-based tablet press simulation
- Reinforcement Learning — RL agent learns optimal parameter settings through reward signals
- Edge Deployment — quantize LSTM Autoencoder to ONNX for IIoT edge node deployment
- Real-Time IIoT Pipeline — Apache Kafka → stream processor → live inference → Grafana
- Real-Time Carbon API — integrate Electricity Maps API for marginal (not average) emission factors
- Sensor time-series data exists for T001 only; T002–T060 are physics-based simulations
- MLP overfit severely on n=60 dataset (R² = –10.15) and is excluded from production ensemble
Energy_kWhis physics-derived (not directly measured); lower R² (0.78–0.85) is expected- Anomaly precision/recall appear low due to extreme class imbalance (10% anomaly rate)
- Carbon calculation uses annual-average grid emission factor, not real-time marginal
- n=60 batches is small; production system requires 500+ batches for robust generalization
Built by Team Knights for the National AI/ML Hackathon by AVEVA
PRISM — Predictive Reliability & Intelligence for Smart Manufacturing