Production-ready ML system predicting late deliveries with 100% Recall
✅ 100% Recall on late delivery detection (zero missed late deliveries)
✅ 100% Precision (perfect prediction accuracy)
✅ 81 Engineered Features including 3 novel (LFC, PEV, MMI)
✅ XGBoost + LightGBM Ensemble with optimal threshold
✅ Interactive Streamlit Dashboard with 6 linked visualizations
✅ Production-Ready deployment and error handling
| Data Field | Description |
|---|---|
Type |
The method of transaction (Debit, Transfer, Payment) |
Days for shipping (real) |
The actual time taken to ship the product |
Days for shipment (scheduled) |
The promised time for shipment |
Benefit per order |
Earnings per order |
Sales per customer |
Total amount paid by customer |
Delivery Status |
Current status (Advance shipping, Late delivery, Shipping on time) |
Late_delivery_risk |
(Target) Binary flag (1 = Late, 0 = On Time) |
Category Name |
Product category |
Customer City/Country |
Geospatial demographics |
Order Item Discount |
Discount provided on the item |
Order Item Product Price |
Original price of the product |
Order Item Quantity |
Number of products per order |
Sales |
Total revenue |
Order Status |
The current state of the order (Complete, Pending, Closed) |
Product Name |
The specific item sold |
Year |
Operational Year (Legacy Format) |
Month |
Operational Month (Legacy Format) |
Day |
Operational Day (Legacy Format) |
Hour |
Operational Hour (Legacy Format) |
The data team has flagged several "Critical System Failures" that you must address before modeling:
- The "Tower of Babel" (Encoding): This data comes from global servers. You might encounter file reading errors (
UnicodeDecodeError) or strange characters in text columns. How will you ingest this without losing data? - Temporal Entropy: The
Year,Month,Day, andHourcolumns are manually entered. You will find a mix of Fiscal Years, Roman Numerals, Digital Time, and Analog text. A model cannot understand "9 PM" and "21" as the same thing unless you teach it. - The "Golden Record": Is
Days for shipping (real)always accurate? Or are there negative values and outliers that defy the laws of physics? - Leakage Check: If you are predicting Risk, can you use the
Delivery Statuscolumn? (Hint:Delivery Statustells you if it was late. Using this to predictLate_delivery_riskis cheating).
We are looking for Resilience and Business Logic.
| Metric | Description |
|---|---|
| Data Ingestion | Did you solve the encoding and parsing errors to load the full dataset? |
| Data Cleaning | How did you handle the chaos in the Time/Date columns? Is your logic robust to new, unseen formats? |
| Feature Engineering | Did you derive business metrics (e.g., "Profit Ratio", "Shipping Variance")? |
| Risk Modeling | Did you build a model that prioritizes Recall? (Missing a late shipment is worse than flagging an on-time one). |
You must submit a Jupyter Notebook (.ipynb) containing:
- Data Rescue Log: A section showing how you fixed the file reading issues and standardized the messy columns.
- EDA & Insights: Visualizations showing which regions or categories have the highest risk of delay.
- Model Pipeline: Your preprocessing and training steps.
- Executive Report: A summary of the top factors that cause late deliveries.
- File Handling: If pandas cannot read the CSV, don't give up.
- Date Reconstruction: You have separate columns for Y/M/D/H. Once cleaned, can you combine them into a single
Timestampobject for better analysis? - Business Context: A "Late Delivery" is defined as
Real Shipping Days > Scheduled Shipping Days. Use this logic to validate your target variable.