Skip to content
Open

Done #13

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Visual Studio cache en gebruikersinstellingen
.vs/
*.suo
*.user
*.userosscache
*.sln.docstates
*.VC.db

# Visual Studio .db & .wal bestanden
*.db
*.db-shm
*.db-wal
*.vsidx

# Python __pycache__ directory en .pyc bestanden
__pycache__/
*.py[cod]

# SQLite databasebestanden (optioneel — als die lokaal zijn)
*.sqlite

# JetBrains IDE's (optioneel)
.idea/

# VS Code settings (optioneel)
.vscode/

# Systeembestanden
.DS_Store
Thumbs.db

# Logs en tijdelijke bestanden
*.log
*.tmp

# JSON uitsluiten (alleen als je dit echt wilt)
*.json
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added .vs/project-2-eda-sql/v17/.wsuo
Binary file not shown.
Binary file added .vs/slnx.sqlite
Binary file not shown.
Binary file added Naamloze presentatie.pptx
Binary file not shown.
74 changes: 44 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,59 @@
![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)
# Delivery Insights Dashboard

# Business Challenge: EDA and SQL
## Overview

## Introduction
**Delivery Insights Dashboard** is a modular data analytics application focused on the procurement process and supplier performance. It transforms raw JSON datasets from internal APIs into insightful, interactive visualizations. The tool is intended to help supply chain analysts, procurement teams, and operational managers understand where deliveries deviate from planning and identify improvement opportunities.

A data project lifecycle has many phases, rather than being just an isolated analysis in a single tool.
In this project you will experience doing an analysis using both Python and SQL to obtain the final result, by exploring each tool's behavior.
The application is built using:

## Project Overview
- **Python** for backend data transformation and analytics
- **Pandas & NumPy** for data manipulation
- **Streamlit** for the interactive frontend UI
- **Plotly** for visually rich, customizable charts
- **Scipy** for statistical testing and significance analysis

Pick up a dataset in our common datasets repos and break your work into big steps:
1. Pick a topic and choose a dataset on that topic. Build around 10 Business questions to answer about this topic.
- Try to build the questions before knowing everything about the data
- If not possible, do step 2. first
2. Data Analysis: Understand your dataset and create a report (word document) about it
3. Data Exploration and Business Understanding:
- Import your dataset into SQL
- Answer your Business questions with SQL Queries
All data transformations and visual outputs are dynamically generated based on user input, allowing deep dive exploration without needing to code.

---

## Dataset repos
## Features

- [Kaggle](https://www.kaggle.com/)
- [Machine Learning Repository](https://archive.ics.uci.edu/)
- [PorData](https://www.pordata.pt/)
- [And many more](https://medium.com/@LearnPythonProgramming/best-data-sources-for-datasets-beyond-kaggle-98aac51e971e)
### ✅ Automated Data Ingestion
- Downloads and caches procurement and delivery datasets from local JSON endpoints
- Ensures repeatable and fail-safe fetching using fallback and logging logic

### 🧼 Robust Data Cleaning
- Utilizes a reusable `DataFrameCleaner` utility to standardize column types and formats
- Handles datetime conversion, missing values, string normalization, and invalid data filtering

## Bonus
### 📦 Delivery Performance Tracking
- Calculates **expected vs. actual delivery dates** per order line
- Derives key indicators such as:
- Whether a line was **fully delivered**
- Number of deliveries per order line
- Delay in days (positive or negative) relative to expected delivery

- Bonus points if you augment your data with data your obtain through WebScrapping
- Bonus points if you include visualizations from Python and/or Tableau in the final presentation
### 📊 Advanced Visualizations
- Uses **Plotly** for bar, line, and stacked visualizations
- Includes supplier filtering, top-X percent segmentation, and missing value detection
- Shows both **order-level** and **order-line-level** analyses

## Deliverables
### 📈 Timeliness & Trends
- Tracks monthly delivery frequency per supplier
- Visualizes how suppliers perform over time
- Automatically highlights most active suppliers

1. **Python Code:** Provide well-documented Python code that conducts the analysis and SQL upload.
2. **SQL text file (.sql)** well commented document with all the queries answering the Business questions
3. **Short Presentation:** Structure the presentation in the following way:
- Intro Slides: introduce the problem and the datasets
- Data cleaning and assumptions
- Business questions and SQL query (1 slide per question with a print screen of the query and the answer is enough)
4. **PDF Document** with notes you might want to share
### 📉 Statistical Insights
- Performs chi-squared tests for independence between delivery categories and responsible staff
- Calculates **Cramér’s V** to evaluate effect strength
- Flags statistically significant results and displays contingency tables interactively

### 🧭 Interactive Filtering
- Year selector: isolate one or multiple years of delivery data
- Supplier selector: choose specific suppliers or rely on automatic relevance filtering (top %)
- Modular layout in Streamlit tabs for clarity and drilldown

---

## Project Structure

Binary file added __pycache__/cleanup.cpython-312.pyc
Binary file not shown.
Binary file added __pycache__/cleanup.cpython-313.pyc
Binary file not shown.
Binary file added __pycache__/eda_service.cpython-312.pyc
Binary file not shown.
Binary file added __pycache__/eda_service.cpython-313.pyc
Binary file not shown.
Binary file added __pycache__/loader.cpython-312.pyc
Binary file not shown.
Binary file added __pycache__/loader.cpython-313.pyc
Binary file not shown.
Binary file added __pycache__/ui.cpython-312.pyc
Binary file not shown.
Binary file added __pycache__/ui.cpython-313.pyc
Binary file not shown.
141 changes: 141 additions & 0 deletions cleanup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
import pandas as pd
import numpy as np

class DataFrameCleaner:
def __init__(self, df: pd.DataFrame, name: str = "DataFrame", log_enabled: bool = False):
"""
Initialize the cleaner using the input DataFrame directly (no copy).

Parameters:
- df: The input pandas DataFrame to clean (modified in-place).
- name: Optional name used in logs to identify this cleaner instance.
- log_enabled: If True, log messages will be printed to stdout.
"""
self.df = df # Direct use, no .copy()
self.name = name
self.log_enabled = log_enabled

def _log(self, message: str):
"""
Internal helper to print log messages only when logging is enabled.
"""
if self.log_enabled:
print(message)

def drop_columns(self, columns: list):
"""
Drop specified columns from the DataFrame if they exist.

Parameters:
- columns: List of column names to drop.
"""
self._log(f"\n=== {self.name} — Dropping Specified Columns ===")
existing_cols = [col for col in columns if col in self.df.columns]
missing_cols = [col for col in columns if col not in self.df.columns]

if existing_cols:
self.df.drop(columns=existing_cols, inplace=True)
self._log(f"Dropped columns: {', '.join(existing_cols)}")
else:
self._log("No columns to drop.")

if missing_cols:
self._log(f"Skipped (not found): {', '.join(missing_cols)}")

def apply_dtype_mapping(self, mapping: dict = None):
"""
Apply data type conversions to columns as specified in the mapping.

Parameters:
- mapping: Dictionary where keys are column names and values are target data types.
Supported types: 'datetime', 'numeric', 'str', 'bool', or any valid numpy/pandas dtype.
"""
self._log(f"\n=== {self.name} — Applying Type Mappings ===")
if mapping is None:
self._log("No mapping provided.")
return

converted = []
failed = []
skipped = []

# Only keep columns that exist in the DataFrame
valid_mapping = {col: typ for col, typ in mapping.items() if col in self.df.columns}
skipped = [col for col in mapping if col not in self.df.columns]

# Convert all datetime columns in bulk
datetime_cols = [col for col, typ in valid_mapping.items() if typ == 'datetime']
if datetime_cols:
try:
self.df[datetime_cols] = self.df[datetime_cols].apply(
pd.to_datetime, errors='coerce', utc=True
)
converted.extend([(col, 'datetime') for col in datetime_cols])
except Exception:
failed.extend(datetime_cols)

# Convert all 'str' and 'bool' columns using bulk astype
astype_map = {col: typ for col, typ in valid_mapping.items() if typ in ['str', 'bool']}
if astype_map:
try:
self.df = self.df.astype(astype_map)
converted.extend(astype_map.items())
except Exception:
failed.extend(astype_map.keys())

# Handle other types individually (e.g. 'numeric', custom dtypes)
for col, typ in valid_mapping.items():
if typ in ['datetime', 'str', 'bool']:
continue # Already handled
try:
if typ == 'numeric':
self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
else:
self.df[col] = self.df[col].astype(typ)
converted.append((col, typ))
except Exception:
failed.append(col)

# Logging results
if converted:
self._log("Converted columns: " + ", ".join(f"{col}: {typ}" for col, typ in converted))
if skipped:
self._log(f"Skipped (not in DataFrame): {', '.join(skipped)}")
if failed:
self._log(f"Failed to convert: {', '.join(failed)}")

def rename_columns(self, rename_map: dict):
"""
Rename columns in the DataFrame using a provided mapping.

Parameters:
- rename_map: Dictionary mapping old column names to new names.
"""
self._log(f"\n=== {self.name} — Renaming Columns ===")
existing = {k: v for k, v in rename_map.items() if k in self.df.columns}
missing = [k for k in rename_map if k not in self.df.columns]

if existing:
self.df.rename(columns=existing, inplace=True)
self._log("Renamed columns: " + ", ".join(f"{k} -> {v}" for k, v in existing.items()))
else:
self._log("No columns were renamed.")

if missing:
self._log(f"Skipped (not found): {', '.join(missing)}")

def normalize_nones(self):
"""
Replace string values 'None' and 'null' (as text) with pandas NA (missing values).
"""
self._log(f"\n=== {self.name} — Replacing 'None'/'null' strings with NaN ===")
self.df.replace(["None", "null"], pd.NA, inplace=True)

def get_cleaned_df(self):
"""
Return the cleaned DataFrame.

Returns:
- pandas DataFrame after all applied transformations.
"""
return self.df
Loading