ironhack-labs · JanDirkvandeBijl · May 20, 2025 · May 20, 2025 · May 20, 2025 · May 20, 2025
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,37 @@
+# Visual Studio cache en gebruikersinstellingen
+.vs/
+*.suo
+*.user
+*.userosscache
+*.sln.docstates
+*.VC.db
+
+# Visual Studio .db & .wal bestanden
+*.db
+*.db-shm
+*.db-wal
+*.vsidx
+
+# Python __pycache__ directory en .pyc bestanden
+__pycache__/
+*.py[cod]
+
+# SQLite databasebestanden (optioneel — als die lokaal zijn)
+*.sqlite
+
+# JetBrains IDE's (optioneel)
+.idea/
+
+# VS Code settings (optioneel)
+.vscode/
+
+# Systeembestanden
+.DS_Store
+Thumbs.db
+
+# Logs en tijdelijke bestanden
+*.log
+*.tmp
+
+# JSON uitsluiten (alleen als je dit echt wilt)
+*.json
diff --git a/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/CodeChunks.db b/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/CodeChunks.db
diff --git a/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/CodeChunks.db-shm b/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/CodeChunks.db-shm
diff --git a/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/CodeChunks.db-wal b/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/CodeChunks.db-wal
diff --git a/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/SemanticSymbols.db b/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/SemanticSymbols.db
diff --git a/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/SemanticSymbols.db-shm b/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/SemanticSymbols.db-shm
diff --git a/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/SemanticSymbols.db-wal b/.vs/project-2-eda-sql/CopilotIndices/17.14.670.39694/SemanticSymbols.db-wal
diff --git a/.vs/project-2-eda-sql/v17/.wsuo b/.vs/project-2-eda-sql/v17/.wsuo
diff --git a/.vs/slnx.sqlite b/.vs/slnx.sqlite
diff --git a/Naamloze presentatie.pptx b/Naamloze presentatie.pptx
diff --git a/README.md b/README.md
@@ -1,45 +1,59 @@
-![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)
+# Delivery Insights Dashboard
 
-# Business Challenge: EDA and SQL
+## Overview
 
-## Introduction
+**Delivery Insights Dashboard** is a modular data analytics application focused on the procurement process and supplier performance. It transforms raw JSON datasets from internal APIs into insightful, interactive visualizations. The tool is intended to help supply chain analysts, procurement teams, and operational managers understand where deliveries deviate from planning and identify improvement opportunities.
 
-A data project lifecycle has many phases, rather than being just an isolated analysis in a single tool.
-In this project you will experience doing an analysis using both Python and SQL to obtain the final result, by exploring each tool's behavior.
+The application is built using:
 
-## Project Overview
+- **Python** for backend data transformation and analytics
+- **Pandas & NumPy** for data manipulation
+- **Streamlit** for the interactive frontend UI
+- **Plotly** for visually rich, customizable charts
+- **Scipy** for statistical testing and significance analysis
 
-Pick up a dataset in our common datasets repos and break your work into big steps:
- 1. Pick a topic and choose a dataset on that topic. Build around 10 Business questions to answer about this topic.
- 	- Try to build the questions before knowing everything about the data
- 	- If not possible, do step 2. first
- 2. Data Analysis: Understand your dataset and create a report (word document) about it
- 3. Data Exploration and Business Understanding: 
- 	- Import your dataset into SQL
- 	- Answer your Business questions with SQL Queries
+All data transformations and visual outputs are dynamically generated based on user input, allowing deep dive exploration without needing to code.
 
+---
 
-## Dataset repos
+## Features
 
- - [Kaggle](https://www.kaggle.com/)
- - [Machine Learning Repository](https://archive.ics.uci.edu/)
- - [PorData](https://www.pordata.pt/)
- - [And many more](https://medium.com/@LearnPythonProgramming/best-data-sources-for-datasets-beyond-kaggle-98aac51e971e)
+### ✅ Automated Data Ingestion
+- Downloads and caches procurement and delivery datasets from local JSON endpoints
+- Ensures repeatable and fail-safe fetching using fallback and logging logic
 
+### 🧼 Robust Data Cleaning
+- Utilizes a reusable `DataFrameCleaner` utility to standardize column types and formats
+- Handles datetime conversion, missing values, string normalization, and invalid data filtering
 
-## Bonus
+### 📦 Delivery Performance Tracking
+- Calculates **expected vs. actual delivery dates** per order line
+- Derives key indicators such as:
+  - Whether a line was **fully delivered**
+  - Number of deliveries per order line
+  - Delay in days (positive or negative) relative to expected delivery
 
- - Bonus points if you augment your data with data your obtain through WebScrapping
- - Bonus points if you include visualizations from Python and/or Tableau in the final presentation
+### 📊 Advanced Visualizations
+- Uses **Plotly** for bar, line, and stacked visualizations
+- Includes supplier filtering, top-X percent segmentation, and missing value detection
+- Shows both **order-level** and **order-line-level** analyses
 
-## Deliverables
+### 📈 Timeliness & Trends
+- Tracks monthly delivery frequency per supplier
+- Visualizes how suppliers perform over time
+- Automatically highlights most active suppliers
 
-1. **Python Code:** Provide well-documented Python code that conducts the analysis and SQL upload.
-2. **SQL text file (.sql)** well commented document with all the queries answering the Business questions
-3. **Short Presentation:** Structure the presentation in the following way:
- - Intro Slides: introduce the problem and the datasets
- - Data cleaning and assumptions
- - Business questions and SQL query (1 slide per question with a print screen of the query and the answer is enough)
-4. **PDF Document** with notes you might want to share
+### 📉 Statistical Insights
+- Performs chi-squared tests for independence between delivery categories and responsible staff
+- Calculates **Cramér’s V** to evaluate effect strength
+- Flags statistically significant results and displays contingency tables interactively
 
+### 🧭 Interactive Filtering
+- Year selector: isolate one or multiple years of delivery data
+- Supplier selector: choose specific suppliers or rely on automatic relevance filtering (top %)
+- Modular layout in Streamlit tabs for clarity and drilldown
+
+---
+
+## Project Structure
 
diff --git a/__pycache__/cleanup.cpython-312.pyc b/__pycache__/cleanup.cpython-312.pyc
diff --git a/__pycache__/cleanup.cpython-313.pyc b/__pycache__/cleanup.cpython-313.pyc
diff --git a/__pycache__/eda_service.cpython-312.pyc b/__pycache__/eda_service.cpython-312.pyc
diff --git a/__pycache__/eda_service.cpython-313.pyc b/__pycache__/eda_service.cpython-313.pyc
diff --git a/__pycache__/loader.cpython-312.pyc b/__pycache__/loader.cpython-312.pyc
diff --git a/__pycache__/loader.cpython-313.pyc b/__pycache__/loader.cpython-313.pyc
diff --git a/__pycache__/ui.cpython-312.pyc b/__pycache__/ui.cpython-312.pyc
diff --git a/__pycache__/ui.cpython-313.pyc b/__pycache__/ui.cpython-313.pyc
diff --git a/cleanup.py b/cleanup.py
@@ -0,0 +1,141 @@
+import pandas as pd
+import numpy as np
+
+class DataFrameCleaner:
+    def __init__(self, df: pd.DataFrame, name: str = "DataFrame", log_enabled: bool = False):
+        """
+        Initialize the cleaner using the input DataFrame directly (no copy).
+
+        Parameters:
+        - df: The input pandas DataFrame to clean (modified in-place).
+        - name: Optional name used in logs to identify this cleaner instance.
+        - log_enabled: If True, log messages will be printed to stdout.
+        """
+        self.df = df  # Direct use, no .copy()
+        self.name = name
+        self.log_enabled = log_enabled
+
+    def _log(self, message: str):
+        """
+        Internal helper to print log messages only when logging is enabled.
+        """
+        if self.log_enabled:
+            print(message)
+
+    def drop_columns(self, columns: list):
+        """
+        Drop specified columns from the DataFrame if they exist.
+
+        Parameters:
+        - columns: List of column names to drop.
+        """
+        self._log(f"\n=== {self.name} — Dropping Specified Columns ===")
+        existing_cols = [col for col in columns if col in self.df.columns]
+        missing_cols = [col for col in columns if col not in self.df.columns]
+
+        if existing_cols:
+            self.df.drop(columns=existing_cols, inplace=True)
+            self._log(f"Dropped columns: {', '.join(existing_cols)}")
+        else:
+            self._log("No columns to drop.")
+
+        if missing_cols:
+            self._log(f"Skipped (not found): {', '.join(missing_cols)}")
+
+    def apply_dtype_mapping(self, mapping: dict = None):
+        """
+        Apply data type conversions to columns as specified in the mapping.
+
+        Parameters:
+        - mapping: Dictionary where keys are column names and values are target data types.
+                   Supported types: 'datetime', 'numeric', 'str', 'bool', or any valid numpy/pandas dtype.
+        """
+        self._log(f"\n=== {self.name} — Applying Type Mappings ===")
+        if mapping is None:
+            self._log("No mapping provided.")
+            return
+
+        converted = []
+        failed = []
+        skipped = []
+
+        # Only keep columns that exist in the DataFrame
+        valid_mapping = {col: typ for col, typ in mapping.items() if col in self.df.columns}
+        skipped = [col for col in mapping if col not in self.df.columns]
+
+        # Convert all datetime columns in bulk
+        datetime_cols = [col for col, typ in valid_mapping.items() if typ == 'datetime']
+        if datetime_cols:
+            try:
+                self.df[datetime_cols] = self.df[datetime_cols].apply(
+                    pd.to_datetime, errors='coerce', utc=True
+                )
+                converted.extend([(col, 'datetime') for col in datetime_cols])
+            except Exception:
+                failed.extend(datetime_cols)
+
+        # Convert all 'str' and 'bool' columns using bulk astype
+        astype_map = {col: typ for col, typ in valid_mapping.items() if typ in ['str', 'bool']}
+        if astype_map:
+            try:
+                self.df = self.df.astype(astype_map)
+                converted.extend(astype_map.items())
+            except Exception:
+                failed.extend(astype_map.keys())
+
+        # Handle other types individually (e.g. 'numeric', custom dtypes)
+        for col, typ in valid_mapping.items():
+            if typ in ['datetime', 'str', 'bool']:
+                continue  # Already handled
+            try:
+                if typ == 'numeric':
+                    self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
+                else:
+                    self.df[col] = self.df[col].astype(typ)
+                converted.append((col, typ))
+            except Exception:
+                failed.append(col)
+
+        # Logging results
+        if converted:
+            self._log("Converted columns: " + ", ".join(f"{col}: {typ}" for col, typ in converted))
+        if skipped:
+            self._log(f"Skipped (not in DataFrame): {', '.join(skipped)}")
+        if failed:
+            self._log(f"Failed to convert: {', '.join(failed)}")
+
+    def rename_columns(self, rename_map: dict):
+        """
+        Rename columns in the DataFrame using a provided mapping.
+
+        Parameters:
+        - rename_map: Dictionary mapping old column names to new names.
+        """
+        self._log(f"\n=== {self.name} — Renaming Columns ===")
+        existing = {k: v for k, v in rename_map.items() if k in self.df.columns}
+        missing = [k for k in rename_map if k not in self.df.columns]
+
+        if existing:
+            self.df.rename(columns=existing, inplace=True)
+            self._log("Renamed columns: " + ", ".join(f"{k} -> {v}" for k, v in existing.items()))
+        else:
+            self._log("No columns were renamed.")
+
+        if missing:
+            self._log(f"Skipped (not found): {', '.join(missing)}")
+
+    def normalize_nones(self):
+        """
+        Replace string values 'None' and 'null' (as text) with pandas NA (missing values).
+        """
+        self._log(f"\n=== {self.name} — Replacing 'None'/'null' strings with NaN ===")
+        self.df.replace(["None", "null"], pd.NA, inplace=True)
+
+    def get_cleaned_df(self):
+        """
+        Return the cleaned DataFrame.
+
+        Returns:
+        - pandas DataFrame after all applied transformations.
+        """
+        return self.df