Skip to content

amalphonse/retail-analytics-dbt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Retail Revenue Analytics Mart (dbt + DuckDB )

📌 Overview

This project builds a production-style analytics mart from raw transactional retail data using dbt.

Using the UCI Online Retail dataset (2010–2011 UK e-commerce transactions), I transform raw invoice line items into a clean, tested star schema and generate revenue reporting models with built-in data quality enforcement.

This project demonstrates how modern Analytics Engineers:

  • Separate raw ingestion from transformation logic
  • Define clear model grain and data contracts
  • Implement automated data quality checks
  • Build documented, reproducible analytics pipelines
  • Design warehouse-portable dbt projects (DuckDB → Snowflake)

🎯 Business Context

Retail transactional data often contains:

  • Returns and cancellations
  • Negative quantities
  • Inconsistent identifiers
  • Revenue miscalculations due to unhandled edge cases

Without proper modeling and validation, downstream dashboards and financial reporting can be incorrect.

This project solves that by:

  • Standardizing raw data in a staging layer
  • Modeling clear fact and dimension tables
  • Explicitly handling cancellation logic
  • Enforcing integrity via dbt tests
  • Producing trustworthy revenue reporting outputs

🏗 Data Model

Raw Layer (Seed)

  • raw.online_retail
  • Grain: one row per invoice line item

Staging Layer

  • stg_online_retail
  • Typed and standardized fields Derived fields:
  • invoice_date
  • line_revenue
  • is_cancellation (business rule logic)

Mart Layer (Star Schema)

Dimensions

  • dim_customers — one row per customer
  • dim_products — one row per product (StockCode)

Fact

  • fct_order_items — one row per invoice line item
    • Revenue cleaned and business-validated

Reporting

  • rpt_daily_revenue Aggregated daily revenue excluding cancellation rows

🛠 Tech Stack

  • dbt Core
  • DuckDB (local analytics warehouse)
  • Snowflake-ready configuration
  • SQL
  • YAML (schema + test configuration)
  • Git / GitHub

Dataset Source: UCI ML Repository – Online Retail

🚀 How to Run Locally

1️⃣ Clone repository

git clone <your-repo-url>
cd retail_analytics_dbt

2️⃣ Create virtual environment

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install dbt-core dbt-duckdb

##3️⃣ Configure dbt profile Create ~/.dbt/profiles.yml:

retail_analytics_dbt:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "retail.duckdb"

Verify connection:

dbt debug

5️⃣ Build the full pipeline

dbt seed
dbt build

6️⃣ View documentation & lineage

dbt docs generate
dbt docs serve

✅ Data Quality & Testing Strategy

This project treats data tests as contracts, not afterthoughts.

Generic Tests

  • unique and not_null on dimension keys
  • relationships tests between fact and dimensions

Business Logic Test

Custom SQL test:

  • Ensures non-cancellation revenue is never negative

Cancellation Handling Logic

The dataset contains returns that appear as:

  • Invoice numbers beginning with “C”
  • Negative quantities
  • Negative revenue patterns

The staging layer introduces a derived is_cancellation flag to ensure:

  • Revenue reporting excludes returns
  • Downstream aggregates are financially accurate

📊 Results & Impact

✔ Built a fully reproducible analytics mart from raw retail data ✔ Enforced primary and foreign key integrity ✔ Identified and corrected dataset-specific revenue edge cases ✔ Designed project for seamless warehouse portability ✔ Full dataset (~540k rows) builds locally in under 1 second

This pipeline ensures revenue metrics are reliable before reaching BI dashboards or executive reporting.

🧠 Key Learnings

  • Clearly defining model grain prevents metric inflation
  • Data tests should reflect business logic, not just schema constraints
  • Cancellation logic must be explicitly modeled in retail datasets
  • Warehouse abstraction in dbt enables seamless platform migration
  • Lineage documentation improves transparency and trust

📌 Why This Project Matters

This is not a dashboard exercise.

It demonstrates:

  • Production-style analytics engineering practices
  • Version-controlled transformation logic
  • Data contract enforcement
  • Business-aware modeling
  • Reproducible local development
  • Clean migration path to enterprise warehouse

👤 Author

Anju Mercian Platform Engineer | Data Platform | Modern Data Stack

Releases

No releases published

Packages

 
 
 

Contributors

Languages