Multinational Retail Data Centralisation

An ETL pipeline designed to streamline and centralize data processing operations across various retail locations of a multinational company, enabling efficient and robust data analysis for strategic decision-making.

Skills and Technologies

This project demanded a wide array of skills and a strong command over various technologies, including but not limited to:

Python Programming: Comprehensive use of Python for data manipulation and pipeline creation.
Data Extraction & Transformation: Proficiency in extracting data from diverse sources and transforming it for analytical readiness.
Data Cleaning: In-depth understanding and application of data cleaning techniques to ensure the integrity of data.
PostgreSQL: Extensive use of PostgreSQL for data storage and complex SQL queries for data retrieval.
Pandas & NumPy: Employed these powerful Python libraries for data analysis and manipulation tasks.
Jupyter Notebooks: Made use of Jupyter Notebooks for iterative coding, debugging and initial data exploration.
Version Control: Managed code and changes using Git, demonstrating best practices in continuous integration and deployment.
Database Design: Skills in conceptualizing and implementing database schemas and relationships.
Object-Oriented Programming (OOP): Applied OOP concepts for creating maintainable and reusable code.
Virtual Environments: Used virtual environments for managing dependencies and ensuring consistent project setups.

Through this project, I have demonstrated my ability to learn and adapt to various technologies quickly, showing my commitment to personal and professional growth.

Introduction

This repository contains the implementation of an ETL pipeline for consolidating sales data across multiple geographic locations of a multinational retail corporation. The pipeline is responsible for extracting data from various sources, cleaning and transforming the data, and loading it into a centralized PostgreSQL database to support business intelligence and analytics.

Installation

Instructions on setting up the environment and installing the necessary dependencies for running the ETL pipeline.

# Clone the repository
git clone https://github.com/ASEIcode/multinational-retail-data-centralisation.git

# Navigate to the repository directory

# Set up a Python virtual environment (optional but recommended)
python -m venv venv # 2nd venv = location of your virtual envs
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

# Install the required packages
pip install -r requirements.txt

Usage

main.ipynb is a notebook containing all of the code blocks to run the ETL pipeline. There are also markdown comments at each critical stage to help you keep track of what each script does.

Configuration

Before running the pipeline in main.ipynb Youll need two configuration yaml files:

sales_data_db_creds.yaml

This will contain the crednetials to access your postgresql database.

 HOST: hostname
 PASSWORD: password
 USER: username
 DATABASE: sales_data
 PORT: port number

db_creds.yaml

This will contain the credentials to access your amazon RDS database for the data extractor methods that utilise AWS RDS.
```
 RDS_HOST: host address
 RDS_PASSWORD: xxxxx
 RDS_USER: username
 RDS_DATABASE: database name
 RDS_PORT: port number
```

SQL Files

The Folder SQL contains two files:

alter_tables_dtypes_add_keys.sql A file containing all the sql queries used to create the database schema (change the datatypes, make any changes needed before0 adding primary and foreign keys to the table) *Run First
business_questions_queries.sql A collection of business insight queries that could be used to make better data driven decisions.

Known issues and Future improvements

Phone numbers are in many different formats in the tables. Regex could clean and standardise this
Try / Except blocks to catch errors more elegently in the extraction and upload classes
has_numbers and has_alpha are used more than once and could be written into the class as a method to avoid duplication
The retrieve_stores_data method in the DataExtractor class currently has to make many 451 consecutive requests and append them all to a list. This takes a long time. Parallel processing could be employed here to make this more efficient.
Currently each cleaning method has been written to fit specifically to the dummy data. If this were to be used to clean future data coming from similar sources in an automated way rather than being supervised manually by an engineer the cleaning would need to involve extra lines to cater to a larger variation of errors that could be present.

Contact

Adam Evans

Email: adamevansjs@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
SQL		SQL
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
data_cleaning.py		data_cleaning.py
data_extraction.py		data_extraction.py
database_utils.py		database_utils.py
main.ipynb		main.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multinational Retail Data Centralisation

Table of Contents

Skills and Technologies

Introduction

Installation

Usage

Configuration

SQL Files

Known issues and Future improvements

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multinational Retail Data Centralisation

Table of Contents

Skills and Technologies

Introduction

Installation

Usage

Configuration

SQL Files

Known issues and Future improvements

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages