his project aims to provide a production-ready data pipeline using MySQL databases and SQL-based transformations. Building on our original AdventureWorks-inspired schema, AdventureSales database. This document outlines how these databases fit into our updated ETL/ELT pipeline, the key tables, and how to configure and run the pipeline.
- Currently under development in our pipeline (no tables shown in the screenshot).
- Future plans may include tables such as
SalesOrder,SalesInvoice, or others depending on business requirements.
- Tables:
- Customers
Holds customer data such as names, addresses, and contact information. - SalesOrderDetail
Represents line items for sales orders (product, quantity, price, etc.). - SalesTerritory
Defines geographical or market segments (e.g., regions) to facilitate sales analytics. - PurchaseOrderHeader
Stores high-level information about each purchase order (e.g., vendor, order date, status). - PurchaseOrderDetail
Contains line-item details for each purchase order (e.g., product, quantity, cost). - ShipMethod
Enumerates shipping/carrier methods for orders, enabling logistics and cost analysis.
- Customers
- Views (if any)
- Could aggregate or transform sales data for dashboards and reports.
-
Data Ingestion (Extract):
- We source raw data (e.g., CSV files, APIs) that map to each of the databases.
- For example, vendor data feeds into
AdventurePurchase, while sales transactions feed intoAdventureSales, and HR employee records flow intoHumanResources.
-
Data Loading (Load/ELT):
- Data is either loaded directly into staging tables or appended to the final tables in each database.
- Python scripts or SQL
LOAD DATAcommands (bulk inserts) may be used to populate these tables.
-
Transformations (Transform):
- SQL transformations (e.g., in
sql/transformations.sql) standardize data formats, apply business rules, and link tables across the different databases when needed. - For example, a transformation could merge region codes in
SalesTerritorywith shipping zones used inAdventurePurchase.
- SQL transformations (e.g., in
-
Analytics and Reporting:
- Once the data is consistent and cleansed, it can be queried to produce dashboards, aggregated tables, or views.
- Tools like Power BI, Tableau, or Python libraries can connect to these MySQL databases for deeper analysis.
- MySQL Server: Version 8.0 or later.
- Python: If using Python scripts for automation, version 3.7+ is recommended.
- MySQL Workbench or VSCode MySQL Shell Extension: Useful for running queries and reviewing schemas.
- Clone the Repository
git clone https://github.com/Gabeleo24/ADS-507/tree/main/MySQL%20Pipeline
- Set Up the MySQL Database
- Download and install MySQL Server from the MySQL Downloads page.
- Run the SQL script to create your production-ready schema:
- Load the CSV data files for each table (located in the `data/` folder) using the provided BULK INSERT commands or your preferred method. - Configure the Pipeline
- Edit the configuration file in the
config/directory to set database credentials and file paths. - Run the ETL/ELT script:
python src/pipeline.py
- Edit the configuration file in the
/MySQL Pipeline ├── js/ │ ├── Node.js ├── data/ │ ├── PurchaseOrderHeader.csv │ ├── PurchaseOrderDetail.csv │ ├── ShipMethod.csv │ ├── Customers.csv │ ├── SalesOrderDetail.csv │ ├── SalesTerritory.csv │ ├── Employee.csv │ └── ... ├── docs/ ├── sql/ │ ├── create-databases.sql │ ├── transformations.sql │ ├── ... ├── src/ │ └── pipeline.py ├── tests/ ├── Dockerfile ├── README.md └── config/ └── db_config.yaml (example)
-
Start the Database:
- Start MySQL using your local installation or via Docker:
docker run -d --name mysql-server -p 3306:3306 -e MYSQL_ROOT_PASSWORD=yourpassword mysql:latest
- Start MySQL using your local installation or via Docker:
-
Load the Data:
- Run the SQL scripts to create the database schema and load the CSV data.
-
Execute the Pipeline:
- Navigate to the
src/directory and run:python pipeline.py
- Navigate to the
-
View Outputs:
- Outputs such as visualizations or email triggers will be available based on the pipeline configuration.
- Logs: Check your logs folder (e.g., logs/) for pipeline execution statuses or errors.
- MySQL Monitoring: Tools such as MySQL Workbench, VS Code’s MySQL extension, or third-party dashboards (e.g., Percona Monitoring) can track server load, query performance, etc.
- Error Handling: Implement robust exception handling in your ETL scripts. If a load or transformation fails, the pipeline should log the error and optionally send notifications (e.g., via email or Slack).
- Scalability:
- Partition large tables or implement data archiving for older records.
- Cloud Integration:
- Migrate the MySQL instances to AWS RDS, Azure Database for MySQL, or GCP Cloud SQL.
- CI/CD:
- Automate schema changes and data loading via a continuous integration pipeline (e.g., GitHub Actions).
- Data Governance & Quality:
- Implement validation rules and data quality checks within the pipeline.
This project is licensed under the MIT License.
- Contributor 1: Duy Nguyen
- Contributor 2: Jorge Roldan
- Contributor 3: Gabriel Gallardo