A Python-based ETL data pipeline built to analyze Walmart's e-commerce supply and demand patterns around public holidays. This project merges sales data from a PostgreSQL database with complementary feature data, transforms and aggregates it, and exports clean CSV outputs.
Walmart is the biggest retail store in the United States. By end of 2022, e-commerce represented $80 billion in sales — 13% of total company revenue. Public holidays like the Super Bowl, Labour Day, Thanksgiving, and Christmas significantly impact weekly sales.
This pipeline was built to support analysis of those holiday-driven patterns.
walmart-data-pipeline/
│
├── notebook.ipynb # Main Jupyter notebook (10 cells)
├── extra_data.parquet # Complementary features dataset
├── clean_data.csv # Output: cleaned & transformed data
├── agg_data.csv # Output: average monthly sales
└── README.md
| Column | Description |
|---|---|
index |
Unique row ID |
Store_ID |
Store number |
Date |
Week of sales |
Weekly_Sales |
Sales for that store/week |
| Column | Description |
|---|---|
IsHoliday |
1 if week contains a public holiday, 0 if not |
Temperature |
Temperature on the day of sale |
Fuel_Price |
Cost of fuel in the region |
CPI |
Consumer Price Index |
Unemployment |
Prevailing unemployment rate |
MarkDown1–4 |
Number of promotional markdowns |
Dept |
Department number in each store |
Size |
Size of the store |
Type |
Type of store (based on Size) |
The pipeline is implemented across 10 notebook cells:
Fetches all data from the grocery_sales PostgreSQL table.
SELECT * FROM grocery_salesReads extra_data.parquet and merges it with grocery_sales on the index column.
def extract(store_data, extra_data):
extra_df = pd.read_parquet(extra_data)
merged_df = store_data.merge(extra_df, on="index")
return merged_df
merged_df = extract(grocery_sales, "extra_data.parquet")Cleans and reshapes the merged data:
- Fills missing numerical values with the column median
- Adds a
Monthcolumn extracted from theDatefield - Keeps only rows where
Weekly_Sales > 10,000 - Selects only the 7 required columns
def transform(raw_data):
raw_data.fillna(raw_data.select_dtypes(include='number').median(), inplace=True)
raw_data["Month"] = pd.to_datetime(raw_data["Date"]).dt.month
raw_data = raw_data[raw_data["Weekly_Sales"] > 10000]
raw_data = raw_data[["Store_ID", "Month", "Dept", "IsHoliday",
"Weekly_Sales", "CPI", "Unemployment"]]
return raw_dataclean_data = transform(merged_df)Calculates average weekly sales per calendar month using a pandas method chain.
def avg_weekly_sales_per_month(clean_data):
agg_data = (clean_data[["Month", "Weekly_Sales"]]
.groupby("Month")
.agg(Avg_Sales=("Weekly_Sales", "mean"))
.reset_index()
.round(2))
return agg_dataagg_data = avg_weekly_sales_per_month(clean_data)Saves both DataFrames as CSV files without the row index.
def load(full_data, full_data_file_path, agg_data, agg_data_file_path):
full_data.to_csv(full_data_file_path, index=False)
agg_data.to_csv(agg_data_file_path, index=False)load(clean_data, "clean_data.csv", agg_data, "agg_data.csv")Verifies that output files exist in the current working directory.
def validation(file_path):
return os.path.exists(file_path)validation("clean_data.csv") # Expected: True
validation("agg_data.csv") # Expected: True| Column | Description |
|---|---|
Store_ID |
Unique store identifier |
Month |
Calendar month (1–12) |
Dept |
Department number |
IsHoliday |
Holiday week flag |
Weekly_Sales |
Weekly sales (filtered > $10,000) |
CPI |
Consumer Price Index |
Unemployment |
Unemployment rate |
| Column | Description |
|---|---|
Month |
Calendar month (1–12) |
Avg_Sales |
Average weekly sales for that month (2 decimal places) |
Sample output:
| Month | Avg_Sales |
|---|---|
| 1.0 | 33174.18 |
| 2.0 | 34333.33 |
| ... | ... |
| Tool | Purpose |
|---|---|
| Python 3 | Core programming language |
| Pandas | Data manipulation and transformation |
| SQL / PostgreSQL | Source data extraction |
| Parquet | Efficient columnar data storage |
| os | File system validation |
| Jupyter Notebook | Interactive development environment |
-
Clone the repository:
git clone https://github.com/your-username/walmart-data-pipeline.git cd walmart-data-pipeline -
Install dependencies:
pip install pandas pyarrow psycopg2-binary jupyter
-
Open the notebook:
jupyter notebook notebook.ipynb
-
Run all cells in order (Cell 1 → Cell 10).
-
Check outputs:
clean_data.csv agg_data.csv
This project was completed as part of a DataCamp data engineering learning path.