Note: The code for this project cannot be shared publicly due to confidentiality agreements.
A four-stage pipeline on AWS — ingest → store → analyze → deploy-as-code.
Uses S3, Lambda, SQS, EventBridge, IAM, and CDK (Python). Mirrors real-world data pipeline flows for scalability and easy maintenance.
- A. Local Jupyter Notebook (Python CDK)
- B. AWS CloudShell (Python CDK)
- C. GitHub Actions CI/CD (automated deploys — in process)
- One Lambda ingests data directly from the BLS and DataUSA APIs.
- An S3 bucket stores raw and processed outputs.
- Another Lambda joins the datasets, applies hashing for integrity/deduplication, and generates summary reports.
- EventBridge triggers the ingest Lambda on a daily schedule.
- When a new file lands in S3, it sends a notification to SQS.
- The queue holds the event until the report Lambda processes it.
Goal: Deploy fast and debug faster.
- Pipeline: Git push → Actions → Build/Test → AWS
View CI/CD workflows (in progress)
Uses API to fetch productivity & inflation data and bulk files.
- Uses compliant User-Agent & file hash checks to skip unchanged data
- Stores JSON results in Amazon S3
- Enhanced Sync version keeps S3 updated—adds, updates, and deletes automatically
- View Notebook – Enhanced Sync Version
Automates pulling API data from BLS and dropping JSON into S3 on a monthly schedule using AWS Lambda Amazon EventBridge.
Acts as a bridge between Part 1 and Part 3 data analysis.
View Script
Loads data from S3 into a Pandas notebook where it’s cleaned, merged, and transformed before producing summary reports.
Enhanced Sync Version - In Process
Automate the above steps. The SQS queue is actively mapped to two Lambda functions:
Both event source mappings are Enabled, confirming that the event-driven pipeline is live: S3 → SQS → Lambda

Runs directly from a Jupyter Notebook with minimal or no CloudShell usage.
This approach is easier to iterate on, test, and document.
View Notebook
No local setup is required.
View Deployment Logs (sanitized)
CloudFormation stack below proves a fully deployed AWS data pipeline:

- Amazon S3 — buckets for both raw and processed BLS datasets
- AWS Lambda — pulls API data and drops it into S3
- Amazon SQS — queue for event-driven report processing (Part 4)
- Amazon EventBridge — kicks off Lambda runs on a set schedule
- AWS IAM — scoped-down roles for Lambda, S3, and SQS access
- AWS CDK — spins up the stack (Lambda, S3, SQS) as code
- AWS Glue Data Catalog — keeps S3 datasets organized with schemas
- Amazon Athena — run SQL queries directly on S3 data via the Glue catalog
- Secrets: Github and Kaggle Secrets; AWS Secrets Manager
- SDKs: Python, Pandas, Boto3 (AWS SDK for Python)
- Sources: BLS Public API + bulk files; DataUSA API