AWS Data Engineering Pipeline (Four-Part)

Note: The code for this project cannot be shared publicly due to confidentiality agreements.

A four-stage pipeline on AWS — ingest → store → analyze → deploy-as-code.
Uses S3, Lambda, SQS, EventBridge, IAM, and CDK (Python). Mirrors real-world data pipeline flows for scalability and easy maintenance.

Three Deployment Methods

A. Local Jupyter Notebook (Python CDK)
B. AWS CloudShell (Python CDK)
C. GitHub Actions CI/CD (automated deploys — in process)

Pipeline Overview:

One Lambda ingests data directly from the BLS and DataUSA APIs.
An S3 bucket stores raw and processed outputs.
Another Lambda joins the datasets, applies hashing for integrity/deduplication, and generates summary reports.
EventBridge triggers the ingest Lambda on a daily schedule.
When a new file lands in S3, it sends a notification to SQS.
The queue holds the event until the report Lambda processes it.

CI/CD with GitHub Actions

Goal: Deploy fast and debug faster.

Pipeline: Git push → Actions → Build/Test → AWS
View CI/CD workflows (in progress)

1. API Data from BLS → AWS S3

Uses API to fetch productivity & inflation data and bulk files.

Uses compliant User-Agent & file hash checks to skip unchanged data
Stores JSON results in Amazon S3
Enhanced Sync version keeps S3 updated—adds, updates, and deletes automatically
View Notebook – Enhanced Sync Version

2. API Request via AWS Lambda → S3

Automates pulling API data from BLS and dropping JSON into S3 on a monthly schedule using AWS Lambda Amazon EventBridge. Acts as a bridge between Part 1 and Part 3 data analysis.
View Script

3. Data Processing and Analysis

Loads data from S3 into a Pandas notebook where it’s cleaned, merged, and transformed before producing summary reports.

Enhanced Sync Version - In Process

4. Infrastructure as Code — AWS CDK Deployment

Automate the above steps. The SQS queue is actively mapped to two Lambda functions:
Both event source mappings are Enabled, confirming that the event-driven pipeline is live: S3 → SQS → Lambda

Method A: Python CDK (Local Jupyter Notebook)

Runs directly from a Jupyter Notebook with minimal or no CloudShell usage.
This approach is easier to iterate on, test, and document.
View Notebook

Method B: Python CDK (AWS CloudShell)

No local setup is required.
View Deployment Logs (sanitized)

CloudFormation stack below proves a fully deployed AWS data pipeline:

AWS Tech Stack

Amazon S3 — buckets for both raw and processed BLS datasets
AWS Lambda — pulls API data and drops it into S3
Amazon SQS — queue for event-driven report processing (Part 4)
Amazon EventBridge — kicks off Lambda runs on a set schedule
AWS IAM — scoped-down roles for Lambda, S3, and SQS access
AWS CDK — spins up the stack (Lambda, S3, SQS) as code
AWS Glue Data Catalog — keeps S3 datasets organized with schemas
Amazon Athena — run SQL queries directly on S3 data via the Glue catalog

Security, SDKs & Data Sources

Secrets: Github and Kaggle Secrets; AWS Secrets Manager
SDKs: Python, Pandas, Boto3 (AWS SDK for Python)
Sources: BLS Public API + bulk files; DataUSA API

Name		Name	Last commit message	Last commit date
Latest commit History 293 Commits
.github/workflows		.github/workflows
cloudformation		cloudformation
docs/part4		docs/part4
scripts		scripts
tests		tests
.gitignore		.gitignore
01-ingest-api-sync.ipynb		01-ingest-api-sync.ipynb
01-ingest-apis-to-s3.ipynb		01-ingest-apis-to-s3.ipynb
02-api-lambda-s3.py		02-api-lambda-s3.py
03-analytics-sync-reports.ipynb		03-analytics-sync-reports.ipynb
04-cdk-iac-python-local.ipynb		04-cdk-iac-python-local.ipynb
04-cdk-iac-sqs-cloud.ipynb		04-cdk-iac-sqs-cloud.ipynb
README.md		README.md
README_academic.md		README_academic.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Data Engineering Pipeline (Four-Part)

Three Deployment Methods

Pipeline Overview:

CI/CD with GitHub Actions

1. API Data from BLS → AWS S3

2. API Request via AWS Lambda → S3

3. Data Processing and Analysis

4. Infrastructure as Code — AWS CDK Deployment

Method A: Python CDK (Local Jupyter Notebook)

Method B: Python CDK (AWS CloudShell)

AWS Tech Stack

Security, SDKs & Data Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AWS Data Engineering Pipeline (Four-Part)

Three Deployment Methods

Pipeline Overview:

CI/CD with GitHub Actions

1. API Data from BLS → AWS S3

2. API Request via AWS Lambda → S3

3. Data Processing and Analysis

4. Infrastructure as Code — AWS CDK Deployment

Method A: Python CDK (Local Jupyter Notebook)

Method B: Python CDK (AWS CloudShell)

AWS Tech Stack

Security, SDKs & Data Sources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages