🧠 Reddit ETL Pipeline: From Reddit API to Redshift via Airflow, S3, and Glue

A complete data pipeline to Extract, Transform, and Load (ETL) Reddit data into an AWS Redshift data warehouse using modern cloud and orchestration tools.

🗺️ Table of Contents

🔍 Project Overview
⚙️ Stack & Technologies
🛠️ Pipeline Steps
🚀 Quick Start
📘 What I Learned

1. 🔍 Project Overview

This pipeline illustrates the complete lifecycle of ingesting data from Reddit subreddits and preparing it for advanced analysis and visualization in a cloud-based data warehouse.

2. ⚙️ Stack & Technologies

Tools: Docker, Airflow, AWS S3, AWS Glue, AWS Redshift Libraries/Tech: praw, pandas, numpy, s3fs, pytest, unittest, logging

3. 🛠️ Pipeline Steps

Scrape data from Reddit using the praw API.
Transform the data with pandas.
Push the raw data to AWS S3 using s3fs.
Automate steps 1–3 using Airflow DAGs.
Run Spark-based ETL jobs using AWS Glue.
Apply additional transformations and move cleaned data to another S3 bucket.
Load the final dataset into Amazon Redshift, making it ready for BI tools like Power BI or Tableau.

4. 🚀 Quick Start

Clone the repository:

git clone https://github.com/NotAbdelrahmanelsayed/Reddit_ETL.git
cd Reddit_ETL

Create the configuration file:
```
touch config/config.conf
```
Set up your credentials:
- Get Reddit API keys from reddit.com/prefs/apps
- Get AWS credentials as described in this AWS CLI guide
Fill in config/config.conf Use config_example.conf as a reference template.
Start Airflow and services:
```
docker compose up -d
```
Access the Airflow UI Open your browser and go to: http://localhost:8080
Trigger the DAG to run the full ETL process.

5. 📘 What I Learned

Spent 15+ hours debugging Airflow, which taught me more than most tutorials.
Deep-dived into AWS documentation, improving my understanding of real-world AWS usage.
Explored various AWS settings, which opened my eyes to configuration flexibility for different use cases.
Learned the basics of Apache Spark while scripting AWS Glue jobs.
Practiced creative testing using mocking and patching to isolate components cleanly.
Enhanced my skills in Docker and Docker Compose while managing the couple container setup.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
config		config
dags		dags
data/output		data/output
etls		etls
logs/scheduler		logs/scheduler
pipelines		pipelines
tests		tests
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
airflow.env		airflow.env
docker-compose.yml		docker-compose.yml
etl_diagram.png		etl_diagram.png
merge_pull.sh		merge_pull.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Reddit ETL Pipeline: From Reddit API to Redshift via Airflow, S3, and Glue

🗺️ Table of Contents

1. 🔍 Project Overview

2. ⚙️ Stack & Technologies

3. 🛠️ Pipeline Steps

4. 🚀 Quick Start

5. 📘 What I Learned

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Reddit ETL Pipeline: From Reddit API to Redshift via Airflow, S3, and Glue

🗺️ Table of Contents

1. 🔍 Project Overview

2. ⚙️ Stack & Technologies

3. 🛠️ Pipeline Steps

4. 🚀 Quick Start

5. 📘 What I Learned

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages