A complete data pipeline to Extract, Transform, and Load (ETL) Reddit data into an AWS Redshift data warehouse using modern cloud and orchestration tools.
This pipeline illustrates the complete lifecycle of ingesting data from Reddit subreddits and preparing it for advanced analysis and visualization in a cloud-based data warehouse.
Tools: Docker, Airflow, AWS S3, AWS Glue, AWS Redshift
Libraries/Tech: praw, pandas, numpy, s3fs, pytest, unittest, logging
- Scrape data from Reddit using the
prawAPI. - Transform the data with
pandas. - Push the raw data to AWS S3 using
s3fs. - Automate steps 1–3 using Airflow DAGs.
- Run Spark-based ETL jobs using AWS Glue.
- Apply additional transformations and move cleaned data to another S3 bucket.
- Load the final dataset into Amazon Redshift, making it ready for BI tools like Power BI or Tableau.
-
Clone the repository:
git clone https://github.com/NotAbdelrahmanelsayed/Reddit_ETL.git cd Reddit_ETL -
Create the configuration file:
touch config/config.conf
-
Set up your credentials:
- Get Reddit API keys from reddit.com/prefs/apps
- Get AWS credentials as described in this AWS CLI guide
-
Fill in
config/config.confUseconfig_example.confas a reference template. -
Start Airflow and services:
docker compose up -d
-
Access the Airflow UI Open your browser and go to: http://localhost:8080
-
Trigger the DAG to run the full ETL process.
- Spent 15+ hours debugging Airflow, which taught me more than most tutorials.
- Deep-dived into AWS documentation, improving my understanding of real-world AWS usage.
- Explored various AWS settings, which opened my eyes to configuration flexibility for different use cases.
- Learned the basics of Apache Spark while scripting AWS Glue jobs.
- Practiced creative testing using mocking and patching to isolate components cleanly.
- Enhanced my skills in Docker and Docker Compose while managing the couple container setup.
