The purpose of this project is to understand what, where, and how each user is listening to the songs in the metadata generated based on the MillionSong Dataset. The analytical goal is to find out what is making the free tier users switch to the paid tier and why paid users are downgrading to the free tier through their listening habits.
- Subset of the real data from Million Song Dataset
- Each file is in JSON format and contains metadata about a song and the artist of that song
- Log files are in JSON format generated by this event simulator based on the songs in the dataset above
- These simulate activity logs from a music streaming app based on specified configurations
- The dataset are partitioned by year and month
We have 1 fact table (songplays), and 4 dimension tables (users, songs, artists, time).

- songplay: Stores information of each song played in the logs data
- Aggregates the crucial information of what and when the user is listening
- play_id as
PRIMARY KEYadded thoughmonotonically_increasing_id, dropped any duplicates from the table as each record is unique - Partitioned by
yearand thenmonth
- users: Stores the information of each unique users using the app
- user_id as
PRIMARY KEY, dropped any duplicates from the table as each user_id is associated with only one user
- user_id as
- songs: Stores the information of each unique songs in the music database
- song_id as
PRIMARY KEY, dropped any duplicates from the table as each song_id is associated with only one song - Partitioned by
yearand thenartist
- song_id as
- artists: Stores the information of each unique artists in the music database
- artist_id as
PRIMARY KEY, dropped any duplicates from the table as each artist_id is associated with only one artist
- artist_id as
- time: timestamps of records in songplays broken down into specific units
- This allows us to better understand what period of the day/week/month/year do people listen to music
- start_time as
PRIMARY KEY, dropped any duplicates from the table as each start_time will have the same date time info - Partitioned by
yearand thenmonth
- Used postgres to create database schema and perform data pipeline ETL
- Performs ETL loading data from S3 to Redshift
- Performs ELT loading data from S3 using Spark to Redshift
- Using Apache Airflow to automate and monitor to data warehouse ETL pipelines
- Build a dynamic, reusable, and allow easy backfills ETL pipeline through data quality checks
