Skip to content

ayn-aval/Spotify-Real-Time-data-analysis

Repository files navigation

Spotify Real-Time Data Analysis (Kafka → MinIO → Airflow → Snowflake → dbt)

A small end-to-end pipeline that simulates Spotify events, streams them through Kafka, lands raw data in MinIO, loads into Snowflake (Bronze) via Airflow, and transforms into curated views (Silver/Gold) using dbt.

Architecture

  • Simulator: simulator/producer.py → produces events to Kafka topic spotify-events
  • Landing: consumer/kafka-to-minio.py → consumes from Kafka and uploads JSON to MinIO under bronze/date=.../hour=.../
  • Orchestration / Load: Airflow DAG docker/dags/minio-to-kafka.py → reads from MinIO and inserts into Snowflake BRONZE table
  • Transformations: dbt project spotify_analysis/ → creates TRANSFORM schema views:
    • transform.spotify_silver
    • transform.top_songs
    • transform.user_engagement

Prerequisites

  • Docker + Docker Compose
  • Python 3 (local)
  • Snowflake account + warehouse + role access
  • dbt Snowflake adapter (pip3 install dbt-snowflake)

Configuration (do not commit secrets)

  • Docker/Airflow env: docker/.env
  • Airflow DAG env (MinIO + Snowflake): docker/dags/.env
  • Consumer env: consumer/.env
  • dbt profile: ~/.dbt/profiles.yml

Keep passwords/tokens out of Git. Use placeholders in committed files.

Run the pipeline

1) Start services (Kafka, MinIO, Airflow, Postgres)

cd docker
docker-compose up -d

2) Start the simulator (Kafka producer)

cd simulator
python3 producer.py

3) Start the consumer (Kafka → MinIO)

cd consumer
python3 kafka-to-minio.py

Verify files land in MinIO:

docker exec -it minio sh -c "mc alias set myminio http://localhost:9000 <MINIO_USER> <MINIO_PASS> && mc ls myminio/spotify/ --recursive"

4) Trigger the Airflow DAG (MinIO → Snowflake Bronze)

  • Open Airflow UI → DAG: spotify_minio_to_snowflake_bronzeTrigger DAG
  • DAG file: docker/dags/minio-to-kafka.py

Verify in Snowflake:

USE DATABASE SPOTIFY_DB;
USE SCHEMA BRONZE;
SELECT COUNT(*) FROM SPOTIFY_EVENTS_BRONZE;

5) Run dbt transformations (Silver/Gold)

cd spotify_analysis
dbt debug
dbt run

Verify in Snowflake:

USE DATABASE SPOTIFY_DB;
USE SCHEMA TRANSFORM;
SHOW VIEWS;
SELECT * FROM SPOTIFY_SILVER LIMIT 10;
SELECT * FROM TOP_SONGS LIMIT 10;
SELECT * FROM USER_ENGAGEMENT LIMIT 10;

Common troubleshooting

  • Airflow task fails: NoSuchBucket → Create bucket spotify in MinIO and ensure consumer is writing.
  • Airflow can’t find MinIO objects → Ensure MINIO_PREFIX matches upload prefix (e.g., bronze/).
  • dbt “invalid identifier …” → Column names in Gold models must match Silver model outputs (song_id, song_name, device_type, event_ts, etc.).
  • dbt can’t connect to Snowflake → Check ~/.dbt/profiles.yml credentials and run dbt debug.

Key files

  • Producer: simulator/producer.py
  • Consumer: consumer/kafka-to-minio.py
  • Airflow DAG: docker/dags/minio-to-kafka.py
  • dbt project: spotify_analysis/
    • Silver: spotify_analysis/models/silver/spotify_silver.sql
    • Gold: spotify_analysis/models/gold/top_songs.sql, spotify_analysis/models/gold/user_engagement.sql
    • Sources: spotify_analysis/models/sources.yml

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages