This repository hosts all the necessary resources for the Advanced Data Systems Project 1 titled "Mastering Snowflake: Sentiment Analysis and Performance Experiments".
python_udtf/naive_bayes_udtf.py: Implementation of Naive Bayes using a UDTF in Python.python_udtf/naive_bayes_udtf.sql: Naive Bayes implementation using a UDTF in Python, adapted for Snowflake.snowflake_sql/naive_bayes_sql.sql: Naive Bayes implementation in SQL.tpch_benchmark/tpch_benchmark.ipynb: Jupyter notebook where Performance Experiments Using TPC-H are performed.tpch_benchmark/query_execution_times.csv: Query execution times for all queries across all possible combinations.tpch_benchmark/average_query_execution_times.csv: Average query execution times after three runs across all possible combinations.
Note: The training and test data from the Yelp Review dataset (https://huggingface.co/datasets/Yelp/yelp_review_full) were uploaded to Snowflake at the beginning of this project and are not available here on GitHub due to their size.
"plots": Directory containing images and plots used in the report."report": Directory containing the report of this assignment in PDF format.
Prerequisites: Ensure Python 3.12.6 is installed on your machine. Other versions might work, but this project was developed with 3.12.6.
1. Create and Activate a Virtual Environment
Create a Virtual Environment in the root directory of this project by running the following commands:
- For macOS/Linux:
python3 -m venv .venvsource .venv/bin/activate
- For Windows:
python -m venv .venv.venv\Scripts\activate
2. Install Required Packages
When the virtual environment is activated, install all necessary packages by running:
pip install -r requirements.txt
1. Running the TPCH Benchmark Notebook:
To recreate the results using the tpch_benchmark/tpch_benchmark.ipynb notebook:
- Run the whole
tpch_benchmark.ipynbnotebook, ensuring to use your own“user”,"account"and“password”when connecting to Snowflake (more details inside the notebook).
2. Executing Naive Bayes Implementations:
The Naive Bayes implementations, both in SQL and as a UDTF in Python, need to be executed within Snowflake.