This project implements a scalable, hybrid travel recommendation system natively designed for PySpark and Databricks. It combines Content-Based Filtering (via user-driven UI widgets) and Collaborative Filtering (using the Alternating Least Squares or ALS algorithm) to deliver highly personalized, top-15 travel destination rankings.
The system dynamically handles strict filtering requirements and features intelligent fallback protocols if initial user constraints yield zero results, ensuring continued relevance and functionality without breaking.
- Hybrid Recommendation Architecture: Fuses content-based pre-filtering with matrix factorization.
- Interactive UI Widgets: Built-in Databricks widgets allow users to define preferences:
- Max Budget (USD/Day)
- Climate Type
- Travel Type
- Preferred Season
- Intelligent Fallback Logic: Automatically relaxes strict "Climate" and "Season" constraints sequentially if multi-factor limits cannot be met, protecting essential budget and travel-type requirements.
- Synthetic Interaction Generation: Bootstraps the ALS model with dynamically generated baseline user interactions (15,000 synthetic ratings) across 1,000 simulated user parameters.
- Evaluation Metrics: Validates predictive performance objectively using Root Mean Square Error (RMSE).
- Rich Output Format: Joins predictions back to the core dataset to display 17 dimensions of relevant destination data (e.g., predicted ratings, safety indexes, average temperature, visa requirements, currency).
- Platform: Apache Spark / Databricks Platform.
- Runtime: Recommended DBR (Databricks Runtime) 10.4 LTS ML or higher to guarantee standard
pyspark.mlanddbutilswidget support.
The primary recommendation model relies on a destination details dataset named ds1_grok.
Ensure the dataset features the following core dimensions to function properly:
countrydestinationavg_cost_usd_per_day(Numeric)travel_typebest_seasonclimate- And other dimensional columns required in final output (
top_attraction,currency,safety_rating, etc.)
Note: The code first attempts to load the dataset as a native Databricks table (spark.table("ds1_grok")). If unavailable, it gracefully falls back to reading the raw CSV direct from DBFS (/FileStore/tables/ds1_grok.csv).
- Import the Code: Create a new PySpark Notebook in your Databricks workspace and paste the script.
- Mount/Upload Data: Register
ds1_grok.csvas a Spark table or upload the file to DBFS at/FileStore/tables/. - Configure Dashboard Widgets: Run the first block to instantiate the UI. Interactive dropdown widgets will appear at the top of your notebook. Select your active preferences.
- Execute to Train & Predict: Run the remainder of the notebook.
- Pre-filtering isolates valid destinations.
- ALS collaborative filtering trains on the remaining valid index.
- An evaluation (RMSE) prints to standard out.
- Ranked top-15 destinations render seamlessly via Databricks'
display()function.
- Widget Ingestion: Captures stateful user preferences.
- Normalization: Concatenates and cleans destination strings, enforcing uniqueness.
- Content Filtering: Narrow downs destinations via strict threshold logic or fallback cascades.
- Collaborative Filtering (ALS):
- Implements string indexing.
- Evaluates train/test split via Spark ML.
- Trains matrix parameters.
- Prediction & Window Ranking: Extrapolates predicted user affinity across active filtered locations, maximizing performance overhead.
- Data Presentation: Structures and explicitly types all final outputs for robust end-user display.