Skip to content

Edwinkorir38/Movie-Recommender-System-Using-Machine-Learning

Repository files navigation

Movie Recommender System Using Machine Learning

Banner


Overview

Banner In a world overloaded with content, recommendation systems help users quickly find what they love.
This project uses content-based filtering powered by cosine similarity to recommend the most similar movies based on metadata.

The entire data analysis and model development pipeline follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework — ensuring the work is structured, reproducible, and production-ready.

Built with:

  • Python
  • Machine Learning (Cosine Similarity + Bag-of-Words)
  • NLTK (Porter Stemmer)
  • Streamlit
  • TMDB 5000 dataset

Project Architecture

Movie Chosen → Tag Construction → Text Vectorisation (BoW)
             → Cosine Similarity Search → Top 5 Similar Movies

Tag Construction combines five metadata sources per movie:

tags = overview_words + genres + keywords + top_3_cast + director

Live Demo

Streamlit App


Demo Screenshots

UI Preview Recommendations


Methodology: CRISP-DM

The notebook (Movie_Recommender_CRISP_DM.ipynb) is structured across six CRISP-DM phases:

Phase Description Key Output
1. Business Understanding Define problem & success criteria Content-based filter; cold-start safe; explainable
2. Data Understanding Explore raw data & missingness 4,803 movies + credits; JSON-encoded features identified
3. Data Preparation Parse, engineer & normalise features Unified tags field; stemmed & lowercased
4. Modelling Vectorise & compute similarity CountVectorizer (5k features) → (4,805 × 4,805) matrix
5. Evaluation Sanity checks & genre analysis Franchise/genre coherence confirmed
6. Deployment Serialise artefacts pickle + gzip files ready for Streamlit

Visualisations

The notebook contains 12 visualisations across all phases:

# Chart Phase
1 CRISP-DM Process Flow Diagram Business Understanding
2 Missing Data Overview (both datasets) Data Understanding
3 Top 15 Genres & Top 20 Keywords Data Preparation
4 Most Prolific Actors & Directors Data Preparation
5 Corpus Temporal & Rating Distributions Data Preparation
6 Top 30 Vocabulary Terms (post-stemming) Modelling
7 Word Cloud of Movie Tags Corpus Modelling
8 Similarity Matrix Heatmap & Score Distribution Modelling
9 Recommendation Scores (4 test queries, 2×2 grid) Evaluation
10 Genre Profile Matrix (query vs recommendations) Evaluation
11 Top 10 Most Similar Movie Pairs Evaluation
12 Artefact Size: Raw vs Compressed Deployment

Types of Recommendation Systems

1️ Content-Based FilteringUsed in this project

  • Uses item attributes (genre, cast, keywords, director, plot)
  • No user history required — cold-start safe
  • Fully explainable recommendations
  • Used in: YouTube, Spotify, Twitter

2️ Collaborative Filtering

  • "People similar to you liked…"
  • Based on user–item interactions
  • Prone to the cold-start problem

3️ Hybrid Systems

  • Best of both worlds
  • Used in: Netflix, Amazon, TikTok

Dataset Used

TMDB 5000 Movies Dataset
https://www.kaggle.com/tmdb/tmdb-movie-metadata

File Rows Key Fields
tmdb_5000_movies.csv 4,803 title, overview, genres, keywords, vote_average, release_date
tmdb_5000_credits.csv 4,803 title, movie_id, cast (JSON), crew (JSON)

ML Core: How It Works

Step 1 — Feature Engineering

Five metadata sources are combined per movie into a single tags string:

overview  +  genres  +  keywords  +  top_3_cast  +  director

Multi-word names are collapsed to prevent token splits:
"Sam Worthington""SamWorthington" | "Science Fiction""ScienceFiction"

Step 2 — Text Normalisation

  • Lowercasing"Action" and "action" treated identically
  • Porter Stemming"adventure""adventur", "dispatched""dispatch"
  • Stop-word removal"the", "is", "and" filtered out

Step 3 — Vectorisation (Bag-of-Words)

CountVectorizer with max_features=5000 transforms each movie's tags into a 5,000-dimensional vector.

Step 4 — Cosine Similarity

$$\text{similarity}(A, B) = \frac{A \cdot B}{|A| |B|}$$

A symmetric (4,805 × 4,805) matrix is computed once and stored. At query time, a single row lookup returns the top-5 most similar movies.

Score Meaning
1.0 Identical content fingerprint
0.7+ Highly similar (e.g. franchise sequels)
0.0 No shared content signals

Project Structure

Movie-Recommender-System-Using-Machine-Learning/

├── artifacts/
│   ├── movie_dict.pkl.gz       # Serialised movie metadata
│   └── similarity.pkl.gz       # Precomputed similarity matrix

├── data/
│   ├── tmdb_5000_movies.csv
│   └── tmdb_5000_credits.csv

├── demo/

├── Movie_Recommender_CRISP_DM.ipynb   # Full CRISP-DM analysis + 12 visualisations
├── app.py
├── requirements.txt
├── README.md
├── LICENSE
└── .gitignore

Installation Guide

1️ Clone Repository

git clone https://github.com/Edwinkorir38/Movie-Recommender-System-Using-Machine-Learning.git

2️ Create a Conda Environment

conda create -n movie python=3.10 -y
conda activate movie

3️ Install Dependencies

pip install -r requirements.txt

For the word cloud visualisation in the notebook, also run:

pip install wordcloud

4️ Recreate ML Model

Run the notebook end-to-end:

jupyter notebook Movie_Recommender_CRISP_DM.ipynb

This will:

  • Load and merge both TMDB CSV files
  • Parse JSON-encoded columns (genres, keywords, cast, crew)
  • Engineer the unified tags feature
  • Apply Porter Stemming and stop-word removal
  • Fit CountVectorizer and compute the cosine similarity matrix
  • Generate all 12 visualisations
  • Save artifacts/movie_dict.pkl.gz and artifacts/similarity.pkl.gz

5️ Run the Web App

streamlit run app.py

Contact

Name Edwin Korir
Email ekorir99@gmail.com
GitHub github.com/Edwinkorir38
LinkedIn linkedin.com/in/edwin-korir-90a794382

About

A content-based movie recommendation system built using Python, Scikit-learn, and Streamlit. The system analyzes movie metadata and uses cosine similarity to suggest films similar to a user-selected title. It includes a clean Streamlit interface and integrates with the TMDB API to display posters and additional movie details.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors