Movie Recommender System Using Machine Learning

Overview

In a world overloaded with content, recommendation systems help users quickly find what they love.
This project uses content-based filtering powered by cosine similarity to recommend the most similar movies based on metadata.

The entire data analysis and model development pipeline follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework — ensuring the work is structured, reproducible, and production-ready.

Built with:

Python
Machine Learning (Cosine Similarity + Bag-of-Words)
NLTK (Porter Stemmer)
Streamlit
TMDB 5000 dataset

Project Architecture

Movie Chosen → Tag Construction → Text Vectorisation (BoW)
             → Cosine Similarity Search → Top 5 Similar Movies

Tag Construction combines five metadata sources per movie:

tags = overview_words + genres + keywords + top_3_cast + director

Live Demo

Streamlit App

Demo Screenshots

UI Preview	Recommendations

Methodology: CRISP-DM

The notebook (Movie_Recommender_CRISP_DM.ipynb) is structured across six CRISP-DM phases:

Phase	Description	Key Output
1. Business Understanding	Define problem & success criteria	Content-based filter; cold-start safe; explainable
2. Data Understanding	Explore raw data & missingness	4,803 movies + credits; JSON-encoded features identified
3. Data Preparation	Parse, engineer & normalise features	Unified `tags` field; stemmed & lowercased
4. Modelling	Vectorise & compute similarity	CountVectorizer (5k features) → (4,805 × 4,805) matrix
5. Evaluation	Sanity checks & genre analysis	Franchise/genre coherence confirmed
6. Deployment	Serialise artefacts	`pickle` + `gzip` files ready for Streamlit

Visualisations

The notebook contains 12 visualisations across all phases:

#	Chart	Phase
1	CRISP-DM Process Flow Diagram	Business Understanding
2	Missing Data Overview (both datasets)	Data Understanding
3	Top 15 Genres & Top 20 Keywords	Data Preparation
4	Most Prolific Actors & Directors	Data Preparation
5	Corpus Temporal & Rating Distributions	Data Preparation
6	Top 30 Vocabulary Terms (post-stemming)	Modelling
7	Word Cloud of Movie Tags Corpus	Modelling
8	Similarity Matrix Heatmap & Score Distribution	Modelling
9	Recommendation Scores (4 test queries, 2×2 grid)	Evaluation
10	Genre Profile Matrix (query vs recommendations)	Evaluation
11	Top 10 Most Similar Movie Pairs	Evaluation
12	Artefact Size: Raw vs Compressed	Deployment

Types of Recommendation Systems

1️ Content-Based Filtering ← Used in this project

Uses item attributes (genre, cast, keywords, director, plot)
No user history required — cold-start safe
Fully explainable recommendations
Used in: YouTube, Spotify, Twitter

2️ Collaborative Filtering

"People similar to you liked…"
Based on user–item interactions
Prone to the cold-start problem

3️ Hybrid Systems

Best of both worlds
Used in: Netflix, Amazon, TikTok

Dataset Used

TMDB 5000 Movies Dataset
https://www.kaggle.com/tmdb/tmdb-movie-metadata

File	Rows	Key Fields
`tmdb_5000_movies.csv`	4,803	title, overview, genres, keywords, vote_average, release_date
`tmdb_5000_credits.csv`	4,803	title, movie_id, cast (JSON), crew (JSON)

ML Core: How It Works

Step 1 — Feature Engineering

Five metadata sources are combined per movie into a single tags string:

overview  +  genres  +  keywords  +  top_3_cast  +  director

Multi-word names are collapsed to prevent token splits:
"Sam Worthington" → "SamWorthington" | "Science Fiction" → "ScienceFiction"

Step 2 — Text Normalisation

Lowercasing — "Action" and "action" treated identically
Porter Stemming — "adventure" → "adventur", "dispatched" → "dispatch"
Stop-word removal — "the", "is", "and" filtered out

Step 3 — Vectorisation (Bag-of-Words)

CountVectorizer with max_features=5000 transforms each movie's tags into a 5,000-dimensional vector.

Step 4 — Cosine Similarity

$$\text{similarity}(A, B) = \frac{A \cdot B}{|A| |B|}$$

A symmetric (4,805 × 4,805) matrix is computed once and stored. At query time, a single row lookup returns the top-5 most similar movies.

Score	Meaning
1.0	Identical content fingerprint
0.7+	Highly similar (e.g. franchise sequels)
0.0	No shared content signals

Project Structure

Movie-Recommender-System-Using-Machine-Learning/
│
├── artifacts/
│   ├── movie_dict.pkl.gz       # Serialised movie metadata
│   └── similarity.pkl.gz       # Precomputed similarity matrix
│
├── data/
│   ├── tmdb_5000_movies.csv
│   └── tmdb_5000_credits.csv
│
├── demo/
│
├── Movie_Recommender_CRISP_DM.ipynb   # Full CRISP-DM analysis + 12 visualisations
├── app.py
├── requirements.txt
├── README.md
├── LICENSE
└── .gitignore

Installation Guide

1️ Clone Repository

git clone https://github.com/Edwinkorir38/Movie-Recommender-System-Using-Machine-Learning.git

2️ Create a Conda Environment

conda create -n movie python=3.10 -y
conda activate movie

3️ Install Dependencies

pip install -r requirements.txt

For the word cloud visualisation in the notebook, also run:
pip install wordcloud

4️ Recreate ML Model

Run the notebook end-to-end:

jupyter notebook Movie_Recommender_CRISP_DM.ipynb

This will:

Load and merge both TMDB CSV files
Parse JSON-encoded columns (genres, keywords, cast, crew)
Engineer the unified tags feature
Apply Porter Stemming and stop-word removal
Fit CountVectorizer and compute the cosine similarity matrix
Generate all 12 visualisations
Save artifacts/movie_dict.pkl.gz and artifacts/similarity.pkl.gz

5️ Run the Web App

streamlit run app.py

Contact


Name	Edwin Korir
Email	ekorir99@gmail.com
GitHub	github.com/Edwinkorir38
LinkedIn	linkedin.com/in/edwin-korir-90a794382

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie Recommender System Using Machine Learning

Overview

Project Architecture

Live Demo

Demo Screenshots

Methodology: CRISP-DM

Visualisations

Types of Recommendation Systems

1️ Content-Based Filtering ← Used in this project

2️ Collaborative Filtering

3️ Hybrid Systems

Dataset Used

ML Core: How It Works

Step 1 — Feature Engineering

Step 2 — Text Normalisation

Step 3 — Vectorisation (Bag-of-Words)

Step 4 — Cosine Similarity

Project Structure

Installation Guide

1️ Clone Repository

2️ Create a Conda Environment

3️ Install Dependencies

4️ Recreate ML Model

5️ Run the Web App

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.devcontainer		.devcontainer
Images		Images
artifacts		artifacts
data		data
.gitignore		.gitignore
LICENSE		LICENSE
Movie Recommender System Data Analysis.ipynb		Movie Recommender System Data Analysis.ipynb
Movie_Recommender_CRISP_DM.pdf		Movie_Recommender_CRISP_DM.pdf
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Movie Recommender System Using Machine Learning

Overview

Project Architecture

Live Demo

Demo Screenshots

Methodology: CRISP-DM

Visualisations

Types of Recommendation Systems

1️ Content-Based Filtering ← Used in this project

2️ Collaborative Filtering

3️ Hybrid Systems

Dataset Used

ML Core: How It Works

Step 1 — Feature Engineering

Step 2 — Text Normalisation

Step 3 — Vectorisation (Bag-of-Words)

Step 4 — Cosine Similarity

Project Structure

Installation Guide

1️ Clone Repository

2️ Create a Conda Environment

3️ Install Dependencies

4️ Recreate ML Model

5️ Run the Web App

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages