In a world overloaded with content, recommendation systems help users quickly find what they love.
This project uses content-based filtering powered by cosine similarity to recommend the most similar movies based on metadata.
The entire data analysis and model development pipeline follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework — ensuring the work is structured, reproducible, and production-ready.
Built with:
- Python
- Machine Learning (Cosine Similarity + Bag-of-Words)
- NLTK (Porter Stemmer)
- Streamlit
- TMDB 5000 dataset
Movie Chosen → Tag Construction → Text Vectorisation (BoW)
→ Cosine Similarity Search → Top 5 Similar Movies
Tag Construction combines five metadata sources per movie:
tags = overview_words + genres + keywords + top_3_cast + director
| UI Preview | Recommendations |
|---|---|
![]() |
![]() |
The notebook (Movie_Recommender_CRISP_DM.ipynb) is structured across six CRISP-DM phases:
| Phase | Description | Key Output |
|---|---|---|
| 1. Business Understanding | Define problem & success criteria | Content-based filter; cold-start safe; explainable |
| 2. Data Understanding | Explore raw data & missingness | 4,803 movies + credits; JSON-encoded features identified |
| 3. Data Preparation | Parse, engineer & normalise features | Unified tags field; stemmed & lowercased |
| 4. Modelling | Vectorise & compute similarity | CountVectorizer (5k features) → (4,805 × 4,805) matrix |
| 5. Evaluation | Sanity checks & genre analysis | Franchise/genre coherence confirmed |
| 6. Deployment | Serialise artefacts | pickle + gzip files ready for Streamlit |
The notebook contains 12 visualisations across all phases:
| # | Chart | Phase |
|---|---|---|
| 1 | CRISP-DM Process Flow Diagram | Business Understanding |
| 2 | Missing Data Overview (both datasets) | Data Understanding |
| 3 | Top 15 Genres & Top 20 Keywords | Data Preparation |
| 4 | Most Prolific Actors & Directors | Data Preparation |
| 5 | Corpus Temporal & Rating Distributions | Data Preparation |
| 6 | Top 30 Vocabulary Terms (post-stemming) | Modelling |
| 7 | Word Cloud of Movie Tags Corpus | Modelling |
| 8 | Similarity Matrix Heatmap & Score Distribution | Modelling |
| 9 | Recommendation Scores (4 test queries, 2×2 grid) | Evaluation |
| 10 | Genre Profile Matrix (query vs recommendations) | Evaluation |
| 11 | Top 10 Most Similar Movie Pairs | Evaluation |
| 12 | Artefact Size: Raw vs Compressed | Deployment |
- Uses item attributes (genre, cast, keywords, director, plot)
- No user history required — cold-start safe
- Fully explainable recommendations
- Used in: YouTube, Spotify, Twitter
- "People similar to you liked…"
- Based on user–item interactions
- Prone to the cold-start problem
- Best of both worlds
- Used in: Netflix, Amazon, TikTok
TMDB 5000 Movies Dataset
https://www.kaggle.com/tmdb/tmdb-movie-metadata
| File | Rows | Key Fields |
|---|---|---|
tmdb_5000_movies.csv |
4,803 | title, overview, genres, keywords, vote_average, release_date |
tmdb_5000_credits.csv |
4,803 | title, movie_id, cast (JSON), crew (JSON) |
Five metadata sources are combined per movie into a single tags string:
overview + genres + keywords + top_3_cast + director
Multi-word names are collapsed to prevent token splits:
"Sam Worthington" → "SamWorthington" | "Science Fiction" → "ScienceFiction"
- Lowercasing —
"Action"and"action"treated identically - Porter Stemming —
"adventure"→"adventur","dispatched"→"dispatch" - Stop-word removal —
"the","is","and"filtered out
CountVectorizer with max_features=5000 transforms each movie's tags into a 5,000-dimensional vector.
A symmetric (4,805 × 4,805) matrix is computed once and stored. At query time, a single row lookup returns the top-5 most similar movies.
| Score | Meaning |
|---|---|
| 1.0 | Identical content fingerprint |
| 0.7+ | Highly similar (e.g. franchise sequels) |
| 0.0 | No shared content signals |
Movie-Recommender-System-Using-Machine-Learning/
│
├── artifacts/
│ ├── movie_dict.pkl.gz # Serialised movie metadata
│ └── similarity.pkl.gz # Precomputed similarity matrix
│
├── data/
│ ├── tmdb_5000_movies.csv
│ └── tmdb_5000_credits.csv
│
├── demo/
│
├── Movie_Recommender_CRISP_DM.ipynb # Full CRISP-DM analysis + 12 visualisations
├── app.py
├── requirements.txt
├── README.md
├── LICENSE
└── .gitignoregit clone https://github.com/Edwinkorir38/Movie-Recommender-System-Using-Machine-Learning.gitconda create -n movie python=3.10 -y
conda activate moviepip install -r requirements.txtFor the word cloud visualisation in the notebook, also run:
pip install wordcloud
Run the notebook end-to-end:
jupyter notebook Movie_Recommender_CRISP_DM.ipynbThis will:
- Load and merge both TMDB CSV files
- Parse JSON-encoded columns (genres, keywords, cast, crew)
- Engineer the unified
tagsfeature - Apply Porter Stemming and stop-word removal
- Fit CountVectorizer and compute the cosine similarity matrix
- Generate all 12 visualisations
- Save
artifacts/movie_dict.pkl.gzandartifacts/similarity.pkl.gz
streamlit run app.py| Name | Edwin Korir |
| ekorir99@gmail.com | |
| GitHub | github.com/Edwinkorir38 |
| linkedin.com/in/edwin-korir-90a794382 |



