This repository contains the official implementation of my Bachelor Thesis Project at Isfahan University of Technology. The project presents a data analysis and machine learning framework for predicting film sales and building a content-based recommender system tailored to Iranian cinema.
Over the last decade, data and artificial intelligence have revolutionized industries from healthcare to entertainment. Inspired by platforms such as Netflix and Filimo, this project explores how machine learning and data-driven systems can enhance Iran’s cinematic landscape.
The work integrates:
- Data collection from Iranian film databases (Soureh Cinema, Cinematicket, etc.)
- Exploratory and statistical analysis to uncover viewing and revenue patterns
- Predictive modeling for box office success
- Content-based recommendation using textual similarity
- Interactive visualization through Power BI and Streamlit
.
├── data/ # datasets
├── notebooks/ # Jupyter notebooks for EDA, ML, and recommender
├── src/ # Python source code
├── reports/ # thesis & visual report
├── demo/ # demo video
├── requirements.txt # Python dependencies
└── README.md # project overview
The dataset covers Iranian films from 2011–2021
(Decade 1390–1400 SH), collected and cleaned from multiple reliable sources:
| Type | Example Variables | Description |
|---|---|---|
| Qualitative | Title, Genre, Director, Stars | Film metadata |
| Quantitative | Sale, Audience Count, IMDb Rating, Duration | Numerical predictors |
| Popularity Indicators | Instagram Followers of Lead Actors | Measures celebrity influence |
- Web scraping and manual compilation from national cinema sources
- Cleaning, normalization, and handling of missing values
- Correlation matrices and scatter plots for feature relationships
- Trend analysis of top-grossing genres, directors, and actors
- Outlier detection and treatment
- Missing value imputation using Random Forest Regression
- One-Hot Encoding for categorical features
- Normalization of numerical variables
- Logistic Regression – baseline classifier
- Stochastic Gradient Descent – efficient linear model
- K-Nearest Neighbors (KNN) – achieved best overall F1 = 0.75
- Random Forest – strong interpretability, F1 ≈ 0.74
- Gradient Boosting – moderate performance (F1 ≈ 0.65)
The KNN classifier was identified as the optimal predictive model for film sales success.
A content-based recommender was built using:
- TF-IDF to vectorize film metadata (genre, director, year, stars)
- Cosine Similarity to measure similarity between films
The system returns the top-N (default = 5) most similar movies to a selected title, handling typos via fuzzy string matching.
- Power BI dashboard for data insights
- Streamlit application for real-time recommendation demo
- Best Predictive Model: K-Nearest Neighbors
- F1 Score = 0.75, Accuracy = 0.72
- Most Influential Factors:
- Genre (especially Comedy and Social)
- Film Duration
- Ticket Price
- Popularity of Lead Actors
- Recommender Validation: produced thematically coherent recommendations outperforming Filimo’s native system.
Author:
Amir Masoud Almasi
Supervisor:
Prof. Reyhaneh Reikhtehgaran
Department of Mathematical Sciences
Isfahan University of Technology