🎬 Data Analysis, Sales Prediction, and Iranian Movies Recommendation System

This repository contains the official implementation of my Bachelor Thesis Project at Isfahan University of Technology. The project presents a data analysis and machine learning framework for predicting film sales and building a content-based recommender system tailored to Iranian cinema.

🧠 Overview

Over the last decade, data and artificial intelligence have revolutionized industries from healthcare to entertainment. Inspired by platforms such as Netflix and Filimo, this project explores how machine learning and data-driven systems can enhance Iran’s cinematic landscape.

The work integrates:

Data collection from Iranian film databases (Soureh Cinema, Cinematicket, etc.)
Exploratory and statistical analysis to uncover viewing and revenue patterns
Predictive modeling for box office success
Content-based recommendation using textual similarity
Interactive visualization through Power BI and Streamlit

📂 Structure


.
├── data/                          # datasets
├── notebooks/                     # Jupyter notebooks for EDA, ML, and recommender
├── src/                           # Python source code
├── reports/                       # thesis & visual report
├── demo/                          # demo video
├── requirements.txt               # Python dependencies
└── README.md                      # project overview

📊 Dataset

The dataset covers Iranian films from 2011–2021
(Decade 1390–1400 SH), collected and cleaned from multiple reliable sources:

🎬 Features (16 total)

Type	Example Variables	Description
Qualitative	Title, Genre, Director, Stars	Film metadata
Quantitative	Sale, Audience Count, IMDb Rating, Duration	Numerical predictors
Popularity Indicators	Instagram Followers of Lead Actors	Measures celebrity influence

🧩 Methodology

1. Data Collection

Web scraping and manual compilation from national cinema sources
Cleaning, normalization, and handling of missing values

2. Data Visualization

Correlation matrices and scatter plots for feature relationships
Trend analysis of top-grossing genres, directors, and actors

3. Data Preprocessing

Outlier detection and treatment
Missing value imputation using Random Forest Regression
One-Hot Encoding for categorical features
Normalization of numerical variables

4. Machine Learning Models

Logistic Regression – baseline classifier
Stochastic Gradient Descent – efficient linear model
K-Nearest Neighbors (KNN) – achieved best overall F1 = 0.75
Random Forest – strong interpretability, F1 ≈ 0.74
Gradient Boosting – moderate performance (F1 ≈ 0.65)

The KNN classifier was identified as the optimal predictive model for film sales success.

5. Recommendation System

A content-based recommender was built using:

TF-IDF to vectorize film metadata (genre, director, year, stars)
Cosine Similarity to measure similarity between films

The system returns the top-N (default = 5) most similar movies to a selected title, handling typos via fuzzy string matching.

6. Visualization and Deployment

Power BI dashboard for data insights
Streamlit application for real-time recommendation demo

🧪 Results

Best Predictive Model: K-Nearest Neighbors
- F1 Score = 0.75, Accuracy = 0.72
Most Influential Factors:
- Genre (especially Comedy and Social)
- Film Duration
- Ticket Price
- Popularity of Lead Actors
Recommender Validation: produced thematically coherent recommendations outperforming Filimo’s native system.

👥 Author & Supervision

Author:
Amir Masoud Almasi

Supervisor:
Prof. Reyhaneh Reikhtehgaran
Department of Mathematical Sciences
Isfahan University of Technology

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
data		data
demo		demo
docs		docs
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Data Analysis, Sales Prediction, and Iranian Movies Recommendation System

🧠 Overview

📂 Structure

📊 Dataset

🎬 Features (16 total)

🧩 Methodology

1. Data Collection

2. Data Visualization

3. Data Preprocessing

4. Machine Learning Models

5. Recommendation System

6. Visualization and Deployment

🧪 Results

👥 Author & Supervision

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎬 Data Analysis, Sales Prediction, and Iranian Movies Recommendation System

🧠 Overview

📂 Structure

📊 Dataset

🎬 Features (16 total)

🧩 Methodology

1. Data Collection

2. Data Visualization

3. Data Preprocessing

4. Machine Learning Models

5. Recommendation System

6. Visualization and Deployment

🧪 Results

👥 Author & Supervision

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages