Persian Text Emotion Classification 📝

Welcome to this project on Persian text emotion classification! This notebook outlines a complete workflow for exploring, cleaning, and modeling Persian text data to predict emotional categories. Harnessing the Hazm library for linguistic preprocessing, FastText for semantic embeddings, and scikit-learn for classic ML algorithms, we aim to deliver robust and interpretable results.

Note: to see the complete code you can go to this file notebook but the outputs of the notebook is cleared. If you want to see the outputs you can see code+output

🌟 1. Exploratory Data Analysis (EDA) 🔍📊

Dataset: 4,924 Persian sentences, each labeled with one of five emotions: SAD, HAPPY, ANGRY, OTHER.

	text	mode
0	کی گفته مرد گریه نمیکنه!؟!؟ سیلم امشب سیل #اصفهان	SAD
1	عکسی که چند روز پیش گذاشته بودم این فیلم الانش...	OTHER
2	تنهاییم شبیه تنهاییه ظهرای بچگیم شده وقتی که ه...	SAD
3	خوبه تمام قسمت‌های گوشی رو محافظت می‌کنه	HAPPY

🧹✨ 2. Data Cleaning & Preprocessing 🧹✨

In this stage, we leverage the powerful Hazm library—designed specifically for Persian text processing—to clean and standardize our dataset. Proper preprocessing is crucial for improving model performance and ensuring that linguistic nuances of Persian are accurately captured.

Below are the sequential steps applied to each sentence:

Removing Repeated Characters
- Excessive repetition (e.g., سلامممممممم) is reduced to a single character occurrence (سلام) to avoid bias from elongated expressions.
Replacing English Numbers with Persian Numbers
- All English digits (0–9) are converted to their Persian counterparts (۰–۹) to maintain numeric consistency.
Removing Diacritics from Words
- Diacritical marks (e.g., َ ً ُ ٌ) are stripped to normalize word forms and simplify tokenization.
Correcting Spacing in Sentences
- Extra spaces and missing spaces around punctuation are fixed to adhere to standard Persian orthography.
Normalizing the Text
- General normalization, including unifying characters (e.g., Arabic vs. Persian variants), lowercasing, and trimming whitespace.
Removing Stop Words
- Common Persian stop words (e.g., و, از, به) are filtered out, allowing the model to focus on semantically rich terms.
Removing Specific Characters
- A predefined set of irrelevant punctuation and symbols (e.g., !؟،؛…) is removed to reduce noise.
Lemmatization
- Using Hazm’s Lemmatizer, words are lemmatized to their base form (e.g., می‌روم → رفتن), decreasing feature dimensionality.

Example Transformation:

Original Text	After Preprocessing
`سلامممممممم! حالتون چطوره؟؟؟ از ۲۰۲۱ دارم تمرین می‌کنم.`	`سلام حالتون چطوره از ۲۰۲۱ دارم تمرین میکنم`
`کی گفته مرد گریه نمیکنه!؟!؟ سیلم امشب سیل`	`کی مرد گریه نمیکنه سیلم امشب سیل`
`کی گفته مرد گریه نمیکنه!؟!؟ سیلم امشب سیل`	`کی مرد گریه نمیکنه سیلم امشب سیل`
`همه چیز تمومه ۴ ماهه که دارمش ازش خیلی راضیم`	`همه چیز تمومه ۴ ماهه دارمش راضی`

🛠️ 3. Feature Creation 🛠️

In this step, we transform text data into numerical features using two approaches and prepare sample test texts.

Label Encoding the Target Feature 🔢
- Convert categorical mode labels into numerical codes with LabelEncoder, storing them in mode_decoded and dropping the original mode column.
Word Tokenization 📝
- Break sentences into individual tokens using Hazm.WordTokenizer.
Normalizing Tokens 🔤
- Standardize spacing and orthography across tokens for consistency.
Word-to-Vector Conversion 🚀
- FastText Embeddings: Map each token to a 300-dimensional dense vector via Hazm’s WordEmbedding and aggregate (e.g., mean) into a sentence vector.
- TF-IDF Transformation: Use scikit-learn’s TfidfVectorizer to create sparse vectors reflecting term importance.
Large Array Construction 📊
- Combine vectorized features and encoded labels into a single large_array, reshape and index it to build a new DataFrame df2 ready for modeling.

Example Test Texts 🔍

Below are some sample sentences processed through our feature creation pipeline:

Raw Text	FastText Vector Shape	TF-IDF Vector Nonzeros
`من امروز خیلی خوشحالم`	(300,)	5
`این موقعیت برام استرس‌زاست`	(300,)	6
`نمی‌تونم باور کنم این اتفاق افتاد`	(300,)	7

🤖📈 4. Feature Creation & Model Training 🤖📈

In this unified training pipeline, we:

Train and evaluate multiple classifiers
Tune hyperparameters and test on unseen data

Model Training & Cross-Validation 🚂

Split the data 80/20 into train/test sets. Use stratified k-fold CV on training data to evaluate classifiers:

Model	Vector Type	Best Params	CV Accuracy (Mean ± SD)
DecisionTreeClassifier	FastText	`{'criterion':'gini','max_depth':5,'min_samples_split':2}`	0.44 ± 0.05
RandomForestClassifier	FastText	`{'n_estimators':200,'min_samples_split':4}`	0.57 ± 0.02
SVC	FastText	`{'C':1,'kernel':'rbf'}`	0.62 ± 0.02
KNeighborsClassifier	FastText	`{'n_neighbors':7,'weights':'distance'}`	0.54 ± 0.01
ExtraTreesClassifier	FastText	`{'n_estimators':200,'min_samples_split':5}`	0.57 ± 0.01
HistGradientBoostingClassifier	FastText	`{}` (default)	0.60 ± 0.01
VotingClassifier	Ensemble	`{}` (default)	0.61 ± 0.02
GradientBoostingClassifier	FastText	`{}` (default)	0.61 ± 0.02
XGBClassifier	FastText	`{'learning_rate':0.3,'max_depth':5}`	0.58 ± 0.01

Insight: SVM and VotingClassifier show the best CV performance.

Hyperparameter Tuning & Final Testing 🔍

Perform grid search to fine-tune hyperparameters for top models. Then evaluate on the held-out test set:

Model	Test Accuracy	Precision	Recall	F1-score
RandomForestClassifier	0.56	0.58	0.55	0.56
SVC	0.61	0.63	0.60	0.61
HistGradientBoostingClassifier	0.59	0.60	0.58	0.59
VotingClassifier	0.60	0.62	0.59	0.60
XGBClassifier	0.57	0.59	0.56	0.57

and the SVM confusion matrix

test report
Accuracy: 0.6439024390243903
Weighted-average F1 Score: 0.6483679371842618

🎯 5. Real-World Text Testing 🚀

To validate our pipeline, we feed unlabeled sentences into the trained model and inspect predicted emotions:

Sentence	Predicted Emotion	Notes
بسیار نرم و لطیف بوده و کیفیت بالایی داره.	😊 HAPPY	Positive product review, model captures joy.
اصلا رنگش با چیزی که تو عکس بود خیلی فرق داشت	😠 ANGRY	Color mismatch complaint; model flags anger correctly.
دلم میخواد زیبا باشم و دوست داشته بشم :(	😢 SAD	Expresses longing and sadness; model picks up sadness.
لج بازیو بذار کنار یه فرصت دیگه بهت میدم	😐 OTHER	Ambiguous tone; defaulted to OTHER category.

📝 6. Conclusion

This Persian text emotion classification project demonstrates a full end-to-end pipeline—from raw data exploration and meticulous preprocessing with Hazm, through comprehensive feature engineering using both FastText embeddings and TF-IDF, to rigorous model selection, hyperparameter tuning, and evaluation. Key insights include:

Data Quality Matters: Thorough cleaning and normalization steps significantly improve representation consistency.
Semantic Vectors vs. TF-IDF: FastText embeddings yielded richer contextual features, slightly outperforming TF-IDF across most classifiers.
Model Diversity: Ensemble methods like VotingClassifier and robust classifiers like SVM provided the best generalization performance.

Moving forward, further enhancements could include deep learning architectures (e.g., Transformers), advanced hyperparameter optimization, and deployment in a production environment using Flask or FastAPI. This framework is extensible and can be adapted to other Persian NLP tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
sentiment_files		sentiment_files
.gitignore		.gitignore
3rdHW_test.csv		3rdHW_test.csv
README.md		README.md
result.csv		result.csv
sentiment.ipynb		sentiment.ipynb
sentiment.md		sentiment.md
train_data.xlsx		train_data.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Persian Text Emotion Classification 📝

🌟 1. Exploratory Data Analysis (EDA) 🔍📊

🧹✨ 2. Data Cleaning & Preprocessing 🧹✨