Welcome to this project on Persian text emotion classification! This notebook outlines a complete workflow for exploring, cleaning, and modeling Persian text data to predict emotional categories. Harnessing the Hazm library for linguistic preprocessing, FastText for semantic embeddings, and scikit-learn for classic ML algorithms, we aim to deliver robust and interpretable results.
Note: to see the complete code you can go to this file notebook but the outputs of the notebook is cleared. If you want to see the outputs you can see code+output
Dataset: 4,924 Persian sentences, each labeled with one of five emotions: SAD, HAPPY, ANGRY, OTHER.
| text | mode | |
|---|---|---|
| 0 | کی گفته مرد گریه نمیکنه!؟!؟ سیلم امشب سیل #اصفهان | SAD |
| 1 | عکسی که چند روز پیش گذاشته بودم این فیلم الانش... | OTHER |
| 2 | تنهاییم شبیه تنهاییه ظهرای بچگیم شده وقتی که ه... | SAD |
| 3 | خوبه تمام قسمتهای گوشی رو محافظت میکنه | HAPPY |
In this stage, we leverage the powerful Hazm library—designed specifically for Persian text processing—to clean and standardize our dataset. Proper preprocessing is crucial for improving model performance and ensuring that linguistic nuances of Persian are accurately captured.
Below are the sequential steps applied to each sentence:
-
Removing Repeated Characters
- Excessive repetition (e.g.,
سلامممممممم) is reduced to a single character occurrence (سلام) to avoid bias from elongated expressions.
- Excessive repetition (e.g.,
-
Replacing English Numbers with Persian Numbers
- All English digits (
0–9) are converted to their Persian counterparts (۰–۹) to maintain numeric consistency.
- All English digits (
-
Removing Diacritics from Words
- Diacritical marks (e.g., َ ً ُ ٌ) are stripped to normalize word forms and simplify tokenization.
-
Correcting Spacing in Sentences
- Extra spaces and missing spaces around punctuation are fixed to adhere to standard Persian orthography.
-
Normalizing the Text
- General normalization, including unifying characters (e.g., Arabic vs. Persian variants), lowercasing, and trimming whitespace.
-
Removing Stop Words
- Common Persian stop words (e.g.,
و,از,به) are filtered out, allowing the model to focus on semantically rich terms.
- Common Persian stop words (e.g.,
-
Removing Specific Characters
- A predefined set of irrelevant punctuation and symbols (e.g.,
!؟،؛…) is removed to reduce noise.
- A predefined set of irrelevant punctuation and symbols (e.g.,
-
Lemmatization
- Using Hazm’s
Lemmatizer, words are lemmatized to their base form (e.g.,میروم→رفتن), decreasing feature dimensionality.
- Using Hazm’s
Example Transformation:
| Original Text | After Preprocessing |
|---|---|
سلامممممممم! حالتون چطوره؟؟؟ از ۲۰۲۱ دارم تمرین میکنم. |
سلام حالتون چطوره از ۲۰۲۱ دارم تمرین میکنم |
کی گفته مرد گریه نمیکنه!؟!؟ سیلم امشب سیل |
کی مرد گریه نمیکنه سیلم امشب سیل |
کی گفته مرد گریه نمیکنه!؟!؟ سیلم امشب سیل |
کی مرد گریه نمیکنه سیلم امشب سیل |
همه چیز تمومه ۴ ماهه که دارمش ازش خیلی راضیم |
همه چیز تمومه ۴ ماهه دارمش راضی |
In this step, we transform text data into numerical features using two approaches and prepare sample test texts.
-
Label Encoding the Target Feature 🔢
- Convert categorical
modelabels into numerical codes withLabelEncoder, storing them inmode_decodedand dropping the originalmodecolumn.
- Convert categorical
-
Word Tokenization 📝
- Break sentences into individual tokens using
Hazm.WordTokenizer.
- Break sentences into individual tokens using
-
Normalizing Tokens 🔤
- Standardize spacing and orthography across tokens for consistency.
-
Word-to-Vector Conversion 🚀
- FastText Embeddings: Map each token to a 300-dimensional dense vector via Hazm’s
WordEmbeddingand aggregate (e.g., mean) into a sentence vector. - TF-IDF Transformation: Use scikit-learn’s
TfidfVectorizerto create sparse vectors reflecting term importance.
- FastText Embeddings: Map each token to a 300-dimensional dense vector via Hazm’s
-
Large Array Construction 📊
- Combine vectorized features and encoded labels into a single
large_array, reshape and index it to build a new DataFramedf2ready for modeling.
- Combine vectorized features and encoded labels into a single
Below are some sample sentences processed through our feature creation pipeline:
| Raw Text | FastText Vector Shape | TF-IDF Vector Nonzeros |
|---|---|---|
من امروز خیلی خوشحالم |
(300,) | 5 |
این موقعیت برام استرسزاست |
(300,) | 6 |
نمیتونم باور کنم این اتفاق افتاد |
(300,) | 7 |
In this unified training pipeline, we:
- Train and evaluate multiple classifiers
- Tune hyperparameters and test on unseen data
Split the data 80/20 into train/test sets. Use stratified k-fold CV on training data to evaluate classifiers:
| Model | Vector Type | Best Params | CV Accuracy (Mean ± SD) |
|---|---|---|---|
| DecisionTreeClassifier | FastText | {'criterion':'gini','max_depth':5,'min_samples_split':2} |
0.44 ± 0.05 |
| RandomForestClassifier | FastText | {'n_estimators':200,'min_samples_split':4} |
0.57 ± 0.02 |
| SVC | FastText | {'C':1,'kernel':'rbf'} |
0.62 ± 0.02 |
| KNeighborsClassifier | FastText | {'n_neighbors':7,'weights':'distance'} |
0.54 ± 0.01 |
| ExtraTreesClassifier | FastText | {'n_estimators':200,'min_samples_split':5} |
0.57 ± 0.01 |
| HistGradientBoostingClassifier | FastText | {} (default) |
0.60 ± 0.01 |
| VotingClassifier | Ensemble | {} (default) |
0.61 ± 0.02 |
| GradientBoostingClassifier | FastText | {} (default) |
0.61 ± 0.02 |
| XGBClassifier | FastText | {'learning_rate':0.3,'max_depth':5} |
0.58 ± 0.01 |
Insight: SVM and VotingClassifier show the best CV performance.
Perform grid search to fine-tune hyperparameters for top models. Then evaluate on the held-out test set:
| Model | Test Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| RandomForestClassifier | 0.56 | 0.58 | 0.55 | 0.56 |
| SVC | 0.61 | 0.63 | 0.60 | 0.61 |
| HistGradientBoostingClassifier | 0.59 | 0.60 | 0.58 | 0.59 |
| VotingClassifier | 0.60 | 0.62 | 0.59 | 0.60 |
| XGBClassifier | 0.57 | 0.59 | 0.56 | 0.57 |
and the SVM confusion matrix
test report
Accuracy: 0.6439024390243903
Weighted-average F1 Score: 0.6483679371842618
To validate our pipeline, we feed unlabeled sentences into the trained model and inspect predicted emotions:
| Sentence | Predicted Emotion | Notes |
|---|---|---|
| بسیار نرم و لطیف بوده و کیفیت بالایی داره. | 😊 HAPPY | Positive product review, model captures joy. |
| اصلا رنگش با چیزی که تو عکس بود خیلی فرق داشت | 😠 ANGRY | Color mismatch complaint; model flags anger correctly. |
| دلم میخواد زیبا باشم و دوست داشته بشم :( | 😢 SAD | Expresses longing and sadness; model picks up sadness. |
| لج بازیو بذار کنار یه فرصت دیگه بهت میدم | 😐 OTHER | Ambiguous tone; defaulted to OTHER category. |
This Persian text emotion classification project demonstrates a full end-to-end pipeline—from raw data exploration and meticulous preprocessing with Hazm, through comprehensive feature engineering using both FastText embeddings and TF-IDF, to rigorous model selection, hyperparameter tuning, and evaluation. Key insights include:
- Data Quality Matters: Thorough cleaning and normalization steps significantly improve representation consistency.
- Semantic Vectors vs. TF-IDF: FastText embeddings yielded richer contextual features, slightly outperforming TF-IDF across most classifiers.
- Model Diversity: Ensemble methods like VotingClassifier and robust classifiers like SVM provided the best generalization performance.
Moving forward, further enhancements could include deep learning architectures (e.g., Transformers), advanced hyperparameter optimization, and deployment in a production environment using Flask or FastAPI. This framework is extensible and can be adapted to other Persian NLP tasks.


