-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Describe the bug
When training models, the code used raw string labels (e.g., "true", "false", "half-true") without encoding them. Some classifiers (e.g., LogisticRegression, RandomForest) expect numerical labels, which caused compatibility issues.
Additionally, TF-IDF vectorization was handled outside the model definitions, leading to duplicated preprocessing logic and potential inconsistencies.
To Reproduce
Steps to reproduce the behavior:
Run the training script with the original code.
Use a classifier like LogisticRegression or RandomForestClassifier.
Observe that the model may fail or behave unexpectedly due to string labels.
Expected behavior
Labels should be encoded into integers so all classifiers can handle them.
Preprocessing (TF-IDF) should be part of the model pipeline, ensuring consistent and reusable workflows.
Environment (please complete the following):
OS: [e.g., Windows 11, Ubuntu 22.04]
Python Version: 3.x
Libraries:
scikit-learn
pandas
matplotlib
seaborn
Additional context
This bug was resolved by:
Adding Label Encoding with LabelEncoder to convert string labels into numeric form.
Refactoring models into Pipelines (TfidfVectorizer + classifier), reducing preprocessing duplication and improving maintainability.