Skip to content

Training pipeline breaks without Label Encoding and integrated TF-IDF #68

@shambhavi0608

Description

@shambhavi0608

Describe the bug

When training models, the code used raw string labels (e.g., "true", "false", "half-true") without encoding them. Some classifiers (e.g., LogisticRegression, RandomForest) expect numerical labels, which caused compatibility issues.
Additionally, TF-IDF vectorization was handled outside the model definitions, leading to duplicated preprocessing logic and potential inconsistencies.

To Reproduce

Steps to reproduce the behavior:

Run the training script with the original code.

Use a classifier like LogisticRegression or RandomForestClassifier.

Observe that the model may fail or behave unexpectedly due to string labels.

Expected behavior

Labels should be encoded into integers so all classifiers can handle them.

Preprocessing (TF-IDF) should be part of the model pipeline, ensuring consistent and reusable workflows.


Environment (please complete the following):

OS: [e.g., Windows 11, Ubuntu 22.04]

Python Version: 3.x

Libraries:

scikit-learn

pandas

matplotlib

seaborn

Additional context

This bug was resolved by:

Adding Label Encoding with LabelEncoder to convert string labels into numeric form.

Refactoring models into Pipelines (TfidfVectorizer + classifier), reducing preprocessing duplication and improving maintainability.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions