Training pipeline breaks without Label Encoding and integrated TF-IDF

Describe the bug

When training models, the code used raw string labels (e.g., "true", "false", "half-true") without encoding them. Some classifiers (e.g., LogisticRegression, RandomForest) expect numerical labels, which caused compatibility issues.
Additionally, TF-IDF vectorization was handled outside the model definitions, leading to duplicated preprocessing logic and potential inconsistencies.

**To **Reproduce****

Steps to reproduce the behavior:

Run the training script with the original code.

Use a classifier like LogisticRegression or RandomForestClassifier.

Observe that the model may fail or behave unexpectedly due to string labels.

Expected behavior

Labels should be encoded into integers so all classifiers can handle them.

Preprocessing (TF-IDF) should be part of the model pipeline, ensuring consistent and reusable workflows.

****
Environment (please complete the following):

OS: [e.g., Windows 11, Ubuntu 22.04]

Python Version: 3.x

Libraries:

scikit-learn

pandas

matplotlib

seaborn

**Additional context**

This bug was resolved by:

Adding Label Encoding with LabelEncoder to convert string labels into numeric form.

Refactoring models into Pipelines (TfidfVectorizer + classifier), reducing preprocessing duplication and improving maintainability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training pipeline breaks without Label Encoding and integrated TF-IDF #68

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training pipeline breaks without Label Encoding and integrated TF-IDF #68

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions