Skip to content

Fix: Implemented Label Encoding and Unified ML Models using Pipelines#139

Draft
itsvishwasj wants to merge 1 commit intoDeepika14145:mainfrom
itsvishwasj:fix-ml-pipelines-gssoc
Draft

Fix: Implemented Label Encoding and Unified ML Models using Pipelines#139
itsvishwasj wants to merge 1 commit intoDeepika14145:mainfrom
itsvishwasj:fix-ml-pipelines-gssoc

Conversation

@itsvishwasj
Copy link

Fix: Implemented Label Encoding and Unified ML Models using Pipelines

🛠️ Pull Request Template

🏷️ PR Type

  • 🐞 Bug Fix
  • 🛠️ Improvement / Refactor

🔗 Related Issue


📝 Rationale / Motivation

This PR fully resolves Issue #68, which reported errors and instability in the classical machine learning model training workflow.

  1. String Label Error: Fixed runtime errors caused by passing non-numerical string labels (e.g., "half-true") directly to scikit-learn classifiers. This is fixed by applying LabelEncoder.
  2. Preprocessing Duplication: Fixed inconsistent and redundant $\text{TfidfVectorizer}$ steps by bundling the vectorizer and the classifier into a unified Pipeline object.

The changes significantly improve stability, data consistency, and code clarity.


✨ Description of Changes

  • Core Fixes Applied:

    • Label Encoding: Implemented sklearn.preprocessing.LabelEncoder to convert the 6 unique string labels to integers across all affected files.
    • Pipeline Integration: Refactored training logic to use sklearn.pipeline.Pipeline, combining $\text{TfidfVectorizer}$ with the classifier (NB, LR, RF) in a single object.
  • Files Modified:

    • scripts/fake_news_logreg_rf.py: Implemented Pipelines for $\text{LogisticRegression}$, $\text{RandomForestClassifier}$, and $\text{MultinomialNB}$.
    • module/liar-data-analysis.py: Implemented Label Encoding and Pipelines for analysis examples.
    • module/fake-news-detection-using-nb.ipynb: Applied Label Encoding and the $\text{MultinomialNB}$ Pipeline logic.

🧪 Testing Instructions

  1. Pull this branch and ensure your dependencies are installed.
  2. Run the main comparison script:
    python scripts/fake_news_logreg_rf.py
  3. Expected Results:
    • The script must execute without crashing (no more errors about string labels).
    • Accuracy scores and classification reports for all three models (NB, LR, RF) should be printed to the console.
    • New or updated result files (.md, confusion matrix PNGs) should be generated in the results/ directory.

👀 Impact Assessment


⚡ Checklist

  • Code follows project’s coding style and guidelines
  • Changes are tested locally
  • Automated tests added/updated (N/A)
  • Documentation updated (N/A)
  • User-facing changes are documented (N/A)
  • Related issue linked
  • No new warnings/errors introduced

⚠️ Breaking Changes

None. This is an internal fix that preserves the input/output of the scripts.


🎯 Priority / Impact Level

  • Priority: High (Critical bug fix)
  • Impact: Medium (Internal code structure)

…r classical ML models (closes #<issue-number>)
@vercel
Copy link

vercel bot commented Oct 26, 2025

@itsvishwasj is attempting to deploy a commit to the deepika's projects Team on Vercel.

A member of the Team first needs to authorize it.

@itsvishwasj itsvishwasj marked this pull request as draft October 26, 2025 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Training pipeline breaks without Label Encoding and integrated TF-IDF

1 participant