Skip to content

leonardopassafiume/NLP_SemanticClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📰 News Article Semantic Classification

Macro F1-Score: 0.741 on the public test set — achieving competitive performance through careful feature engineering, without deep learning.

A supervised NLP pipeline for classifying news articles into 7 semantic categories, developed as part of the Data Science & Machine Learning Laboratory course at Politecnico di Torino (Winter 2026).

Authors: Leonardo Passafiume (s358616), Lucio Baiocchi (s360244)


🎯 Problem Statement

Given a dataset of ~80,000 web-scraped news articles — each with metadata including source, title, timestamp, and PageRank — the goal is to build a classifier that accurately categorizes them into:

Label Category Samples
0 International News 23,542
1 Business 10,588
2 Technology 11,161
3 Entertainment 9,977
4 Sports 8,574
5 General News 13,053
6 Health 3,102

Key challenges include class imbalance (Health has ~7× fewer samples than International News), lexical overlap between semantically close categories (International vs. General News share 67% of their top-100 keywords), and noisy HTML artifacts embedded in the raw text.


🧠 Methodology

1. Preprocessing & Feature Engineering

  • No-Clean HTML Strategy — Instead of stripping all HTML, semantic content (e.g., image alt-text) is preserved as additional signal.
  • Source Boosting — High-purity sources (e.g., ESPN → Sports, CNET → Technology) are amplified via text repetition, effectively turning a text classification problem into a simpler metadata-routing task for reliable publishers.
  • Temporal Features — Day-of-week and time-of-day (morning / afternoon / night) are extracted from timestamps and one-hot encoded.
  • Length Features — Log-transformed article and title lengths serve as lightweight discriminative signals.

2. Hybrid TF-IDF Vectorization

A dual-branch text representation captures complementary patterns:

Branch Analyzer N-gram Range Max Features SelectKBest
Word-level word (1, 2) 100,000 60,000
Char-level char (3, 5) 50,000 25,000
  • Word-level features use SnowballStemmer to reduce vocabulary and capture semantic similarity.
  • Character-level n-grams add robustness to misspellings and capture sub-word morphological patterns.

3. Model: SVM with Stochastic Gradient Descent

An SGDClassifier with modified_huber loss, chosen for its efficiency on high-dimensional sparse data (100,000+ features) and flexibility in regularization.

Hyperparameters were tuned via 5-fold Grid Search over a wide search space including preprocessing parameters and classifier settings.


📊 Results

Model Validation (Macro F1) Public Score
Naive Baseline 0.443 0.443
Random Forest 0.689
SVM (LinearSVC) 0.703
SVM (SGDClassifier) 0.726 0.741

Per-Class Highlights

  • 🏆 Sports (Class 4): 95% accuracy — cleanly separated via source boosting and distinctive vocabulary.
  • 🖥️ Technology (Class 2): 85% accuracy — strong domain-specific terms and high-purity sources.
  • ⚠️ Entertainment (Class 3): 51% accuracy — significant confusion with International and General News due to lexical overlap.

🗂️ Project Structure

NLP_SemanticClassification/
├── main.py                         # Full pipeline: preprocessing → training → prediction
├── development.csv                 # Training set (79,997 labeled samples)
├── evaluation.csv                  # Test set (20,000 unlabeled samples)
├── report_exam_winter_2026.pdf     # Academic report with full analysis
└── README.md

🚀 Quick Start

Prerequisites

pip install numpy pandas scikit-learn nltk
import nltk
nltk.download('punkt')

Generate Submission

python main.py

This will:

  1. Load and preprocess development.csv (deduplication, NaN handling, feature extraction)
  2. Build the hybrid TF-IDF + metadata pipeline
  3. Train the SGDClassifier on the full development set
  4. Predict labels for evaluation.csv
  5. Save results to submission.csv

🔑 Key Insights

  1. Metadata is underrated. Source boosting alone accounts for a significant chunk of performance — some publishers map almost 1:1 to a category.
  2. Character n-grams complement word n-grams. They capture patterns that survive noisy tokenization and HTML artifacts.
  3. Linear models scale better. On 100K+ sparse features, SGD-based SVMs outperform tree-based ensembles both in accuracy and training speed.
  4. Semantic overlap is the bottleneck. The confusion between International/General News and Entertainment is structural — resolving it likely requires contextual embeddings beyond bag-of-words.

📜 License

This project was developed for academic purposes at Politecnico di Torino.

📎 References

  • Salton & Buckley (1988) — Term-weighting approaches in automatic text retrieval
  • Joachims (1998) — Text categorization with SVMs
  • Bottou (2010) — Large-scale ML with SGD
  • Kim et al. (2019) — Categorical metadata representation for customized text classification

About

News Article Semantic Classification | NLP pipeline for classifying news articles into 7 semantic categories

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages