📰 News Article Semantic Classification

Macro F1-Score: 0.741 on the public test set — achieving competitive performance through careful feature engineering, without deep learning.

A supervised NLP pipeline for classifying news articles into 7 semantic categories, developed as part of the Data Science & Machine Learning Laboratory course at Politecnico di Torino (Winter 2026).

Authors: Leonardo Passafiume (s358616), Lucio Baiocchi (s360244)

🎯 Problem Statement

Given a dataset of ~80,000 web-scraped news articles — each with metadata including source, title, timestamp, and PageRank — the goal is to build a classifier that accurately categorizes them into:

Label	Category	Samples
0	International News	23,542
1	Business	10,588
2	Technology	11,161
3	Entertainment	9,977
4	Sports	8,574
5	General News	13,053
6	Health	3,102

Key challenges include class imbalance (Health has ~7× fewer samples than International News), lexical overlap between semantically close categories (International vs. General News share 67% of their top-100 keywords), and noisy HTML artifacts embedded in the raw text.

🧠 Methodology

1. Preprocessing & Feature Engineering

No-Clean HTML Strategy — Instead of stripping all HTML, semantic content (e.g., image alt-text) is preserved as additional signal.
Source Boosting — High-purity sources (e.g., ESPN → Sports, CNET → Technology) are amplified via text repetition, effectively turning a text classification problem into a simpler metadata-routing task for reliable publishers.
Temporal Features — Day-of-week and time-of-day (morning / afternoon / night) are extracted from timestamps and one-hot encoded.
Length Features — Log-transformed article and title lengths serve as lightweight discriminative signals.

2. Hybrid TF-IDF Vectorization

A dual-branch text representation captures complementary patterns:

Branch	Analyzer	N-gram Range	Max Features	SelectKBest
Word-level	word	(1, 2)	100,000	60,000
Char-level	char	(3, 5)	50,000	25,000

Word-level features use SnowballStemmer to reduce vocabulary and capture semantic similarity.
Character-level n-grams add robustness to misspellings and capture sub-word morphological patterns.

3. Model: SVM with Stochastic Gradient Descent

An SGDClassifier with modified_huber loss, chosen for its efficiency on high-dimensional sparse data (100,000+ features) and flexibility in regularization.

Hyperparameters were tuned via 5-fold Grid Search over a wide search space including preprocessing parameters and classifier settings.

📊 Results

Model	Validation (Macro F1)	Public Score
Naive Baseline	0.443	0.443
Random Forest	0.689	—
SVM (LinearSVC)	0.703	—
SVM (SGDClassifier) ✅	0.726	0.741

Per-Class Highlights

🏆 Sports (Class 4): 95% accuracy — cleanly separated via source boosting and distinctive vocabulary.
🖥️ Technology (Class 2): 85% accuracy — strong domain-specific terms and high-purity sources.
⚠️ Entertainment (Class 3): 51% accuracy — significant confusion with International and General News due to lexical overlap.

🗂️ Project Structure

NLP_SemanticClassification/
├── main.py                         # Full pipeline: preprocessing → training → prediction
├── development.csv                 # Training set (79,997 labeled samples)
├── evaluation.csv                  # Test set (20,000 unlabeled samples)
├── report_exam_winter_2026.pdf     # Academic report with full analysis
└── README.md

🚀 Quick Start

Prerequisites

pip install numpy pandas scikit-learn nltk

import nltk
nltk.download('punkt')

Generate Submission

python main.py

This will:

Load and preprocess development.csv (deduplication, NaN handling, feature extraction)
Build the hybrid TF-IDF + metadata pipeline
Train the SGDClassifier on the full development set
Predict labels for evaluation.csv
Save results to submission.csv

🔑 Key Insights

Metadata is underrated. Source boosting alone accounts for a significant chunk of performance — some publishers map almost 1:1 to a category.
Character n-grams complement word n-grams. They capture patterns that survive noisy tokenization and HTML artifacts.
Linear models scale better. On 100K+ sparse features, SGD-based SVMs outperform tree-based ensembles both in accuracy and training speed.
Semantic overlap is the bottleneck. The confusion between International/General News and Entertainment is structural — resolving it likely requires contextual embeddings beyond bag-of-words.

📜 License

This project was developed for academic purposes at Politecnico di Torino.

📎 References

Salton & Buckley (1988) — Term-weighting approaches in automatic text retrieval
Joachims (1998) — Text categorization with SVMs
Bottou (2010) — Large-scale ML with SGD
Kim et al. (2019) — Categorical metadata representation for customized text classification

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
report_exam_winter_2026.pdf		report_exam_winter_2026.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📰 News Article Semantic Classification

🎯 Problem Statement

🧠 Methodology

1. Preprocessing & Feature Engineering

2. Hybrid TF-IDF Vectorization

3. Model: SVM with Stochastic Gradient Descent

📊 Results

Per-Class Highlights

🗂️ Project Structure

🚀 Quick Start

Prerequisites

Generate Submission

🔑 Key Insights

📜 License

📎 References

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📰 News Article Semantic Classification

🎯 Problem Statement

🧠 Methodology

1. Preprocessing & Feature Engineering

2. Hybrid TF-IDF Vectorization

3. Model: SVM with Stochastic Gradient Descent

📊 Results

Per-Class Highlights

🗂️ Project Structure

🚀 Quick Start

Prerequisites

Generate Submission

🔑 Key Insights

📜 License

📎 References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages