Macro F1-Score: 0.741 on the public test set — achieving competitive performance through careful feature engineering, without deep learning.
A supervised NLP pipeline for classifying news articles into 7 semantic categories, developed as part of the Data Science & Machine Learning Laboratory course at Politecnico di Torino (Winter 2026).
Authors: Leonardo Passafiume (s358616), Lucio Baiocchi (s360244)
Given a dataset of ~80,000 web-scraped news articles — each with metadata including source, title, timestamp, and PageRank — the goal is to build a classifier that accurately categorizes them into:
| Label | Category | Samples |
|---|---|---|
| 0 | International News | 23,542 |
| 1 | Business | 10,588 |
| 2 | Technology | 11,161 |
| 3 | Entertainment | 9,977 |
| 4 | Sports | 8,574 |
| 5 | General News | 13,053 |
| 6 | Health | 3,102 |
Key challenges include class imbalance (Health has ~7× fewer samples than International News), lexical overlap between semantically close categories (International vs. General News share 67% of their top-100 keywords), and noisy HTML artifacts embedded in the raw text.
- No-Clean HTML Strategy — Instead of stripping all HTML, semantic content (e.g., image alt-text) is preserved as additional signal.
- Source Boosting — High-purity sources (e.g., ESPN → Sports, CNET → Technology) are amplified via text repetition, effectively turning a text classification problem into a simpler metadata-routing task for reliable publishers.
- Temporal Features — Day-of-week and time-of-day (morning / afternoon / night) are extracted from timestamps and one-hot encoded.
- Length Features — Log-transformed article and title lengths serve as lightweight discriminative signals.
A dual-branch text representation captures complementary patterns:
| Branch | Analyzer | N-gram Range | Max Features | SelectKBest |
|---|---|---|---|---|
| Word-level | word | (1, 2) | 100,000 | 60,000 |
| Char-level | char | (3, 5) | 50,000 | 25,000 |
- Word-level features use SnowballStemmer to reduce vocabulary and capture semantic similarity.
- Character-level n-grams add robustness to misspellings and capture sub-word morphological patterns.
An SGDClassifier with modified_huber loss, chosen for its efficiency on high-dimensional sparse data (100,000+ features) and flexibility in regularization.
Hyperparameters were tuned via 5-fold Grid Search over a wide search space including preprocessing parameters and classifier settings.
| Model | Validation (Macro F1) | Public Score |
|---|---|---|
| Naive Baseline | 0.443 | 0.443 |
| Random Forest | 0.689 | — |
| SVM (LinearSVC) | 0.703 | — |
| SVM (SGDClassifier) ✅ | 0.726 | 0.741 |
- 🏆 Sports (Class 4): 95% accuracy — cleanly separated via source boosting and distinctive vocabulary.
- 🖥️ Technology (Class 2): 85% accuracy — strong domain-specific terms and high-purity sources.
⚠️ Entertainment (Class 3): 51% accuracy — significant confusion with International and General News due to lexical overlap.
NLP_SemanticClassification/
├── main.py # Full pipeline: preprocessing → training → prediction
├── development.csv # Training set (79,997 labeled samples)
├── evaluation.csv # Test set (20,000 unlabeled samples)
├── report_exam_winter_2026.pdf # Academic report with full analysis
└── README.md
pip install numpy pandas scikit-learn nltkimport nltk
nltk.download('punkt')python main.pyThis will:
- Load and preprocess
development.csv(deduplication, NaN handling, feature extraction) - Build the hybrid TF-IDF + metadata pipeline
- Train the SGDClassifier on the full development set
- Predict labels for
evaluation.csv - Save results to
submission.csv
- Metadata is underrated. Source boosting alone accounts for a significant chunk of performance — some publishers map almost 1:1 to a category.
- Character n-grams complement word n-grams. They capture patterns that survive noisy tokenization and HTML artifacts.
- Linear models scale better. On 100K+ sparse features, SGD-based SVMs outperform tree-based ensembles both in accuracy and training speed.
- Semantic overlap is the bottleneck. The confusion between International/General News and Entertainment is structural — resolving it likely requires contextual embeddings beyond bag-of-words.
This project was developed for academic purposes at Politecnico di Torino.
- Salton & Buckley (1988) — Term-weighting approaches in automatic text retrieval
- Joachims (1998) — Text categorization with SVMs
- Bottou (2010) — Large-scale ML with SGD
- Kim et al. (2019) — Categorical metadata representation for customized text classification