Raw Data
↓
┌─────────────────────┐
│ 1. CLEANING │ Remove bad data, handle missing values
│ - Deduplication │
│ - Error removal │
│ - Imputation │ ← Specialized cleaning step
└─────────────────────┘
↓
┌─────────────────────┐
│ 2. PREPROCESSING │ Transform data format/representation
│ - Tokenization │
│ - Normalization │
│ - Feature extract│
└─────────────────────┘
↓
┌─────────────────────┐
│ 3. ANNOTATION │ Add labels (can happen at different stages)
│ - Human labeling │
│ - Machine labels │
└─────────────────────┘
↓
┌─────────────────────┐
│ 4. ANNOTATION │ Assess annotation quality
│ ANALYSIS │
└─────────────────────┘
↓
Final Dataset