This project applies Natural Language Processing (NLP) and Unsupervised Machine Learning to group thousands of job postings from LinkedIn into meaningful clusters.
It demonstrates how unsupervised techniques like K-Means and Agglomerative Clustering can uncover hidden job market trends based on textual similarities in descriptions.
To automatically cluster job descriptions into categories that reveal underlying employment domains such as Technology, Healthcare, Retail, and Management, without prior labels.
- Source: LinkedIn Job Postings (2023–2024)
- Total Records: 110,906 (10,000 sampled for analysis)
- Main Columns:
titledescriptioncompany_namelocation
- Removed nulls, duplicates, and short descriptions (< 50 chars)
- Combined title + description into a single text column
- Cleaned text (lowercasing, punctuation, and stopword removal)
Used SpaCy for advanced text preprocessing:
- Tokenization
- Lemmatization
- Stopword removal
- Sentence normalization
- Generation of dense vector embeddings
These embeddings capture contextual similarity between job descriptions far beyond basic keyword matching.
Two feature extraction methods were tested and compared:
| Feature Type | Description |
|---|---|
| TF-IDF | Captures word frequency importance across jobs |
| SpaCy Word Embeddings | Creates dense semantic representations |
Both were scaled and evaluated using unsupervised clustering algorithms.
| Model | Features | Silhouette Score | DB Index | Observation |
|---|---|---|---|---|
| K-Means | SpaCy Embeddings | 0.155 | 1.968 | ✅ Best-performing model |
| K-Means | TF-IDF | 0.132 | 2.21 | Moderate performance |
| Agglomerative | SpaCy Embeddings | 0.110 | 2.42 | Lower separation |
Evaluation Metrics:
- Silhouette Score — Higher = better separation
- Davies–Bouldin Index (DBI) — Lower = more compact clusters
✅ Best setup: K-Means with SpaCy embeddings (k = 6)
Dimensionality reduction via PCA was applied for visualization.
Top 6 Cluster Themes:
| Cluster | Theme | Keywords |
|---|---|---|
| 0 | Management & Operations | “Manager”, “Customer”, “Process” |
| 1 | Retail & Sales | “Customer service”, “Store”, “Product” |
| 2 | Hospitality | “Hotel”, “Experience”, “Spanish” |
| 3 | Tech & Software | “Design”, “Cloud”, “Application” |
| 4 | Healthcare | “Nurse”, “Care”, “Physician” |
| 5 | Manufacturing | “Equipment”, “Quality”, “Factory” |
Visualizations include:
- Silhouette plots
- PCA 2D scatter plots
- Word frequency analysis per cluster
- SpaCy embeddings outperformed TF-IDF in producing semantically coherent clusters.
- Clear domain-based segmentation emerged even without labels.
- Excellent foundation for AI-driven job recommendation or labor market analytics.
- Combining embeddings + TF-IDF improved interpretability.
- Implement BERT or fastText embeddings for deeper contextual understanding.
- Experiment with dynamic clustering for time-based job evolution.
- Integrate interactive visualization dashboards (Plotly / Power BI).
- Python 3.10+
- SpaCy for NLP processing
- Scikit-learn for clustering algorithms
- Pandas, NumPy for data wrangling
- Matplotlib, Seaborn for visualization
- Jupyter Notebook for experimentation
# 1️⃣ Install dependencies
pip install -r requirements.txt
# 2️⃣ Open Jupyter Notebook
jupyter notebook EiEiKhaing_1290619_Project2.ipynb
# 3️⃣ Run all cells to preprocess, cluster, and visualize resultsNLP_Clustering_JobPostings/
│
├── EiEiKhaing_1290619_Project2.ipynb
├── README.md
├── requirements.txt
└── Clustering Job Postings Presentation.pptx
Ei Ei Khaing
Graduate Certificate in Artificial Intelligence & Machine Learning
Fanshawe College | London, Ontario, Canada
NLP Unsupervised Learning K-Means Agglomerative Clustering SpaCy TF-IDF Job Market Analysis Text Mining Fanshawe College