💼 NLP Job Postings Clustering using SpaCy and Unsupervised Learning

This project applies Natural Language Processing (NLP) and Unsupervised Machine Learning to group thousands of job postings from LinkedIn into meaningful clusters.
It demonstrates how unsupervised techniques like K-Means and Agglomerative Clustering can uncover hidden job market trends based on textual similarities in descriptions.

🎯 Objective

To automatically cluster job descriptions into categories that reveal underlying employment domains such as Technology, Healthcare, Retail, and Management, without prior labels.

📊 Dataset Overview

Source: LinkedIn Job Postings (2023–2024)
Total Records: 110,906 (10,000 sampled for analysis)
Main Columns:
- title
- description
- company_name
- location

🔹 Preprocessing Steps

Removed nulls, duplicates, and short descriptions (< 50 chars)
Combined title + description into a single text column
Cleaned text (lowercasing, punctuation, and stopword removal)

⚙️ Text Processing with SpaCy

Used SpaCy for advanced text preprocessing:

Tokenization
Lemmatization
Stopword removal
Sentence normalization
Generation of dense vector embeddings

These embeddings capture contextual similarity between job descriptions far beyond basic keyword matching.

🧩 Feature Engineering

Two feature extraction methods were tested and compared:

Feature Type	Description
TF-IDF	Captures word frequency importance across jobs
SpaCy Word Embeddings	Creates dense semantic representations

Both were scaled and evaluated using unsupervised clustering algorithms.

🤖 Models and Evaluation

Model	Features	Silhouette Score	DB Index	Observation
K-Means	SpaCy Embeddings	0.155	1.968	✅ Best-performing model
K-Means	TF-IDF	0.132	2.21	Moderate performance
Agglomerative	SpaCy Embeddings	0.110	2.42	Lower separation

Evaluation Metrics:

Silhouette Score — Higher = better separation
Davies–Bouldin Index (DBI) — Lower = more compact clusters

✅ Best setup: K-Means with SpaCy embeddings (k = 6)

📈 Visualization & Cluster Insights

Dimensionality reduction via PCA was applied for visualization.

Top 6 Cluster Themes:

Cluster	Theme	Keywords
0	Management & Operations	“Manager”, “Customer”, “Process”
1	Retail & Sales	“Customer service”, “Store”, “Product”
2	Hospitality	“Hotel”, “Experience”, “Spanish”
3	Tech & Software	“Design”, “Cloud”, “Application”
4	Healthcare	“Nurse”, “Care”, “Physician”
5	Manufacturing	“Equipment”, “Quality”, “Factory”

Visualizations include:

Silhouette plots
PCA 2D scatter plots
Word frequency analysis per cluster

💡 Key Findings

SpaCy embeddings outperformed TF-IDF in producing semantically coherent clusters.
Clear domain-based segmentation emerged even without labels.
Excellent foundation for AI-driven job recommendation or labor market analytics.
Combining embeddings + TF-IDF improved interpretability.

🚀 Future Enhancements

Implement BERT or fastText embeddings for deeper contextual understanding.
Experiment with dynamic clustering for time-based job evolution.
Integrate interactive visualization dashboards (Plotly / Power BI).

🧰 Tech Stack

Python 3.10+
SpaCy for NLP processing
Scikit-learn for clustering algorithms
Pandas, NumPy for data wrangling
Matplotlib, Seaborn for visualization
Jupyter Notebook for experimentation

⚙️ How to Run

# 1️⃣ Install dependencies
pip install -r requirements.txt

# 2️⃣ Open Jupyter Notebook
jupyter notebook EiEiKhaing_1290619_Project2.ipynb

# 3️⃣ Run all cells to preprocess, cluster, and visualize results

📁 Project Structure

NLP_Clustering_JobPostings/
│
├── EiEiKhaing_1290619_Project2.ipynb
├── README.md
├── requirements.txt
└── Clustering Job Postings Presentation.pptx

👩‍💻 Author

Ei Ei Khaing
Graduate Certificate in Artificial Intelligence & Machine Learning
Fanshawe College | London, Ontario, Canada

🔗 LinkedIn
💻 GitHub

🏷️ Keywords

NLP Unsupervised Learning K-Means Agglomerative Clustering SpaCy TF-IDF Job Market Analysis Text Mining Fanshawe College

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Clustering Job Postings with SpaCy and Unsupervised Learning.pptx		Clustering Job Postings with SpaCy and Unsupervised Learning.pptx
Clustering_Job_Postings.ipynb		Clustering_Job_Postings.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💼 NLP Job Postings Clustering using SpaCy and Unsupervised Learning

🎯 Objective

📊 Dataset Overview

🔹 Preprocessing Steps

⚙️ Text Processing with SpaCy

🧩 Feature Engineering

🤖 Models and Evaluation

📈 Visualization & Cluster Insights

💡 Key Findings

🚀 Future Enhancements

🧰 Tech Stack

⚙️ How to Run

📁 Project Structure

👩‍💻 Author

🏷️ Keywords

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💼 NLP Job Postings Clustering using SpaCy and Unsupervised Learning

🎯 Objective

📊 Dataset Overview

🔹 Preprocessing Steps

⚙️ Text Processing with SpaCy

🧩 Feature Engineering

🤖 Models and Evaluation

📈 Visualization & Cluster Insights

💡 Key Findings

🚀 Future Enhancements

🧰 Tech Stack

⚙️ How to Run

📁 Project Structure

👩‍💻 Author

🏷️ Keywords

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages