Skip to content

9eek9/NLP_Clustering_JobPostings

Repository files navigation

💼 NLP Job Postings Clustering using SpaCy and Unsupervised Learning

This project applies Natural Language Processing (NLP) and Unsupervised Machine Learning to group thousands of job postings from LinkedIn into meaningful clusters.
It demonstrates how unsupervised techniques like K-Means and Agglomerative Clustering can uncover hidden job market trends based on textual similarities in descriptions.


🎯 Objective

To automatically cluster job descriptions into categories that reveal underlying employment domains such as Technology, Healthcare, Retail, and Management, without prior labels.


📊 Dataset Overview

  • Source: LinkedIn Job Postings (2023–2024)
  • Total Records: 110,906 (10,000 sampled for analysis)
  • Main Columns:
    • title
    • description
    • company_name
    • location

🔹 Preprocessing Steps

  • Removed nulls, duplicates, and short descriptions (< 50 chars)
  • Combined title + description into a single text column
  • Cleaned text (lowercasing, punctuation, and stopword removal)

⚙️ Text Processing with SpaCy

Used SpaCy for advanced text preprocessing:

  • Tokenization
  • Lemmatization
  • Stopword removal
  • Sentence normalization
  • Generation of dense vector embeddings

These embeddings capture contextual similarity between job descriptions far beyond basic keyword matching.


🧩 Feature Engineering

Two feature extraction methods were tested and compared:

Feature Type Description
TF-IDF Captures word frequency importance across jobs
SpaCy Word Embeddings Creates dense semantic representations

Both were scaled and evaluated using unsupervised clustering algorithms.


🤖 Models and Evaluation

Model Features Silhouette Score DB Index Observation
K-Means SpaCy Embeddings 0.155 1.968 ✅ Best-performing model
K-Means TF-IDF 0.132 2.21 Moderate performance
Agglomerative SpaCy Embeddings 0.110 2.42 Lower separation

Evaluation Metrics:

  • Silhouette Score — Higher = better separation
  • Davies–Bouldin Index (DBI) — Lower = more compact clusters

Best setup: K-Means with SpaCy embeddings (k = 6)


📈 Visualization & Cluster Insights

Dimensionality reduction via PCA was applied for visualization.

Top 6 Cluster Themes:

Cluster Theme Keywords
0 Management & Operations “Manager”, “Customer”, “Process”
1 Retail & Sales “Customer service”, “Store”, “Product”
2 Hospitality “Hotel”, “Experience”, “Spanish”
3 Tech & Software “Design”, “Cloud”, “Application”
4 Healthcare “Nurse”, “Care”, “Physician”
5 Manufacturing “Equipment”, “Quality”, “Factory”

Visualizations include:

  • Silhouette plots
  • PCA 2D scatter plots
  • Word frequency analysis per cluster

💡 Key Findings

  • SpaCy embeddings outperformed TF-IDF in producing semantically coherent clusters.
  • Clear domain-based segmentation emerged even without labels.
  • Excellent foundation for AI-driven job recommendation or labor market analytics.
  • Combining embeddings + TF-IDF improved interpretability.

🚀 Future Enhancements

  • Implement BERT or fastText embeddings for deeper contextual understanding.
  • Experiment with dynamic clustering for time-based job evolution.
  • Integrate interactive visualization dashboards (Plotly / Power BI).

🧰 Tech Stack

  • Python 3.10+
  • SpaCy for NLP processing
  • Scikit-learn for clustering algorithms
  • Pandas, NumPy for data wrangling
  • Matplotlib, Seaborn for visualization
  • Jupyter Notebook for experimentation

⚙️ How to Run

# 1️⃣ Install dependencies
pip install -r requirements.txt

# 2️⃣ Open Jupyter Notebook
jupyter notebook EiEiKhaing_1290619_Project2.ipynb

# 3️⃣ Run all cells to preprocess, cluster, and visualize results

📁 Project Structure

NLP_Clustering_JobPostings/
│
├── EiEiKhaing_1290619_Project2.ipynb
├── README.md
├── requirements.txt
└── Clustering Job Postings Presentation.pptx

👩‍💻 Author

Ei Ei Khaing
Graduate Certificate in Artificial Intelligence & Machine Learning
Fanshawe College | London, Ontario, Canada

🔗 LinkedIn
💻 GitHub


🏷️ Keywords

NLP Unsupervised Learning K-Means Agglomerative Clustering SpaCy TF-IDF Job Market Analysis Text Mining Fanshawe College

About

Unsupervised NLP project that clusters job postings using SpaCy embeddings and K-Means / Agglomerative Clustering to uncover hidden employment patterns.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors