Skip to content

Latest commit

ย 

History

History
99 lines (67 loc) ยท 4.79 KB

File metadata and controls

99 lines (67 loc) ยท 4.79 KB

๐Ÿ” Exploratory Data Analysis w/ SQL: Job Market Analytics

EDA Project Overview

A SQL project analyzing the data engineer job market using real world job posting data. It demonstrates my ability to write production-quality analytical SQL, design efficient queries, and turn business questions into data-driven insights.


๐Ÿงพ Executive Summary (For Hiring Managers)

  • โœ… Project scope: Built 3 analytical queries that answer key questions about the data engineer job market
  • โœ… Data modeling: Used multi-table joins across fact and dimension tables to extract insights
  • โœ… Analytics: Applied aggregations, filtering, and sorting to find top skills by demand, salary, and overall value
  • โœ… Outcomes: Delivered actionable insights on SQL/Python dominance, cloud trends, and salary patterns

If you only have a minute, review these:

  1. 01_top_demanded_skills.sql โ€“ demand analysis with multi-table joins
  2. 02_top_paying_skills.sql โ€“ salary analysis with aggregations
  3. 03_optimal_skills.sql โ€“ combined demand/salary optimization query

๐Ÿงฉ Problem & Context

Job market analysts need to answer questions like:

  • ๐ŸŽฏ Most in-demand: Which skills are most in-demand for data engineers?
  • ๐Ÿ’ฐ Highest paid: Which skills command the highest salaries?
  • โš–๏ธ Best trade-off: What is the optimal skill set balancing demand and compensation?

This project analyzes a data warehouse built using a star schema design. The warehouse structure consists of:

Data Warehouse Schema

  • Fact Table: job_postings_fact - Central table containing job posting details (job titles, locations, salaries, dates, etc.)
  • Dimension Tables:
    • company_dim - Company information linked to job postings
    • skills_dim - Skills catalog with skill names and types
  • Bridge Table: skills_job_dim - Resolves the many-to-many relationship between job postings and skills

By querying across these interconnected tables, I extracted insights about skill demand, salary patterns, and optimal skill combinations for data engineering roles.


๐Ÿงฐ Tech Stack

  • ๐Ÿค Query Engine: DuckDB for fast OLAP-style analytical queries
  • ๐Ÿงฎ Language: SQL (ANSI-style with analytical functions)
  • ๐Ÿ“Š Data Model: Star schema with fact + dimension + bridge tables
  • ๐Ÿ› ๏ธ Development: VS Code for SQL editing + Terminal for DuckDB CLI
  • ๐Ÿ“ฆ Version Control: Git/GitHub for versioned SQL scripts

๐Ÿ“‚ Repository Structure

1_EDA/
โ”œโ”€โ”€ 01_top_demanded_skills.sql    # Demand analysis query
โ”œโ”€โ”€ 02_top_paying_skills.sql      # Salary analysis query
โ”œโ”€โ”€ 03_optimal_skills.sql         # Combined demand/salary optimization
โ””โ”€โ”€ README.md                     # You are here

๐Ÿ— Analysis Overview

Query Structure

  1. Top Demanded Skills โ€“ Identifies the 10 most in-demand skills for remote data engineer positions
  2. Top Paying Skills โ€“ Analyzes the 25 highest-paying skills with salary and demand metrics
  3. Optimal Skills โ€“ Calculates an optimal score using natural log of demand combined with median salary to identify the most valuable skills to learn

Key Insights

  • ๐Ÿง  Core languages: SQL and Python each appear in ~29,000 job postings, making them the most demanded skills
  • โ˜๏ธ Cloud platforms: AWS and Azure are critical for modern data engineering roles-
  • ๐Ÿงฑ Infra & tooling: Kubernetes, Docker, and Terraform are associated with premium salaries
  • ๐Ÿ”ฅ Big data tools: Apache Spark shows strong demand with competitive compensation

๐Ÿ’ป SQL Skills Demonstrated

Query Design & Optimization

  • Complex Joins: Multi-table INNER JOIN operations across job_postings_fact, skills_job_dim, and skills_dim
  • Aggregations: COUNT(), MEDIAN(), ROUND() for statistical analysis
  • Filtering: Boolean logic with WHERE clauses and multiple conditions (job_title_short, job_work_from_home, salary_year_avg IS NOT NULL)
  • Sorting & Limiting: ORDER BY with DESC and LIMIT for top-N analysis

Data Analysis Techniques

  • Grouping: GROUP BY for categorical analysis by skill
  • Mathematical Functions: LN() for natural logarithm transformation to normalize demand metrics
  • Calculated Metrics: Derived optimal score combining log-transformed demand with median salary
  • HAVING Clause: Filtering aggregated results (skills with >= 100 postings)
  • NULL Handling: Proper filtering of incomplete records (salary_year_avg IS NOT NULL)