Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
f0a458e
updated
Sep 22, 2025
105b1a7
merged and dropped
Sep 22, 2025
ea048c9
Changed files naming convention
luispabloaiello-da Sep 23, 2025
ce18e8d
Merge pull request #1 from SeenoUrbanism/luispabloaiello-da
luispabloaiello-da Sep 23, 2025
7350693
first cleaning
Sep 23, 2025
2dd755f
Merge pull request #2 from SeenoUrbanism/sina
SeenoUrbanism Sep 23, 2025
6238a8e
changes before merge
feitenjj Sep 23, 2025
47e5ae7
Merge branch 'main' of https://github.com/SeenoUrbanism/first_project
feitenjj Sep 23, 2025
1cb3b1d
2nd day - cleaning
feitenjj Sep 23, 2025
d127ba1
Merge pull request #3 from SeenoUrbanism/janna
SeenoUrbanism Sep 23, 2025
009b995
functions.py
luispabloaiello-da Sep 24, 2025
8223ff1
functions.py
luispabloaiello-da Sep 24, 2025
9966a7b
Merge pull request #4 from SeenoUrbanism/luispabloaiello-da
SeenoUrbanism Sep 24, 2025
a89b965
EDA Analysis
luispabloaiello-da Sep 24, 2025
c0257ac
Merge pull request #5 from SeenoUrbanism/luispabloaiello-da
SeenoUrbanism Sep 24, 2025
9fa00d4
Create new notebooks fro insights and SQL and add comments and descri…
luispabloaiello-da Sep 24, 2025
0a5a36b
4th day
luispabloaiello-da Sep 25, 2025
be51f5f
4th day
luispabloaiello-da Sep 25, 2025
9d6fd1c
Merge pull request #6 from SeenoUrbanism/luispabloaiello-da
SeenoUrbanism Sep 25, 2025
40e3a10
last day edits
Sep 25, 2025
1f1305b
Merge pull request #7 from SeenoUrbanism/sina
SeenoUrbanism Sep 25, 2025
6b301bc
last day ultimatum
luispabloaiello-da Sep 26, 2025
04a0b2b
last version not yet finished
luispabloaiello-da Sep 26, 2025
8231b2f
project completed
luispabloaiello-da Sep 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.13
77 changes: 77 additions & 0 deletions README - Copy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Project overview
...

# Installation

1. **Clone the repository**:

```bash
git clone https://github.com/YourUsername/repository_name.git
```

2. **Install UV**

If you're a MacOS/Linux user type:

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

If you're a Windows user open an Anaconda Powershell Prompt and type :

```bash
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```

3. **Create an environment**

```bash
uv venv
```

3. **Activate the environment**

If you're a MacOS/Linux user type (if you're using a bash shell):

```bash
source ./venv/bin/activate
```

If you're a MacOS/Linux user type (if you're using a csh/tcsh shell):

```bash
source ./venv/bin/activate.csh
```

If you're a Windows user type:

```bash
.\venv\Scripts\activate
```

4. **Install dependencies**:

```bash
uv pip install -r requirements.txt
```

# Questions
...

# Dataset
...

## Main dataset issues

- ...
- ...
- ...

## Solutions for the dataset issues
...

# Conclussions
...

# Next steps
...
153 changes: 102 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,128 @@
# Project overview
...
# NYC Jobs Data Analysis & Data Cleaning Project

# Installation
## Overview

1. **Clone the repository**:
This project focuses on cleaning, transforming, and analyzing public job postings for New York City, using a structured Python (Pandas) workflow and custom functions for robust data wrangling. The source data includes large CSVs with diverse job roles, salary ranges, and categorical features. The main goal is to provide clean, analysis-ready datasets for insights into salary distributions, contract types, agency trends, and demand for data-centric skills.

```bash
git clone https://github.com/YourUsername/repository_name.git
```
The project is organized into modular Jupyter notebooks and Python scripts, with a special emphasis on reproducible preprocessing and targeted extraction of roles related to data analysis, engineering, and modern data skills.

2. **Install UV**
The workflow is structured around dynamic file paths managed via a YAML configuration file, ensuring portability and easy customization.

If you're a MacOS/Linux user type:
---

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
## Data Sources

If you're a Windows user open an Anaconda Powershell Prompt and type :
- **NYC Jobs CSV files:** Two separate CSVs with thousands of job postings, each containing up to 30 columns.
- **YAML configuration:** Manages all input/output paths for raw and cleaned data, as well as output figures.

```bash
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```
---

3. **Create an environment**
## Main Files and Structure

```bash
uv venv
```
### 1. `data_wrangling.ipynb`

3. **Activate the environment**
- Loads raw CSVs and YAML config for path management.
- Applies cleaning functions from `functions.py`:
- Standardizes column names.
- Drops duplicates by job ID.
- Removes unnecessary columns.
- Cleans punctuation and normalizes text (job titles, skills).
- Converts dates to pandas datetime format.
- Filters jobs by:
- **Business title** (extracts roles like "data analyst", "data engineer").
- **Preferred skills** (finds mentions of SQL, Python, BI, Tableau, ML, etc.).
- Outputs three grouped CSVs:
- All other jobs.
- Data analyst/engineer roles.
- Jobs requiring specific data skills.
- Includes summary tables for nulls, column types, and value counts.

If you're a MacOS/Linux user type (if you're using a bash shell):
### 2. `functions.py`

```bash
source ./venv/bin/activate
```
- Contains all custom data cleaning and transformation functions:
- Standardize column names.
- Drop duplicates.
- Concatenate DataFrames.
- Remove punctuation and lowercase.
- Drop irrelevant columns.
- Regex-based row filtering.
- Standardize dates.
- Functions are written for flexible, repeatable use in notebooks.

If you're a MacOS/Linux user type (if you're using a csh/tcsh shell):
### 3. `data_insights - Copy.ipynb`

```bash
source ./venv/bin/activate.csh
```
- Loads cleaned CSVs and applies further transformation as needed.
- Explores:
- Salary distributions by role type and skill requirements.
- Contract frequency (annual, hourly, daily) by group.
- Posting trends over time, highlighting recent demand for data talent.
- Produces visualizations (matplotlib, seaborn):
- KDE plots for salary bands.
- Bar charts for contract types and agency hiring.
- Histograms for posting year.
- Includes documented code cells explaining each plot and table and how to interpret results.

If you're a Windows user type:
---

```bash
.\venv\Scripts\activate
```
## Configuration

4. **Install dependencies**:
- All file paths for raw inputs, cleaned outputs, and figures are managed in `config.yaml`:
- Update this file to change inputs and outputs without modifying notebook logic.
- Example outputs managed via YAML include:
- Cleaned CSVs by job type and skill.
- Figures for salary, agency, and trends.

```bash
uv pip install -r requirements.txt
```
---

# Questions
...
## Key Features

# Dataset
...
- **Robust Data Cleaning:** Handles missing values, inconsistent text, irrelevant columns, and duplicates.
- **Skill Filtering:** Extracts jobs by business title and by presence of data-related keywords in the skills field using regex.
- **Date Normalization:** Converts multiple date formats to pandas datetime for time-series analysis.
- **Modular Outputs:** Splits the cleaned data into logical groups for focused analysis.
- **Configurable Workflow:** Uses YAML for paths, making the notebooks portable and reusable.

## Main dataset issues
---

- ...
- ...
- ...
## Example Insights

## Solutions for the dataset issues
...
- **Salary Distribution:** Data analyst and engineering roles have higher median starting salaries compared to general postings. Jobs mentioning modern data skills also tend to offer higher pay.
- **Contract Type:** Most data-related roles are annual contracts; hourly/daily contracts are rare.
- **Trends Over Time:** Demand for data-centric jobs is increasing, with more postings in recent years.
- **Agency Hiring:** Top agencies hiring for data roles can be visualized and ranked using the included analysis

# Conclussions
...
---

# Next steps
...
## How to Use

1. **Configure file paths:** Update `config.yaml` as needed for input/output CSVs and figures.
2. **Run `data_wrangling.ipynb`:** This notebook processes raw data into analysis-ready CSVs.
3. **Run `data_insights.ipynb`:** Explore cleaned datasets and generate summary tables and visualizations.
4. **Customize filtering:** Adjust regex patterns or column selections in notebooks/scripts to focus on different roles or skills if needed.

---

## Files Included

- `data_wrangling.ipynb` — Main notebook for data cleaning and preparation.
- `functions.py` — Library of custom preprocessing functions.
- `data_insights.ipynb` — Notebook for data analysis and visualization.
- `config.yaml` — Centralized config for all input/output file locations.
- **CSV Outputs:** Cleaned, grouped datasets for further analysis (paths managed with YAML).
- **Figures:** Visual outputs saved per YAML config.

## Presentation Slides
[[Link to Slides](https://www.canva.com/design/DAG0Ra3GTHo/xIG_axW6IWI_54hR0ECAoQ/edit?utm_content=DAG0Ra3GTHo&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton)]

---

## Authors
- Janna Julian
- Sina Yazdi
- Luis Pablo Aiello

---

## License

This repository is for educational, analytical, and non-commercial purposes only. Data is derived from publicly available NYC jobs datasets.
Binary file added anaconda_projects/db/project_filebrowser.db
Binary file not shown.
15 changes: 13 additions & 2 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
input_data:
file: "../data/raw/raw_data_file.csv"
file1: "../data/raw/raw_data_Jobs_NYC_Postings.csv"
file2: "../data/raw/raw_data_Jobs_NYC_Postings_2.csv"

output_data:
file: "../data/clean/cleaned_data_file.csv"
file1: "../data/clean/cleaned_data_file_nyc_postings.csv"
file2: "../data/clean/cleaned_data_file_data_analyst.csv"
file3: "../data/clean/cleaned_data_file_keywords.csv"
file4: "../data/clean/cleaned_data_file_overlap.csv"
fig1: "../figures/figure1.jpeg"
fig2: "../figures/figure2.jpeg"
fig3: "../figures/figure3.jpeg"
fig4: "../figures/figure4.jpeg"
fig5: "../figures/figure5.jpeg"
fig6: "../figures/figure6.jpeg"
fig7: "../figures/figure7.jpeg"
Empty file removed data/clean/cleaned_data_file.csv
Empty file.
Loading