Exercise 2: Web Scraping and Data Analysis Pipeline

This project implements a two-stage Python pipeline that scrapes product information from the Phasics website, saves it to a structured JSON file, and then performs a follow-up analysis on the collected data.

Architectural Highlights & Features

Two-Stage Pipeline: The project is split into two distinct, single-responsibility scripts:
1. scraper.py: Handles all web scraping and data extraction.
2. analyzer.py: Handles all data processing, analysis, and presentation.
Parsing Engine: It uses a rule-based parsing engine built on regex to classify each specification string into a structured data format.
- It identifies and parses different data types: numbers, ranges, and multi-part dimensions.
- It preserves the full context of the data, capturing units, qualifiers (like <, ~), and parenthetical extra information.
- This is all managed through a schema using Python's TypedDict.

How to Run the Project

Prerequisites

Python 3.10+
We recommend creating and activating a new virtual environment (e.g., with venv).

1. Setup

Install the project dependencies.

# 1. (Optional but recommended) Create and activate a virtual environment

# 2. Install dependencies from the lock file
pip install -r requirements.txt

2. Running the Pipeline

The pipeline is a two-step process.

Step 1: Run the Scraper

This will fetch the data from the website and create the data/products.json file.

python scraper.py

Step 2: Run the Analyzer

This will load the generated JSON file, process the data, and print the final sorted table to your terminal.

python analyzer.py

Notes

There are spelling mistakes in the website (wavelenght instead of wavelength) or specifications that are the same but named differently for different products (accuracy vs accuracy (absolute)). I decided to ignore these issues as they don't seem to be within the scope of this exercise.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
analyzer.py		analyzer.py
requirements.in		requirements.in
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exercise 2: Web Scraping and Data Analysis Pipeline

Architectural Highlights & Features

How to Run the Project

Prerequisites

1. Setup

2. Running the Pipeline

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exercise 2: Web Scraping and Data Analysis Pipeline

Architectural Highlights & Features

How to Run the Project

Prerequisites

1. Setup

2. Running the Pipeline

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages