This project implements a two-stage Python pipeline that scrapes product information from the Phasics website, saves it to a structured JSON file, and then performs a follow-up analysis on the collected data.
-
Two-Stage Pipeline: The project is split into two distinct, single-responsibility scripts:
scraper.py: Handles all web scraping and data extraction.analyzer.py: Handles all data processing, analysis, and presentation.
-
Parsing Engine: It uses a rule-based parsing engine built on regex to classify each specification string into a structured data format.
- It identifies and parses different data types: numbers, ranges, and multi-part dimensions.
- It preserves the full context of the data, capturing units, qualifiers (like
<,~), and parenthetical extra information. - This is all managed through a schema using Python's
TypedDict.
- Python 3.10+
- We recommend creating and activating a new virtual environment (e.g., with
venv).
Install the project dependencies.
# 1. (Optional but recommended) Create and activate a virtual environment
# 2. Install dependencies from the lock file
pip install -r requirements.txtThe pipeline is a two-step process.
Step 1: Run the Scraper
This will fetch the data from the website and create the data/products.json file.
python scraper.pyStep 2: Run the Analyzer
This will load the generated JSON file, process the data, and print the final sorted table to your terminal.
python analyzer.pyThere are spelling mistakes in the website (wavelenght instead of wavelength) or specifications that are the same but named differently for different products (accuracy vs accuracy (absolute)). I decided to ignore these issues as they don't seem to be within the scope of this exercise.