Skip to content

yasio90/meetoptics-ex2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exercise 2: Web Scraping and Data Analysis Pipeline

This project implements a two-stage Python pipeline that scrapes product information from the Phasics website, saves it to a structured JSON file, and then performs a follow-up analysis on the collected data.

Architectural Highlights & Features

  • Two-Stage Pipeline: The project is split into two distinct, single-responsibility scripts:

    1. scraper.py: Handles all web scraping and data extraction.
    2. analyzer.py: Handles all data processing, analysis, and presentation.
  • Parsing Engine: It uses a rule-based parsing engine built on regex to classify each specification string into a structured data format.

    • It identifies and parses different data types: numbers, ranges, and multi-part dimensions.
    • It preserves the full context of the data, capturing units, qualifiers (like <, ~), and parenthetical extra information.
    • This is all managed through a schema using Python's TypedDict.

How to Run the Project

Prerequisites

  • Python 3.10+
  • We recommend creating and activating a new virtual environment (e.g., with venv).

1. Setup

Install the project dependencies.

# 1. (Optional but recommended) Create and activate a virtual environment

# 2. Install dependencies from the lock file
pip install -r requirements.txt

2. Running the Pipeline

The pipeline is a two-step process.

Step 1: Run the Scraper

This will fetch the data from the website and create the data/products.json file.

python scraper.py

Step 2: Run the Analyzer

This will load the generated JSON file, process the data, and print the final sorted table to your terminal.

python analyzer.py

Notes

There are spelling mistakes in the website (wavelenght instead of wavelength) or specifications that are the same but named differently for different products (accuracy vs accuracy (absolute)). I decided to ignore these issues as they don't seem to be within the scope of this exercise.

About

Basic Python data extraction pipeline that collects information from a website and performs basic numeric analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages