This project demonstrates the use of a multi-stage Docker build to scrape data from a specified URL using Node.js with Puppeteer and Chromium, and serve the scraped data via a simple Python Flask web server.
/scraper-flask-app
β
βββ app.py # Flask web server for serving scraped data
βββ scraper.js # Node.js script to scrape the provided URL
βββ Dockerfile # Multi-stage Dockerfile for building the image
βββ scraped_data.json # Output file containing the scraped data (generated by the scraper)
βββ README.md # Project documentation
Before you begin, ensure that you have the following installed:
- Docker: Installation Guide
- A Docker Hub account: Sign up here
The project consists of two main parts:
- Scraper (Node.js with Puppeteer): A Node.js script (
scrape.js) that uses Puppeteer to scrape content from a specified URL and stores the output in a JSON file. - Web Server (Flask): A simple Flask web server (
server.py) that reads the scraped JSON data and serves it via an HTTP endpoint.
- The scraper script will accept a URL as an environment variable.
- It will use Puppeteer to load the page and scrape content (e.g., the title of the page).
- The scraped data will be stored as a JSON file (
scraped_data.json).
- The Flask web server will read the
scraped_data.jsonfile. - It will serve the data through an endpoint (
/scraped_data) that returns the content as JSON when accessed.
The Dockerfile includes two stages:
- Scraper Stage: Uses a Node.js image to install Puppeteer and Chromium, and then runs the scraper script.
- Server Stage: Uses a Python image with Flask to serve the scraped content.
Start by cloning this repository to your local machine:
git clone https://github.com/sanjaykadavarath/puppeteer-scraper-flask-app.git
cd puppeteer-scraper-flask-appThe first step in setting up the project is to build the Docker image. You will need to specify the URL you want to scrape via a build argument.
docker build --build-arg SCRAPE_URL=http://example.com -t scraper-flask-app .- Replace
http://example.comwith the URL you want to scrape.
Once the image is built, run the container on your local machine or a server:
docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest- This command runs the container and maps port
5000on the host machine to port5000inside the container.
After the container starts, you can access the Flask web server by opening a browser and navigating to:
http://localhost:5000/scraped_data
If you're running it on a remote server, replace localhost with the server's IP address.
To push the Docker image to Docker Hub, follow these steps:
-
Tag the image with your Docker Hub username and repository name:
docker tag scraper-flask-app sanjaykadavarath/scraper-flask-app:latest
-
Push the image to Docker Hub:
docker push sanjaykadavarath/scraper-flask-app:latest
To run this project on another machine, follow these steps:
-
Install Docker on the other machine.
-
Login to Docker Hub on the new machine:
docker login
-
Pull the image from Docker Hub:
docker pull sanjaykadavarath/scraper-flask-app:latest
-
Run the container:
docker run -p 5000:5000 sanjaykadavarath/scraper-flask-app:latest
-
Access the Flask server at
http://<machine-ip>:5000/scraped_data.
FROM node:16 AS scraper
# Install dependencies
RUN apt-get update && apt-get install -y wget ca-certificates --no-install-recommends && rm -rf /var/lib/apt/lists/*
# Install Puppeteer and Chromium
RUN npm install puppeteer --save
# Set working directory
WORKDIR /app
# Copy the scraper script
COPY scraper.js .
# Set the environment variable for the URL to scrape
ARG SCRAPE_URL
ENV SCRAPE_URL=$SCRAPE_URL
# Run the scraper
RUN node scraper.js- This stage installs necessary dependencies, installs Puppeteer, and runs the
scraper.jsscript.
FROM python:3.9-slim AS server
# Install Flask
RUN pip install flask
# Set working directory
WORKDIR /app
# Copy the scraped data and Flask app
COPY --from=scraper /app/scraped_data.json .
COPY app.py .
# Expose the port
EXPOSE 5000
# Run Flask app
CMD ["python", "app.py"]- This stage copies the
scraped_data.jsonfile from the first stage and sets up the Flask web server.
- Environment Variable: The scraper script uses the
SCRAPE_URLenvironment variable to specify the URL to scrape. You must pass this as a build argument when building the Docker image. - Dynamic Scraping: The scraper can be easily adapted to scrape different data by modifying the
scraper.jsscript. - Flask Web Server: The Flask app serves the scraped data as a JSON response at the
/scraped_dataendpoint.
This project is licensed under the MIT License - see the LICENSE file for details.