A robust and modular web scraping solution built on Node.js, utilizing Puppeteer for modern, headless browser interaction and Cheerio for efficient DOM parsing. Data is automatically extracted into a CSV file.
- Headless Browser: Uses Puppeteer to handle dynamic content (AJAX, JavaScript rendering).
- Efficient Parsing: Leverages Cheerio for fast DOM manipulation post-load.
- Modular Code: Built as a reusable class (
AdvancedScraper). - CSV Export: Automatically saves results to
scraped_data.csv.
- Node.js (v14 or higher)
- Clone the repository:
git clone https:[github.com/ewhx-dev/Advanced-Web-Scraper.git](https://github.com/ewhx-dev/Advanced-Web-Scraper.git) cd Advanced-Web-Scraper - Install dependencies (Puppeteer, Cheerio, csv-writer):
npm install
- Customize
scraper.js:- Update the
TARGET_URLconstant with the URL of the website you wish to scrape. - Crucially, update the CSS selectors (e.g.,
.product-item,.product-title) within theextractData()method to match the HTML structure of your target website.
- Update the
- Run the script:
The extracted data will be saved to a file named
npm start # OR node scraper.jsscraped_data.csvin the project root.
This project is licensed under the ISC License. See the package.json for details.
⚠️ Legal Notice: Always check the website'srobots.txtfile and their terms of service before scraping. Use this tool responsibly and ethically.