Skip to content

SiteGround IP-based captcha blocks scraper — need proxy/rotating IP support #192

@chubes4

Description

@chubes4

Problem

Venues hosted on SiteGround are completely inaccessible from our server IP. SiteGround serves a JS-based captcha challenge (/.well-known/sgcaptcha/) on every URL — homepage, events page, RSS feeds, wp-json API, everything.

Case Study: The High Low (Los Angeles)

  • URL: https://thehighlowbar.com
  • Events: https://thehighlowbar.com/events (The Events Calendar / Tribe Events)
  • Flow ID: 721 (pipeline 26 - Los Angeles Events)
  • Hosting: SiteGround (captcha on all routes)
  • Platform: WordPress + The Events Calendar — would be trivially scrapable if we could reach it

The qualify step somehow got through once and found JSON-LD event data. All subsequent pipeline runs return completed_no_items because the captcha wall returns no event HTML.

Current scraper behavior

The Universal Web Scraper already detects sgcaptcha in the response:

// UniversalWebScraper.php line 687-691
$is_captcha = isset( $result['data'] ) && (
    strpos( $result['data'], 'sgcaptcha' ) !== false ||
    strpos( $result['data'], 'cloudflare-challenge' ) !== false ||
    ...
);

It tries a fallback request (browser_mode: false) but both modes come from the same server IP, so the captcha remains.

Proposed solutions (ranked)

1. Scraping API integration (recommended)

Add an optional proxy layer via ScrapingBee, ScraperAPI, or Bright Data. These services route requests through rotating residential IPs that bypass IP-based captchas.

  • HttpClient gets a new use_proxy option
  • When captcha is detected + proxy is configured, retry through the scraping API
  • Cost: ~$5-25/mo for our volume (~600 daily scrapes, most don't need proxy)
  • Only route blocked sites through the proxy (not every request)

2. Flag and use alternative sources

For SiteGround-blocked WordPress sites, check if the events exist on:

  • Ticketmaster / Dice.fm (already scraped via aggregator flows)
  • Facebook Events page
  • Google Events listing

This avoids the captcha entirely by using a different source for the same data.

3. Headless browser with cookie handling

Use Playwright/Puppeteer to solve the JS challenge and maintain a session cookie. Heavy infrastructure for a niche case.

Impact

SiteGround is a major WordPress host. As we scale, we'll hit more venues behind this wall. A proxy integration solves it for all of them at once.

Workaround until fixed

The High Low flow (721) is set to daily but will keep returning completed_no_items. Can be left running — once proxy support is added, it'll start working automatically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions