Problem
Venues hosted on SiteGround are completely inaccessible from our server IP. SiteGround serves a JS-based captcha challenge (/.well-known/sgcaptcha/) on every URL — homepage, events page, RSS feeds, wp-json API, everything.
Case Study: The High Low (Los Angeles)
- URL: https://thehighlowbar.com
- Events: https://thehighlowbar.com/events (The Events Calendar / Tribe Events)
- Flow ID: 721 (pipeline 26 - Los Angeles Events)
- Hosting: SiteGround (captcha on all routes)
- Platform: WordPress + The Events Calendar — would be trivially scrapable if we could reach it
The qualify step somehow got through once and found JSON-LD event data. All subsequent pipeline runs return completed_no_items because the captcha wall returns no event HTML.
Current scraper behavior
The Universal Web Scraper already detects sgcaptcha in the response:
// UniversalWebScraper.php line 687-691
$is_captcha = isset( $result['data'] ) && (
strpos( $result['data'], 'sgcaptcha' ) !== false ||
strpos( $result['data'], 'cloudflare-challenge' ) !== false ||
...
);
It tries a fallback request (browser_mode: false) but both modes come from the same server IP, so the captcha remains.
Proposed solutions (ranked)
1. Scraping API integration (recommended)
Add an optional proxy layer via ScrapingBee, ScraperAPI, or Bright Data. These services route requests through rotating residential IPs that bypass IP-based captchas.
- HttpClient gets a new
use_proxy option
- When captcha is detected + proxy is configured, retry through the scraping API
- Cost: ~$5-25/mo for our volume (~600 daily scrapes, most don't need proxy)
- Only route blocked sites through the proxy (not every request)
2. Flag and use alternative sources
For SiteGround-blocked WordPress sites, check if the events exist on:
- Ticketmaster / Dice.fm (already scraped via aggregator flows)
- Facebook Events page
- Google Events listing
This avoids the captcha entirely by using a different source for the same data.
3. Headless browser with cookie handling
Use Playwright/Puppeteer to solve the JS challenge and maintain a session cookie. Heavy infrastructure for a niche case.
Impact
SiteGround is a major WordPress host. As we scale, we'll hit more venues behind this wall. A proxy integration solves it for all of them at once.
Workaround until fixed
The High Low flow (721) is set to daily but will keep returning completed_no_items. Can be left running — once proxy support is added, it'll start working automatically.
Problem
Venues hosted on SiteGround are completely inaccessible from our server IP. SiteGround serves a JS-based captcha challenge (
/.well-known/sgcaptcha/) on every URL — homepage, events page, RSS feeds, wp-json API, everything.Case Study: The High Low (Los Angeles)
The qualify step somehow got through once and found JSON-LD event data. All subsequent pipeline runs return
completed_no_itemsbecause the captcha wall returns no event HTML.Current scraper behavior
The Universal Web Scraper already detects
sgcaptchain the response:It tries a fallback request (browser_mode: false) but both modes come from the same server IP, so the captcha remains.
Proposed solutions (ranked)
1. Scraping API integration (recommended)
Add an optional proxy layer via ScrapingBee, ScraperAPI, or Bright Data. These services route requests through rotating residential IPs that bypass IP-based captchas.
use_proxyoption2. Flag and use alternative sources
For SiteGround-blocked WordPress sites, check if the events exist on:
This avoids the captcha entirely by using a different source for the same data.
3. Headless browser with cookie handling
Use Playwright/Puppeteer to solve the JS challenge and maintain a session cookie. Heavy infrastructure for a niche case.
Impact
SiteGround is a major WordPress host. As we scale, we'll hit more venues behind this wall. A proxy integration solves it for all of them at once.
Workaround until fixed
The High Low flow (721) is set to daily but will keep returning
completed_no_items. Can be left running — once proxy support is added, it'll start working automatically.