SiteGround IP-based captcha blocks scraper — need proxy/rotating IP support

## Problem

Venues hosted on SiteGround are completely inaccessible from our server IP. SiteGround serves a JS-based captcha challenge (`/.well-known/sgcaptcha/`) on every URL — homepage, events page, RSS feeds, wp-json API, everything.

## Case Study: The High Low (Los Angeles)

- **URL**: https://thehighlowbar.com
- **Events**: https://thehighlowbar.com/events (The Events Calendar / Tribe Events)
- **Flow ID**: 721 (pipeline 26 - Los Angeles Events)
- **Hosting**: SiteGround (captcha on all routes)
- **Platform**: WordPress + The Events Calendar — would be trivially scrapable if we could reach it

The qualify step somehow got through once and found JSON-LD event data. All subsequent pipeline runs return `completed_no_items` because the captcha wall returns no event HTML.

## Current scraper behavior

The Universal Web Scraper already detects `sgcaptcha` in the response:

```php
// UniversalWebScraper.php line 687-691
$is_captcha = isset( $result['data'] ) && (
    strpos( $result['data'], 'sgcaptcha' ) !== false ||
    strpos( $result['data'], 'cloudflare-challenge' ) !== false ||
    ...
);
```

It tries a fallback request (browser_mode: false) but both modes come from the same server IP, so the captcha remains.

## Proposed solutions (ranked)

### 1. Scraping API integration (recommended)
Add an optional proxy layer via ScrapingBee, ScraperAPI, or Bright Data. These services route requests through rotating residential IPs that bypass IP-based captchas.

- HttpClient gets a new `use_proxy` option
- When captcha is detected + proxy is configured, retry through the scraping API
- Cost: ~$5-25/mo for our volume (~600 daily scrapes, most don't need proxy)
- Only route blocked sites through the proxy (not every request)

### 2. Flag and use alternative sources
For SiteGround-blocked WordPress sites, check if the events exist on:
- Ticketmaster / Dice.fm (already scraped via aggregator flows)
- Facebook Events page
- Google Events listing

This avoids the captcha entirely by using a different source for the same data.

### 3. Headless browser with cookie handling
Use Playwright/Puppeteer to solve the JS challenge and maintain a session cookie. Heavy infrastructure for a niche case.

## Impact

SiteGround is a major WordPress host. As we scale, we'll hit more venues behind this wall. A proxy integration solves it for all of them at once.

## Workaround until fixed

The High Low flow (721) is set to daily but will keep returning `completed_no_items`. Can be left running — once proxy support is added, it'll start working automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SiteGround IP-based captcha blocks scraper — need proxy/rotating IP support #192

Problem

Case Study: The High Low (Los Angeles)

Current scraper behavior

Proposed solutions (ranked)

1. Scraping API integration (recommended)

2. Flag and use alternative sources

3. Headless browser with cookie handling

Impact

Workaround until fixed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SiteGround IP-based captcha blocks scraper — need proxy/rotating IP support #192

Description

Problem

Case Study: The High Low (Los Angeles)

Current scraper behavior

Proposed solutions (ranked)

1. Scraping API integration (recommended)

2. Flag and use alternative sources

3. Headless browser with cookie handling

Impact

Workaround until fixed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions