High-performance web content extraction engine built in Rust. Primary purpose is serving as an external web loader for OpenWebUI, but it's flexible enough for any use case that needs clean content extraction from web pages - RAG pipelines, content indexing, web scraping, archiving, and more.
- OpenWebUI Compatible - Native API support, drop-in replacement
- Multiple Output Formats - Markdown, HTML, plain text, screenshots
- Readability Extraction - Mozilla Readability algorithm for clean article content
- JavaScript Rendering - Chromium-based rendering for JS-heavy sites
- Smart Caching - Built-in response caching with configurable TTL
- Rate Limiting - Per-domain rate limiting and circuit breakers
- Batch Processing - Process multiple URLs concurrently
- Security - SSRF protection, blocked internal IPs, optional API key auth
- Egress Proxy Support - Honors
HTTPS_PROXY/HTTP_PROXY/NO_PROXYfor both HTTP client and Chromium browser traffic
docker build -t web-loader-engine .
docker run -d -p 14786:14786 --name web-loader web-loader-engineservices:
web-loader:
build: .
ports:
- "14786:14786"
environment:
- BROWSER_POOL_SIZE=10
- CACHE_TTL=3600
# - API_KEY=your-secret-key
volumes:
- screenshots:/app/screenshots
restart: unless-stopped
volumes:
screenshots:docker-compose up -dservices:
web-loader:
image: edgaras0x4e/web-loader-engine:latest
ports:
- "14786:14786"
environment:
- BROWSER_POOL_SIZE=10
- CACHE_TTL=3600
# - API_KEY=your-secret-key
volumes:
- screenshots:/app/screenshots
restart: unless-stopped
volumes:
screenshots:docker-compose up -dThen set OpenWebUI's web loader URL to http://web-loader:14786
Requires Rust 1.70+ and Chrome/Chromium installed.
cp .env.example .env # Configure settings
cargo build --release
./target/release/web-loader-engineCopy the example environment file and adjust as needed:
cp .env.example .envEnvironment variables:
| Variable | Default | Description |
|---|---|---|
API_PORT |
14786 |
Server port |
API_KEY |
- | Optional API key for authentication |
CHROME_PATH |
/usr/bin/chromium |
Path to Chrome/Chromium binary |
BROWSER_POOL_SIZE |
10 |
Concurrent browser pages |
REQUEST_TIMEOUT |
30 |
Default timeout in seconds |
CACHE_TTL |
3600 |
Cache lifetime in seconds |
SCREENSHOT_DIR |
/app/screenshots |
Screenshot storage path |
BROWSER_LOG_LEVEL |
error |
Log level for the headless browser driver (chromiumoxide). Silences noisy CDP deserialization warnings by default. Accepts off, error, warn, info, debug, trace |
DEFAULT_USER_AGENT |
Chrome 120 on Windows | User agent used when no override is provided and rotation is disabled |
USER_AGENT_ROTATION |
off |
Rotation strategy: off, round_robin, random |
USER_AGENT_POOL |
- | Inline pool of UAs separated by | or newlines |
USER_AGENT_POOL_FILE |
- | Path to a file with one UA per line (lines starting with # are comments). Takes precedence over USER_AGENT_POOL |
HTTPS_PROXY / HTTP_PROXY |
- | Egress proxy URL (e.g. http://proxy:3128). When set, routes both HTTP client and Chromium traffic through the proxy |
NO_PROXY |
- | Comma-separated list of hosts/domains to bypass the proxy (e.g. localhost,127.0.0.1,*.internal.example.com) |
POST /{"urls": ["https://example.com/article"]}Returns:
[
{
"page_content": "# Article Title\n\nContent...",
"metadata": {
"source": "https://example.com/article",
"title": "Article Title"
}
}
]POST /load{"url": "https://example.com"}Response:
{
"url": "https://example.com",
"title": "Example Domain",
"content": "# Example Domain\n\nThis domain is for examples...",
"metadata": {
"processing_time_ms": 1234,
"cached": false
}
}POST /load/batch{"urls": ["https://example.com/1", "https://example.com/2"]}Response:
{
"results": [
{
"url": "https://example.com/1",
"response": {
"url": "https://example.com/1",
"title": "Page Title",
"content": "...",
"metadata": {"processing_time_ms": 500, "cached": false}
}
}
],
"total_processing_time_ms": 1234
}GET /health| Header | Values | Description |
|---|---|---|
x-respond-with |
markdown, html, text, screenshot, pageshot |
Output format |
x-wait-for-selector |
CSS selector | Wait for element before extraction |
x-target-selector |
CSS selector | Extract only matching content |
x-remove-selector |
CSS selector | Remove elements before extraction |
x-timeout |
seconds | Request timeout |
x-set-cookie |
name=value |
Set cookies |
x-no-cache |
true |
Bypass cache |
x-with-images-summary |
true |
Include images list |
x-with-links-summary |
true |
Include links list |
x-user-agent |
UA string, rotate, default |
Override the user agent for this request. rotate forces rotation from the pool even when USER_AGENT_ROTATION=off; default forces the configured default |
Authorization |
Bearer <key> |
API key (if configured) |
{
"url": "https://example.com",
"options": {
"wait_for_selector": "#content",
"target_selector": "article",
"remove_selector": ".ads",
"timeout": 60
}
}Set x-respond-with to either screenshot (viewport only) or pageshot (full scrolling page). The API renders the page in headless Chromium, saves the PNG to SCREENSHOT_DIR, and returns a relative URL you can fetch from the same server.
curl -X POST http://localhost:14786/load \
-H "Content-Type: application/json" \
-H "x-respond-with: screenshot" \
-d '{"url": "https://example.com"}'Response:
{
"url": "https://example.com",
"title": null,
"content": "",
"screenshot_url": "/screenshots/httpsexamplecom_441d3714-d010-4eb4-a729-606873b081d9.png",
"metadata": {"processing_time_ms": 1064, "cached": false}
}Fetch the PNG:
curl -o page.png \
http://localhost:14786/screenshots/httpsexamplecom_441d3714-d010-4eb4-a729-606873b081d9.pngcurl -X POST http://localhost:14786/load \
-H "Content-Type: application/json" \
-H "x-respond-with: pageshot" \
-d '{"url": "https://example.com"}'Combine with x-wait-for-selector so the screenshot is only taken once a specific element has rendered:
curl -X POST http://localhost:14786/load \
-H "Content-Type: application/json" \
-H "x-respond-with: screenshot" \
-H "x-wait-for-selector: article h1" \
-d '{"url": "https://example.com/post/123"}'curl -X POST http://localhost:14786/load \
-H "Authorization: Bearer your-secret-key" \
-H "Content-Type: application/json" \
-H "x-respond-with: screenshot" \
-d '{"url": "https://example.com"}'
curl -H "Authorization: Bearer your-secret-key" \
-o page.png \
http://localhost:14786/screenshots/httpsexamplecom_441d3714-d010-4eb4-a729-606873b081d9.png- Files are written to
SCREENSHOT_DIR(default/app/screenshotsin Docker, configurable via env) - Filenames include a UUID to avoid collisions
- When using Docker, mount a volume at
/app/screenshotsto persist captures across container restarts
Three ways to control which User-Agent is sent with a request:
- Configured default - set
DEFAULT_USER_AGENTin the environment. Used when rotation is off and no header is provided. - Rotation pool - set
USER_AGENT_ROTATION=round_robinorrandomplusUSER_AGENT_POOL(orUSER_AGENT_POOL_FILE). The server picks a different UA per request. - Per-request override - send
x-user-agenton the individual call.
Resolution order per request: explicit header > rotation (if enabled) > configured default.
Separate UAs with | or newlines:
USER_AGENT_ROTATION=round_robin
USER_AGENT_POOL="Mozilla/5.0 ...Chrome/120...|Mozilla/5.0 ...Firefox/121..."One UA per line, # lines are comments. Takes precedence over USER_AGENT_POOL if both are set.
USER_AGENT_ROTATION=random
USER_AGENT_POOL_FILE=/etc/web-loader/user-agents.txtSample user-agents.txt:
# Desktop Chrome
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
# Desktop Firefox
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0
curl -X POST http://localhost:14786/load \
-H "Content-Type: application/json" \
-H "x-user-agent: MyBot/1.0 (+https://example.com/bot)" \
-d '{"url": "https://httpbin.org/user-agent"}'curl -X POST http://localhost:14786/load \
-H "Content-Type: application/json" \
-H "x-user-agent: rotate" \
-d '{"url": "https://httpbin.org/user-agent"}'curl -X POST http://localhost:14786/load \
-H "Content-Type: application/json" \
-H "x-user-agent: default" \
-d '{"url": "https://httpbin.org/user-agent"}'While built for OpenWebUI, this works for:
- RAG Pipelines - Clean content for embeddings and retrieval
- Content Archiving - Save readable versions of web pages
- Web Scraping - Extract data from JavaScript-rendered pages
- Screenshot Services - Programmatic page captures
- Search Indexing - Extract text content for indexing
User Agent Rotation & Browser Log Control - Added configurable user agents and a way to silence warnings.
- New
USER_AGENT_ROTATIONenv var with strategiesoff(default),round_robin,random- rotates per request - Provide the pool via
USER_AGENT_POOL(inline,|- or newline-separated) orUSER_AGENT_POOL_FILE(path to a file, one UA per line,#comments supported). The file takes precedence when both are set DEFAULT_USER_AGENToverrides the hardcoded default used when rotation is off and no header is setx-user-agentheader now accepts special values:rotateforces rotation even whenUSER_AGENT_ROTATION=off, anddefaultforces the configured default- Precedence: explicit header → rotation (if enabled) → configured default. Empty pool safely falls back to the default with a warning at startup
- New
BROWSER_LOG_LEVELenv var (defaulterror) silences chromiumoxide's noisyWS Invalid messagewarnings emitted when Chromium sends CDP events the driver doesn't yet model. Acceptsoff,error,warn,info,debug,trace- operates independently ofRUST_LOG
Chromium Egress Proxy Support - Chromium now honors HTTPS_PROXY/HTTP_PROXY/NO_PROXY from the environment so the browser's outbound traffic can be routed through an egress proxy.
- On launch, if
HTTPS_PROXY(orHTTP_PROXYas fallback) is set, Chromium is started with--proxy-server=<url> - If
NO_PROXYis set, its value is translated to Chrome's bypass-list syntax and passed via--proxy-bypass-list=<list>(commas → semicolons,*.domain→.domain) - When no proxy env vars are set, behavior is unchanged - dev/local runs need no configuration
- The Rust HTTP client (reqwest) already honors these vars natively, so direct HTTP fetches and browser fetches now share the same egress path
Screenshot Delivery Fix - Screenshot URLs returned by the API are now actually reachable.
- Fixed issue where
/loadresponses advertised ascreenshot_urlthat returned 404 when fetched - Saved screenshots are now served directly from the configured
SCREENSHOT_DIR - Safe by design: path-traversal attempts (e.g.
/screenshots/../etc/passwd) return 404 - Respects the same API key authentication as the rest of the API when one is configured
Browser Pool Resilience - Fixed critical issue where dead browser connections would cause requests to hang indefinitely.
- Added automatic browser health detection with 5-second timeout on page creation
- Implemented connection error detection for
Ws(AlreadyClosed)and related WebSocket errors - Auto-recovery: dead browsers are now automatically recreated on connection failure
- Request-level retry logic (up to 3 retries) for transient connection errors
- Health endpoint now exposes
healthystatus andrecreation_countfor monitoring
Health response now includes:
{
"status": "ok",
"version": "0.1.4",
"browser_pool": {
"available": 10,
"total": 10,
"healthy": true,
"recreation_count": 1
}
}Monitor recreation_count increasing to track browser recovery events.
MIT