Web Loader Engine

High-performance web content extraction engine built in Rust. Primary purpose is serving as an external web loader for OpenWebUI, but it's flexible enough for any use case that needs clean content extraction from web pages - RAG pipelines, content indexing, web scraping, archiving, and more.

Features

OpenWebUI Compatible - Native API support, drop-in replacement
Multiple Output Formats - Markdown, HTML, plain text, screenshots
Readability Extraction - Mozilla Readability algorithm for clean article content
JavaScript Rendering - Chromium-based rendering for JS-heavy sites
Smart Caching - Built-in response caching with configurable TTL
Rate Limiting - Per-domain rate limiting and circuit breakers
Batch Processing - Process multiple URLs concurrently
Security - SSRF protection, blocked internal IPs, optional API key auth
Egress Proxy Support - Honors HTTPS_PROXY/HTTP_PROXY/NO_PROXY for both HTTP client and Chromium browser traffic

Quick Start

Docker (Recommended)

docker build -t web-loader-engine .
docker run -d -p 14786:14786 --name web-loader web-loader-engine

Docker Compose

services:
  web-loader:
    build: .
    ports:
      - "14786:14786"
    environment:
      - BROWSER_POOL_SIZE=10
      - CACHE_TTL=3600
      # - API_KEY=your-secret-key
    volumes:
      - screenshots:/app/screenshots
    restart: unless-stopped

volumes:
  screenshots:

docker-compose up -d

Docker Hub (Pre-built Image)

services:
  web-loader:
    image: edgaras0x4e/web-loader-engine:latest
    ports:
      - "14786:14786"
    environment:
      - BROWSER_POOL_SIZE=10
      - CACHE_TTL=3600
      # - API_KEY=your-secret-key
    volumes:
      - screenshots:/app/screenshots
    restart: unless-stopped

volumes:
  screenshots:

docker-compose up -d

Then set OpenWebUI's web loader URL to http://web-loader:14786

From Source

Requires Rust 1.70+ and Chrome/Chromium installed.

cp .env.example .env  # Configure settings
cargo build --release
./target/release/web-loader-engine

Configuration

Copy the example environment file and adjust as needed:

cp .env.example .env

Environment variables:

Variable	Default	Description
`API_PORT`	`14786`	Server port
`API_KEY`	-	Optional API key for authentication
`CHROME_PATH`	`/usr/bin/chromium`	Path to Chrome/Chromium binary
`BROWSER_POOL_SIZE`	`10`	Concurrent browser pages
`REQUEST_TIMEOUT`	`30`	Default timeout in seconds
`CACHE_TTL`	`3600`	Cache lifetime in seconds
`SCREENSHOT_DIR`	`/app/screenshots`	Screenshot storage path
`BROWSER_LOG_LEVEL`	`error`	Log level for the headless browser driver (chromiumoxide). Silences noisy CDP deserialization warnings by default. Accepts `off`, `error`, `warn`, `info`, `debug`, `trace`
`DEFAULT_USER_AGENT`	Chrome 120 on Windows	User agent used when no override is provided and rotation is disabled
`USER_AGENT_ROTATION`	`off`	Rotation strategy: `off`, `round_robin`, `random`
`USER_AGENT_POOL`	-	Inline pool of UAs separated by `\|` or newlines
`USER_AGENT_POOL_FILE`	-	Path to a file with one UA per line (lines starting with `#` are comments). Takes precedence over `USER_AGENT_POOL`
`HTTPS_PROXY` / `HTTP_PROXY`	-	Egress proxy URL (e.g. `http://proxy:3128`). When set, routes both HTTP client and Chromium traffic through the proxy
`NO_PROXY`	-	Comma-separated list of hosts/domains to bypass the proxy (e.g. `localhost,127.0.0.1,*.internal.example.com`)

API

OpenWebUI Endpoint

POST /

{"urls": ["https://example.com/article"]}

Returns:

[
  {
    "page_content": "# Article Title\n\nContent...",
    "metadata": {
      "source": "https://example.com/article",
      "title": "Article Title"
    }
  }
]

Single URL

POST /load

{"url": "https://example.com"}

Response:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "# Example Domain\n\nThis domain is for examples...",
  "metadata": {
    "processing_time_ms": 1234,
    "cached": false
  }
}

Batch

POST /load/batch

{"urls": ["https://example.com/1", "https://example.com/2"]}

Response:

{
  "results": [
    {
      "url": "https://example.com/1",
      "response": {
        "url": "https://example.com/1",
        "title": "Page Title",
        "content": "...",
        "metadata": {"processing_time_ms": 500, "cached": false}
      }
    }
  ],
  "total_processing_time_ms": 1234
}

Health Check

GET /health

Request Headers

Header	Values	Description
`x-respond-with`	`markdown`, `html`, `text`, `screenshot`, `pageshot`	Output format
`x-wait-for-selector`	CSS selector	Wait for element before extraction
`x-target-selector`	CSS selector	Extract only matching content
`x-remove-selector`	CSS selector	Remove elements before extraction
`x-timeout`	seconds	Request timeout
`x-set-cookie`	`name=value`	Set cookies
`x-no-cache`	`true`	Bypass cache
`x-with-images-summary`	`true`	Include images list
`x-with-links-summary`	`true`	Include links list
`x-user-agent`	UA string, `rotate`, `default`	Override the user agent for this request. `rotate` forces rotation from the pool even when `USER_AGENT_ROTATION=off`; `default` forces the configured default
`Authorization`	`Bearer <key>`	API key (if configured)

Request Body Options (all optional)

{
  "url": "https://example.com",
  "options": {
    "wait_for_selector": "#content",
    "target_selector": "article",
    "remove_selector": ".ads",
    "timeout": 60
  }
}

Screenshots

Set x-respond-with to either screenshot (viewport only) or pageshot (full scrolling page). The API renders the page in headless Chromium, saves the PNG to SCREENSHOT_DIR, and returns a relative URL you can fetch from the same server.

Viewport screenshot

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-respond-with: screenshot" \
  -d '{"url": "https://example.com"}'

Response:

{
  "url": "https://example.com",
  "title": null,
  "content": "",
  "screenshot_url": "/screenshots/httpsexamplecom_441d3714-d010-4eb4-a729-606873b081d9.png",
  "metadata": {"processing_time_ms": 1064, "cached": false}
}

Fetch the PNG:

curl -o page.png \
  http://localhost:14786/screenshots/httpsexamplecom_441d3714-d010-4eb4-a729-606873b081d9.png

Full-page screenshot

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-respond-with: pageshot" \
  -d '{"url": "https://example.com"}'

Wait for content before capturing

Combine with x-wait-for-selector so the screenshot is only taken once a specific element has rendered:

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-respond-with: screenshot" \
  -H "x-wait-for-selector: article h1" \
  -d '{"url": "https://example.com/post/123"}'

With API key

curl -X POST http://localhost:14786/load \
  -H "Authorization: Bearer your-secret-key" \
  -H "Content-Type: application/json" \
  -H "x-respond-with: screenshot" \
  -d '{"url": "https://example.com"}'

curl -H "Authorization: Bearer your-secret-key" \
  -o page.png \
  http://localhost:14786/screenshots/httpsexamplecom_441d3714-d010-4eb4-a729-606873b081d9.png

Storage

Files are written to SCREENSHOT_DIR (default /app/screenshots in Docker, configurable via env)
Filenames include a UUID to avoid collisions
When using Docker, mount a volume at /app/screenshots to persist captures across container restarts

User Agents

Three ways to control which User-Agent is sent with a request:

Configured default - set DEFAULT_USER_AGENT in the environment. Used when rotation is off and no header is provided.
Rotation pool - set USER_AGENT_ROTATION=round_robin or random plus USER_AGENT_POOL (or USER_AGENT_POOL_FILE). The server picks a different UA per request.
Per-request override - send x-user-agent on the individual call.

Resolution order per request: explicit header > rotation (if enabled) > configured default.

Pool from an inline list

Separate UAs with | or newlines:

USER_AGENT_ROTATION=round_robin
USER_AGENT_POOL="Mozilla/5.0 ...Chrome/120...|Mozilla/5.0 ...Firefox/121..."

Pool from a file

One UA per line, # lines are comments. Takes precedence over USER_AGENT_POOL if both are set.

USER_AGENT_ROTATION=random
USER_AGENT_POOL_FILE=/etc/web-loader/user-agents.txt

Sample user-agents.txt:

# Desktop Chrome
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

# Desktop Firefox
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0

Explicit UA per request

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-user-agent: MyBot/1.0 (+https://example.com/bot)" \
  -d '{"url": "https://httpbin.org/user-agent"}'

Force rotation on a single request (even when rotation is off globally)

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-user-agent: rotate" \
  -d '{"url": "https://httpbin.org/user-agent"}'

Force the configured default (bypass rotation for one request)

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-user-agent: default" \
  -d '{"url": "https://httpbin.org/user-agent"}'

Other Use Cases

While built for OpenWebUI, this works for:

RAG Pipelines - Clean content for embeddings and retrieval
Content Archiving - Save readable versions of web pages
Web Scraping - Extract data from JavaScript-rendered pages
Screenshot Services - Programmatic page captures
Search Indexing - Extract text content for indexing

Changelog

v0.1.4

User Agent Rotation & Browser Log Control - Added configurable user agents and a way to silence warnings.

New USER_AGENT_ROTATION env var with strategies off (default), round_robin, random - rotates per request
Provide the pool via USER_AGENT_POOL (inline, |- or newline-separated) or USER_AGENT_POOL_FILE (path to a file, one UA per line, # comments supported). The file takes precedence when both are set
DEFAULT_USER_AGENT overrides the hardcoded default used when rotation is off and no header is set
x-user-agent header now accepts special values: rotate forces rotation even when USER_AGENT_ROTATION=off, and default forces the configured default
Precedence: explicit header → rotation (if enabled) → configured default. Empty pool safely falls back to the default with a warning at startup
New BROWSER_LOG_LEVEL env var (default error) silences chromiumoxide's noisy WS Invalid message warnings emitted when Chromium sends CDP events the driver doesn't yet model. Accepts off, error, warn, info, debug, trace - operates independently of RUST_LOG

v0.1.3

Chromium Egress Proxy Support - Chromium now honors HTTPS_PROXY/HTTP_PROXY/NO_PROXY from the environment so the browser's outbound traffic can be routed through an egress proxy.

On launch, if HTTPS_PROXY (or HTTP_PROXY as fallback) is set, Chromium is started with --proxy-server=<url>
If NO_PROXY is set, its value is translated to Chrome's bypass-list syntax and passed via --proxy-bypass-list=<list> (commas → semicolons, *.domain → .domain)
When no proxy env vars are set, behavior is unchanged - dev/local runs need no configuration
The Rust HTTP client (reqwest) already honors these vars natively, so direct HTTP fetches and browser fetches now share the same egress path

v0.1.2

Screenshot Delivery Fix - Screenshot URLs returned by the API are now actually reachable.

Fixed issue where /load responses advertised a screenshot_url that returned 404 when fetched
Saved screenshots are now served directly from the configured SCREENSHOT_DIR
Safe by design: path-traversal attempts (e.g. /screenshots/../etc/passwd) return 404
Respects the same API key authentication as the rest of the API when one is configured

v0.1.1

Browser Pool Resilience - Fixed critical issue where dead browser connections would cause requests to hang indefinitely.

Added automatic browser health detection with 5-second timeout on page creation
Implemented connection error detection for Ws(AlreadyClosed) and related WebSocket errors
Auto-recovery: dead browsers are now automatically recreated on connection failure
Request-level retry logic (up to 3 retries) for transient connection errors
Health endpoint now exposes healthy status and recreation_count for monitoring

Health response now includes:

{
  "status": "ok",
  "version": "0.1.4",
  "browser_pool": {
    "available": 10,
    "total": 10,
    "healthy": true,
    "recreation_count": 1
  }
}

Monitor recreation_count increasing to track browser recovery events.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Web Loader Engine

Features

Quick Start

Docker (Recommended)

Docker Compose

Docker Hub (Pre-built Image)

From Source

Configuration

API

OpenWebUI Endpoint

Single URL

Batch

Health Check

Request Headers

Request Body Options (all optional)

Screenshots

Viewport screenshot

Full-page screenshot

Wait for content before capturing

With API key

Storage

User Agents

Pool from an inline list

Pool from a file

Explicit UA per request

Force rotation on a single request (even when rotation is off globally)

Force the configured default (bypass rotation for one request)

Other Use Cases

Changelog

v0.1.4

v0.1.3

v0.1.2

v0.1.1

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages