Skip to content

High-performance web crawler API optimized for LLMs. Turn any search or website into clean Markdown using remote browsers. Firecrawl alternative

License

Notifications You must be signed in to change notification settings

BrowserCash/teracrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⭐ Teracrawl

High-performance web crawler & scraper API optimized for LLMs.

Powered by Browser.cash remote browsers.

Features β€’ Quick Start β€’ API Reference β€’ Configuration β€’ Docker

License Node.js Version TypeScript Visit Browser.cash

Follow on X Follow on LinkedIn Join our Discord


⚠️ Important: Search functionality (`/crawl`) requires a running instance of browser-serp.


πŸ“Š Benchmarks

Teracrawl achieves #1 coverage at 82.1%

Teracrawl achieves #1 coverage (84.2%) across 14 scraping providers on the scrape-evals benchmark, an open evaluation framework that tests web scrapers against 1,000 diverse URLs for success rate and content quality.


πŸš€ What is Teracrawl?

Teracrawl is a production-ready API designed to turn websites into clean, LLM-ready Markdown. It handles the complexity of JavaScript rendering, anti-bot measures, and parallel execution allowing AI systems to access real-time data quickly.

Unlike simple HTML scrapers, Teracrawl uses real managed Chrome browsers, ensuring high success rates even across protected sites.

Why use Teracrawl?

  • πŸ€– LLM-Optimized Output: Converts complex HTML into clean, semantic Markdown perfect for RAG and context windows.
  • ⚑ Smart Two-Phase Crawling:
    • Fast Mode: Optimized for static/SSR pages (reuses contexts, blocks heavy assets).
    • Dynamic Mode: Automatic fallback for complex SPAs (waits for hydration/rendering).
  • πŸ” Search & Scrape: Single endpoint to query Google and scrape the top results in parallel.
  • 🏎️ High Concurrency: Built on a robust session pool to handle multiple pages simultaneously.

✨ Features

  • Search + Scrape: Query Google and scrape top N results in a single API call.
  • Direct Scraping: Convert any specific URL to Markdown.
  • Smart Content Extraction: Automatically detects main content areas (article, main, etc.) and removes clutter (scripts, styles, navs).
  • Safety & Performance:
    • Blocks ads, trackers, and analytics.
    • Removes base64 images to save token count.
    • Automatic timeout handling and error recovery.
  • Docker Ready: Deploy anywhere with a lightweight container.

πŸ› οΈ Quick Start

Prerequisites

  1. Node.js 18+ installed.
  2. A Browser.cash API Key.
  3. A running SERP service like browser-serp on port 8080 (optional, only for /crawl endpoint).

Installation

# Clone the repository
git clone https://github.com/BrowserCash/teracrawl.git
cd teracrawl

# Install dependencies
npm install

Configuration

Copy the example environment file and configure your settings:

cp .env.example .env

Open .env and set your BROWSER_API_KEY:

BROWSER_API_KEY=your_browser_cash_api_key_here

Running the Server

# Development mode
npm run dev

# Production build & start
npm run build
npm start

The server will start at http://0.0.0.0:8085.

πŸ“š API Reference

1. Search & Crawl

Performs a Google search and scrapes the content of the top results.

Endpoint: POST /crawl

CURL Request:

curl -X POST http://localhost:8085/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "q": "What is the capital of France?",
    "count": 3
  }'
Field Type Default Description
q string Required The search query.
count number 3 Number of results to scrape (max 20).

Response:

{
  "query": "What is the capital of France?",
  "results": [
    {
      "url": "https://en.wikipedia.org/wiki/Paris",
      "title": "Paris - Wikipedia",
      "markdown": "# Paris\n\nParis is the capital and most populous city of France...",
      "status": "success"
    },
    {
      "url": "https://...",
      "status": "error",
      "error": "Timeout exceeded"
    }
  ]
}

2. Single Page Scrape

Scrapes a specific URL and converts it to Markdown.

Endpoint: POST /scrape

CURL Request:

curl -X POST http://localhost:8085/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post-1"
  }'

Response:

{
  "url": "https://example.com/blog/post-1",
  "title": "My Blog Post",
  "markdown": "# My Blog Post\n\nContent of the post...",
  "status": "success"
}

3. SERP Search Only

Proxies a search request to the underlying SERP service without scraping content.

Endpoint: POST /serp/search

CURL Request:

curl -X POST http://localhost:8085/serp/search \
  -H "Content-Type: application/json" \
  -d '{
    "q": "browser automation",
    "count": 5
  }'

Response:

{
  "results": [
    {
      "url": "https://...",
      "title": "Result Title",
      "description": "Result description..."
    }
  ]
}

4. Health Check

Endpoint: GET /health

CURL Request:

curl http://localhost:8085/health

Response:

{
  "ok": true
}

βš™οΈ Configuration

Server & Infrastructure

Variable Default Description
BROWSER_API_KEY Required Your Browser.cash API key.
PORT 8085 Port for the API server.
HOST 0.0.0.0 Host to bind to.
SERP_SERVICE_URL http://localhost:8080 URL of the upstream SERP/Search service.
POOL_SIZE 1 Number of concurrent browser sessions to maintain.
DEBUG_LOG false Enable verbose logging for debugging.
DATALAB_API_KEY Optional Datalab API key for PDF-to-Markdown conversion.

Crawler Tuning

Variable Default Description
CRAWL_TABS_PER_SESSION 8 Max concurrent tabs per browser session.
CRAWL_MIN_CONTENT_LENGTH 200 Minimum markdown char length to consider a scrape successful.
CRAWL_NAVIGATION_TIMEOUT_MS 10000 Timeout for "Fast" scraping mode (ms).
CRAWL_SLOW_TIMEOUT_MS 20000 Timeout for "Slow" scraping mode (ms).
CRAWL_JITTER_MS 0 Max random delay (ms) between requests to avoid thundering herd.

🐳 Docker

You can run Teracrawl easily using Docker.

Build & Run

# Build the image
docker build -t teracrawl .

# Run with env file
docker run -p 8085:8085 --env-file .env teracrawl

Docker Compose

version: "3.8"
services:
  teracrawl:
    build: .
    ports:
      - "8085:8085"
    environment:
      - BROWSER_API_KEY=${BROWSER_API_KEY}
      - SERP_SERVICE_URL=http://serp:8080
    depends_on:
      - serp

  serp:
    image: ghcr.io/mega-tera/browser-serp:latest
    ports:
      - "8080:8080"

🀝 Contributing

Contributions are welcome! We appreciate your help in making Teracrawl better.

How to Contribute

  1. Fork the Project: click the 'Fork' button at the top right of this page.
  2. Create your Feature Branch: git checkout -b feature/AmazingFeature
  3. Commit your Changes: git commit -m 'Add some AmazingFeature'
  4. Push to the Branch: git push origin feature/AmazingFeature
  5. Open a Pull Request: Submit your changes for review.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.