Skip to content

flyrank-bih/flyscrape

image


The Ultimate Node.js Web Scraping & Crawling Engine

npm version License Downloads GitHub Stars

FlyScrape is a Node.js package, based on top of Crawl4AI, that makes it easy to integrate powerful scrapers and crawlers directly into your web applications. Designed for the modern web, it provides modular, production-ready tools to extract clean, structured data, ready for RAG pipelines, AI agents, or advanced analytics.

Whether you’re building a content aggregator, an AI agent, or a complex data pipeline, FlyScrape simplifies web crawling and scraping while giving you maximum flexibility and performance.

πŸ€” Why Developers Pick FlyScrape
  • LLM-Ready Output: Generates smart Markdown with headings, tables, code blocks, and citation hints optimized for RAG.
  • Production Grade: Built for reliability with retry strategies, caching, and robust error handling.
  • Full Control: Customize every aspect of the crawl with hooks, custom transformers, and flexible configurations.
  • Anti-Blocking: Integrated stealth techniques to bypass WAFs and bot detection systems.
  • Developer Experience: Fully typed in TypeScript with a modular architecture for easy extensibility.

πŸš€ Quick Start

1. Installation

npm install @flyrank/flyscrape
# or
yarn add @flyrank/flyscrape
# or
pnpm add @flyrank/flyscrape

2. Basic Crawl

import { AsyncWebCrawler } from "@flyrank/flyscrape";

async function main() {
  const crawler = new AsyncWebCrawler();
  await crawler.start();
  
  // Crawl a URL and get clean Markdown
  const result = await crawler.arun("https://example.com");
  
  if (result.success) {
    console.log(result.markdown);
  }
  
  await crawler.close();
}

main();

3. Content-Only Mode (Smart Cleaning)

Extract only the main article content, removing all UI clutter.

const result = await crawler.arun("https://blog.example.com/guide", {
  contentOnly: true,
  excludeMedia: true, // Remove images/videos
});

✨ Features

πŸ“ Markdown Generation
  • 🧹 Clean Markdown: Generates clean, structured Markdown with accurate formatting.
  • 🎯 Fit Markdown: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
  • πŸ”— Citations and References: Converts page links into a numbered reference list with clean citations.
  • πŸ› οΈ Custom Strategies: Users can create their own Markdown generation strategies tailored to specific needs.
  • πŸ“š BM25 Algorithm: Employs BM25-based filtering for extracting core information and removing irrelevant content.
πŸ”Ž Crawling & Scraping
  • πŸ–ΌοΈ Media Support: Extract images, audio, videos, and responsive image formats like srcset and picture.
  • πŸš€ Dynamic Crawling: Execute JS and wait for async or sync for dynamic content extraction.
  • πŸ“Έ Screenshots: Capture page screenshots during crawling for debugging or analysis.
  • πŸ“‚ Raw Data Crawling: Directly process raw HTML (raw:) or local files (file://).
  • πŸ”— Comprehensive Link Extraction: Extracts internal, external links, and embedded iframe content.
  • πŸ› οΈ Customizable Hooks: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
  • πŸ’Ύ Caching: Cache data for improved speed and to avoid redundant fetches.
  • πŸ“„ Metadata Extraction: Retrieve structured metadata (OpenGraph, Twitter Cards) from web pages.
  • πŸ“‘ IFrame Content Extraction: Seamless extraction from embedded iframe content.
  • πŸ•΅οΈ Lazy Load Handling: Waits for images to fully load, ensuring no content is missed due to lazy loading.
  • πŸ”„ Full-Page Scanning: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
πŸ“Š Structured Data & AI
  • 🧠 AI-Powered Extraction: Seamlessly integrate with OpenAI and other LLMs to extract structured JSON data.
  • 🧹 Smart Content Cleaning: Automatically strips navigation, ads, footers, and boilerplate.
  • πŸ“ LLM-Ready Markdown: Converts HTML to clean, semantic Markdown, optimized for RAG (Retrieval-Augmented Generation) pipelines.
πŸ•ΆοΈ Stealth & Performance
  • πŸ‘» Stealth Mode: Integrated evasion techniques (user-agent rotation, fingerprinting protection) to bypass WAFs.
  • ⚑ Hybrid Caching: Memory and disk-based caching to speed up redundant crawls.
  • 🚫 Resource Blocking: Block unnecessary assets (images, css, fonts) for faster loading.

πŸ”¬ Advanced Usage Examples

πŸ”„ Session Persistence (Anti-Detection)

Keep your session alive across multiple requests to look like a real user and avoid being blocked.

const sessionId = 'my-session-1';

// First request: Creates session, saves cookies/local storage
await crawler.arun("https://example.com/login", { 
  session_id: sessionId 
});

// Second request: Reuses the same session (cookies are preserved!)
await crawler.arun("https://example.com/dashboard", { 
  session_id: sessionId 
});

// Clean up when done
await crawler.closeSession(sessionId);
⚑ TLS Client / Fast Mode

Use impit under the hood to mimic real browser TLS fingerprints without the overhead of a full browser.

// Fast mode (no browser, but stealthy TLS fingerprint)
const result = await crawler.arun("https://example.com", {
  jsExecution: false // Disables Playwright, enables impit
});
οΏ½ Stealth Mode

Enable advanced anti-detection features to bypass WAFs and bot detection systems.

const crawler = new AsyncWebCrawler({
  stealth: true, // Enable stealth mode
  headless: true,
});

await crawler.start();
οΏ½πŸ› οΈ Custom Markdown Strategies

Need full control? Provide a customTransformer to define exactly how HTML maps to Markdown.

const result = await crawler.arun("https://example.com", {
  processing: {
    markdown: {
      customTransformer: (html) => {
        // Your custom logic here
        return myCustomConverter(html);
      }
    }
  }
});
πŸ”„ Dynamic Content & Infinite Scroll

Handle modern SPAs with ease using built-in scrolling and wait strategies.

const result = await crawler.arun("https://infinite-scroll.com", {
  autoScroll: true, // Automatically scroll to bottom
  waitMode: 'networkidle', // Wait for network to settle
});
πŸ› οΈ Lifecycle Hooks

Inject custom logic at key stages of the crawling process.

const result = await crawler.arun("https://example.com", {
  hooks: {
    onPageCreated: async (page) => {
      // Set cookies or modify environment
      await page.context().addCookies([...]);
    },
    onLoad: async (page) => {
      // Interact with the page
      await page.click('#accept-cookies'); 
    }
  }
});
πŸ“‚ Raw HTML & Local Files

Process raw HTML or local files directly without a web server.

// Raw HTML
await crawler.arun("raw:<html><body><h1>Hello</h1></body></html>");

// Local File
await crawler.arun("file:///path/to/local/file.html");
🧠 Structured Data Extraction (LLM)

Define a schema and let the LLM do the work.

const schema = {
  type: "object",
  properties: {
    title: { type: "string" },
    price: { type: "number" },
    features: { type: "array", items: { type: "string" } }
  }
};

const result = await crawler.arun("https://store.example.com/product/123", {
  extraction: {
    type: "llm",
    schema: schema,
    provider: myOpenAIProvider // Your LLM provider instance
  }
});

🀝 Contributing

We welcome contributions! Please see our Contribution Guidelines for details on how to get started.

πŸ“„ License

This project is licensed under the MIT License.


Built with ❀️ by FlyRank