FlyScrape is a Node.js package, based on top of Crawl4AI, that makes it easy to integrate powerful scrapers and crawlers directly into your web applications. Designed for the modern web, it provides modular, production-ready tools to extract clean, structured data, ready for RAG pipelines, AI agents, or advanced analytics.
Whether youβre building a content aggregator, an AI agent, or a complex data pipeline, FlyScrape simplifies web crawling and scraping while giving you maximum flexibility and performance.
π€ Why Developers Pick FlyScrape
- LLM-Ready Output: Generates smart Markdown with headings, tables, code blocks, and citation hints optimized for RAG.
- Production Grade: Built for reliability with retry strategies, caching, and robust error handling.
- Full Control: Customize every aspect of the crawl with hooks, custom transformers, and flexible configurations.
- Anti-Blocking: Integrated stealth techniques to bypass WAFs and bot detection systems.
- Developer Experience: Fully typed in TypeScript with a modular architecture for easy extensibility.
npm install @flyrank/flyscrape
# or
yarn add @flyrank/flyscrape
# or
pnpm add @flyrank/flyscrapeimport { AsyncWebCrawler } from "@flyrank/flyscrape";
async function main() {
const crawler = new AsyncWebCrawler();
await crawler.start();
// Crawl a URL and get clean Markdown
const result = await crawler.arun("https://example.com");
if (result.success) {
console.log(result.markdown);
}
await crawler.close();
}
main();Extract only the main article content, removing all UI clutter.
const result = await crawler.arun("https://blog.example.com/guide", {
contentOnly: true,
excludeMedia: true, // Remove images/videos
});π Markdown Generation
- π§Ή Clean Markdown: Generates clean, structured Markdown with accurate formatting.
- π― Fit Markdown: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- π Citations and References: Converts page links into a numbered reference list with clean citations.
- π οΈ Custom Strategies: Users can create their own Markdown generation strategies tailored to specific needs.
- π BM25 Algorithm: Employs BM25-based filtering for extracting core information and removing irrelevant content.
π Crawling & Scraping
- πΌοΈ Media Support: Extract images, audio, videos, and responsive image formats like
srcsetandpicture. - π Dynamic Crawling: Execute JS and wait for async or sync for dynamic content extraction.
- πΈ Screenshots: Capture page screenshots during crawling for debugging or analysis.
- π Raw Data Crawling: Directly process raw HTML (
raw:) or local files (file://). - π Comprehensive Link Extraction: Extracts internal, external links, and embedded iframe content.
- π οΈ Customizable Hooks: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).
- πΎ Caching: Cache data for improved speed and to avoid redundant fetches.
- π Metadata Extraction: Retrieve structured metadata (OpenGraph, Twitter Cards) from web pages.
- π‘ IFrame Content Extraction: Seamless extraction from embedded iframe content.
- π΅οΈ Lazy Load Handling: Waits for images to fully load, ensuring no content is missed due to lazy loading.
- π Full-Page Scanning: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
π Structured Data & AI
- π§ AI-Powered Extraction: Seamlessly integrate with OpenAI and other LLMs to extract structured JSON data.
- π§Ή Smart Content Cleaning: Automatically strips navigation, ads, footers, and boilerplate.
- π LLM-Ready Markdown: Converts HTML to clean, semantic Markdown, optimized for RAG (Retrieval-Augmented Generation) pipelines.
πΆοΈ Stealth & Performance
- π» Stealth Mode: Integrated evasion techniques (user-agent rotation, fingerprinting protection) to bypass WAFs.
- β‘ Hybrid Caching: Memory and disk-based caching to speed up redundant crawls.
- π« Resource Blocking: Block unnecessary assets (images, css, fonts) for faster loading.
π Session Persistence (Anti-Detection)
Keep your session alive across multiple requests to look like a real user and avoid being blocked.
const sessionId = 'my-session-1';
// First request: Creates session, saves cookies/local storage
await crawler.arun("https://example.com/login", {
session_id: sessionId
});
// Second request: Reuses the same session (cookies are preserved!)
await crawler.arun("https://example.com/dashboard", {
session_id: sessionId
});
// Clean up when done
await crawler.closeSession(sessionId);β‘ TLS Client / Fast Mode
Use impit under the hood to mimic real browser TLS fingerprints without the overhead of a full browser.
// Fast mode (no browser, but stealthy TLS fingerprint)
const result = await crawler.arun("https://example.com", {
jsExecution: false // Disables Playwright, enables impit
});οΏ½ Stealth Mode
Enable advanced anti-detection features to bypass WAFs and bot detection systems.
const crawler = new AsyncWebCrawler({
stealth: true, // Enable stealth mode
headless: true,
});
await crawler.start();οΏ½π οΈ Custom Markdown Strategies
Need full control? Provide a customTransformer to define exactly how HTML maps to Markdown.
const result = await crawler.arun("https://example.com", {
processing: {
markdown: {
customTransformer: (html) => {
// Your custom logic here
return myCustomConverter(html);
}
}
}
});π Dynamic Content & Infinite Scroll
Handle modern SPAs with ease using built-in scrolling and wait strategies.
const result = await crawler.arun("https://infinite-scroll.com", {
autoScroll: true, // Automatically scroll to bottom
waitMode: 'networkidle', // Wait for network to settle
});π οΈ Lifecycle Hooks
Inject custom logic at key stages of the crawling process.
const result = await crawler.arun("https://example.com", {
hooks: {
onPageCreated: async (page) => {
// Set cookies or modify environment
await page.context().addCookies([...]);
},
onLoad: async (page) => {
// Interact with the page
await page.click('#accept-cookies');
}
}
});π Raw HTML & Local Files
Process raw HTML or local files directly without a web server.
// Raw HTML
await crawler.arun("raw:<html><body><h1>Hello</h1></body></html>");
// Local File
await crawler.arun("file:///path/to/local/file.html");π§ Structured Data Extraction (LLM)
Define a schema and let the LLM do the work.
const schema = {
type: "object",
properties: {
title: { type: "string" },
price: { type: "number" },
features: { type: "array", items: { type: "string" } }
}
};
const result = await crawler.arun("https://store.example.com/product/123", {
extraction: {
type: "llm",
schema: schema,
provider: myOpenAIProvider // Your LLM provider instance
}
});We welcome contributions! Please see our Contribution Guidelines for details on how to get started.
This project is licensed under the MIT License.
