ScrapeGraphAI
diff --git a/‎scrapegraph-js/README.md‎
Lines changed: 30 additions & 2 deletions b/‎scrapegraph-js/README.md‎
Lines changed: 30 additions & 2 deletions
diff --git a/‎scrapegraph-js/examples/crawl/crawl_example.js‎
Lines changed: 1 addition & 0 deletions b/‎scrapegraph-js/examples/crawl/crawl_example.js‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎scrapegraph-js/examples/crawl/crawl_markdown_example.js‎
Lines changed: 2 additions & 2 deletions b/‎scrapegraph-js/examples/crawl/crawl_markdown_example.js‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎scrapegraph-js/examples/crawl/crawl_sitemap_example.js‎
Lines changed: 232 additions & 0 deletions b/‎scrapegraph-js/examples/crawl/crawl_sitemap_example.js‎
Lines changed: 232 additions & 0 deletions
diff --git a/‎scrapegraph-js/package.json‎
Lines changed: 1 addition & 1 deletion b/‎scrapegraph-js/package.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎scrapegraph-js/src/crawl.js‎
Lines changed: 2 additions & 0 deletions b/‎scrapegraph-js/src/crawl.js‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎scrapegraph-py/README.md‎
Lines changed: 7 additions & 1 deletion b/‎scrapegraph-py/README.md‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎scrapegraph-py/examples/crawl/async/async_crawl_example.py‎
Lines changed: 1 addition & 0 deletions b/‎scrapegraph-py/examples/crawl/async/async_crawl_example.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎scrapegraph-py/examples/crawl/async/async_crawl_markdown_example.py‎
Lines changed: 2 additions & 2 deletions b/‎scrapegraph-py/examples/crawl/async/async_crawl_markdown_example.py‎
Lines changed: 2 additions & 2 deletions
@@ -363,6 +363,7 @@ const schema = {
       depth: 2,
       maxPages: 2,
       sameDomainOnly: true,
+      sitemap: true, // Use sitemap for better page discovery
       batchSize: 1,
     });
     console.log('Crawl job started. Response:', crawlResponse);
@@ -392,7 +393,13 @@ const schema = {
 })();
 ```
 
-You can use a plain JSON schema or a [Zod](https://www.npmjs.com/package/zod) schema for the `schema` parameter. The crawl API supports options for crawl depth, max pages, domain restriction, and batch size.
+You can use a plain JSON schema or a [Zod](https://www.npmjs.com/package/zod) schema for the `schema` parameter. The crawl API supports options for crawl depth, max pages, domain restriction, sitemap discovery, and batch size.
+
+**Sitemap Benefits:**
+- Better page discovery using sitemap.xml
+- More comprehensive website coverage
+- Efficient crawling of structured websites
+- Perfect for e-commerce, news sites, and content-heavy websites
 
 ### Scraping local HTML
 
@@ -547,10 +554,31 @@ Searches and extracts information from multiple web sources using AI.
 
 ### Crawl API
 
-#### `crawl(apiKey, url, prompt, dataSchema, extractionMode, cacheWebsite, depth, maxPages, sameDomainOnly, sitemap, batchSize)`
+#### `crawl(apiKey, url, prompt, dataSchema, options)`
 
 Starts a crawl job to extract structured data from a website and its linked pages.
 
+**Parameters:**
+- `apiKey` (string): Your ScrapeGraph AI API key
+- `url` (string): The starting URL for the crawl
+- `prompt` (string): AI prompt to guide data extraction (required for AI mode)
+- `dataSchema` (object): JSON schema defining extracted data structure (required for AI mode)
+- `options` (object): Optional crawl parameters
+  - `extractionMode` (boolean, default: true): true for AI extraction, false for markdown conversion
+  - `cacheWebsite` (boolean, default: true): Whether to cache website content
+  - `depth` (number, default: 2): Maximum crawl depth (1-10)
+  - `maxPages` (number, default: 2): Maximum pages to crawl (1-100)
+  - `sameDomainOnly` (boolean, default: true): Only crawl pages from the same domain
+  - `sitemap` (boolean, default: false): Use sitemap.xml for better page discovery
+  - `batchSize` (number, default: 1): Batch size for processing pages (1-10)
+  - `renderHeavyJs` (boolean, default: false): Whether to render heavy JavaScript
+
+**Sitemap Benefits:**
+- Better page discovery using sitemap.xml
+- More comprehensive website coverage
+- Efficient crawling of structured websites
+- Perfect for e-commerce, news sites, and content-heavy websites
+
 ### Markdownify
 
 #### `markdownify(apiKey, url, headers)`
 
@@ -71,6 +71,7 @@ const prompt = 'What does the company do? and I need text content from there pri
       depth: 2,
       maxPages: 2,
       sameDomainOnly: true,
+      sitemap: true, // Use sitemap for better page discovery
       batchSize: 1,
     });
     console.log('\nCrawl job started. Response:');
 
@@ -98,7 +98,7 @@ async function markdownCrawlingExample() {
   console.log("🤖 AI Prompt: None (no AI processing)");
   console.log("📊 Crawl Depth: 2");
   console.log("📄 Max Pages: 2");
-  console.log("🗺️ Use Sitemap: false");
+  console.log("🗺️ Use Sitemap: true");
   console.log("💡 Mode: Pure HTML to markdown conversion");
   console.log();
 
@@ -112,7 +112,7 @@ async function markdownCrawlingExample() {
       depth: 2,
       maxPages: 2,
       sameDomainOnly: true,
-      sitemap: false,
+      sitemap: true, // Use sitemap for better page discovery
       // Note: No prompt or dataSchema needed when extractionMode=false
     });
 
 
@@ -0,0 +1,232 @@
+#!/usr/bin/env node
+
+/**
+ * Example demonstrating the ScrapeGraphAI Crawler with sitemap functionality.
+ *
+ * This example shows how to use the crawler with sitemap enabled for better page discovery:
+ * - Sitemap helps discover more pages efficiently
+ * - Better coverage of website content
+ * - More comprehensive crawling results
+ *
+ * Requirements:
+ * - Node.js 14+
+ * - scrapegraph-js
+ * - dotenv
+ * - A valid API key (set in .env file as SGAI_APIKEY=your_key or environment variable)
+ *
+ * Usage:
+ *   node crawl_sitemap_example.js
+ */
+
+import { crawl, getCrawlRequest } from '../index.js';
+import 'dotenv/config';
+
+// Example .env file:
+// SGAI_APIKEY=your_sgai_api_key
+
+const apiKey = process.env.SGAI_APIKEY;
+
+/**
+ * Poll for crawl results with intelligent backoff to avoid rate limits.
+ * @param {string} crawlId - The crawl ID to poll for
+ * @param {number} maxAttempts - Maximum number of polling attempts
+ * @returns {Promise<Object>} The final result or throws an exception on timeout/failure
+ */
+async function pollForResult(crawlId, maxAttempts = 20) {
+  console.log("⏳ Starting to poll for results with rate-limit protection...");
+
+  // Initial wait to give the job time to start processing
+  await new Promise(resolve => setTimeout(resolve, 15000));
+
+  for (let attempt = 0; attempt < maxAttempts; attempt++) {
+    try {
+      const result = await getCrawlRequest(apiKey, crawlId);
+      const status = result.status;
+
+      if (status === "success") {
+        return result;
+      } else if (status === "failed") {
+        throw new Error(`Crawl failed: ${result.error || 'Unknown error'}`);
+      } else {
+        // Calculate progressive wait time: start at 15s, increase gradually
+        const baseWait = 15000;
+        const progressiveWait = Math.min(60000, baseWait + (attempt * 3000)); // Cap at 60s
+
+        console.log(`⏳ Status: ${status} (attempt ${attempt + 1}/${maxAttempts}) - waiting ${progressiveWait/1000}s...`);
+        await new Promise(resolve => setTimeout(resolve, progressiveWait));
+      }
+    } catch (error) {
+      if (error.message.toLowerCase().includes('rate') || error.message.includes('429')) {
+        const waitTime = Math.min(90000, 45000 + (attempt * 10000));
+        console.log(`⚠️ Rate limit detected in error, waiting ${waitTime/1000}s...`);
+        await new Promise(resolve => setTimeout(resolve, waitTime));
+        continue;
+      } else {
+        console.log(`❌ Error polling for results: ${error.message}`);
+        if (attempt < maxAttempts - 1) {
+          await new Promise(resolve => setTimeout(resolve, 20000)); // Wait before retry
+          continue;
+        }
+        throw error;
+      }
+    }
+  }
+
+  throw new Error(`⏰ Timeout: Job did not complete after ${maxAttempts} attempts`);
+}
+
+/**
+ * Sitemap-enabled Crawling Example
+ *
+ * This example demonstrates how to use sitemap for better page discovery.
+ * Sitemap helps the crawler find more pages efficiently by using the website's sitemap.xml.
+ */
+async function sitemapCrawlingExample() {
+  console.log("=".repeat(60));
+  console.log("SITEMAP-ENABLED CRAWLING EXAMPLE");
+  console.log("=".repeat(60));
+  console.log("Use case: Comprehensive website crawling with sitemap discovery");
+  console.log("Benefits: Better page coverage, more efficient crawling");
+  console.log("Features: Sitemap-based page discovery, structured data extraction");
+  console.log();
+
+  // Target URL - using a website that likely has a sitemap
+  const url = "https://www.giemmeagordo.com/risultati-ricerca-annunci/?sort=newest&search_city=&search_lat=null&search_lng=null&search_category=0&search_type=0&search_min_price=&search_max_price=&bagni=&bagni_comparison=equal&camere=&camere_comparison=equal";
+
+  // Schema for real estate listings
+  const schema = {
+    "type": "object",
+    "properties": {
+      "listings": {
+        "type": "array",
+        "items": {
+          "type": "object",
+          "properties": {
+            "title": { "type": "string" },
+            "price": { "type": "string" },
+            "location": { "type": "string" },
+            "description": { "type": "string" },
+            "features": { "type": "array", "items": { "type": "string" } },
+            "url": { "type": "string" }
+          }
+        }
+      }
+    }
+  };
+
+  const prompt = "Extract all real estate listings with their details including title, price, location, description, and features";
+
+  console.log(`🌐 Target URL: ${url}`);
+  console.log("🤖 AI Prompt: Extract real estate listings");
+  console.log("📊 Crawl Depth: 1");
+  console.log("📄 Max Pages: 10");
+  console.log("🗺️ Use Sitemap: true (enabled for better page discovery)");
+  console.log("🏠 Same Domain Only: true");
+  console.log("💾 Cache Website: true");
+  console.log("💡 Mode: AI extraction with sitemap discovery");
+  console.log();
+
+  // Start the sitemap-enabled crawl job
+  console.log("🚀 Starting sitemap-enabled crawl job...");
+
+  try {
+    // Call crawl with sitemap=true for better page discovery
+    const response = await crawl(apiKey, url, prompt, schema, {
+      extractionMode: true, // AI extraction mode
+      depth: 1,
+      maxPages: 10,
+      sameDomainOnly: true,
+      cacheWebsite: true,
+      sitemap: true, // Enable sitemap for better page discovery
+    });
+
+    const crawlId = response.id || response.task_id || response.crawl_id;
+
+    if (!crawlId) {
+      console.log("❌ Failed to start sitemap-enabled crawl job");
+      return;
+    }
+
+    console.log(`📋 Crawl ID: ${crawlId}`);
+    console.log("⏳ Polling for results...");
+    console.log();
+
+    // Poll for results with rate-limit protection
+    const result = await pollForResult(crawlId, 20);
+
+    console.log("✅ Sitemap-enabled crawl completed successfully!");
+    console.log();
+
+    const resultData = result.result || {};
+    const llmResult = resultData.llm_result || {};
+    const crawledUrls = resultData.crawled_urls || [];
+    const creditsUsed = resultData.credits_used || 0;
+    const pagesProcessed = resultData.pages_processed || 0;
+
+    // Prepare JSON output
+    const jsonOutput = {
+      crawl_results: {
+        pages_processed: pagesProcessed,
+        credits_used: creditsUsed,
+        cost_per_page: pagesProcessed > 0 ? creditsUsed / pagesProcessed : 0,
+        crawled_urls: crawledUrls,
+        sitemap_enabled: true
+      },
+      extracted_data: llmResult
+    };
+
+    // Print JSON output
+    console.log("📊 RESULTS IN JSON FORMAT:");
+    console.log("-".repeat(40));
+    console.log(JSON.stringify(jsonOutput, null, 2));
+
+    // Print summary
+    console.log("\n" + "=".repeat(60));
+    console.log("📈 CRAWL SUMMARY:");
+    console.log("=".repeat(60));
+    console.log(`✅ Pages processed: ${pagesProcessed}`);
+    console.log(`💰 Credits used: ${creditsUsed}`);
+    console.log(`🔗 URLs crawled: ${crawledUrls.length}`);
+    console.log(`🗺️ Sitemap enabled: Yes`);
+    console.log(`📊 Data extracted: ${llmResult.listings ? llmResult.listings.length : 0} listings found`);
+
+  } catch (error) {
+    console.log(`❌ Sitemap-enabled crawl failed: ${error.message}`);
+  }
+}
+
+/**
+ * Main function to run the sitemap crawling example.
+ */
+async function main() {
+  console.log("🌐 ScrapeGraphAI Crawler - Sitemap Example");
+  console.log("Comprehensive website crawling with sitemap discovery");
+  console.log("=".repeat(60));
+
+  // Check if API key is set
+  if (!apiKey) {
+    console.log("⚠️ Please set your API key in the environment variable SGAI_APIKEY");
+    console.log("   Option 1: Create a .env file with: SGAI_APIKEY=your_api_key_here");
+    console.log("   Option 2: Set environment variable: export SGAI_APIKEY=your_api_key_here");
+    console.log();
+    console.log("   You can get your API key from: https://dashboard.scrapegraphai.com");
+    return;
+  }
+
+  console.log(`🔑 Using API key: ${apiKey.substring(0, 10)}...`);
+  console.log();
+
+  // Run the sitemap crawling example
+  await sitemapCrawlingExample();
+
+  console.log("\n" + "=".repeat(60));
+  console.log("🎉 Example completed!");
+  console.log("💡 This demonstrates sitemap-enabled crawling:");
+  console.log("   • Better page discovery using sitemap.xml");
+  console.log("   • More comprehensive website coverage");
+  console.log("   • Efficient crawling of structured websites");
+  console.log("   • Perfect for e-commerce, news sites, and content-heavy websites");
+}
+
+// Run the example
+main().catch(console.error);
@@ -1,7 +1,7 @@
 {
   "name": "scrapegraph-js",
   "author": "ScrapeGraphAI",
-  "version": "0.1.5",
+  "version": "0.1.6",
   "description": "Scrape and extract structured data from a webpage using ScrapeGraphAI's APIs. Supports cookies for authentication, infinite scrolling, and pagination.",
   "repository": {
     "type": "git",
 
@@ -66,6 +66,7 @@ export async function crawl(
     depth = 2,
     maxPages = 2,
     sameDomainOnly = true,
+    sitemap = false,
     batchSize = 1,
   } = options;
 
@@ -77,6 +78,7 @@ export async function crawl(
     depth,
     max_pages: maxPages,
     same_domain_only: sameDomainOnly,
+    sitemap,
     batch_size: batchSize,
     render_heavy_js: renderHeavyJs,
   };
 
@@ -300,14 +300,20 @@ for page in result["result"]["pages"]:
 - **depth** (default: 2): Maximum crawl depth (1-10)
 - **max_pages** (default: 2): Maximum pages to crawl (1-100)
 - **same_domain_only** (default: True): Only crawl pages from the same domain
-- **sitemap** (default: False): Use sitemap for better page discovery
+- **sitemap** (default: False): Use sitemap.xml for better page discovery and more comprehensive crawling
 - **cache_website** (default: True): Cache website content
 - **batch_size** (optional): Batch size for processing pages (1-10)
 
 **Cost Comparison:**
 - AI Extraction Mode: ~10 credits per page
 - Markdown Conversion Mode: ~2 credits per page (80% savings!)
 
+**Sitemap Benefits:**
+- Better page discovery using sitemap.xml
+- More comprehensive website coverage
+- Efficient crawling of structured websites
+- Perfect for e-commerce, news sites, and content-heavy websites
+
 </details>
 
 ## ⚡ Async Support
 
@@ -67,6 +67,7 @@ async def main():
                 depth=2,
                 max_pages=2,
                 same_domain_only=True,
+                sitemap=True,  # Use sitemap for better page discovery
                 # batch_size is optional and will be excluded if not provided
             )
             execution_time = time.time() - start_time
 
@@ -105,7 +105,7 @@ async def markdown_crawling_example():
     print("🤖 AI Prompt: None (no AI processing)")
     print("📊 Crawl Depth: 2")
     print("📄 Max Pages: 2")
-    print("🗺️ Use Sitemap: False")
+    print("🗺️ Use Sitemap: True")
     print("💡 Mode: Pure HTML to markdown conversion")
     print()
 
@@ -119,7 +119,7 @@ async def markdown_crawling_example():
         depth=2,
         max_pages=2,
         same_domain_only=True,
-        sitemap=False,  # Use sitemap for better coverage
+        sitemap=True,  # Use sitemap for better coverage
         # Note: No prompt or data_schema needed when extraction_mode=False
     )
Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,7 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "scrapegraph-js",`
`3`	`3`	`"author": "ScrapeGraphAI",`
`4`		`- "version": "0.1.5",`
	`4`	`+ "version": "0.1.6",`
`5`	`5`	`"description": "Scrape and extract structured data from a webpage using ScrapeGraphAI's APIs. Supports cookies for authentication, infinite scrolling, and pagination.",`
`6`	`6`	`"repository": {`
`7`	`7`	`"type": "git",`
Original file line number	Diff line number	Diff line change
`@@ -67,6 +67,7 @@ async def main():`
`67`	`67`	`depth=2,`
`68`	`68`	`max_pages=2,`
`69`	`69`	`same_domain_only=True,`
	`70`	`+ sitemap=True, # Use sitemap for better page discovery`
`70`	`71`	`# batch_size is optional and will be excluded if not provided`
`71`	`72`	`)`
`72`	`73`	`execution_time = time.time() - start_time`