Skip to content

Commit 73e1e42

Browse files
committed
feat: update smartcrawler with sitemamp functionalities
1 parent cb13062 commit 73e1e42

File tree

15 files changed

+795
-11
lines changed

15 files changed

+795
-11
lines changed

scrapegraph-js/README.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -363,6 +363,7 @@ const schema = {
363363
depth: 2,
364364
maxPages: 2,
365365
sameDomainOnly: true,
366+
sitemap: true, // Use sitemap for better page discovery
366367
batchSize: 1,
367368
});
368369
console.log('Crawl job started. Response:', crawlResponse);
@@ -392,7 +393,13 @@ const schema = {
392393
})();
393394
```
394395

395-
You can use a plain JSON schema or a [Zod](https://www.npmjs.com/package/zod) schema for the `schema` parameter. The crawl API supports options for crawl depth, max pages, domain restriction, and batch size.
396+
You can use a plain JSON schema or a [Zod](https://www.npmjs.com/package/zod) schema for the `schema` parameter. The crawl API supports options for crawl depth, max pages, domain restriction, sitemap discovery, and batch size.
397+
398+
**Sitemap Benefits:**
399+
- Better page discovery using sitemap.xml
400+
- More comprehensive website coverage
401+
- Efficient crawling of structured websites
402+
- Perfect for e-commerce, news sites, and content-heavy websites
396403

397404
### Scraping local HTML
398405

@@ -547,10 +554,31 @@ Searches and extracts information from multiple web sources using AI.
547554

548555
### Crawl API
549556

550-
#### `crawl(apiKey, url, prompt, dataSchema, extractionMode, cacheWebsite, depth, maxPages, sameDomainOnly, sitemap, batchSize)`
557+
#### `crawl(apiKey, url, prompt, dataSchema, options)`
551558

552559
Starts a crawl job to extract structured data from a website and its linked pages.
553560

561+
**Parameters:**
562+
- `apiKey` (string): Your ScrapeGraph AI API key
563+
- `url` (string): The starting URL for the crawl
564+
- `prompt` (string): AI prompt to guide data extraction (required for AI mode)
565+
- `dataSchema` (object): JSON schema defining extracted data structure (required for AI mode)
566+
- `options` (object): Optional crawl parameters
567+
- `extractionMode` (boolean, default: true): true for AI extraction, false for markdown conversion
568+
- `cacheWebsite` (boolean, default: true): Whether to cache website content
569+
- `depth` (number, default: 2): Maximum crawl depth (1-10)
570+
- `maxPages` (number, default: 2): Maximum pages to crawl (1-100)
571+
- `sameDomainOnly` (boolean, default: true): Only crawl pages from the same domain
572+
- `sitemap` (boolean, default: false): Use sitemap.xml for better page discovery
573+
- `batchSize` (number, default: 1): Batch size for processing pages (1-10)
574+
- `renderHeavyJs` (boolean, default: false): Whether to render heavy JavaScript
575+
576+
**Sitemap Benefits:**
577+
- Better page discovery using sitemap.xml
578+
- More comprehensive website coverage
579+
- Efficient crawling of structured websites
580+
- Perfect for e-commerce, news sites, and content-heavy websites
581+
554582
### Markdownify
555583

556584
#### `markdownify(apiKey, url, headers)`

scrapegraph-js/examples/crawl/crawl_example.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ const prompt = 'What does the company do? and I need text content from there pri
7171
depth: 2,
7272
maxPages: 2,
7373
sameDomainOnly: true,
74+
sitemap: true, // Use sitemap for better page discovery
7475
batchSize: 1,
7576
});
7677
console.log('\nCrawl job started. Response:');

scrapegraph-js/examples/crawl/crawl_markdown_example.js

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ async function markdownCrawlingExample() {
9898
console.log("🤖 AI Prompt: None (no AI processing)");
9999
console.log("📊 Crawl Depth: 2");
100100
console.log("📄 Max Pages: 2");
101-
console.log("🗺️ Use Sitemap: false");
101+
console.log("🗺️ Use Sitemap: true");
102102
console.log("💡 Mode: Pure HTML to markdown conversion");
103103
console.log();
104104

@@ -112,7 +112,7 @@ async function markdownCrawlingExample() {
112112
depth: 2,
113113
maxPages: 2,
114114
sameDomainOnly: true,
115-
sitemap: false,
115+
sitemap: true, // Use sitemap for better page discovery
116116
// Note: No prompt or dataSchema needed when extractionMode=false
117117
});
118118

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
#!/usr/bin/env node
2+
3+
/**
4+
* Example demonstrating the ScrapeGraphAI Crawler with sitemap functionality.
5+
*
6+
* This example shows how to use the crawler with sitemap enabled for better page discovery:
7+
* - Sitemap helps discover more pages efficiently
8+
* - Better coverage of website content
9+
* - More comprehensive crawling results
10+
*
11+
* Requirements:
12+
* - Node.js 14+
13+
* - scrapegraph-js
14+
* - dotenv
15+
* - A valid API key (set in .env file as SGAI_APIKEY=your_key or environment variable)
16+
*
17+
* Usage:
18+
* node crawl_sitemap_example.js
19+
*/
20+
21+
import { crawl, getCrawlRequest } from '../index.js';
22+
import 'dotenv/config';
23+
24+
// Example .env file:
25+
// SGAI_APIKEY=your_sgai_api_key
26+
27+
const apiKey = process.env.SGAI_APIKEY;
28+
29+
/**
30+
* Poll for crawl results with intelligent backoff to avoid rate limits.
31+
* @param {string} crawlId - The crawl ID to poll for
32+
* @param {number} maxAttempts - Maximum number of polling attempts
33+
* @returns {Promise<Object>} The final result or throws an exception on timeout/failure
34+
*/
35+
async function pollForResult(crawlId, maxAttempts = 20) {
36+
console.log("⏳ Starting to poll for results with rate-limit protection...");
37+
38+
// Initial wait to give the job time to start processing
39+
await new Promise(resolve => setTimeout(resolve, 15000));
40+
41+
for (let attempt = 0; attempt < maxAttempts; attempt++) {
42+
try {
43+
const result = await getCrawlRequest(apiKey, crawlId);
44+
const status = result.status;
45+
46+
if (status === "success") {
47+
return result;
48+
} else if (status === "failed") {
49+
throw new Error(`Crawl failed: ${result.error || 'Unknown error'}`);
50+
} else {
51+
// Calculate progressive wait time: start at 15s, increase gradually
52+
const baseWait = 15000;
53+
const progressiveWait = Math.min(60000, baseWait + (attempt * 3000)); // Cap at 60s
54+
55+
console.log(`⏳ Status: ${status} (attempt ${attempt + 1}/${maxAttempts}) - waiting ${progressiveWait/1000}s...`);
56+
await new Promise(resolve => setTimeout(resolve, progressiveWait));
57+
}
58+
} catch (error) {
59+
if (error.message.toLowerCase().includes('rate') || error.message.includes('429')) {
60+
const waitTime = Math.min(90000, 45000 + (attempt * 10000));
61+
console.log(`⚠️ Rate limit detected in error, waiting ${waitTime/1000}s...`);
62+
await new Promise(resolve => setTimeout(resolve, waitTime));
63+
continue;
64+
} else {
65+
console.log(`❌ Error polling for results: ${error.message}`);
66+
if (attempt < maxAttempts - 1) {
67+
await new Promise(resolve => setTimeout(resolve, 20000)); // Wait before retry
68+
continue;
69+
}
70+
throw error;
71+
}
72+
}
73+
}
74+
75+
throw new Error(`⏰ Timeout: Job did not complete after ${maxAttempts} attempts`);
76+
}
77+
78+
/**
79+
* Sitemap-enabled Crawling Example
80+
*
81+
* This example demonstrates how to use sitemap for better page discovery.
82+
* Sitemap helps the crawler find more pages efficiently by using the website's sitemap.xml.
83+
*/
84+
async function sitemapCrawlingExample() {
85+
console.log("=".repeat(60));
86+
console.log("SITEMAP-ENABLED CRAWLING EXAMPLE");
87+
console.log("=".repeat(60));
88+
console.log("Use case: Comprehensive website crawling with sitemap discovery");
89+
console.log("Benefits: Better page coverage, more efficient crawling");
90+
console.log("Features: Sitemap-based page discovery, structured data extraction");
91+
console.log();
92+
93+
// Target URL - using a website that likely has a sitemap
94+
const url = "https://www.giemmeagordo.com/risultati-ricerca-annunci/?sort=newest&search_city=&search_lat=null&search_lng=null&search_category=0&search_type=0&search_min_price=&search_max_price=&bagni=&bagni_comparison=equal&camere=&camere_comparison=equal";
95+
96+
// Schema for real estate listings
97+
const schema = {
98+
"type": "object",
99+
"properties": {
100+
"listings": {
101+
"type": "array",
102+
"items": {
103+
"type": "object",
104+
"properties": {
105+
"title": { "type": "string" },
106+
"price": { "type": "string" },
107+
"location": { "type": "string" },
108+
"description": { "type": "string" },
109+
"features": { "type": "array", "items": { "type": "string" } },
110+
"url": { "type": "string" }
111+
}
112+
}
113+
}
114+
}
115+
};
116+
117+
const prompt = "Extract all real estate listings with their details including title, price, location, description, and features";
118+
119+
console.log(`🌐 Target URL: ${url}`);
120+
console.log("🤖 AI Prompt: Extract real estate listings");
121+
console.log("📊 Crawl Depth: 1");
122+
console.log("📄 Max Pages: 10");
123+
console.log("🗺️ Use Sitemap: true (enabled for better page discovery)");
124+
console.log("🏠 Same Domain Only: true");
125+
console.log("💾 Cache Website: true");
126+
console.log("💡 Mode: AI extraction with sitemap discovery");
127+
console.log();
128+
129+
// Start the sitemap-enabled crawl job
130+
console.log("🚀 Starting sitemap-enabled crawl job...");
131+
132+
try {
133+
// Call crawl with sitemap=true for better page discovery
134+
const response = await crawl(apiKey, url, prompt, schema, {
135+
extractionMode: true, // AI extraction mode
136+
depth: 1,
137+
maxPages: 10,
138+
sameDomainOnly: true,
139+
cacheWebsite: true,
140+
sitemap: true, // Enable sitemap for better page discovery
141+
});
142+
143+
const crawlId = response.id || response.task_id || response.crawl_id;
144+
145+
if (!crawlId) {
146+
console.log("❌ Failed to start sitemap-enabled crawl job");
147+
return;
148+
}
149+
150+
console.log(`📋 Crawl ID: ${crawlId}`);
151+
console.log("⏳ Polling for results...");
152+
console.log();
153+
154+
// Poll for results with rate-limit protection
155+
const result = await pollForResult(crawlId, 20);
156+
157+
console.log("✅ Sitemap-enabled crawl completed successfully!");
158+
console.log();
159+
160+
const resultData = result.result || {};
161+
const llmResult = resultData.llm_result || {};
162+
const crawledUrls = resultData.crawled_urls || [];
163+
const creditsUsed = resultData.credits_used || 0;
164+
const pagesProcessed = resultData.pages_processed || 0;
165+
166+
// Prepare JSON output
167+
const jsonOutput = {
168+
crawl_results: {
169+
pages_processed: pagesProcessed,
170+
credits_used: creditsUsed,
171+
cost_per_page: pagesProcessed > 0 ? creditsUsed / pagesProcessed : 0,
172+
crawled_urls: crawledUrls,
173+
sitemap_enabled: true
174+
},
175+
extracted_data: llmResult
176+
};
177+
178+
// Print JSON output
179+
console.log("📊 RESULTS IN JSON FORMAT:");
180+
console.log("-".repeat(40));
181+
console.log(JSON.stringify(jsonOutput, null, 2));
182+
183+
// Print summary
184+
console.log("\n" + "=".repeat(60));
185+
console.log("📈 CRAWL SUMMARY:");
186+
console.log("=".repeat(60));
187+
console.log(`✅ Pages processed: ${pagesProcessed}`);
188+
console.log(`💰 Credits used: ${creditsUsed}`);
189+
console.log(`🔗 URLs crawled: ${crawledUrls.length}`);
190+
console.log(`🗺️ Sitemap enabled: Yes`);
191+
console.log(`📊 Data extracted: ${llmResult.listings ? llmResult.listings.length : 0} listings found`);
192+
193+
} catch (error) {
194+
console.log(`❌ Sitemap-enabled crawl failed: ${error.message}`);
195+
}
196+
}
197+
198+
/**
199+
* Main function to run the sitemap crawling example.
200+
*/
201+
async function main() {
202+
console.log("🌐 ScrapeGraphAI Crawler - Sitemap Example");
203+
console.log("Comprehensive website crawling with sitemap discovery");
204+
console.log("=".repeat(60));
205+
206+
// Check if API key is set
207+
if (!apiKey) {
208+
console.log("⚠️ Please set your API key in the environment variable SGAI_APIKEY");
209+
console.log(" Option 1: Create a .env file with: SGAI_APIKEY=your_api_key_here");
210+
console.log(" Option 2: Set environment variable: export SGAI_APIKEY=your_api_key_here");
211+
console.log();
212+
console.log(" You can get your API key from: https://dashboard.scrapegraphai.com");
213+
return;
214+
}
215+
216+
console.log(`🔑 Using API key: ${apiKey.substring(0, 10)}...`);
217+
console.log();
218+
219+
// Run the sitemap crawling example
220+
await sitemapCrawlingExample();
221+
222+
console.log("\n" + "=".repeat(60));
223+
console.log("🎉 Example completed!");
224+
console.log("💡 This demonstrates sitemap-enabled crawling:");
225+
console.log(" • Better page discovery using sitemap.xml");
226+
console.log(" • More comprehensive website coverage");
227+
console.log(" • Efficient crawling of structured websites");
228+
console.log(" • Perfect for e-commerce, news sites, and content-heavy websites");
229+
}
230+
231+
// Run the example
232+
main().catch(console.error);

scrapegraph-js/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "scrapegraph-js",
33
"author": "ScrapeGraphAI",
4-
"version": "0.1.5",
4+
"version": "0.1.6",
55
"description": "Scrape and extract structured data from a webpage using ScrapeGraphAI's APIs. Supports cookies for authentication, infinite scrolling, and pagination.",
66
"repository": {
77
"type": "git",

scrapegraph-js/src/crawl.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ export async function crawl(
6666
depth = 2,
6767
maxPages = 2,
6868
sameDomainOnly = true,
69+
sitemap = false,
6970
batchSize = 1,
7071
} = options;
7172

@@ -77,6 +78,7 @@ export async function crawl(
7778
depth,
7879
max_pages: maxPages,
7980
same_domain_only: sameDomainOnly,
81+
sitemap,
8082
batch_size: batchSize,
8183
render_heavy_js: renderHeavyJs,
8284
};

scrapegraph-py/README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -300,14 +300,20 @@ for page in result["result"]["pages"]:
300300
- **depth** (default: 2): Maximum crawl depth (1-10)
301301
- **max_pages** (default: 2): Maximum pages to crawl (1-100)
302302
- **same_domain_only** (default: True): Only crawl pages from the same domain
303-
- **sitemap** (default: False): Use sitemap for better page discovery
303+
- **sitemap** (default: False): Use sitemap.xml for better page discovery and more comprehensive crawling
304304
- **cache_website** (default: True): Cache website content
305305
- **batch_size** (optional): Batch size for processing pages (1-10)
306306

307307
**Cost Comparison:**
308308
- AI Extraction Mode: ~10 credits per page
309309
- Markdown Conversion Mode: ~2 credits per page (80% savings!)
310310

311+
**Sitemap Benefits:**
312+
- Better page discovery using sitemap.xml
313+
- More comprehensive website coverage
314+
- Efficient crawling of structured websites
315+
- Perfect for e-commerce, news sites, and content-heavy websites
316+
311317
</details>
312318

313319
## ⚡ Async Support

scrapegraph-py/examples/crawl/async/async_crawl_example.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ async def main():
6767
depth=2,
6868
max_pages=2,
6969
same_domain_only=True,
70+
sitemap=True, # Use sitemap for better page discovery
7071
# batch_size is optional and will be excluded if not provided
7172
)
7273
execution_time = time.time() - start_time

scrapegraph-py/examples/crawl/async/async_crawl_markdown_example.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ async def markdown_crawling_example():
105105
print("🤖 AI Prompt: None (no AI processing)")
106106
print("📊 Crawl Depth: 2")
107107
print("📄 Max Pages: 2")
108-
print("🗺️ Use Sitemap: False")
108+
print("🗺️ Use Sitemap: True")
109109
print("💡 Mode: Pure HTML to markdown conversion")
110110
print()
111111

@@ -119,7 +119,7 @@ async def markdown_crawling_example():
119119
depth=2,
120120
max_pages=2,
121121
same_domain_only=True,
122-
sitemap=False, # Use sitemap for better coverage
122+
sitemap=True, # Use sitemap for better coverage
123123
# Note: No prompt or data_schema needed when extraction_mode=False
124124
)
125125

0 commit comments

Comments
 (0)