Summary
Add a serverless web crawling engine that lets assistant creators point at a URL, set a crawl depth, and automatically ingest the site's content as markdown into the assistant's knowledge base. Built on AWS Step Functions + Lambda with Crawl4AI for extraction.
Motivation
Many assistants need to be grounded in web content — institutional websites, documentation sites, department pages, etc. Currently, users must manually download pages, convert to a supported format, and upload them one by one. A crawl-and-ingest pipeline removes that friction entirely and keeps knowledge bases up to date.
The target use case for boisestate.ai is crawling institutional sites (department pages, policy docs, course catalogs) so assistants can answer questions grounded in official web content.
Architecture Overview
Serverless Markdown Extraction Engine
┌──────────────────────┐
│ Assistant Edit UI │
│ "Add Website Source" │
└──────────┬───────────┘
│ POST /assistants/{id}/crawl
v
┌──────────────────────┐
│ API Gateway │
│ (trigger endpoint) │
└──────────┬───────────┘
│ StartExecution
v
┌──────────────────────────────────────────┐
│ AWS Step Functions (Standard Workflow) │
│ │
│ ┌─────────────┐ ┌────────────────┐ │
│ │ Scrape Page │───▶│ Store Markdown │ │
│ │ (Lambda) │ │ (S3) │ │
│ └──────┬──────┘ └────────────────┘ │
│ │ discovered links │
│ v │
│ ┌─────────────────┐ │
│ │ Distributed Map │ (fan-out per link) │
│ │ (next depth) │ │
│ └─────────────────┘ │
└──────────────────────────────────────────┘
│ on complete
v
┌──────────────────────┐
│ Ingest to KB │
│ (existing RAG │
│ ingestion pipeline)│
└──────────────────────┘
Components
1. Orchestration: AWS Step Functions
- Standard Workflow as the parent to handle overall job state and depth tracking
- Distributed Map state to fan out — process hundreds of pages concurrently without hitting the 15-minute Lambda timeout
- Each iteration passes
current_depth and a list of discovered_links
- Built-in error handling and retry logic per page
2. Compute: AWS Lambda (Container Image)
- Runtime: Python 3.x packaged as a Docker image (hosted in ECR)
- Engine: Crawl4AI + Playwright for JavaScript-rendered pages
- Configuration:
- Memory: 3072 MB (Chromium overhead)
- Ephemeral storage: 512 MB+ for browser binaries
- Timeout: 5 minutes per page
- Output: Clean markdown + list of discovered internal links
3. Storage & Data Flow
- Results: Cleaned markdown streamed to S3 with key structure:
/{job-id}/{domain}/{page-title}.md
- Deduplication: DynamoDB table storing URL hashes with 24-hour TTL to prevent infinite loops and re-crawling within a job
- Final destination: Markdown files are fed into the existing RAG ingestion pipeline (Docling processor → S3 Vector Buckets)
Process Flow
- Trigger: User enters a URL and max depth on the assistant edit page →
POST /assistants/{id}/crawl starts the Step Function with root_url, max_depth, and assistant_id
- Scrape: Worker Lambda boots Crawl4AI, extracts content, converts to markdown
- Store: Lambda saves the
.md file to S3
- Extract: Lambda returns a list of internal links found on the page
- Deduplicate: Step Functions filters out already-crawled URLs via DynamoDB
- Recurse: New Map iterations triggered for the next depth level (up to
max_depth)
- Ingest: On workflow completion, trigger the existing document ingestion pipeline for all crawled markdown files
API
Start Crawl
POST /assistants/{assistant_id}/crawl
Body: {
"url": "https://www.boisestate.edu/registrar",
"maxDepth": 3,
"maxPages": 100
}
Response: {
"crawlJobId": "job-abc123",
"status": "RUNNING",
"startedAt": "2026-04-01T20:00:00Z"
}
Get Crawl Status
GET /assistants/{assistant_id}/crawl/{crawl_job_id}
Response: {
"crawlJobId": "job-abc123",
"status": "RUNNING" | "COMPLETED" | "FAILED",
"pagesProcessed": 47,
"pagesTotal": 100,
"startedAt": "...",
"completedAt": "..."
}
List Crawl Jobs
GET /assistants/{assistant_id}/crawls
Response: { "crawls": [...] }
Frontend Changes
Assistant Edit Page
- New section: "Website Sources"
- Input field for URL + depth selector (1–5) + max pages cap (default 100)
- "Start Crawl" button
- Progress indicator showing pages processed / total
- List of completed crawl jobs with page count and timestamp
- Ability to delete a crawl job (removes the ingested documents from the knowledge base)
Guardrails
| Guardrail |
Value |
Rationale |
| Max depth |
5 |
Prevents exponential blowup |
| Max pages per job |
500 |
Cost and time cap |
| Domain lock |
Same domain as root URL |
Prevents crawling the entire internet |
| Rate limiting |
2 requests/second per domain |
Polite crawling, avoid getting blocked |
| robots.txt |
Respected |
Standard web etiquette |
| URL dedup TTL |
24 hours |
Prevents re-crawling within a job window |
| Concurrent Lambdas |
50 |
Account-level safety |
Cost Estimate
Based on the Step Functions + Lambda + S3 + DynamoDB pricing model:
| Scale |
Est. Cost |
| 50 pages (shallow crawl) |
~$0.15 |
| 200 pages (medium site) |
~$0.50 |
| 500 pages (large site) |
~$1.20 |
Near-zero cost when idle — all serverless, no always-on compute.
Infrastructure (CDK)
New CDK stack or extension of the existing gateway stack:
- Lambda function (container image in ECR) with Crawl4AI + Playwright
- Step Functions state machine (Standard Workflow with Distributed Map)
- DynamoDB table for URL deduplication (with TTL)
- S3 bucket/prefix for crawl output (or reuse existing assistants bucket)
- IAM roles for Step Functions → Lambda → S3/DynamoDB
- API Gateway route or App API endpoint to trigger workflows
Migration
- No changes to existing data or infrastructure
- Purely additive — new Step Functions workflow, new Lambda, new DynamoDB table
- Existing assistants are unaffected
Out of Scope (Phase 1)
- Scheduled/recurring crawls (re-crawl on a cadence to keep content fresh)
- Authentication-gated pages (pages behind login)
- PDF/document download from crawled pages (only HTML → markdown)
- Cross-domain crawling
- Crawl budget per user/role (quota integration)
- Sitemap.xml-based crawling (optimization for later)
Summary
Add a serverless web crawling engine that lets assistant creators point at a URL, set a crawl depth, and automatically ingest the site's content as markdown into the assistant's knowledge base. Built on AWS Step Functions + Lambda with Crawl4AI for extraction.
Motivation
Many assistants need to be grounded in web content — institutional websites, documentation sites, department pages, etc. Currently, users must manually download pages, convert to a supported format, and upload them one by one. A crawl-and-ingest pipeline removes that friction entirely and keeps knowledge bases up to date.
The target use case for boisestate.ai is crawling institutional sites (department pages, policy docs, course catalogs) so assistants can answer questions grounded in official web content.
Architecture Overview
Serverless Markdown Extraction Engine
Components
1. Orchestration: AWS Step Functions
current_depthand a list ofdiscovered_links2. Compute: AWS Lambda (Container Image)
3. Storage & Data Flow
/{job-id}/{domain}/{page-title}.mdProcess Flow
POST /assistants/{id}/crawlstarts the Step Function withroot_url,max_depth, andassistant_id.mdfile to S3max_depth)API
Start Crawl
Get Crawl Status
List Crawl Jobs
Frontend Changes
Assistant Edit Page
Guardrails
Cost Estimate
Based on the Step Functions + Lambda + S3 + DynamoDB pricing model:
Near-zero cost when idle — all serverless, no always-on compute.
Infrastructure (CDK)
New CDK stack or extension of the existing gateway stack:
Migration
Out of Scope (Phase 1)