Skip to content

Feature: Web Crawling Engine for Assistant Knowledge Bases #115

@DerrickF

Description

@DerrickF

Summary

Add a serverless web crawling engine that lets assistant creators point at a URL, set a crawl depth, and automatically ingest the site's content as markdown into the assistant's knowledge base. Built on AWS Step Functions + Lambda with Crawl4AI for extraction.

Motivation

Many assistants need to be grounded in web content — institutional websites, documentation sites, department pages, etc. Currently, users must manually download pages, convert to a supported format, and upload them one by one. A crawl-and-ingest pipeline removes that friction entirely and keeps knowledge bases up to date.

The target use case for boisestate.ai is crawling institutional sites (department pages, policy docs, course catalogs) so assistants can answer questions grounded in official web content.

Architecture Overview

Serverless Markdown Extraction Engine

┌──────────────────────┐
│  Assistant Edit UI    │
│  "Add Website Source" │
└──────────┬───────────┘
           │ POST /assistants/{id}/crawl
           v
┌──────────────────────┐
│  API Gateway         │
│  (trigger endpoint)  │
└──────────┬───────────┘
           │ StartExecution
           v
┌──────────────────────────────────────────┐
│  AWS Step Functions (Standard Workflow)   │
│                                          │
│  ┌─────────────┐    ┌────────────────┐   │
│  │ Scrape Page  │───▶│ Store Markdown │   │
│  │ (Lambda)     │    │ (S3)           │   │
│  └──────┬──────┘    └────────────────┘   │
│         │ discovered links                │
│         v                                 │
│  ┌─────────────────┐                     │
│  │ Distributed Map  │ (fan-out per link) │
│  │ (next depth)     │                     │
│  └─────────────────┘                     │
└──────────────────────────────────────────┘
           │ on complete
           v
┌──────────────────────┐
│  Ingest to KB        │
│  (existing RAG       │
│   ingestion pipeline)│
└──────────────────────┘

Components

1. Orchestration: AWS Step Functions

  • Standard Workflow as the parent to handle overall job state and depth tracking
  • Distributed Map state to fan out — process hundreds of pages concurrently without hitting the 15-minute Lambda timeout
  • Each iteration passes current_depth and a list of discovered_links
  • Built-in error handling and retry logic per page

2. Compute: AWS Lambda (Container Image)

  • Runtime: Python 3.x packaged as a Docker image (hosted in ECR)
  • Engine: Crawl4AI + Playwright for JavaScript-rendered pages
  • Configuration:
    • Memory: 3072 MB (Chromium overhead)
    • Ephemeral storage: 512 MB+ for browser binaries
    • Timeout: 5 minutes per page
  • Output: Clean markdown + list of discovered internal links

3. Storage & Data Flow

  • Results: Cleaned markdown streamed to S3 with key structure: /{job-id}/{domain}/{page-title}.md
  • Deduplication: DynamoDB table storing URL hashes with 24-hour TTL to prevent infinite loops and re-crawling within a job
  • Final destination: Markdown files are fed into the existing RAG ingestion pipeline (Docling processor → S3 Vector Buckets)

Process Flow

  1. Trigger: User enters a URL and max depth on the assistant edit page → POST /assistants/{id}/crawl starts the Step Function with root_url, max_depth, and assistant_id
  2. Scrape: Worker Lambda boots Crawl4AI, extracts content, converts to markdown
  3. Store: Lambda saves the .md file to S3
  4. Extract: Lambda returns a list of internal links found on the page
  5. Deduplicate: Step Functions filters out already-crawled URLs via DynamoDB
  6. Recurse: New Map iterations triggered for the next depth level (up to max_depth)
  7. Ingest: On workflow completion, trigger the existing document ingestion pipeline for all crawled markdown files

API

Start Crawl

POST /assistants/{assistant_id}/crawl
Body: {
  "url": "https://www.boisestate.edu/registrar",
  "maxDepth": 3,
  "maxPages": 100
}
Response: {
  "crawlJobId": "job-abc123",
  "status": "RUNNING",
  "startedAt": "2026-04-01T20:00:00Z"
}

Get Crawl Status

GET /assistants/{assistant_id}/crawl/{crawl_job_id}
Response: {
  "crawlJobId": "job-abc123",
  "status": "RUNNING" | "COMPLETED" | "FAILED",
  "pagesProcessed": 47,
  "pagesTotal": 100,
  "startedAt": "...",
  "completedAt": "..."
}

List Crawl Jobs

GET /assistants/{assistant_id}/crawls
Response: { "crawls": [...] }

Frontend Changes

Assistant Edit Page

  • New section: "Website Sources"
  • Input field for URL + depth selector (1–5) + max pages cap (default 100)
  • "Start Crawl" button
  • Progress indicator showing pages processed / total
  • List of completed crawl jobs with page count and timestamp
  • Ability to delete a crawl job (removes the ingested documents from the knowledge base)

Guardrails

Guardrail Value Rationale
Max depth 5 Prevents exponential blowup
Max pages per job 500 Cost and time cap
Domain lock Same domain as root URL Prevents crawling the entire internet
Rate limiting 2 requests/second per domain Polite crawling, avoid getting blocked
robots.txt Respected Standard web etiquette
URL dedup TTL 24 hours Prevents re-crawling within a job window
Concurrent Lambdas 50 Account-level safety

Cost Estimate

Based on the Step Functions + Lambda + S3 + DynamoDB pricing model:

Scale Est. Cost
50 pages (shallow crawl) ~$0.15
200 pages (medium site) ~$0.50
500 pages (large site) ~$1.20

Near-zero cost when idle — all serverless, no always-on compute.

Infrastructure (CDK)

New CDK stack or extension of the existing gateway stack:

  • Lambda function (container image in ECR) with Crawl4AI + Playwright
  • Step Functions state machine (Standard Workflow with Distributed Map)
  • DynamoDB table for URL deduplication (with TTL)
  • S3 bucket/prefix for crawl output (or reuse existing assistants bucket)
  • IAM roles for Step Functions → Lambda → S3/DynamoDB
  • API Gateway route or App API endpoint to trigger workflows

Migration

  • No changes to existing data or infrastructure
  • Purely additive — new Step Functions workflow, new Lambda, new DynamoDB table
  • Existing assistants are unaffected

Out of Scope (Phase 1)

  • Scheduled/recurring crawls (re-crawl on a cadence to keep content fresh)
  • Authentication-gated pages (pages behind login)
  • PDF/document download from crawled pages (only HTML → markdown)
  • Cross-domain crawling
  • Crawl budget per user/role (quota integration)
  • Sitemap.xml-based crawling (optimization for later)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions