Feature: Web Crawling Engine for Assistant Knowledge Bases

## Summary

Add a serverless web crawling engine that lets assistant creators point at a URL, set a crawl depth, and automatically ingest the site's content as markdown into the assistant's knowledge base. Built on AWS Step Functions + Lambda with Crawl4AI for extraction.

## Motivation

Many assistants need to be grounded in web content — institutional websites, documentation sites, department pages, etc. Currently, users must manually download pages, convert to a supported format, and upload them one by one. A crawl-and-ingest pipeline removes that friction entirely and keeps knowledge bases up to date.

The target use case for [boisestate.ai](https://boisestate.ai) is crawling institutional sites (department pages, policy docs, course catalogs) so assistants can answer questions grounded in official web content.

## Architecture Overview

### Serverless Markdown Extraction Engine

```
┌──────────────────────┐
│  Assistant Edit UI    │
│  "Add Website Source" │
└──────────┬───────────┘
           │ POST /assistants/{id}/crawl
           v
┌──────────────────────┐
│  API Gateway         │
│  (trigger endpoint)  │
└──────────┬───────────┘
           │ StartExecution
           v
┌──────────────────────────────────────────┐
│  AWS Step Functions (Standard Workflow)   │
│                                          │
│  ┌─────────────┐    ┌────────────────┐   │
│  │ Scrape Page  │───▶│ Store Markdown │   │
│  │ (Lambda)     │    │ (S3)           │   │
│  └──────┬──────┘    └────────────────┘   │
│         │ discovered links                │
│         v                                 │
│  ┌─────────────────┐                     │
│  │ Distributed Map  │ (fan-out per link) │
│  │ (next depth)     │                     │
│  └─────────────────┘                     │
└──────────────────────────────────────────┘
           │ on complete
           v
┌──────────────────────┐
│  Ingest to KB        │
│  (existing RAG       │
│   ingestion pipeline)│
└──────────────────────┘
```

### Components

#### 1. Orchestration: AWS Step Functions

- **Standard Workflow** as the parent to handle overall job state and depth tracking
- **Distributed Map** state to fan out — process hundreds of pages concurrently without hitting the 15-minute Lambda timeout
- Each iteration passes `current_depth` and a list of `discovered_links`
- Built-in error handling and retry logic per page

#### 2. Compute: AWS Lambda (Container Image)

- **Runtime:** Python 3.x packaged as a Docker image (hosted in ECR)
- **Engine:** Crawl4AI + Playwright for JavaScript-rendered pages
- **Configuration:**
  - Memory: 3072 MB (Chromium overhead)
  - Ephemeral storage: 512 MB+ for browser binaries
  - Timeout: 5 minutes per page
- **Output:** Clean markdown + list of discovered internal links

#### 3. Storage & Data Flow

- **Results:** Cleaned markdown streamed to S3 with key structure: `/{job-id}/{domain}/{page-title}.md`
- **Deduplication:** DynamoDB table storing URL hashes with 24-hour TTL to prevent infinite loops and re-crawling within a job
- **Final destination:** Markdown files are fed into the existing RAG ingestion pipeline (Docling processor → S3 Vector Buckets)

### Process Flow

1. **Trigger:** User enters a URL and max depth on the assistant edit page → `POST /assistants/{id}/crawl` starts the Step Function with `root_url`, `max_depth`, and `assistant_id`
2. **Scrape:** Worker Lambda boots Crawl4AI, extracts content, converts to markdown
3. **Store:** Lambda saves the `.md` file to S3
4. **Extract:** Lambda returns a list of internal links found on the page
5. **Deduplicate:** Step Functions filters out already-crawled URLs via DynamoDB
6. **Recurse:** New Map iterations triggered for the next depth level (up to `max_depth`)
7. **Ingest:** On workflow completion, trigger the existing document ingestion pipeline for all crawled markdown files

## API

### Start Crawl
```
POST /assistants/{assistant_id}/crawl
Body: {
  "url": "https://www.boisestate.edu/registrar",
  "maxDepth": 3,
  "maxPages": 100
}
Response: {
  "crawlJobId": "job-abc123",
  "status": "RUNNING",
  "startedAt": "2026-04-01T20:00:00Z"
}
```

### Get Crawl Status
```
GET /assistants/{assistant_id}/crawl/{crawl_job_id}
Response: {
  "crawlJobId": "job-abc123",
  "status": "RUNNING" | "COMPLETED" | "FAILED",
  "pagesProcessed": 47,
  "pagesTotal": 100,
  "startedAt": "...",
  "completedAt": "..."
}
```

### List Crawl Jobs
```
GET /assistants/{assistant_id}/crawls
Response: { "crawls": [...] }
```

## Frontend Changes

### Assistant Edit Page

- New section: **"Website Sources"**
- Input field for URL + depth selector (1–5) + max pages cap (default 100)
- "Start Crawl" button
- Progress indicator showing pages processed / total
- List of completed crawl jobs with page count and timestamp
- Ability to delete a crawl job (removes the ingested documents from the knowledge base)

## Guardrails

| Guardrail | Value | Rationale |
|---|---|---|
| Max depth | 5 | Prevents exponential blowup |
| Max pages per job | 500 | Cost and time cap |
| Domain lock | Same domain as root URL | Prevents crawling the entire internet |
| Rate limiting | 2 requests/second per domain | Polite crawling, avoid getting blocked |
| robots.txt | Respected | Standard web etiquette |
| URL dedup TTL | 24 hours | Prevents re-crawling within a job window |
| Concurrent Lambdas | 50 | Account-level safety |

## Cost Estimate

Based on the Step Functions + Lambda + S3 + DynamoDB pricing model:

| Scale | Est. Cost |
|---|---|
| 50 pages (shallow crawl) | ~$0.15 |
| 200 pages (medium site) | ~$0.50 |
| 500 pages (large site) | ~$1.20 |

Near-zero cost when idle — all serverless, no always-on compute.

## Infrastructure (CDK)

New CDK stack or extension of the existing gateway stack:

- Lambda function (container image in ECR) with Crawl4AI + Playwright
- Step Functions state machine (Standard Workflow with Distributed Map)
- DynamoDB table for URL deduplication (with TTL)
- S3 bucket/prefix for crawl output (or reuse existing assistants bucket)
- IAM roles for Step Functions → Lambda → S3/DynamoDB
- API Gateway route or App API endpoint to trigger workflows

## Migration

- No changes to existing data or infrastructure
- Purely additive — new Step Functions workflow, new Lambda, new DynamoDB table
- Existing assistants are unaffected

## Out of Scope (Phase 1)
- Scheduled/recurring crawls (re-crawl on a cadence to keep content fresh)
- Authentication-gated pages (pages behind login)
- PDF/document download from crawled pages (only HTML → markdown)
- Cross-domain crawling
- Crawl budget per user/role (quota integration)
- Sitemap.xml-based crawling (optimization for later)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Web Crawling Engine for Assistant Knowledge Bases #115

Summary

Motivation

Architecture Overview

Serverless Markdown Extraction Engine

Components

1. Orchestration: AWS Step Functions

2. Compute: AWS Lambda (Container Image)

3. Storage & Data Flow

Process Flow

API

Start Crawl

Get Crawl Status

List Crawl Jobs

Frontend Changes

Assistant Edit Page

Guardrails

Cost Estimate

Infrastructure (CDK)

Migration

Out of Scope (Phase 1)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Guardrail	Value	Rationale
Max depth	5	Prevents exponential blowup
Max pages per job	500	Cost and time cap
Domain lock	Same domain as root URL	Prevents crawling the entire internet
Rate limiting	2 requests/second per domain	Polite crawling, avoid getting blocked
robots.txt	Respected	Standard web etiquette
URL dedup TTL	24 hours	Prevents re-crawling within a job window
Concurrent Lambdas	50	Account-level safety

Scale	Est. Cost
50 pages (shallow crawl)	~$0.15
200 pages (medium site)	~$0.50
500 pages (large site)	~$1.20

Feature: Web Crawling Engine for Assistant Knowledge Bases #115

Description

Summary

Motivation

Architecture Overview

Serverless Markdown Extraction Engine

Components

1. Orchestration: AWS Step Functions

2. Compute: AWS Lambda (Container Image)

3. Storage & Data Flow

Process Flow

API

Start Crawl

Get Crawl Status

List Crawl Jobs

Frontend Changes

Assistant Edit Page

Guardrails

Cost Estimate

Infrastructure (CDK)

Migration

Out of Scope (Phase 1)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions