Skip to content

first pass on web resource#168

Open
bmdavis419 wants to merge 5 commits intomainfrom
davis/web-resource
Open

first pass on web resource#168
bmdavis419 wants to merge 5 commits intomainfrom
davis/web-resource

Conversation

@bmdavis419
Copy link
Collaborator

@bmdavis419 bmdavis419 commented Feb 5, 2026

first pass on web resource

better web resource setup for now...

Greptile Overview

Greptile Summary

This PR adds comprehensive website resource support to btca, enabling crawling and indexing of documentation sites. The implementation includes:

Core Features

  • Website crawler with robots.txt and sitemap.xml support
  • Markdown variant detection (.md and /.md URLs)
  • TTL-based caching with stale fallback
  • HTML-to-markdown conversion using cheerio
  • Scope limiting to prevent cross-origin crawling

Integration

  • CLI wizard for adding website resources (btca add)
  • Server API endpoint validation via WebsiteResourceSchema
  • VFS integration for tool access (Glob, Grep, Read, List)

Testing

  • Comprehensive unit tests (346 lines) covering crawling, caching, robots, markdown variants
  • E2E tests validating full API integration

Issues Found

  • CLI auto-detection routes non-GitHub URLs to git wizard instead of website wizard (breaks btca add https://docs.example.com without --type website)
  • Uses node:fs instead of Bun APIs in website.ts (style preference per AGENTS.md)

The core website resource implementation is solid and well-tested. The URL detection bug in the CLI should be fixed before merge to ensure good UX.

Confidence Score: 4/5

  • This PR is safe to merge with one logical issue that needs fixing
  • The website resource implementation is well-tested and comprehensive, but the CLI auto-detection logic has a bug that breaks btca add for website URLs without --type flag
  • apps/cli/src/commands/add.ts needs the URL auto-detection logic fixed

Important Files Changed

Filename Overview
apps/server/src/resources/impls/website.ts New comprehensive website crawler implementation with robots.txt, sitemap, markdown variant support, and caching - uses Node.js fs instead of Bun APIs
apps/cli/src/commands/add.ts Added website resource wizard and CLI support - auto-detection logic incorrectly routes non-GitHub URLs to git handler
apps/server/src/resources/schema.ts Added WebsiteResourceSchema with proper validation for HTTPS URLs, maxPages, maxDepth, and ttlHours
apps/server/src/index.ts Added AddWebsiteResourceRequestSchema to discriminated union for POST /config/resources endpoint

Context used:

  • Context from dashboard - AGENTS.md (source)

Copy link
Collaborator Author

bmdavis419 commented Feb 5, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 431 to 438
} else {
// Auto-detect: if it looks like a URL, it's git; otherwise local
const isUrl =
const looksLikeUrl =
urlOrPath.startsWith('http://') ||
urlOrPath.startsWith('https://') ||
urlOrPath.startsWith('github.com/') ||
urlOrPath.includes('github.com/');
resourceType = isUrl ? 'git' : 'local';
resourceType = looksLikeUrl ? 'git' : 'local';
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Website resources mis-detected

When --type is omitted, resourceType is inferred as 'git' for anything that "looksLikeUrl" (any http(s)://), otherwise 'local' (apps/cli/src/commands/add.ts:431-438). That means passing a website URL like https://docs.example.com will route into the git wizard and then hard-exit with "Invalid GitHub URL". Website URLs should be detected (e.g., non-GitHub URLs) so btca add https://... works without requiring --type website.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/cli/src/commands/add.ts
Line: 431:438

Comment:
**Website resources mis-detected**

When `--type` is omitted, `resourceType` is inferred as `'git'` for anything that "looksLikeUrl" (any `http(s)://`), otherwise `'local'` (`apps/cli/src/commands/add.ts:431-438`). That means passing a website URL like `https://docs.example.com` will route into the git wizard and then hard-exit with "Invalid GitHub URL". Website URLs should be detected (e.g., non-GitHub URLs) so `btca add https://...` works without requiring `--type website`.

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 6, 2026

Additional Comments (1)

apps/server/src/config/index.ts
Tests failing: error messages

After this commit, Config.load() errors use messages like "Failed to parse config file - invalid JSON syntax" (apps/server/src/config/index.ts:273-282), but existing tests assert older substrings like "Failed to parse config file" / "Invalid config". On main server tests pass, but on this SHA bun test --cwd apps/server fails multiple assertions due to these message changes. Either update the tests or restore the expected error message substrings so CI stays green.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/server/src/config/index.ts
Line: 273:282

Comment:
**Tests failing: error messages**

After this commit, `Config.load()` errors use messages like `"Failed to parse config file - invalid JSON syntax"` (`apps/server/src/config/index.ts:273-282`), but existing tests assert older substrings like `"Failed to parse config file"` / `"Invalid config"`. On `main` server tests pass, but on this SHA `bun test --cwd apps/server` fails multiple assertions due to these message changes. Either update the tests or restore the expected error message substrings so CI stays green.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +432 to 438
const looksLikeUrl =
urlOrPath.startsWith('http://') ||
urlOrPath.startsWith('https://') ||
urlOrPath.startsWith('github.com/') ||
urlOrPath.includes('github.com/');
resourceType = isUrl ? 'git' : 'local';
resourceType = looksLikeUrl ? 'git' : 'local';
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-detection incorrectly routes website URLs to git handler

The looksLikeUrl check treats any http(s):// URL as a git resource. When users run btca add https://docs.example.com without --type website, it routes to addGitResourceWizard, which then fails with "Invalid GitHub URL" (line 143-145).

Website URLs should be distinguished from GitHub URLs so the wizard routes correctly.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/cli/src/commands/add.ts
Line: 432:438

Comment:
Auto-detection incorrectly routes website URLs to git handler

The `looksLikeUrl` check treats any `http(s)://` URL as a git resource. When users run `btca add https://docs.example.com` without `--type website`, it routes to `addGitResourceWizard`, which then fails with "Invalid GitHub URL" (line 143-145).

Website URLs should be distinguished from GitHub URLs so the wizard routes correctly.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant