Investigate robots.txt Disallow filtering for sitemap-discovered URLs

## Context

When `afdocs` discovers page URLs via sitemap (either as the primary source or as a fallback to supplement a thin llms.txt), it does not check whether those URLs are blocked by `robots.txt` `Disallow` directives. This matters for agent discoverability: if a page is disallowed in robots.txt, an LLM that respected robots.txt during training would not have crawled it, so the agent has no way to discover the page from its base training data unless it's also listed in llms.txt.

## Current behavior

`discoverSitemapUrls()` reads robots.txt only to find `Sitemap:` directives. It does not parse `Disallow:` rules. All sitemap URLs that pass origin and path-prefix filtering are included in the URL pool regardless of robots.txt crawl rules.

## Questions to investigate

1. **How common is this?** Do real documentation sites block doc pages via robots.txt Disallow while still listing them in sitemaps?
2. **Scope of parsing needed**: robots.txt Disallow rules can include wildcards, user-agent matching, and other complexity. What's the minimum viable parser for this use case?
3. **Interaction with llms.txt**: If a page is in llms.txt but blocked by robots.txt, should it still be tested? (Likely yes, since the site owner explicitly offered it via llms.txt.)
4. **Behavior options**: Should blocked URLs be excluded from the pool, or included but annotated in the report?

## Related

- Came up during #27 (discovery source fallback) implementation
- Applies to all sitemap-based discovery, not just the llms.txt+sitemap merge case

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate robots.txt Disallow filtering for sitemap-discovered URLs #38

Context

Current behavior

Questions to investigate

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Investigate robots.txt Disallow filtering for sitemap-discovered URLs #38

Description

Context

Current behavior

Questions to investigate

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions