Context
When afdocs discovers page URLs via sitemap (either as the primary source or as a fallback to supplement a thin llms.txt), it does not check whether those URLs are blocked by robots.txt Disallow directives. This matters for agent discoverability: if a page is disallowed in robots.txt, an LLM that respected robots.txt during training would not have crawled it, so the agent has no way to discover the page from its base training data unless it's also listed in llms.txt.
Current behavior
discoverSitemapUrls() reads robots.txt only to find Sitemap: directives. It does not parse Disallow: rules. All sitemap URLs that pass origin and path-prefix filtering are included in the URL pool regardless of robots.txt crawl rules.
Questions to investigate
- How common is this? Do real documentation sites block doc pages via robots.txt Disallow while still listing them in sitemaps?
- Scope of parsing needed: robots.txt Disallow rules can include wildcards, user-agent matching, and other complexity. What's the minimum viable parser for this use case?
- Interaction with llms.txt: If a page is in llms.txt but blocked by robots.txt, should it still be tested? (Likely yes, since the site owner explicitly offered it via llms.txt.)
- Behavior options: Should blocked URLs be excluded from the pool, or included but annotated in the report?
Related
Context
When
afdocsdiscovers page URLs via sitemap (either as the primary source or as a fallback to supplement a thin llms.txt), it does not check whether those URLs are blocked byrobots.txtDisallowdirectives. This matters for agent discoverability: if a page is disallowed in robots.txt, an LLM that respected robots.txt during training would not have crawled it, so the agent has no way to discover the page from its base training data unless it's also listed in llms.txt.Current behavior
discoverSitemapUrls()reads robots.txt only to findSitemap:directives. It does not parseDisallow:rules. All sitemap URLs that pass origin and path-prefix filtering are included in the URL pool regardless of robots.txt crawl rules.Questions to investigate
Related