Skip to content

Investigate robots.txt Disallow filtering for sitemap-discovered URLs #38

@dacharyc

Description

@dacharyc

Context

When afdocs discovers page URLs via sitemap (either as the primary source or as a fallback to supplement a thin llms.txt), it does not check whether those URLs are blocked by robots.txt Disallow directives. This matters for agent discoverability: if a page is disallowed in robots.txt, an LLM that respected robots.txt during training would not have crawled it, so the agent has no way to discover the page from its base training data unless it's also listed in llms.txt.

Current behavior

discoverSitemapUrls() reads robots.txt only to find Sitemap: directives. It does not parse Disallow: rules. All sitemap URLs that pass origin and path-prefix filtering are included in the URL pool regardless of robots.txt crawl rules.

Questions to investigate

  1. How common is this? Do real documentation sites block doc pages via robots.txt Disallow while still listing them in sitemaps?
  2. Scope of parsing needed: robots.txt Disallow rules can include wildcards, user-agent matching, and other complexity. What's the minimum viable parser for this use case?
  3. Interaction with llms.txt: If a page is in llms.txt but blocked by robots.txt, should it still be tested? (Likely yes, since the site owner explicitly offered it via llms.txt.)
  4. Behavior options: Should blocked URLs be excluded from the pool, or included but annotated in the report?

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions