Skip to content

Fix robots.txt parser for stacked user-agent groups (RFC 9309)#6

Open
federicobartoli wants to merge 2 commits intoaddyosmani:mainfrom
federicobartoli:fix/robots-txt-multi-agent-groups
Open

Fix robots.txt parser for stacked user-agent groups (RFC 9309)#6
federicobartoli wants to merge 2 commits intoaddyosmani:mainfrom
federicobartoli:fix/robots-txt-multi-agent-groups

Conversation

@federicobartoli
Copy link
Copy Markdown

Summary

  • Parse consecutive User-agent lines as a single group per RFC 9309 (§2.1, §2.2) instead of keeping only the last agent
  • Treat any non-empty Allow rule (not just Allow: /) as an explicit allowance for scoring
  • Add regression fixture modeled on docs.nvidia.com and unit tests for parseRobotsTxt

Test plan

  • 45/45 tests pass (node --test)
  • Verified parser output on nvidia-style fixture: all 4 agents get both Allow rules
  • Verified checker scores: good-site 10/10, bad-site 2/10, nvidia-style 10/10
  • No regressions on existing fixtures

Fixes #5

parseRobotsTxt previously overwrote currentAgent on every User-agent
line, so robots.txt files that stack multiple user-agents before a
shared block of rules (e.g. docs.nvidia.com/robots.txt) attributed
those rules only to the last agent in the stack.

RFC 9309 §2.1 defines a group as "one or more user-agent lines
followed by one or more rules"; the ABNF grammar in §2.2 formalises
this with startgroupline repeating before rules.

Accumulate agents in currentAgents and reset the group only when a
User-agent line follows a rule line.
The robots-txt checker only treated literal `Allow: /` as an explicit
allowance. Sites that welcome AI crawlers with more specific rules
(e.g. `Allow: /*.llms.txt$`) were still reported as having no
explicitly allowed crawlers.

Treat any non-empty Allow rule for a known AI crawler as an explicit
allowance. Combined with the preceding parser fix, this closes the
false positive reported in addyosmani#5.

Adds a regression fixture and tests modeled on docs.nvidia.com.

Fixes addyosmani#5
@federicobartoli federicobartoli force-pushed the fix/robots-txt-multi-agent-groups branch from 4e090e3 to bddf6cb Compare April 16, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error: No AI crawlers are explicitly allowed in robots.txt

1 participant