Summary
The llms-txt-freshness check (proposed rename to llms-txt-coverage in #16) currently flags any sitemap URL missing from llms.txt as a gap. This works well for detecting drift, but produces false positives when site owners intentionally curate their llms.txt to include only a subset of pages.
For example, a site might exclude directory/index pages that just link to other pages already in llms.txt, or API reference pages that aren't useful in markdown form. The current thresholds (95% pass, 80% warn) penalize this intentional curation.
Inciting issues
Proposed approach: configurable thresholds + exclusion globs
The coverage check serves different use cases that need different behavior. Rather than a separate "mode" toggle, two configuration knobs cover all cases:
- Configurable thresholds — pass/warn percentages (defaults: 95/80)
- Exclusion globs — patterns subtracted from the sitemap before calculating coverage
These compose to handle three personas:
| Use case |
Exclusions |
Thresholds |
Effect |
| Full parity (site wants llms.txt to mirror sitemap) |
none (default) |
95/80 (default) |
Current behavior, no config needed |
| Curated (site intentionally includes a subset, e.g. Stripe at ~16% coverage) |
none |
0/0 |
Check still runs and reports coverage %, but never fails |
| Hybrid (site wants strict coverage but with known exclusions, e.g. Cloudflare) |
intentional gaps |
95/80 |
Exclusions shrink the denominator; remaining pages held to strict standard |
Example config:
options:
coveragePassThreshold: 95 # default
coverageWarnThreshold: 80 # default
coverageExclude:
- "/api/reference/**"
- "/internal/**"
Setting thresholds to 0 effectively makes the check informational without needing a separate mode concept. The check still reports the percentage and lists what's missing, but never warns or fails.
Transitive coverage (longer-term)
Transitive coverage (following links within llms.txt's linked markdown files to count reachable pages) could make the check smarter so that most sites wouldn't need exclusions. However:
- It's unclear how many hops agents realistically follow before giving up and trying other discovery methods. The value of transitive coverage as a metric depends on empirical data about agent behavior.
- Even with 1-hop link following, heavily curated sites like Stripe only reach ~53% of their sitemap (analysis).
This could be explored as a future enhancement, ideally backed by testing around agent link-following behavior, but shouldn't block the threshold + exclusion approach.
Open questions
- Should the spec prescribe these specific configuration options, or just note that implementations should account for intentional curation and leave the mechanism to implementers?
- Are there other signals (beyond exclusion lists) that could help distinguish intentional curation from drift?
- What other considerations should inform this design?
Summary
The
llms-txt-freshnesscheck (proposed rename tollms-txt-coveragein #16) currently flags any sitemap URL missing fromllms.txtas a gap. This works well for detecting drift, but produces false positives when site owners intentionally curate theirllms.txtto include only a subset of pages.For example, a site might exclude directory/index pages that just link to other pages already in
llms.txt, or API reference pages that aren't useful in markdown form. The current thresholds (95% pass, 80% warn) penalize this intentional curation.Inciting issues
llms.txtProposed approach: configurable thresholds + exclusion globs
The coverage check serves different use cases that need different behavior. Rather than a separate "mode" toggle, two configuration knobs cover all cases:
These compose to handle three personas:
Example config:
Setting thresholds to 0 effectively makes the check informational without needing a separate mode concept. The check still reports the percentage and lists what's missing, but never warns or fails.
Transitive coverage (longer-term)
Transitive coverage (following links within
llms.txt's linked markdown files to count reachable pages) could make the check smarter so that most sites wouldn't need exclusions. However:This could be explored as a future enhancement, ideally backed by testing around agent link-following behavior, but shouldn't block the threshold + exclusion approach.
Open questions