Coverage check: distinguish intentional curation from accidental gaps

## Summary

The `llms-txt-freshness` check (proposed rename to `llms-txt-coverage` in #16) currently flags any sitemap URL missing from `llms.txt` as a gap. This works well for detecting drift, but produces false positives when site owners intentionally curate their `llms.txt` to include only a subset of pages.

For example, a site might exclude directory/index pages that just link to other pages already in `llms.txt`, or API reference pages that aren't useful in markdown form. The current thresholds (95% pass, 80% warn) penalize this intentional curation.

## Inciting issues

- agent-ecosystem/afdocs#46 — Cloudflare contributor hitting this with intentionally curated `llms.txt`
- agent-ecosystem/afdocs#17 — suggested measuring "transitive coverage" (pages reachable through linked markdown) rather than just direct listing

## Proposed approach: configurable thresholds + exclusion globs

The coverage check serves different use cases that need different behavior. Rather than a separate "mode" toggle, two configuration knobs cover all cases:

1. **Configurable thresholds** — pass/warn percentages (defaults: 95/80)
2. **Exclusion globs** — patterns subtracted from the sitemap before calculating coverage

These compose to handle three personas:

| Use case | Exclusions | Thresholds | Effect |
|---|---|---|---|
| **Full parity** (site wants llms.txt to mirror sitemap) | none (default) | 95/80 (default) | Current behavior, no config needed |
| **Curated** (site intentionally includes a subset, e.g. Stripe at ~16% coverage) | none | 0/0 | Check still runs and reports coverage %, but never fails |
| **Hybrid** (site wants strict coverage but with known exclusions, e.g. Cloudflare) | intentional gaps | 95/80 | Exclusions shrink the denominator; remaining pages held to strict standard |

Example config:

```yaml
options:
  coveragePassThreshold: 95    # default
  coverageWarnThreshold: 80    # default
  coverageExclude:
    - "/api/reference/**"
    - "/internal/**"
```

Setting thresholds to 0 effectively makes the check informational without needing a separate mode concept. The check still reports the percentage and lists what's missing, but never warns or fails.

## Transitive coverage (longer-term)

Transitive coverage (following links within `llms.txt`'s linked markdown files to count reachable pages) could make the check smarter so that most sites wouldn't need exclusions. However:

- It's unclear how many hops agents realistically follow before giving up and trying other discovery methods. The value of transitive coverage as a metric depends on empirical data about agent behavior.
- Even with 1-hop link following, heavily curated sites like Stripe only reach ~53% of their sitemap ([analysis](https://dacharycarey.com/2026/03/15/is-your-llms-txt-already-stale/)).

This could be explored as a future enhancement, ideally backed by testing around agent link-following behavior, but shouldn't block the threshold + exclusion approach.

## Open questions

- Should the spec prescribe these specific configuration options, or just note that implementations should account for intentional curation and leave the mechanism to implementers?
- Are there other signals (beyond exclusion lists) that could help distinguish intentional curation from drift?
- What other considerations should inform this design?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Coverage check: distinguish intentional curation from accidental gaps #17

Summary

Inciting issues

Proposed approach: configurable thresholds + exclusion globs

Transitive coverage (longer-term)

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use case	Exclusions	Thresholds	Effect
Full parity (site wants llms.txt to mirror sitemap)	none (default)	95/80 (default)	Current behavior, no config needed
Curated (site intentionally includes a subset, e.g. Stripe at ~16% coverage)	none	0/0	Check still runs and reports coverage %, but never fails
Hybrid (site wants strict coverage but with known exclusions, e.g. Cloudflare)	intentional gaps	95/80	Exclusions shrink the denominator; remaining pages held to strict standard

Uh oh!

Coverage check: distinguish intentional curation from accidental gaps #17

Description

Summary

Inciting issues

Proposed approach: configurable thresholds + exclusion globs

Transitive coverage (longer-term)

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions