Skip to content

Coverage check: distinguish intentional curation from accidental gaps #17

@dacharyc

Description

@dacharyc

Summary

The llms-txt-freshness check (proposed rename to llms-txt-coverage in #16) currently flags any sitemap URL missing from llms.txt as a gap. This works well for detecting drift, but produces false positives when site owners intentionally curate their llms.txt to include only a subset of pages.

For example, a site might exclude directory/index pages that just link to other pages already in llms.txt, or API reference pages that aren't useful in markdown form. The current thresholds (95% pass, 80% warn) penalize this intentional curation.

Inciting issues

Proposed approach: configurable thresholds + exclusion globs

The coverage check serves different use cases that need different behavior. Rather than a separate "mode" toggle, two configuration knobs cover all cases:

  1. Configurable thresholds — pass/warn percentages (defaults: 95/80)
  2. Exclusion globs — patterns subtracted from the sitemap before calculating coverage

These compose to handle three personas:

Use case Exclusions Thresholds Effect
Full parity (site wants llms.txt to mirror sitemap) none (default) 95/80 (default) Current behavior, no config needed
Curated (site intentionally includes a subset, e.g. Stripe at ~16% coverage) none 0/0 Check still runs and reports coverage %, but never fails
Hybrid (site wants strict coverage but with known exclusions, e.g. Cloudflare) intentional gaps 95/80 Exclusions shrink the denominator; remaining pages held to strict standard

Example config:

options:
  coveragePassThreshold: 95    # default
  coverageWarnThreshold: 80    # default
  coverageExclude:
    - "/api/reference/**"
    - "/internal/**"

Setting thresholds to 0 effectively makes the check informational without needing a separate mode concept. The check still reports the percentage and lists what's missing, but never warns or fails.

Transitive coverage (longer-term)

Transitive coverage (following links within llms.txt's linked markdown files to count reachable pages) could make the check smarter so that most sites wouldn't need exclusions. However:

  • It's unclear how many hops agents realistically follow before giving up and trying other discovery methods. The value of transitive coverage as a metric depends on empirical data about agent behavior.
  • Even with 1-hop link following, heavily curated sites like Stripe only reach ~53% of their sitemap (analysis).

This could be explored as a future enhancement, ideally backed by testing around agent link-following behavior, but shouldn't block the threshold + exclusion approach.

Open questions

  • Should the spec prescribe these specific configuration options, or just note that implementations should account for intentional curation and leave the mechanism to implementers?
  • Are there other signals (beyond exclusion lists) that could help distinguish intentional curation from drift?
  • What other considerations should inform this design?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions