Skip to content

Possible false fail/warn for sites with localized documentation #56

@ethanpalm

Description

@ethanpalm

What happened?

The llms-txt-freshness check appears to produce false positives (incorrect fail/warn) for documentation sites that have localized versions of their content. The check compares llms.txt URLs against the sitemap to compute coverage, but if the sitemap includes multiple language variants and llms.txt only covers the primary language, the denominator is inflated and coverage appears low.

It looks like the gap is when the default language has no locale prefix.

For example

  • Sitemap contains: /docs/intro, /docs/de/intro, /docs/ja/intro
  • llms.txt contains: /docs/intro (no locale segment)

detectLocalePosition finds a locale position in the sitemap (for de, ja), but getDominantSegment on llms.txt returns null with no locale codes present. So the locale filter is never applied, the sitemap comparison includes all three locale variants, and coverage is computed as 33%.

Additional edge cases

  1. Two-locale sites probably fail the detection threshold. detectLocalePosition requires the locale codes to cover >50% of URLs at a given path position. With exactly two locales, each covers exactly 50% so detection would fail. (I have not verified this since I don't have a site with exactly two locales to test on, but I think this would happen)

  2. Overly broad locale regex. The pattern /^[a-z]{2}(-[a-z]{2})?$/ matches common 2-letter path segments that aren't locales (/go/, /ai/, /us/, /uk/), which could trigger false locale detection.

What did you expect?

When llms.txt covers the primary unprefixed language and the sitemap includes additional locale variants, the check should recognize this and filter the sitemap to only the unprefixed pages before computing coverage.

Suggested fix

After failing to find a dominant locale in llms.txt, check whether the llms.txt URLs match the sitemap URLs with the locale segment stripped. If >50% match, filter the sitemap to only non-locale-prefixed URLs before computing coverage.

How to reproduce

Environment

  • afdocs version: v0.3.0
  • Node.js version: v22.15.0
  • OS: macOS

Additional context

Happy to open a PR if this is valid! 😄

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions