What happened?
The llms-txt-freshness check appears to produce false positives (incorrect fail/warn) for documentation sites that have localized versions of their content. The check compares llms.txt URLs against the sitemap to compute coverage, but if the sitemap includes multiple language variants and llms.txt only covers the primary language, the denominator is inflated and coverage appears low.
It looks like the gap is when the default language has no locale prefix.
For example
- Sitemap contains:
/docs/intro, /docs/de/intro, /docs/ja/intro
llms.txt contains: /docs/intro (no locale segment)
detectLocalePosition finds a locale position in the sitemap (for de, ja), but getDominantSegment on llms.txt returns null with no locale codes present. So the locale filter is never applied, the sitemap comparison includes all three locale variants, and coverage is computed as 33%.
Additional edge cases
-
Two-locale sites probably fail the detection threshold. detectLocalePosition requires the locale codes to cover >50% of URLs at a given path position. With exactly two locales, each covers exactly 50% so detection would fail. (I have not verified this since I don't have a site with exactly two locales to test on, but I think this would happen)
-
Overly broad locale regex. The pattern /^[a-z]{2}(-[a-z]{2})?$/ matches common 2-letter path segments that aren't locales (/go/, /ai/, /us/, /uk/), which could trigger false locale detection.
What did you expect?
When llms.txt covers the primary unprefixed language and the sitemap includes additional locale variants, the check should recognize this and filter the sitemap to only the unprefixed pages before computing coverage.
Suggested fix
After failing to find a dominant locale in llms.txt, check whether the llms.txt URLs match the sitemap URLs with the locale segment stripped. If >50% match, filter the sitemap to only non-locale-prefixed URLs before computing coverage.
How to reproduce
Environment
- afdocs version: v0.3.0
- Node.js version: v22.15.0
- OS: macOS
Additional context
Happy to open a PR if this is valid! 😄
What happened?
The
llms-txt-freshnesscheck appears to produce false positives (incorrect fail/warn) for documentation sites that have localized versions of their content. The check comparesllms.txtURLs against the sitemap to compute coverage, but if the sitemap includes multiple language variants andllms.txtonly covers the primary language, the denominator is inflated and coverage appears low.It looks like the gap is when the default language has no locale prefix.
For example
/docs/intro,/docs/de/intro,/docs/ja/introllms.txtcontains:/docs/intro(no locale segment)detectLocalePositionfinds a locale position in the sitemap (forde,ja), butgetDominantSegmentonllms.txtreturnsnullwith no locale codes present. So the locale filter is never applied, the sitemap comparison includes all three locale variants, and coverage is computed as 33%.Additional edge cases
Two-locale sites probably fail the detection threshold.
detectLocalePositionrequires the locale codes to cover >50% of URLs at a given path position. With exactly two locales, each covers exactly 50% so detection would fail. (I have not verified this since I don't have a site with exactly two locales to test on, but I think this would happen)Overly broad locale regex. The pattern
/^[a-z]{2}(-[a-z]{2})?$/matches common 2-letter path segments that aren't locales (/go/,/ai/,/us/,/uk/), which could trigger false locale detection.What did you expect?
When
llms.txtcovers the primary unprefixed language and the sitemap includes additional locale variants, the check should recognize this and filter the sitemap to only the unprefixed pages before computing coverage.Suggested fix
After failing to find a dominant locale in
llms.txt, check whether thellms.txtURLs match the sitemap URLs with the locale segment stripped. If >50% match, filter the sitemap to only non-locale-prefixed URLs before computing coverage.How to reproduce
Environment
Additional context
Happy to open a PR if this is valid! 😄