Skip to content

Validate llms.txt has the required H1 heading#4

Open
federicobartoli wants to merge 3 commits intoaddyosmani:mainfrom
federicobartoli:feat/validate-llms-txt-h1-requirement
Open

Validate llms.txt has the required H1 heading#4
federicobartoli wants to merge 3 commits intoaddyosmani:mainfrom
federicobartoli:feat/validate-llms-txt-h1-requirement

Conversation

@federicobartoli
Copy link
Copy Markdown

The llms.txt spec requires a single H1 with the project or site name as the first element in the ordered structure. The checker didn't verify this, so a file without an H1 (or with multiple H1s, or with the H1 appearing after other content) was treated as well-formed.

This PR adds three findings, implemented as a small validateH1 helper so check() stays readable:

  • error when no H1 is present
  • warning when multiple H1s are present
  • warning when the H1 is not the first content in the file
  • info with the detected H1 on success

Fenced code blocks ( `` / ~~~) are stripped before matching so `# comment` lines inside bash etc. aren't counted as H1s.

Scoring is unchanged — this PR is purely additive, so no existing assertion or user-facing score shifts. Follow-ups can validate the other spec rules (blockquote summary, "Optional" H2 section, file-list link format) the same way.

Test plan

  • npm test passes (25/25, including a new llms-no-h1 fixture that asserts the missing-H1 error finding)
  • Existing good-site fixture still scores the same (already has # ExampleDocs as its H1)

The llms.txt spec requires a single H1 with the project or site name
as the first element in the ordered structure. The checker didn't
verify this, so a file with no H1 (or multiple H1s) passed as
well-formed.

- error  when no H1 is present
- warning when multiple H1s are present
- warning when the H1 is not the first content in the file

Fenced code blocks are stripped before matching so '# comment' lines
inside bash etc. aren't counted as H1s.

Spec: https://llmstxt.org
- Check the H1 position against the original content, not the
  code-block-stripped version. Previously, a file starting with a
  fenced code block followed by an H1 was treated as "H1 first"
  because stripping hoisted the H1 to the top of the analyzed text.
- Allow up to 3 spaces of leading indentation before the '#' as
  CommonMark does.

Adds a regression fixture (code block before the H1) that now
correctly produces the "not the first content" warning.
- Strip a leading BOM before running the position check, so files
  saved by editors that add one are not incorrectly flagged as
  "H1 not first content".
- Normalize setext H1 syntax (Title\n=====) to ATX before matching,
  so a spec-compliant setext H1 is recognized.

Fixtures and tests added for both cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant