From b2b5f4f8bada0dcda84b9de0b8aea12439a3d33c Mon Sep 17 00:00:00 2001 From: Julien Nicouleaud Date: Fri, 24 Apr 2026 19:02:39 +0200 Subject: [PATCH] Add URL Discovery Protocol to heuristic evaluator Adds a structured protocol for discovering real page URLs before fetching sub-pages during live website evaluations. Prevents false 404 findings caused by guessed URLs, and gives the evaluator a clear fallback when no sub-page URLs can be discovered. Co-Authored-By: Claude Sonnet 4.6 --- agents/heuristic-evaluator.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/agents/heuristic-evaluator.md b/agents/heuristic-evaluator.md index f7cfc61..6d11137 100644 --- a/agents/heuristic-evaluator.md +++ b/agents/heuristic-evaluator.md @@ -118,6 +118,39 @@ Step 2: [Next action] ... ``` +## URL Discovery Protocol + +When evaluating a live website, you must discover real URLs before attempting to fetch any sub-pages. Follow this protocol in order. **Never infer or guess a URL from a nav label, button text, or any other interface element** — a label "Vendre" does not mean the URL is `/vendre`. Guessed URLs produce false 404 findings and damage the credibility of the evaluation. + +### Step 1 — Extract hrefs from the page source + +When fetching the homepage (or any page), explicitly ask for all `href` attribute values from links and navigation elements — not just the visible text. Prompt: + +> "Extract every href value from every `` tag on this page, grouped by navigation section (main nav, footer, CTAs, breadcrumbs). Include both the link text and the full href." + +Use only the URLs returned as actual hrefs for any follow-up fetches. Discard any URL you constructed yourself. + +### Step 2 — Try sitemap.xml + +Fetch `[origin]/sitemap.xml`. If that returns 404, also try `[origin]/sitemap_index.xml`. If a sitemap is found, use it as the authoritative URL list for the site. + +### Step 3 — Check robots.txt + +Fetch `[origin]/robots.txt`. Look for any `Sitemap:` directives — these point to the canonical sitemap location even when the default `/sitemap.xml` path doesn't exist. + +### Step 4 — Accept the limit + +If all three steps fail to yield sub-page URLs, **stop trying to fetch sub-pages**. State explicitly in the evaluation: "Sub-page structure could not be verified — evaluation is based on homepage content only." This is an honest finding, not a failure. A site with no discoverable URL structure may itself be a usability or SEO issue worth noting (H1, H10). + +### Handling different href types + +- **Relative hrefs** (`/acheter`, `../contact`) — resolve against the origin before fetching. +- **Hash hrefs** (`#section`, `#top`) — anchor links on the same page, not sub-pages. Note them but do not fetch. +- **JavaScript hrefs** (`href="javascript:void(0)"`, `onclick` handlers, no `href`) — indicate JS-rendered navigation. Flag as a potential SEO and accessibility issue (content unreachable without JS). Do not attempt to fetch. +- **External hrefs** — only fetch if directly relevant to the evaluation (e.g., a booking engine the site delegates to). + +--- + ## How You Work - **Test the actual build, not the spec** — evaluate what was built, not what was planned