[CAI-749] Local parser by anemone008 · Pull Request #2007 · pagopa/developer-portal

anemone008 · 2026-02-09T16:44:33Z

List of Changes

Add script for parsing to parser app. Parsed content is saved locally. Urls are sanitized for filesystems and used as file names.

Motivation and Context

How Has This Been Tested?

Tested for errors associated to non-existent or unreachable urls. Reproducible through npm test as described in the README.md

Screenshots (if appropriate):

Types of changes

Chore (nothing changes by a user perspective)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My change requires a change to the documentation.
I have updated the documentation accordingly.

…r filesystem use.

changeset-bot · 2026-02-09T16:44:38Z

🦋 Changeset detected

Latest commit: 06a80fc

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
parser	Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copilot

Pull request overview

Adds a local “parser” CLI to crawl a site, extract page metadata, and persist results to disk with URL/filename sanitization.

Changes:

Introduces a Puppeteer-based crawler/metadata extractor and local JSON output.
Adds URL normalization + filesystem-safe filename sanitization utilities.
Adds build/test tooling (TypeScript + Jest) and an error-handling integration test.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
apps/parser/src/parser.ts	Adds the CLI entrypoint: reachability check, crawl orchestration, page parsing, persistence.
apps/parser/src/modules/crawler.ts	Implements recursive crawl + scope filtering + link discovery.
apps/parser/src/modules/domActions.ts	Expands interactive UI sections before scraping text/links.
apps/parser/src/modules/config.ts	Resolves env-based configuration and output directory derivation.
apps/parser/src/modules/output.ts	Creates output directory and writes JSON snapshots.
apps/parser/src/modules/errors.ts	Centralizes fatal error handling/exit code.
apps/parser/src/modules/types.ts	Adds typed metadata/node structures for crawl results.
apps/parser/src/utils/url.ts	Adds URL normalization helpers and remote URL detection.
apps/parser/src/utils/sanitizeFilename.ts	Adds filesystem-safe filename sanitization.
apps/parser/tests/parser.error-handling.test.ts	Adds integration test for unreachable/nonexistent URL behavior.
apps/parser/package.json	Adds build/parse/test scripts and required dependencies.
apps/parser/jest.config.ts	Configures Jest + ts-jest for the parser app.
apps/parser/tsconfig.json	Adds parser app TS config for dev/test typechecking.
apps/parser/tsconfig.build.json	Adds build TS config emitting to dist/.
apps/parser/README.md	Documents CLI usage, env vars, and tests.
.changeset/wide-hairs-fail.md	Changeset entry for the new parser feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

apps/parser/src/parser.ts

apps/parser/src/modules/domActions.ts

apps/parser/src/modules/output.ts

apps/parser/src/modules/crawler.ts

apps/parser/src/parser.ts

apps/parser/tests/parser.error-handling.test.ts

apps/parser/README.md

apps/parser/src/utils/sanitizeFilename.ts

apps/parser/src/modules/config.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ng of the same document

…a/developer-portal into CAI-749-parser-url-crawler

…es uniqueness

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…le navigation_timeout parameter. Remove unused isRemoteUrl

…awler

github-actions · 2026-02-13T11:28:11Z

Branch is not up to date with base branch

@anemone008 it seems this Pull Request is not updated with base branch.
Please proceed with a merge or rebase to solve this.

…awler

…mited depth, implement base scope handling, and enhance URL sanitization functions for Directory names

github-actions · 2026-02-13T15:49:05Z

Jira Pull Request Link

This Pull Request refers to the following Jira issue CAI-749

github-actions · 2026-02-13T15:49:09Z

This PR exceeds the recommended size of 800 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

…COPE (e.g. if example.com is passed as URL but it redirects to www.example.com, the second is correctly stored as BASE_SCOPE)

… to avoid code repetition. Make hash logic more robust.

…c determination of the base url. Update related functions to accept base_scope as an argument.

MarBert · 2026-02-13T16:31:18Z

apps/parser/src/helpers/url-handling.ts

+let BASE_SCOPE: string;
+
+export function setBaseScope(scope: string): void {
+  BASE_SCOPE = scope;


Remove this variable and add parameter in sanitizeUrlAsFilename

apps/parser/src/modules/config.ts

MarBert · 2026-02-13T16:52:55Z

apps/parser/src/modules/network.ts

@@ -0,0 +1,47 @@
+const REQUEST_TIMEOUT_MS = 10_000;


add this as env var but set it to 10000 if the var is missing

Added as environment variable and documented it in 6ccb9db

apps/parser/src/modules/parser.ts

MarBert · 2026-02-13T17:03:09Z

apps/parser/src/modules/parser.ts

+      browser,
+      child,
+      depth + 1,
+      parsedPages,


This may parse already parsed pages when a recursion branch is completed

i suggest to:

add a return type to the recursive function -> array of string or the same type of parsedPages

when there are no children to explore, return parsed pages + the page explored in this recursion

doing this should build a parsed page with all the pages explored

…al variable that allows parsing specific domain variants. Refactor base_scope to baseScope for consistency.

…ment variable VALID_DOMAIN_VARIANTS

… configuration

MarBert · 2026-02-16T15:59:05Z

apps/parser/src/modules/types.ts

+  title?: string;
+  bodyText?: string;
+  lang?: string | null;
+  keywords?: string | null;
+  datePublished?: string | null;
+  lastModified?: string | null;


theese fields are the same as ParsedMetadata, check if they are really needed

As they are not needed in fact, they have been remove in commit ddfef8a

MarBert · 2026-02-16T16:08:35Z

apps/parser/src/main.ts

+      }
+    } catch (error) {
+      console.warn(
+        `Failed to detect redirect for base URL: ${env.baseUrl}`,


If the execution reaches this catch, the page might be uncreachable. Print a warning with the message "Failed to reach the url" or something like that and stop the execution.

Assessed in commit 9043c39

MarBert · 2026-02-16T16:21:09Z

apps/parser/src/helpers/url-handling.ts

+  if (!url) {
+    console.warn(
+      `Missing input url, sanitizing as default "${DEFAULT_REPLACEMENT}"`,
+    );
+    return DEFAULT_REPLACEMENT;
+  }


check if URL is empty... also, this check should be made before the sanitizeUrlAsFilename

Indeed, the check of baseUrl is performed in config.ts. All child URLs are discovered through parsing, therefore in no case the url parameter of this function happens to be empty. This check is simply removed in commit b501940

MarBert · 2026-02-16T16:24:46Z

apps/parser/src/helpers/url-handling.ts

+    filenameBase = new URL(filenameBase).hostname.replace(/^www\./, "");
+  } else {
+    const pathAndSearch = url.replace(baseScope, "").replace(/^\/+/, "");
+    if (!pathAndSearch || pathAndSearch === "/") {


i don't think a replace can return null

MarBert · 2026-02-16T16:26:44Z

apps/parser/src/helpers/url-handling.ts

+    );
+    return DEFAULT_REPLACEMENT;
+  }
+  let filenameBase = url;


this let is not needed, return directly in the different if else paths

MarBert · 2026-02-16T16:26:59Z

apps/parser/src/helpers/url-handling.ts

+  url: string,
+  options?: SanitizeOptions,
+): string {
+  let filenameBase = url;


MarBert · 2026-02-16T16:27:41Z

apps/parser/src/helpers/url-handling.ts

+  const replacement = validReplacementOrDefault(
+    options?.replacement ?? DEFAULT_REPLACEMENT,
+  );
+  let sanitized = input


make it a const

MarBert · 2026-02-16T16:38:50Z

apps/parser/src/helpers/url-handling.ts

+export function deriveSubPath(
+  targetUrl: string,
+  baseUrl: string,
+  sanitizedBaseUrl: string,
+): string {
+  const base = new URL(baseUrl);
+  const target = new URL(targetUrl);
+  let relPath = target.pathname;
+  if (base.pathname !== "/" && relPath.startsWith(base.pathname)) {
+    relPath = relPath.slice(base.pathname.length);
+    if (!relPath.startsWith("/")) relPath = "/" + relPath;
+  }
+  if (
+    RemoveAnchorsFromUrl(targetUrl) === sanitizedBaseUrl ||
+    relPath === "/" ||
+    relPath === ""
+  ) {
+    return "/";
+  }
+  return `${relPath}${target.search}${target.hash}` || "/";
+}
+


this code is never called, delete it

All comments regarding the file url-handling.ts have been addressed with commit 5a324d1

MarBert · 2026-02-16T16:41:46Z

apps/parser/src/main.ts

+      );
+    } finally {
+      if (page) await page.close();
+    }


check if the original base url has a different domain from the finalURL. in this case stop the exectution. If the domain (host) is the same, save the new base scope and continue

Added domain check after redirect in 950eae3

MarBert · 2026-02-16T16:50:58Z

apps/parser/src/modules/parser.ts

+      browser,
+      child,
+      depth + 1,
+      parsedPages,


i suggest to:

add a return type to the recursive function -> array of string or the same type of parsedPages

when there are no children to explore, return parsed pages + the page explored in this recursion

doing this should build a parsed page with all the pages explored

…cordingly; remove unused fields in ParsedNode.

…efault of 10000. Update .env.default and README accordingly

…rning

…llow parsing of .html files. Remove unused deriveSubPath function.

…so the corresponding test case

…ndInteractiveSections

anemone008 added 2 commits February 9, 2026 17:24

Add script to perform parsing. Store output locally. Sanitize urls fo…

dbcf818

…r filesystem use.

Add changeset

4354cd5

anemone008 requested review from a team, MarBert, MarcoPonchia, marcobottaro and tommaso1 as code owners February 9, 2026 16:44

github-actions bot added the changeset label Feb 9, 2026

anemone008 self-assigned this Feb 9, 2026

anemone008 requested a review from Copilot February 9, 2026 16:45

Copilot AI reviewed Feb 9, 2026

View reviewed changes

anemone008 marked this pull request as draft February 9, 2026 16:59

anemone008 added 2 commits February 9, 2026 18:08

Fix fetch. Set headless true

a9540a3

Replace timeout with AbortController + setTimeout

de11fe0

batdevis reviewed Feb 10, 2026

View reviewed changes

apps/parser/src/modules/config.ts Outdated Show resolved Hide resolved

anemone008 and others added 7 commits February 10, 2026 10:12

Add filter on ariaExpanded to avoid collapse of already expanded content

3daa8c9

Update apps/parser/src/modules/output.ts

a1ea8f0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Drop the hash from the visit key to avoid redundant navigation/scrapi…

4018873

…ng of the same document

Merge branch 'CAI-749-parser-url-crawler' of https://github.com/pagop…

3feea1b

…a/developer-portal into CAI-749-parser-url-crawler

Update README.md

26236f0

Add test on filename sanitization

9afad68

Add short hash when filename is longer than 255 chars for URL filenam…

d0b0cc6

…es uniqueness

github-actions bot added the size/large label Feb 10, 2026

anemone008 and others added 2 commits February 10, 2026 12:04

Update apps/parser/tests/parser.error-handling.test.ts

31c4bad

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Rename VECTOR_INDEX_NAME to PARSER_VECTOR_INDEX_NAME

fca22ae

anemone008 marked this pull request as ready for review February 10, 2026 13:56

anemone008 and others added 3 commits February 10, 2026 17:42

Refactor deriveSubPath to use base URL for path derivation

748684e

Rename vector_index variable, remove alternative output dir, set sing…

b403294

…le navigation_timeout parameter. Remove unused isRemoteUrl

Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…

ccdc78a

…awler

Formatting

cfd4ffe

anemone008 added 3 commits February 13, 2026 12:28

Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…

923aa16

…awler

Save root page metadata as index.json

755fb6b

Refactor parser configuration: update maxDepth to allow null for unli…

2a1b9a0

…mited depth, implement base scope handling, and enhance URL sanitization functions for Directory names

MarBert self-requested a review February 13, 2026 16:15

anemone008 added 4 commits February 16, 2026 11:37

Add filter to exclude from parsing files. Improve retrieval of BASE_S…

69fc7f1

…COPE (e.g. if example.com is passed as URL but it redirects to www.example.com, the second is correctly stored as BASE_SCOPE)

Refactor URL sanitization: move logic into applySanitization function…

a46d384

… to avoid code repetition. Make hash logic more robust.

Replace BASE_SCOPE global variable with function parameter for dynami…

43e93f3

…c determination of the base url. Update related functions to accept base_scope as an argument.

Refactor sanitizeUrlAsFilename tests to include BASE_SCOPE parameter

d252b1b

MarBert requested changes Feb 16, 2026

View reviewed changes

anemone008 added 5 commits February 16, 2026 12:29

Add warnings in functions resolveEnv and extractDocumentMetadata

86861b7

Add to the README explanations about VALID_DOMAIN_VARIANTS, an option…

859b442

…al variable that allows parsing specific domain variants. Refactor base_scope to baseScope for consistency.

Improve .env.default and README with clarifications about the environ…

acfb71f

…ment variable VALID_DOMAIN_VARIANTS

Remove option to use command line variables for environment variables…

d692b60

… configuration

Remove unused metadata assignment to node in exploreAndParsePages

7e65b58

MarBert requested changes Feb 16, 2026

View reviewed changes

anemone008 added 7 commits February 17, 2026 14:30

Rename output directory from /parsing to /parser and update README ac…

ddfef8a

…cordingly; remove unused fields in ParsedNode.

Set PUBLIC_PARSER_REQUEST_TIMEOUT_MS as environment variable with a d…

6ccb9db

…efault of 10000. Update .env.default and README accordingly

Exit process in case of non-reachable URL in main, after logging a wa…

9043c39

…rning

Remove superfluous url existence check in sanitizeUrlAsFilename

b501940

Improve functions in url-handling helper to declare less variables. A…

5a324d1

…llow parsing of .html files. Remove unused deriveSubPath function.

Add domain mismatch handling in case of redirect from baseUrl. Add al…

950eae3

…so the corresponding test case

Fix recursion with parsedPages update

7ace0bd

anemone008 requested a review from MarBert February 17, 2026 16:05

anemone008 added 3 commits February 17, 2026 17:19

Remove searchParams from buildVisitKey

dc7e3f0

Minor fix

4fa6fa8

Fix filename generation in inconsistent domain cases. Fix bug of expa…

06a80fc

…ndInteractiveSections

Conversation

anemone008 commented Feb 9, 2026

List of Changes

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

changeset-bot bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 13, 2026

Branch is not up to date with base branch

Uh oh!

github-actions bot commented Feb 13, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira Pull Request Link

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changeset-bot bot commented Feb 9, 2026 •

edited

Loading

github-actions bot commented Feb 13, 2026 •

edited by atlassian bot

Loading