Skip to content

[CAI-749] Local parser#2007

Open
anemone008 wants to merge 57 commits intomainfrom
CAI-749-parser-url-crawler
Open

[CAI-749] Local parser#2007
anemone008 wants to merge 57 commits intomainfrom
CAI-749-parser-url-crawler

Conversation

@anemone008
Copy link
Collaborator

List of Changes

Add script for parsing to parser app. Parsed content is saved locally. Urls are sanitized for filesystems and used as file names.

Motivation and Context

How Has This Been Tested?

Tested for errors associated to non-existent or unreachable urls. Reproducible through npm test as described in the README.md

Screenshots (if appropriate):

Types of changes

  • Chore (nothing changes by a user perspective)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@changeset-bot
Copy link

changeset-bot bot commented Feb 9, 2026

🦋 Changeset detected

Latest commit: 06a80fc

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
parser Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@anemone008 anemone008 self-assigned this Feb 9, 2026
@anemone008 anemone008 requested a review from Copilot February 9, 2026 16:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a local “parser” CLI to crawl a site, extract page metadata, and persist results to disk with URL/filename sanitization.

Changes:

  • Introduces a Puppeteer-based crawler/metadata extractor and local JSON output.
  • Adds URL normalization + filesystem-safe filename sanitization utilities.
  • Adds build/test tooling (TypeScript + Jest) and an error-handling integration test.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
apps/parser/src/parser.ts Adds the CLI entrypoint: reachability check, crawl orchestration, page parsing, persistence.
apps/parser/src/modules/crawler.ts Implements recursive crawl + scope filtering + link discovery.
apps/parser/src/modules/domActions.ts Expands interactive UI sections before scraping text/links.
apps/parser/src/modules/config.ts Resolves env-based configuration and output directory derivation.
apps/parser/src/modules/output.ts Creates output directory and writes JSON snapshots.
apps/parser/src/modules/errors.ts Centralizes fatal error handling/exit code.
apps/parser/src/modules/types.ts Adds typed metadata/node structures for crawl results.
apps/parser/src/utils/url.ts Adds URL normalization helpers and remote URL detection.
apps/parser/src/utils/sanitizeFilename.ts Adds filesystem-safe filename sanitization.
apps/parser/tests/parser.error-handling.test.ts Adds integration test for unreachable/nonexistent URL behavior.
apps/parser/package.json Adds build/parse/test scripts and required dependencies.
apps/parser/jest.config.ts Configures Jest + ts-jest for the parser app.
apps/parser/tsconfig.json Adds parser app TS config for dev/test typechecking.
apps/parser/tsconfig.build.json Adds build TS config emitting to dist/.
apps/parser/README.md Documents CLI usage, env vars, and tests.
.changeset/wide-hairs-fail.md Changeset entry for the new parser feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@anemone008 anemone008 marked this pull request as draft February 9, 2026 16:59
anemone008 and others added 2 commits February 10, 2026 12:04
@anemone008 anemone008 marked this pull request as ready for review February 10, 2026 13:56
@github-actions
Copy link
Contributor

Branch is not up to date with base branch

@anemone008 it seems this Pull Request is not updated with base branch.
Please proceed with a merge or rebase to solve this.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 13, 2026

Jira Pull Request Link

This Pull Request refers to the following Jira issue CAI-749

@github-actions
Copy link
Contributor

This PR exceeds the recommended size of 800 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@MarBert MarBert self-requested a review February 13, 2026 16:15
…COPE (e.g. if example.com is passed as URL but it redirects to www.example.com, the second is correctly stored as BASE_SCOPE)
… to avoid code repetition. Make hash logic more robust.
…c determination of the base url. Update related functions to accept base_scope as an argument.
Comment on lines 11 to 14
let BASE_SCOPE: string;

export function setBaseScope(scope: string): void {
BASE_SCOPE = scope;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this variable and add parameter in sanitizeUrlAsFilename

@@ -0,0 +1,47 @@
const REQUEST_TIMEOUT_MS = 10_000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this as env var but set it to 10000 if the var is missing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as environment variable and documented it in 6ccb9db

browser,
child,
depth + 1,
parsedPages,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may parse already parsed pages when a recursion branch is completed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest to:

  • add a return type to the recursive function -> array of string or the same type of parsedPages
  • when there are no children to explore, return parsed pages + the page explored in this recursion
  • doing this should build a parsed page with all the pages explored

Comment on lines 26 to 31
title?: string;
bodyText?: string;
lang?: string | null;
keywords?: string | null;
datePublished?: string | null;
lastModified?: string | null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theese fields are the same as ParsedMetadata, check if they are really needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As they are not needed in fact, they have been remove in commit ddfef8a

}
} catch (error) {
console.warn(
`Failed to detect redirect for base URL: ${env.baseUrl}`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the execution reaches this catch, the page might be uncreachable. Print a warning with the message "Failed to reach the url" or something like that and stop the execution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessed in commit 9043c39

Comment on lines 16 to 21
if (!url) {
console.warn(
`Missing input url, sanitizing as default "${DEFAULT_REPLACEMENT}"`,
);
return DEFAULT_REPLACEMENT;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if URL is empty... also, this check should be made before the sanitizeUrlAsFilename

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the check of baseUrl is performed in config.ts. All child URLs are discovered through parsing, therefore in no case the url parameter of this function happens to be empty. This check is simply removed in commit b501940

filenameBase = new URL(filenameBase).hostname.replace(/^www\./, "");
} else {
const pathAndSearch = url.replace(baseScope, "").replace(/^\/+/, "");
if (!pathAndSearch || pathAndSearch === "/") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think a replace can return null

);
return DEFAULT_REPLACEMENT;
}
let filenameBase = url;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this let is not needed, return directly in the different if else paths

url: string,
options?: SanitizeOptions,
): string {
let filenameBase = url;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here!

const replacement = validReplacementOrDefault(
options?.replacement ?? DEFAULT_REPLACEMENT,
);
let sanitized = input
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it a const

Comment on lines 117 to 138
export function deriveSubPath(
targetUrl: string,
baseUrl: string,
sanitizedBaseUrl: string,
): string {
const base = new URL(baseUrl);
const target = new URL(targetUrl);
let relPath = target.pathname;
if (base.pathname !== "/" && relPath.startsWith(base.pathname)) {
relPath = relPath.slice(base.pathname.length);
if (!relPath.startsWith("/")) relPath = "/" + relPath;
}
if (
RemoveAnchorsFromUrl(targetUrl) === sanitizedBaseUrl ||
relPath === "/" ||
relPath === ""
) {
return "/";
}
return `${relPath}${target.search}${target.hash}` || "/";
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is never called, delete it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments regarding the file url-handling.ts have been addressed with commit 5a324d1

);
} finally {
if (page) await page.close();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if the original base url has a different domain from the finalURL. in this case stop the exectution. If the domain (host) is the same, save the new base scope and continue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added domain check after redirect in 950eae3

browser,
child,
depth + 1,
parsedPages,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest to:

  • add a return type to the recursive function -> array of string or the same type of parsedPages
  • when there are no children to explore, return parsed pages + the page explored in this recursion
  • doing this should build a parsed page with all the pages explored

@anemone008 anemone008 requested a review from MarBert February 17, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants