Conversation
🦋 Changeset detectedLatest commit: 06a80fc The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Pull request overview
Adds a local “parser” CLI to crawl a site, extract page metadata, and persist results to disk with URL/filename sanitization.
Changes:
- Introduces a Puppeteer-based crawler/metadata extractor and local JSON output.
- Adds URL normalization + filesystem-safe filename sanitization utilities.
- Adds build/test tooling (TypeScript + Jest) and an error-handling integration test.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/parser/src/parser.ts | Adds the CLI entrypoint: reachability check, crawl orchestration, page parsing, persistence. |
| apps/parser/src/modules/crawler.ts | Implements recursive crawl + scope filtering + link discovery. |
| apps/parser/src/modules/domActions.ts | Expands interactive UI sections before scraping text/links. |
| apps/parser/src/modules/config.ts | Resolves env-based configuration and output directory derivation. |
| apps/parser/src/modules/output.ts | Creates output directory and writes JSON snapshots. |
| apps/parser/src/modules/errors.ts | Centralizes fatal error handling/exit code. |
| apps/parser/src/modules/types.ts | Adds typed metadata/node structures for crawl results. |
| apps/parser/src/utils/url.ts | Adds URL normalization helpers and remote URL detection. |
| apps/parser/src/utils/sanitizeFilename.ts | Adds filesystem-safe filename sanitization. |
| apps/parser/tests/parser.error-handling.test.ts | Adds integration test for unreachable/nonexistent URL behavior. |
| apps/parser/package.json | Adds build/parse/test scripts and required dependencies. |
| apps/parser/jest.config.ts | Configures Jest + ts-jest for the parser app. |
| apps/parser/tsconfig.json | Adds parser app TS config for dev/test typechecking. |
| apps/parser/tsconfig.build.json | Adds build TS config emitting to dist/. |
| apps/parser/README.md | Documents CLI usage, env vars, and tests. |
| .changeset/wide-hairs-fail.md | Changeset entry for the new parser feature. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ng of the same document
…a/developer-portal into CAI-749-parser-url-crawler
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…le navigation_timeout parameter. Remove unused isRemoteUrl
Branch is not up to date with base branch@anemone008 it seems this Pull Request is not updated with base branch. |
…mited depth, implement base scope handling, and enhance URL sanitization functions for Directory names
Jira Pull Request LinkThis Pull Request refers to the following Jira issue CAI-749 |
|
This PR exceeds the recommended size of 800 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
…COPE (e.g. if example.com is passed as URL but it redirects to www.example.com, the second is correctly stored as BASE_SCOPE)
… to avoid code repetition. Make hash logic more robust.
…c determination of the base url. Update related functions to accept base_scope as an argument.
| let BASE_SCOPE: string; | ||
|
|
||
| export function setBaseScope(scope: string): void { | ||
| BASE_SCOPE = scope; |
There was a problem hiding this comment.
Remove this variable and add parameter in sanitizeUrlAsFilename
apps/parser/src/modules/network.ts
Outdated
| @@ -0,0 +1,47 @@ | |||
| const REQUEST_TIMEOUT_MS = 10_000; | |||
There was a problem hiding this comment.
add this as env var but set it to 10000 if the var is missing
There was a problem hiding this comment.
Added as environment variable and documented it in 6ccb9db
| browser, | ||
| child, | ||
| depth + 1, | ||
| parsedPages, |
There was a problem hiding this comment.
This may parse already parsed pages when a recursion branch is completed
There was a problem hiding this comment.
i suggest to:
- add a return type to the recursive function -> array of string or the same type of parsedPages
- when there are no children to explore, return parsed pages + the page explored in this recursion
- doing this should build a parsed page with all the pages explored
…al variable that allows parsing specific domain variants. Refactor base_scope to baseScope for consistency.
…ment variable VALID_DOMAIN_VARIANTS
apps/parser/src/modules/types.ts
Outdated
| title?: string; | ||
| bodyText?: string; | ||
| lang?: string | null; | ||
| keywords?: string | null; | ||
| datePublished?: string | null; | ||
| lastModified?: string | null; |
There was a problem hiding this comment.
theese fields are the same as ParsedMetadata, check if they are really needed
There was a problem hiding this comment.
As they are not needed in fact, they have been remove in commit ddfef8a
apps/parser/src/main.ts
Outdated
| } | ||
| } catch (error) { | ||
| console.warn( | ||
| `Failed to detect redirect for base URL: ${env.baseUrl}`, |
There was a problem hiding this comment.
If the execution reaches this catch, the page might be uncreachable. Print a warning with the message "Failed to reach the url" or something like that and stop the execution.
| if (!url) { | ||
| console.warn( | ||
| `Missing input url, sanitizing as default "${DEFAULT_REPLACEMENT}"`, | ||
| ); | ||
| return DEFAULT_REPLACEMENT; | ||
| } |
There was a problem hiding this comment.
check if URL is empty... also, this check should be made before the sanitizeUrlAsFilename
There was a problem hiding this comment.
Indeed, the check of baseUrl is performed in config.ts. All child URLs are discovered through parsing, therefore in no case the url parameter of this function happens to be empty. This check is simply removed in commit b501940
| filenameBase = new URL(filenameBase).hostname.replace(/^www\./, ""); | ||
| } else { | ||
| const pathAndSearch = url.replace(baseScope, "").replace(/^\/+/, ""); | ||
| if (!pathAndSearch || pathAndSearch === "/") { |
There was a problem hiding this comment.
i don't think a replace can return null
| ); | ||
| return DEFAULT_REPLACEMENT; | ||
| } | ||
| let filenameBase = url; |
There was a problem hiding this comment.
this let is not needed, return directly in the different if else paths
| url: string, | ||
| options?: SanitizeOptions, | ||
| ): string { | ||
| let filenameBase = url; |
| const replacement = validReplacementOrDefault( | ||
| options?.replacement ?? DEFAULT_REPLACEMENT, | ||
| ); | ||
| let sanitized = input |
| export function deriveSubPath( | ||
| targetUrl: string, | ||
| baseUrl: string, | ||
| sanitizedBaseUrl: string, | ||
| ): string { | ||
| const base = new URL(baseUrl); | ||
| const target = new URL(targetUrl); | ||
| let relPath = target.pathname; | ||
| if (base.pathname !== "/" && relPath.startsWith(base.pathname)) { | ||
| relPath = relPath.slice(base.pathname.length); | ||
| if (!relPath.startsWith("/")) relPath = "/" + relPath; | ||
| } | ||
| if ( | ||
| RemoveAnchorsFromUrl(targetUrl) === sanitizedBaseUrl || | ||
| relPath === "/" || | ||
| relPath === "" | ||
| ) { | ||
| return "/"; | ||
| } | ||
| return `${relPath}${target.search}${target.hash}` || "/"; | ||
| } | ||
|
|
There was a problem hiding this comment.
this code is never called, delete it
There was a problem hiding this comment.
All comments regarding the file url-handling.ts have been addressed with commit 5a324d1
| ); | ||
| } finally { | ||
| if (page) await page.close(); | ||
| } |
There was a problem hiding this comment.
check if the original base url has a different domain from the finalURL. in this case stop the exectution. If the domain (host) is the same, save the new base scope and continue
There was a problem hiding this comment.
Added domain check after redirect in 950eae3
| browser, | ||
| child, | ||
| depth + 1, | ||
| parsedPages, |
There was a problem hiding this comment.
i suggest to:
- add a return type to the recursive function -> array of string or the same type of parsedPages
- when there are no children to explore, return parsed pages + the page explored in this recursion
- doing this should build a parsed page with all the pages explored
…cordingly; remove unused fields in ParsedNode.
…efault of 10000. Update .env.default and README accordingly
…llow parsing of .html files. Remove unused deriveSubPath function.
…so the corresponding test case
List of Changes
Add script for parsing to parser app. Parsed content is saved locally. Urls are sanitized for filesystems and used as file names.
Motivation and Context
How Has This Been Tested?
Tested for errors associated to non-existent or unreachable urls. Reproducible through
npm testas described in the README.mdScreenshots (if appropriate):
Types of changes
Checklist: