Skip to content

RA-8607: PSL data parser updates rebased onto latest upstream#106

Closed
digilob wants to merge 4 commits intomasterfrom
RA-8607_psl_data_parser
Closed

RA-8607: PSL data parser updates rebased onto latest upstream#106
digilob wants to merge 4 commits intomasterfrom
RA-8607_psl_data_parser

Conversation

@digilob
Copy link
Copy Markdown
Collaborator

@digilob digilob commented Mar 25, 2026

Rebased RA-8607 changes from domain-control-validation-review onto latest digicert/domain-control-validation master.

Copilot AI review requested due to automatic review settings March 25, 2026 19:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Public Suffix List (PSL) parsing to better handle internationalized (Unicode) entries by inserting punycode variants, and expands test coverage for Unicode scripts and PSL rule types.

Changes:

  • Update PslDataParser.addToTrie to use IDN.toASCII(..., IDN.ALLOW_UNASSIGNED) and avoid failing the full parse on conversion errors.
  • Add parameterized tests covering multiple Unicode scripts, wildcard/exception rules, comment/empty-line skipping, and private-domain section handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
library/src/main/java/com/digicert/validation/psl/PslDataParser.java Adjusts IDN/punycode conversion behavior (ALLOW_UNASSIGNED + exception handling) during trie insertion.
library/src/test/java/com/digicert/validation/psl/PslDataParserTest.java Adds parameterized test suites for Unicode scripts and additional PSL parsing behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +57 to +73
private static Stream<Arguments> provideUnicodeScriptTestData() {
return Stream.of(
Arguments.of("ᬩᬮᬶ", "xn--9tfky", "Balinese script (.bali TLD)"),
Arguments.of("संगठन", "xn--i1b6b1a6a2e", "Devanagari script (.org in Hindi)"),
Arguments.of("বাংলা", "xn--54b7fta0cc", "Bengali script"),
Arguments.of("ελ", "xn--qxam", "Greek (.el for Greece)"),
Arguments.of("рф", "xn--p1ai", "Cyrillic (.rf for Russia)"),
Arguments.of("한국", "xn--3e0b707e", "Korean (.kr)"),
Arguments.of("ไทย", "xn--o3cw4h", "Thai (.th)"),
Arguments.of("இலங்கை", "xn--xkc2al3hye2a", "Tamil (.lk for Sri Lanka)"),
Arguments.of("مصر", "xn--wgbh1c", "Arabic (.eg for Egypt)"),
Arguments.of("קום", "xn--9dbq2a", "Hebrew (.com in Hebrew)"),
Arguments.of("münchen", "xn--mnchen-3ya", "German umlaut"),
Arguments.of("日本", "xn--wgv71a", "Japanese"),
Arguments.of("中国", "xn--fiqs8s", "Chinese"),
Arguments.of("example.com", "example.com", "ASCII domain")
);
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode-script parameterized test hardcodes expected punycode strings. These values are derived from java.net.IDN and can change across JDK updates, making the test brittle even when the parser behavior is correct. Consider deriving the expected punycode at runtime via IDN.toASCII(unicodeEntry, IDN.ALLOW_UNASSIGNED) (or generating the punycode in the method source) and asserting that value is present in the trie, rather than pinning specific punycode literals.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to 122
// Using ALLOW_UNASSIGNED flag to handle edge cases like Balinese script (ᬩᬮᬶ)
try {
String punycode = IDN.toASCII(substring, IDN.ALLOW_UNASSIGNED);
if (!punycode.equals(substring)) {
trie.insert(punycode);
}
} catch (IllegalArgumentException e) {
// If conversion fails even with ALLOW_UNASSIGNED, the original substring
// has already been inserted above, so no entries are skipped
// This should be rare but we handle it gracefully
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addToTrie now swallows IllegalArgumentException from IDN.toASCII(...) without any logging or surfacing of the bad PSL entry. If conversion fails for an entry, the punycode form won’t be inserted, which can lead to incorrect suffix classification for punycode input while failing silently. Consider either (a) rethrowing with context (e.g., include the offending substring) to avoid loading a partially-correct PSL, or (b) logging at least once (warn/debug) so the issue can be diagnosed in production.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +103
* and validated correctly.
* <p>
* The method uses the {@code IDN.ALLOW_UNASSIGNED} flag to handle Unicode characters that are not yet assigned
* in the Unicode standard (such as Balinese script). This flag allows the conversion to succeed for a wider
* range of Unicode characters that may appear in the Public Suffix List.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc says adding punycode “ensures that both Unicode and punycode versions ... are recognized”. With the new try/catch, punycode insertion is now best-effort (it can be skipped on IllegalArgumentException). Please update the wording to match the behavior, or make punycode insertion fail-fast so the guarantee holds.

Copilot uses AI. Check for mistakes.
@digilob digilob force-pushed the RA-8607_psl_data_parser branch from e881baf to ec95178 Compare March 25, 2026 19:16
@digilob digilob force-pushed the RA-8607_psl_data_parser branch from ec95178 to 9ea12bb Compare March 25, 2026 19:17
@digilob
Copy link
Copy Markdown
Collaborator Author

digilob commented Mar 25, 2026

Closing in favor of a squashed, neutral branch/PR without RA ticket references.

@digilob digilob closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants