RA-8607: PSL data parser updates rebased onto latest upstream#106
RA-8607: PSL data parser updates rebased onto latest upstream#106
Conversation
library/src/test/java/com/digicert/validation/psl/PslDataParserTest.java
Fixed
Show fixed
Hide fixed
library/src/test/java/com/digicert/validation/psl/PslDataParserTest.java
Fixed
Show fixed
Hide fixed
library/src/test/java/com/digicert/validation/psl/PslDataParserTest.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Pull request overview
Updates the Public Suffix List (PSL) parsing to better handle internationalized (Unicode) entries by inserting punycode variants, and expands test coverage for Unicode scripts and PSL rule types.
Changes:
- Update
PslDataParser.addToTrieto useIDN.toASCII(..., IDN.ALLOW_UNASSIGNED)and avoid failing the full parse on conversion errors. - Add parameterized tests covering multiple Unicode scripts, wildcard/exception rules, comment/empty-line skipping, and private-domain section handling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| library/src/main/java/com/digicert/validation/psl/PslDataParser.java | Adjusts IDN/punycode conversion behavior (ALLOW_UNASSIGNED + exception handling) during trie insertion. |
| library/src/test/java/com/digicert/validation/psl/PslDataParserTest.java | Adds parameterized test suites for Unicode scripts and additional PSL parsing behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private static Stream<Arguments> provideUnicodeScriptTestData() { | ||
| return Stream.of( | ||
| Arguments.of("ᬩᬮᬶ", "xn--9tfky", "Balinese script (.bali TLD)"), | ||
| Arguments.of("संगठन", "xn--i1b6b1a6a2e", "Devanagari script (.org in Hindi)"), | ||
| Arguments.of("বাংলা", "xn--54b7fta0cc", "Bengali script"), | ||
| Arguments.of("ελ", "xn--qxam", "Greek (.el for Greece)"), | ||
| Arguments.of("рф", "xn--p1ai", "Cyrillic (.rf for Russia)"), | ||
| Arguments.of("한국", "xn--3e0b707e", "Korean (.kr)"), | ||
| Arguments.of("ไทย", "xn--o3cw4h", "Thai (.th)"), | ||
| Arguments.of("இலங்கை", "xn--xkc2al3hye2a", "Tamil (.lk for Sri Lanka)"), | ||
| Arguments.of("مصر", "xn--wgbh1c", "Arabic (.eg for Egypt)"), | ||
| Arguments.of("קום", "xn--9dbq2a", "Hebrew (.com in Hebrew)"), | ||
| Arguments.of("münchen", "xn--mnchen-3ya", "German umlaut"), | ||
| Arguments.of("日本", "xn--wgv71a", "Japanese"), | ||
| Arguments.of("中国", "xn--fiqs8s", "Chinese"), | ||
| Arguments.of("example.com", "example.com", "ASCII domain") | ||
| ); |
There was a problem hiding this comment.
The Unicode-script parameterized test hardcodes expected punycode strings. These values are derived from java.net.IDN and can change across JDK updates, making the test brittle even when the parser behavior is correct. Consider deriving the expected punycode at runtime via IDN.toASCII(unicodeEntry, IDN.ALLOW_UNASSIGNED) (or generating the punycode in the method source) and asserting that value is present in the trie, rather than pinning specific punycode literals.
| // Using ALLOW_UNASSIGNED flag to handle edge cases like Balinese script (ᬩᬮᬶ) | ||
| try { | ||
| String punycode = IDN.toASCII(substring, IDN.ALLOW_UNASSIGNED); | ||
| if (!punycode.equals(substring)) { | ||
| trie.insert(punycode); | ||
| } | ||
| } catch (IllegalArgumentException e) { | ||
| // If conversion fails even with ALLOW_UNASSIGNED, the original substring | ||
| // has already been inserted above, so no entries are skipped | ||
| // This should be rare but we handle it gracefully | ||
| } |
There was a problem hiding this comment.
addToTrie now swallows IllegalArgumentException from IDN.toASCII(...) without any logging or surfacing of the bad PSL entry. If conversion fails for an entry, the punycode form won’t be inserted, which can lead to incorrect suffix classification for punycode input while failing silently. Consider either (a) rethrowing with context (e.g., include the offending substring) to avoid loading a partially-correct PSL, or (b) logging at least once (warn/debug) so the issue can be diagnosed in production.
| * and validated correctly. | ||
| * <p> | ||
| * The method uses the {@code IDN.ALLOW_UNASSIGNED} flag to handle Unicode characters that are not yet assigned | ||
| * in the Unicode standard (such as Balinese script). This flag allows the conversion to succeed for a wider | ||
| * range of Unicode characters that may appear in the Public Suffix List. |
There was a problem hiding this comment.
The Javadoc says adding punycode “ensures that both Unicode and punycode versions ... are recognized”. With the new try/catch, punycode insertion is now best-effort (it can be skipped on IllegalArgumentException). Please update the wording to match the behavior, or make punycode insertion fail-fast so the guarantee holds.
e881baf to
ec95178
Compare
ec95178 to
9ea12bb
Compare
|
Closing in favor of a squashed, neutral branch/PR without RA ticket references. |
Rebased RA-8607 changes from domain-control-validation-review onto latest digicert/domain-control-validation master.