Skip to content

fix: classify Hangul Compatibility Jamo (U+3130–U+318F) as CJK#141

Closed
mayrang wants to merge 3 commits intochenglou:mainfrom
mayrang:fix/hangul-compatibility-jamo-cjk-classification
Closed

fix: classify Hangul Compatibility Jamo (U+3130–U+318F) as CJK#141
mayrang wants to merge 3 commits intochenglou:mainfrom
mayrang:fix/hangul-compatibility-jamo-cjk-classification

Conversation

@mayrang
Copy link
Copy Markdown
Contributor

@mayrang mayrang commented Apr 17, 2026

Closes #142

Problem

isCJKCodePoint() in src/analysis.ts includes Hangul syllables (U+AC00–U+D7AF) but not Hangul Compatibility Jamo (U+3130–U+318F) — the standalone consonants and vowels Korean speakers type constantly in digital communication:

Expression Meaning Characters
ㅋㅋㅋ laughter ("lol") U+314B
ㅠㅠ crying / sad U+3160
ㄹㅇ "literally" / "for real" U+3139 U+3147
ㅇㅋ okay U+3147 U+314B
ㄴㄴ no / nope U+3134

These appear in virtually every Korean chat conversation. Without CJK classification, these characters are treated as atomic segments the line-breaker cannot split at grapheme boundaries — causing line counts 1 too high compared to Chrome and Safari.

Reproduction

Wrong line count (most severe)

font: "20px serif", width: 200px
text: "ㅋㅋㅋ 진짜 웃기다 ㅋㅋㅋ 진짜로"

pretext: 4 lines
browser: 3 lines  ← 1 line off

Wrong line count

font: "20px serif", width: 150px
text: "이거 ㄹㅇ임 ㄹㅇ 아니면 뭐야"

pretext: 5 lines
browser: 4 lines  ← 1 line off

Reproduced in: Chrome ✓ Safari ✓

The bug is width-sensitive — all Hangul Compatibility Jamo are affected, but failures appear at different widths depending on the character.

Reproduce without this PR's script (using the existing /probe page on main):

Run bun start, then open:

http://localhost:3210/probe?text=ㅋㅋㅋ%20진짜%20웃기다%20ㅋㅋㅋ%20진짜로&width=200&font=20px%20serif&lineHeight=34&lang=ko&method=span&requestId=debug1

The page shows pretext's predicted lines vs the browser's actual lines side by side. On main (before this fix), it will show a mismatch.

Fix

// src/analysis.ts — isCJKCodePoint()
+    (codePoint >= 0x3130 && codePoint <= 0x318F) ||  // Hangul Compatibility Jamo (ㄱ-ㅣ)
     (codePoint >= 0xAC00 && codePoint <= 0xD7AF) ||

Without this range, ㅋㅋㅋ is treated as a single word unit rather than being split into individual CJK grapheme units. The line-breaker tries to fit the entire run as one piece, while browsers allow breaks between consecutive Compatibility Jamo at word boundaries — the same way they handle other CJK characters.

Verification

All existing tests pass (bun test).

Added scripts/korean-check.ts — a dedicated oracle checker (19 cases) following the keep-all-check.ts pattern.

After fix — Chrome:

Korean Layout Check — Chrome
────────────────────────────────────────────────────────────
  ✓ PASS  B1: Hangul Jamo standalone (U+1100)             [3 lines]
  ✓ PASS  B2: Hangul Compatibility Jamo (U+3130)          [4 lines]
  ✓ PASS  B3: Korean+English mixed                        [3 lines]
  ✓ PASS  B4: Korean+numbers mixed                        [4 lines]
  ✓ PASS  B5: Korean+CJK punctuation                      [4 lines]
  ✓ PASS  B6: NBSP + Korean                               [3 lines]
  ✓ PASS  B2c-w160: ㅠㅠ crying expression (160px)        [4 lines]
  ✓ PASS  B2c-w140: ㅠㅠ crying expression (140px)        [5 lines]
  ✓ PASS  B2d-w150: ㄹㅇ literally slang (150px)          [4 lines]
  ✓ PASS  B2f-w150: ㅇㅋ/ㄴㄴ okay/nope slang (150px)     [5 lines]
  ✓ PASS  B2b: ㅋㅋ laughter slang mixed                  [3 lines]
  ✓ PASS  B2c: ㅠㅠ crying expression                     [3 lines]
  ✓ PASS  B2d: ㄹㅇ literally slang mid-sentence          [3 lines]
  ✓ PASS  B2e: consonants-only run                        [5 lines]
  ✓ PASS  B2f: ㅇㅋ/ㄴㄴ okay/nope internet slang         [4 lines]
  ✓ PASS  C1: keep-all + narrow width                     [9 lines]
  ✓ PASS  C2: keep-all + Korean+English mixed             [4 lines]
  ✓ PASS  C3: pre-wrap + Korean hard break                [2 lines]
  ✓ PASS  C4: pre-wrap + tab + Korean                     [1 lines]

Summary: chrome 19/19 pass

After fix — Safari:

Korean Layout Check — Safari
────────────────────────────────────────────────────────────
  ✓ PASS  B1: Hangul Jamo standalone (U+1100)             [3 lines]
  ✓ PASS  B2: Hangul Compatibility Jamo (U+3130)          [4 lines]
  ✓ PASS  B3: Korean+English mixed                        [3 lines]
  ✓ PASS  B4: Korean+numbers mixed                        [4 lines]
  ✓ PASS  B5: Korean+CJK punctuation                      [4 lines]
  ✓ PASS  B6: NBSP + Korean                               [3 lines]
  ✓ PASS  B2c-w160: ㅠㅠ crying expression (160px)        [4 lines]
  ✓ PASS  B2c-w140: ㅠㅠ crying expression (140px)        [5 lines]
  ✓ PASS  B2d-w150: ㄹㅇ literally slang (150px)          [4 lines]
  ✓ PASS  B2f-w150: ㅇㅋ/ㄴㄴ okay/nope slang (150px)     [5 lines]
  ✓ PASS  B2b: ㅋㅋ laughter slang mixed                  [3 lines]
  ✓ PASS  B2c: ㅠㅠ crying expression                     [3 lines]
  ✓ PASS  B2d: ㄹㅇ literally slang mid-sentence          [3 lines]
  ✓ PASS  B2e: consonants-only run                        [5 lines]
  ✓ PASS  B2f: ㅇㅋ/ㄴㄴ okay/nope internet slang         [4 lines]
  ✓ PASS  C1: keep-all + narrow width                     [9 lines]
  ✓ PASS  C2: keep-all + Korean+English mixed             [4 lines]
  ✓ PASS  C3: pre-wrap + Korean hard break                [2 lines]
  ✓ PASS  C4: pre-wrap + tab + Korean                     [1 lines]

Summary: safari 19/19 pass

mayrang added 2 commits April 17, 2026 18:27
isCJKCodePoint() included Hangul syllables (U+AC00–U+D7AF) but not
Hangul Compatibility Jamo (U+3130–U+318F), the standalone consonants
and vowels (ㄱ ㄴ ㄷ ㅋ ㅠ ...) used constantly in Korean digital text.

Without CJK classification these characters were treated as atomic
segments the line-breaker could not split at grapheme boundaries,
producing line counts 1 too high for common expressions like ㅋㅋ, ㄹㅇ.

Verified against Chrome and Safari.
@chenglou
Copy link
Copy Markdown
Owner

Thanks! I've implement this and credited you as co-author

@chenglou chenglou closed this Apr 17, 2026
chenglou added a commit that referenced this pull request Apr 18, 2026
Normalize chunked batch line starts through the same segment-kind policy used by streaming so layoutWithLines(), walkLineRanges(), layoutNextLine(), and layout() stay aligned after zero-width break opportunities and collapsible spaces.

Classify Hangul Compatibility Jamo (U+3130..U+318F) as CJK so common Korean compatibility jamo runs break like browser text.

Refs #129, #135, #141.
Closes #121
Closes #142

Co-authored-by: mayrang <pkss0626@naver.com>
Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com>
Co-authored-by: lttlin <lttlin@gmail.com>
@chenglou
Copy link
Copy Markdown
Owner

@mayrang got your message! Prev commits formatting screwed up. Fixed!

@mayrang
Copy link
Copy Markdown
Contributor Author

mayrang commented Apr 18, 2026

Awesome, thanks for fixing it! Really appreciate your help. Have a great weekend.

nice-hang pushed a commit to nice-hang/pretext that referenced this pull request Apr 18, 2026
Normalize chunked batch line starts through the same segment-kind policy used by streaming so layoutWithLines(), walkLineRanges(), layoutNextLine(), and layout() stay aligned after zero-width break opportunities and collapsible spaces.

Classify Hangul Compatibility Jamo (U+3130..U+318F) as CJK so common Korean compatibility jamo runs break like browser text.

Refs chenglou#129, chenglou#135, chenglou#141.
Closes chenglou#121
Closes chenglou#142

Co-authored-by: mayrang <pkss0626@naver.com>
Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com>
Co-authored-by: lttlin <lttlin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Hangul Compatibility Jamo (U+3130–U+318F) causes incorrect line breaks — affects common Korean expressions like ㅋㅋ, ㅠㅠ, ㄹㅇ

2 participants