fix: classify Hangul Compatibility Jamo (U+3130–U+318F) as CJK#141
Closed
mayrang wants to merge 3 commits intochenglou:mainfrom
Closed
fix: classify Hangul Compatibility Jamo (U+3130–U+318F) as CJK#141mayrang wants to merge 3 commits intochenglou:mainfrom
mayrang wants to merge 3 commits intochenglou:mainfrom
Conversation
isCJKCodePoint() included Hangul syllables (U+AC00–U+D7AF) but not Hangul Compatibility Jamo (U+3130–U+318F), the standalone consonants and vowels (ㄱ ㄴ ㄷ ㅋ ㅠ ...) used constantly in Korean digital text. Without CJK classification these characters were treated as atomic segments the line-breaker could not split at grapheme boundaries, producing line counts 1 too high for common expressions like ㅋㅋ, ㄹㅇ. Verified against Chrome and Safari.
Owner
|
Thanks! I've implement this and credited you as co-author |
chenglou
added a commit
that referenced
this pull request
Apr 18, 2026
Normalize chunked batch line starts through the same segment-kind policy used by streaming so layoutWithLines(), walkLineRanges(), layoutNextLine(), and layout() stay aligned after zero-width break opportunities and collapsible spaces. Classify Hangul Compatibility Jamo (U+3130..U+318F) as CJK so common Korean compatibility jamo runs break like browser text. Refs #129, #135, #141. Closes #121 Closes #142 Co-authored-by: mayrang <pkss0626@naver.com> Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com> Co-authored-by: lttlin <lttlin@gmail.com>
Owner
|
@mayrang got your message! Prev commits formatting screwed up. Fixed! |
Contributor
Author
|
Awesome, thanks for fixing it! Really appreciate your help. Have a great weekend. |
nice-hang
pushed a commit
to nice-hang/pretext
that referenced
this pull request
Apr 18, 2026
Normalize chunked batch line starts through the same segment-kind policy used by streaming so layoutWithLines(), walkLineRanges(), layoutNextLine(), and layout() stay aligned after zero-width break opportunities and collapsible spaces. Classify Hangul Compatibility Jamo (U+3130..U+318F) as CJK so common Korean compatibility jamo runs break like browser text. Refs chenglou#129, chenglou#135, chenglou#141. Closes chenglou#121 Closes chenglou#142 Co-authored-by: mayrang <pkss0626@naver.com> Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com> Co-authored-by: lttlin <lttlin@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #142
Problem
isCJKCodePoint()insrc/analysis.tsincludes Hangul syllables (U+AC00–U+D7AF) but not Hangul Compatibility Jamo (U+3130–U+318F) — the standalone consonants and vowels Korean speakers type constantly in digital communication:These appear in virtually every Korean chat conversation. Without CJK classification, these characters are treated as atomic segments the line-breaker cannot split at grapheme boundaries — causing line counts 1 too high compared to Chrome and Safari.
Reproduction
Wrong line count (most severe)
Wrong line count
Reproduced in: Chrome ✓ Safari ✓
The bug is width-sensitive — all Hangul Compatibility Jamo are affected, but failures appear at different widths depending on the character.
Reproduce without this PR's script (using the existing
/probepage onmain):Run
bun start, then open:The page shows pretext's predicted lines vs the browser's actual lines side by side. On
main(before this fix), it will show a mismatch.Fix
// src/analysis.ts — isCJKCodePoint() + (codePoint >= 0x3130 && codePoint <= 0x318F) || // Hangul Compatibility Jamo (ㄱ-ㅣ) (codePoint >= 0xAC00 && codePoint <= 0xD7AF) ||Without this range,
ㅋㅋㅋis treated as a single word unit rather than being split into individual CJK grapheme units. The line-breaker tries to fit the entire run as one piece, while browsers allow breaks between consecutive Compatibility Jamo at word boundaries — the same way they handle other CJK characters.Verification
All existing tests pass (
bun test).Added
scripts/korean-check.ts— a dedicated oracle checker (19 cases) following thekeep-all-check.tspattern.After fix — Chrome:
After fix — Safari: