Skip to content

fix: vocabulary replacement fails for English words adjacent to CJK characters#93

Open
sysalpha01 wants to merge 1 commit intoamicalhq:mainfrom
sysalpha01:fix/cjk-vocabulary-replacement
Open

fix: vocabulary replacement fails for English words adjacent to CJK characters#93
sysalpha01 wants to merge 1 commit intoamicalhq:mainfrom
sysalpha01:fix/cjk-vocabulary-replacement

Conversation

@sysalpha01
Copy link

@sysalpha01 sysalpha01 commented Feb 9, 2026

Summary

  • Fix vocabulary replacement regex that fails when English words appear adjacent to CJK (Japanese/Chinese/Korean) characters
  • Change \p{L} to \p{Script=Latin} in word boundary lookahead/lookbehind
  • Latin word boundary protection still works (e.g., "apple" in "pineapple" is not replaced)

Problem

The word boundary regex in applyTextReplacements() uses \p{L} (all Unicode letters) in negative lookahead/lookbehind. Since \p{L} matches CJK characters, English words adjacent to Japanese text (e.g., Xavix in Xavixの設定) are never replaced because the CJK character triggers the boundary check.

Example

With vocabulary entry Xavix → ZABBIX:

  • Input: Xavixの設定
  • Before fix: Xavixの設定 (no replacement - matches \p{L})
  • After fix: ZABBIXの設定 (correctly replaced - doesn't match \p{Script=Latin})

Changes

apps/desktop/src/utils/text-replacement.ts:

  • Replace \p{L} with \p{Script=Latin} in the word boundary regex
  • This ensures only Latin script characters are considered for word boundaries
  • CJK characters adjacent to English words no longer block replacement

Test Plan

  • Verified Xavixの設定ZABBIXの設定 works correctly
  • Verified pineapple does not have apple replaced (Latin boundary still works)
  • Verified pure CJK replacements (e.g., ザビックスZABBIX) still work
  • Tested with Amical Cloud transcription on Windows

Summary by CodeRabbit

  • Bug Fixes
    • Text replacement now properly handles mixed-script text, including CJK characters alongside Latin characters.

…haracters

The word boundary regex used \p{L} (all Unicode letters) which includes
CJK characters, preventing replacement of English words when they appear
next to Japanese/Chinese/Korean text (e.g., "Xavixの設定" would not
replace "Xavix" because "の" matched \p{L} in the lookahead).

Changed \p{L} to \p{Script=Latin} so the word boundary check only
considers Latin script characters. This allows vocabulary replacements
to work correctly in CJK contexts while still preventing partial matches
within Latin words (e.g., "apple" in "pineapple" is still protected).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

The change modifies the word-boundary regex in the non-CJK branch of applyTextReplacements to use Latin-script boundaries instead of broader alphabetic/numeric boundaries. This allows CJK characters to be adjacent to matched text while preventing unintended matches within non-Latin words.

Changes

Cohort / File(s) Summary
Text Replacement Logic
apps/desktop/src/utils/text-replacement.ts
Modified word-boundary regex from \p{L}\p{N} to [a-zA-Z0-9] in non-CJK branch, narrowing boundary checks to Latin script and allowing CJK character adjacency. Updated comments to reflect Latin-script focus.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • amicalhq/amical#81: Introduced the original applyTextReplacements logic that this PR directly modifies.

Suggested reviewers

  • haritabh-z01

Poem

🐰 A boundary drawn with Latin grace,
No CJK chars out of place,
"Xavixの設定" now sings true,
Word edges sharp, regex anew! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: it addresses a specific bug where vocabulary replacement fails when English words are adjacent to CJK characters, which aligns with the primary change in the regex modification.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant