-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
When text passed to generateEmbedding() contains an incomplete or malformed UTF-8 multibyte sequence (e.g. a Persian/Arabic character split mid-byte by a prior substr() call), the current sanitization path calls iconv('UTF-8', 'UTF-8//TRANSLIT//IGNORE', $text). Despite the //IGNORE flag, PHP's iconv emits a warning on incomplete trailing multibyte characters, which gets logged as a PHP Notice/Warning and can be promoted to an exception if a strict error handler is registered.
Error
PHP Notice: iconv(): Detected an incomplete multibyte character in input string
in src/Ai/Infrastructure/Services/OpenAi/EmbeddingService.php:53
Root Cause
iconv() with //IGNORE skips invalid byte sequences but still triggers a warning for incomplete multibyte characters at the end of the string. This is a known PHP behavior.
Steps to Reproduce
- Provide a text string that contains multibyte UTF-8 characters (e.g. Persian or Arabic script).
- Truncate the string mid-character using a byte-level operation such as
substr(). - Pass the resulting string to
generateEmbedding(). - Observe the PHP warning in the error log.
Fix
Replace iconv() with mb_convert_encoding($text, 'UTF-8', 'UTF-8'), which silently drops invalid and incomplete byte sequences without emitting any warning, and does not require the mbstring.substitute_character ini setting to be configured.
Affected File
src/Ai/Infrastructure/Services/OpenAi/EmbeddingService.php