Skip to content

PHP Warning: iconv() detects incomplete multibyte character in EmbeddingService input #7

@xinpehr

Description

@xinpehr

Description
When text passed to generateEmbedding() contains an incomplete or malformed UTF-8 multibyte sequence (e.g. a Persian/Arabic character split mid-byte by a prior substr() call), the current sanitization path calls iconv('UTF-8', 'UTF-8//TRANSLIT//IGNORE', $text). Despite the //IGNORE flag, PHP's iconv emits a warning on incomplete trailing multibyte characters, which gets logged as a PHP Notice/Warning and can be promoted to an exception if a strict error handler is registered.

Error

PHP Notice: iconv(): Detected an incomplete multibyte character in input string
  in src/Ai/Infrastructure/Services/OpenAi/EmbeddingService.php:53

Root Cause
iconv() with //IGNORE skips invalid byte sequences but still triggers a warning for incomplete multibyte characters at the end of the string. This is a known PHP behavior.

Steps to Reproduce

  1. Provide a text string that contains multibyte UTF-8 characters (e.g. Persian or Arabic script).
  2. Truncate the string mid-character using a byte-level operation such as substr().
  3. Pass the resulting string to generateEmbedding().
  4. Observe the PHP warning in the error log.

Fix
Replace iconv() with mb_convert_encoding($text, 'UTF-8', 'UTF-8'), which silently drops invalid and incomplete byte sequences without emitting any warning, and does not require the mbstring.substitute_character ini setting to be configured.

Affected File
src/Ai/Infrastructure/Services/OpenAi/EmbeddingService.php

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions