Skip to content

Replacement character handling is inconsistent/incomplete #42

@se-mz

Description

@se-mz

The following formats have decoders that can emit #\Replacement_Character even though their encoders don't accept that: :cp1251, :iso-8859-3, :iso-8859-6, :iso-8859-7, :iso-8859-8, :iso-8859-11. :ebcdic-international has a similar issue, but with #\U+FFFF instead. :ebcdic-us seems to substitute various Latin-1 code points such as the private use characters, but for what little I know about EBCDIC, that might actually be the correct behavior.

I would expect octets-to-string output to be valid input to string-to-octets, even if chaining the two need not result in the same bytes. It's not quite clear what the behavior should be because the only other encodings in babel that run into this edge case (:cp1252, :gbk, :eucjp, :cp932) lack error checks for it entirely. I actually have a patch more or less prepared for that already, but it should be consistent with the rest.

In my opinion, signalling an error is the right thing to do when errorp is set and otherwise the ASCII substitution byte (which seems to be available in all supported encodings) could be used. decoding-error conveniently does this out of the box.

Note that this overlaps heavily with the first half of #41. Both have the same underlying issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions