Skip to content
This repository was archived by the owner on Dec 31, 2024. It is now read-only.

Fixed national character encoding.#104

Closed
sp9usb wants to merge 1 commit intosmiley22:masterfrom
sp9usb:master
Closed

Fixed national character encoding.#104
sp9usb wants to merge 1 commit intosmiley22:masterfrom
sp9usb:master

Conversation

@sp9usb
Copy link

@sp9usb sp9usb commented Jan 30, 2015

No description provided.

@NiKiZe
Copy link
Contributor

NiKiZe commented Jan 30, 2015

This is not a real fix, just a "it works in more cases" (if the encoding is 8bit Shift-JIS for example this would fail)
It is wrong to make an assumption about the encoding.
The real fix would be to handle everything internally as bytes, find the encoding of each part of the message itself and then decode to strings with the correct encoding. (Decoding headers the same way as the main body, if the encoding is defined there as an 8bit variant)

An intermediate solution would be to have the "global" S22Imap encoding as a setting that can be changed on runtime. An example of this is #96 but that PR has several other changes that makes it "invalid"
I'm highly against changing ASCII to some other hard coded value.

Some of this is discussed in great detail in #47

@jstedfast
Copy link

For what it's worth (and I don't meant to beat a dead horse), MimeKit handles undeclared 8-bit text in headers in what I would describe as probably the only real sane way possible.

MimeKit's parser optionally takes a ParserOptions instance which provides various configuration options for the parser. One of which is a CharsetEncoding option which is used as a fallback charset when the parser encounters undeclared 8-bit text in headers.

The process goes something like this: The parser tries to convert the 8-bit header value[1] into a System.String using UTF-8. If that fails, then the parer tries the charset provided in the ParserOptions. If that also fails, then it falls back to ISO-8859-1.

The reason for this order of preference is that the latest email specifications allow for UTF-8 encoded email headers, and so going forward, it's probably reasonable to try UTF-8 first since that is an accepted standard. Even if it weren't, though, UTF-8 is still a good charset to try first since it is quite common (due largely to the fact that many systems these days use UTF-8 as their default locale charset). The user supplied charset is tried next because if the user selects ISO-8859-1 (since that is their locale charset, perhaps? or because they live in a western country where latin1 is the most common?), ISO-8859-1 will convert any sequence of 8-bit bytes cleanly, whether it really is ISO-8859-1 or not, so it should ONLY be tried last (which is coincidentally why it is the last charset always tried).

Now... even if none of those 3 charsets was the correct charset (e.g. maybe the actual charset is Big5 but the user set ParserOptions.CharsetEncoding to, say, koi8-r), then the user will still have the option of trying the conversion again after the parser has finished parsing the message by locating the header in the MimeMessage.Headers list and calling Header.GetValue (Encoding encoding) and passing in some other charset encoding to try over and over as many times as they want until the user is satisfied.

  1. There's actually finer granularity than this since in an address header (such as To, Cc, etc) it's possible for the name string for each email address to be in some other charset (it's convoluted, I know, but I think we've all seen how convoluted email software out there can be).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants