Skip to content

htmlbody for Hebrew text (iso-8859-8-i encoding) is incorrect #117

@chkp-santoshg

Description

@chkp-santoshg

iso-8859-8-i.LOG

The htmlbody of a tnefobject containing Hebrew text (iso-8859-8-i encoding) is incorrect. Attached is an email example iso-8859-8-i.LOG [LOG extension only because github does not allow eml] to demonstrate the issue.

>>> fname = 'iso-8859-8-i.LOG'
>>> import email
>>> mime_msg = email.message_from_file(open(fname))
>>> tnef_parsed_content = mime_msg.get_payload()[-1].get_payload(decode=True)
>>> from tnefparse import TNEF
>>> tnefobj = TNEF(tnef_parsed_content)
>>> htmlbody = tnefobj.htmlbody
>>> htmlbody
u'<html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-8-i">\r\n<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>\r\n</head>\r\n<body dir="ltr">\r\n<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">\r\n<a href="https://www.walla.co.il/" id="LPlnk">https://www.walla.co.il/</a><br>\r\n</div>\r\n<div class="_Entity _EType_OWALinkPreview _EId_OWALinkPreview _EReadonly_1">\r\n<div id="LPBorder_GTaHR0cHM6Ly93d3cud2FsbGEuY28uaWwv" class="LPBorder510319" style="width: 100%; margin-top: 16px; margin-bottom: 16px; position: relative; max-width: 800px; min-width: 424px;">\r\n<table id="LPContainer510319" role="presentation" style="padding: 12px 36px 12px 12px; width: 100%; border-width: 1px; border-style: solid; border-color: rgb(200, 200, 200); border-radius: 2px;">\r\n<tbody>\r\n<tr valign="top" style="border-spacing: 0px;">\r\n<td>\r\n<div id="LPImageContainer510319" style="position: relative; margin-right: 12px; height: 135px; overflow: hidden; width: 240px;">\r\n<a target="_blank" id="LPImageAnchor510319" href="https://www.walla.co.il/"><img id="LPThumbnailImageId510319" alt="" height="135" width="240" style="display: block;" src="https://img.wcdn.co.il/f_auto,q_auto,w_1200,t_54/3/1/3/6/3136860-46.jpg"></a></div>\r\n</td>\r\n<td style="width: 100%;">\r\n<div id="LPTitle510319" style="font-size: 21px; font-weight: 300; margin-right: 8px; font-family: wf_segoe-ui_light, &quot;Segoe UI Light&quot;, &quot;Segoe WP Light&quot;, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif; margin-bottom: 12px;">\r\n<a target="_blank" id="LPUrlAnchor510319" href="https://www.walla.co.il/" style="text-decoration: none; color: var(--themePrimary);">\xe5\xe5\xe0\xec\xe4! - \xe4\xe0\xfa\xf8 \xe4\xee\xe5\xe1\xe9\xec \xe1\xe9\xf9\xf8\xe0\xec - \xf2\xe3\xeb\xe5\xf0\xe9\xed \xee\xf1\xe1\xe9\xe1 \xec\xf9\xf2\xe5\xef</a></div>\r\n<div id="LPDescription510319" style="font-size: 14px; max-height: 100px; color: rgb(102, 102, 102); font-family: wf_segoe-ui_normal, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif; margin-bottom: 12px; margin-right: 8px; overflow: hidden;">\r\n\xe5\xe5\xe0\xec\xe4!- \xe4\xe0\xfa\xf8 \xe4\xf4\xe5\xf4\xe5\xec\xf8\xe9 \xe1\xe9\xf9\xf8\xe0\xec. \xe7\xe3\xf9\xe5\xfa \xf2\xe3\xeb\xf0\xe9\xe5\xfa 24/7, \xf2\xf9\xf8\xe5\xfa \xf2\xf8\xe5\xf6\xe9 \xfa\xe5\xeb\xef \xe5\xee\xe9\xe3\xf2 \xee\xe5\xe1\xe9\xec\xe9\xed, \xf9\xe9\xf8\xe5\xfa \xe3\xe5\xe0\xf8 \xe0\xec\xf7\xe8\xf8\xe5\xf0\xe9 \xec\xec\xe0 \xe4\xe2\xe1\xec\xfa \xf0\xf4\xe7, \xf9\xe9\xf8\xe5\xfa\xe9 \xf7\xf0\xe9\xe5\xfa \xe5\xfa\xe9\xe9\xf8\xe5\xfa \xe1\xe0\xfa\xf8 walla!</div>\r\n<div id="LPMetadata510319" style="font-size: 14px; font-weight: 400; color: rgb(166, 166, 166); font-family: wf_segoe-ui_normal, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif;">\r\nwww.walla.co.il</div>\r\n</td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n</div>\r\n</div>\r\n<br>\r\n</body>\r\n</html>\r\n'
>>> htmlbody[1795:1799]
u'\xe5\xe5\xe0\xec'
>>>

This seems wrongly decoded while constructing the htmlbody as the part in 1795:1799 is not unicode. Using this htmlbody to create a text/html part fails while setting the payload with content charset.

>>> from email.charset import Charset, QP
>>> from email.mime.nonmultipart import MIMENonMultipart
>>> charset_name = 'iso-8859-8'  # text/plain content type is 'iso-8859-8-i' which is mapped to 'iso-8859-8'
>>> cs = Charset(charset_name)
>>> cs.body_encoding = QP
>>> text_body_part = MIMENonMultipart('text', 'html', charset=charset_name)
>>> text_body_part.set_payload(htmlbody, charset=cs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/email/message.py", line 226, in set_payload
    self.set_charset(charset)
  File "/usr/local/lib/python2.7/email/message.py", line 262, in set_charset
    self._payload = self._payload.encode(charset.output_charset)
  File "/usr/local/lib/python2.7/encodings/iso8859_8.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1795-1799: character maps to <undefined>
>>> htmlbody[1795:1799]
u'\xe5\xe5\xe0\xec'
>>>

Above code snippets were run with version 1.31 on ubuntu machine.

The issue seem to exist in 1.31 and 1.40 versions (didn't check earlier ones but I think they too have the bug).
It seems the bug is here: https://github.com/koodaamo/tnefparse/blob/master/tnefparse/mapi.py#L152-L155

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions