htmlbody for Hebrew text (iso-8859-8-i encoding) is incorrect #117

santoshghimire · 2021-06-01T11:50:11Z

The htmlbody of a tnefobject containing Hebrew text (iso-8859-8-i encoding) is incorrect. Attached is an email example iso-8859-8-i.LOG [LOG extension only because github does not allow eml] to demonstrate the issue.

>>> fname = 'iso-8859-8-i.LOG'
>>> import email
>>> mime_msg = email.message_from_file(open(fname))
>>> tnef_parsed_content = mime_msg.get_payload()[-1].get_payload(decode=True)
>>> from tnefparse import TNEF
>>> tnefobj = TNEF(tnef_parsed_content)
>>> htmlbody = tnefobj.htmlbody
>>> htmlbody
u'<html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-8-i">\r\n<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>\r\n</head>\r\n<body dir="ltr">\r\n<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">\r\n<a href="https://www.walla.co.il/" id="LPlnk">https://www.walla.co.il/</a><br>\r\n</div>\r\n<div class="_Entity _EType_OWALinkPreview _EId_OWALinkPreview _EReadonly_1">\r\n<div id="LPBorder_GTaHR0cHM6Ly93d3cud2FsbGEuY28uaWwv" class="LPBorder510319" style="width: 100%; margin-top: 16px; margin-bottom: 16px; position: relative; max-width: 800px; min-width: 424px;">\r\n<table id="LPContainer510319" role="presentation" style="padding: 12px 36px 12px 12px; width: 100%; border-width: 1px; border-style: solid; border-color: rgb(200, 200, 200); border-radius: 2px;">\r\n<tbody>\r\n<tr valign="top" style="border-spacing: 0px;">\r\n<td>\r\n<div id="LPImageContainer510319" style="position: relative; margin-right: 12px; height: 135px; overflow: hidden; width: 240px;">\r\n<a target="_blank" id="LPImageAnchor510319" href="https://www.walla.co.il/"><img id="LPThumbnailImageId510319" alt="" height="135" width="240" style="display: block;" src="https://img.wcdn.co.il/f_auto,q_auto,w_1200,t_54/3/1/3/6/3136860-46.jpg"></a></div>\r\n</td>\r\n<td style="width: 100%;">\r\n<div id="LPTitle510319" style="font-size: 21px; font-weight: 300; margin-right: 8px; font-family: wf_segoe-ui_light, &quot;Segoe UI Light&quot;, &quot;Segoe WP Light&quot;, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif; margin-bottom: 12px;">\r\n<a target="_blank" id="LPUrlAnchor510319" href="https://www.walla.co.il/" style="text-decoration: none; color: var(--themePrimary);">\xe5\xe5\xe0\xec\xe4! - \xe4\xe0\xfa\xf8 \xe4\xee\xe5\xe1\xe9\xec \xe1\xe9\xf9\xf8\xe0\xec - \xf2\xe3\xeb\xe5\xf0\xe9\xed \xee\xf1\xe1\xe9\xe1 \xec\xf9\xf2\xe5\xef</a></div>\r\n<div id="LPDescription510319" style="font-size: 14px; max-height: 100px; color: rgb(102, 102, 102); font-family: wf_segoe-ui_normal, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif; margin-bottom: 12px; margin-right: 8px; overflow: hidden;">\r\n\xe5\xe5\xe0\xec\xe4!- \xe4\xe0\xfa\xf8 \xe4\xf4\xe5\xf4\xe5\xec\xf8\xe9 \xe1\xe9\xf9\xf8\xe0\xec. \xe7\xe3\xf9\xe5\xfa \xf2\xe3\xeb\xf0\xe9\xe5\xfa 24/7, \xf2\xf9\xf8\xe5\xfa \xf2\xf8\xe5\xf6\xe9 \xfa\xe5\xeb\xef \xe5\xee\xe9\xe3\xf2 \xee\xe5\xe1\xe9\xec\xe9\xed, \xf9\xe9\xf8\xe5\xfa \xe3\xe5\xe0\xf8 \xe0\xec\xf7\xe8\xf8\xe5\xf0\xe9 \xec\xec\xe0 \xe4\xe2\xe1\xec\xfa \xf0\xf4\xe7, \xf9\xe9\xf8\xe5\xfa\xe9 \xf7\xf0\xe9\xe5\xfa \xe5\xfa\xe9\xe9\xf8\xe5\xfa \xe1\xe0\xfa\xf8 walla!</div>\r\n<div id="LPMetadata510319" style="font-size: 14px; font-weight: 400; color: rgb(166, 166, 166); font-family: wf_segoe-ui_normal, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif;">\r\nwww.walla.co.il</div>\r\n</td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n</div>\r\n</div>\r\n<br>\r\n</body>\r\n</html>\r\n'
>>> htmlbody[1795:1799]
u'\xe5\xe5\xe0\xec'
>>>

This seems wrongly decoded while constructing the htmlbody as the part in 1795:1799 is not unicode. Using this htmlbody to create a text/html part fails while setting the payload with content charset.

>>> from email.charset import Charset, QP
>>> from email.mime.nonmultipart import MIMENonMultipart
>>> charset_name = 'iso-8859-8'  # text/plain content type is 'iso-8859-8-i' which is mapped to 'iso-8859-8'
>>> cs = Charset(charset_name)
>>> cs.body_encoding = QP
>>> text_body_part = MIMENonMultipart('text', 'html', charset=charset_name)
>>> text_body_part.set_payload(htmlbody, charset=cs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/email/message.py", line 226, in set_payload
    self.set_charset(charset)
  File "/usr/local/lib/python2.7/email/message.py", line 262, in set_charset
    self._payload = self._payload.encode(charset.output_charset)
  File "/usr/local/lib/python2.7/encodings/iso8859_8.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1795-1799: character maps to <undefined>
>>> htmlbody[1795:1799]
u'\xe5\xe5\xe0\xec'
>>>

Above code snippets were run with version 1.31 on ubuntu machine.

The issue seem to exist in 1.31 and 1.40 versions (didn't check earlier ones but I think they too have the bug).
It seems the bug is here: https://github.com/koodaamo/tnefparse/blob/master/tnefparse/mapi.py#L152-L155

The text was updated successfully, but these errors were encountered:

santoshghimire linked a pull request Jun 1, 2021 that will close this issue

Fix htmlbody encoding for hebrew text #118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

htmlbody for Hebrew text (iso-8859-8-i encoding) is incorrect #117

htmlbody for Hebrew text (iso-8859-8-i encoding) is incorrect #117

santoshghimire commented Jun 1, 2021

htmlbody for Hebrew text (iso-8859-8-i encoding) is incorrect #117

htmlbody for Hebrew text (iso-8859-8-i encoding) is incorrect #117

Comments

santoshghimire commented Jun 1, 2021