Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htmlbody for Hebrew text (iso-8859-8-i encoding) is incorrect #117

Open
santoshghimire opened this issue Jun 1, 2021 · 0 comments · May be fixed by #118
Open

htmlbody for Hebrew text (iso-8859-8-i encoding) is incorrect #117

santoshghimire opened this issue Jun 1, 2021 · 0 comments · May be fixed by #118

Comments

@santoshghimire
Copy link

iso-8859-8-i.LOG

The htmlbody of a tnefobject containing Hebrew text (iso-8859-8-i encoding) is incorrect. Attached is an email example iso-8859-8-i.LOG [LOG extension only because github does not allow eml] to demonstrate the issue.

>>> fname = 'iso-8859-8-i.LOG'
>>> import email
>>> mime_msg = email.message_from_file(open(fname))
>>> tnef_parsed_content = mime_msg.get_payload()[-1].get_payload(decode=True)
>>> from tnefparse import TNEF
>>> tnefobj = TNEF(tnef_parsed_content)
>>> htmlbody = tnefobj.htmlbody
>>> htmlbody
u'<html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-8-i">\r\n<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>\r\n</head>\r\n<body dir="ltr">\r\n<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">\r\n<a href="https://www.walla.co.il/" id="LPlnk">https://www.walla.co.il/</a><br>\r\n</div>\r\n<div class="_Entity _EType_OWALinkPreview _EId_OWALinkPreview _EReadonly_1">\r\n<div id="LPBorder_GTaHR0cHM6Ly93d3cud2FsbGEuY28uaWwv" class="LPBorder510319" style="width: 100%; margin-top: 16px; margin-bottom: 16px; position: relative; max-width: 800px; min-width: 424px;">\r\n<table id="LPContainer510319" role="presentation" style="padding: 12px 36px 12px 12px; width: 100%; border-width: 1px; border-style: solid; border-color: rgb(200, 200, 200); border-radius: 2px;">\r\n<tbody>\r\n<tr valign="top" style="border-spacing: 0px;">\r\n<td>\r\n<div id="LPImageContainer510319" style="position: relative; margin-right: 12px; height: 135px; overflow: hidden; width: 240px;">\r\n<a target="_blank" id="LPImageAnchor510319" href="https://www.walla.co.il/"><img id="LPThumbnailImageId510319" alt="" height="135" width="240" style="display: block;" src="https://img.wcdn.co.il/f_auto,q_auto,w_1200,t_54/3/1/3/6/3136860-46.jpg"></a></div>\r\n</td>\r\n<td style="width: 100%;">\r\n<div id="LPTitle510319" style="font-size: 21px; font-weight: 300; margin-right: 8px; font-family: wf_segoe-ui_light, &quot;Segoe UI Light&quot;, &quot;Segoe WP Light&quot;, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif; margin-bottom: 12px;">\r\n<a target="_blank" id="LPUrlAnchor510319" href="https://www.walla.co.il/" style="text-decoration: none; color: var(--themePrimary);">\xe5\xe5\xe0\xec\xe4! - \xe4\xe0\xfa\xf8 \xe4\xee\xe5\xe1\xe9\xec \xe1\xe9\xf9\xf8\xe0\xec - \xf2\xe3\xeb\xe5\xf0\xe9\xed \xee\xf1\xe1\xe9\xe1 \xec\xf9\xf2\xe5\xef</a></div>\r\n<div id="LPDescription510319" style="font-size: 14px; max-height: 100px; color: rgb(102, 102, 102); font-family: wf_segoe-ui_normal, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif; margin-bottom: 12px; margin-right: 8px; overflow: hidden;">\r\n\xe5\xe5\xe0\xec\xe4!- \xe4\xe0\xfa\xf8 \xe4\xf4\xe5\xf4\xe5\xec\xf8\xe9 \xe1\xe9\xf9\xf8\xe0\xec. \xe7\xe3\xf9\xe5\xfa \xf2\xe3\xeb\xf0\xe9\xe5\xfa 24/7, \xf2\xf9\xf8\xe5\xfa \xf2\xf8\xe5\xf6\xe9 \xfa\xe5\xeb\xef \xe5\xee\xe9\xe3\xf2 \xee\xe5\xe1\xe9\xec\xe9\xed, \xf9\xe9\xf8\xe5\xfa \xe3\xe5\xe0\xf8 \xe0\xec\xf7\xe8\xf8\xe5\xf0\xe9 \xec\xec\xe0 \xe4\xe2\xe1\xec\xfa \xf0\xf4\xe7, \xf9\xe9\xf8\xe5\xfa\xe9 \xf7\xf0\xe9\xe5\xfa \xe5\xfa\xe9\xe9\xf8\xe5\xfa \xe1\xe0\xfa\xf8 walla!</div>\r\n<div id="LPMetadata510319" style="font-size: 14px; font-weight: 400; color: rgb(166, 166, 166); font-family: wf_segoe-ui_normal, &quot;Segoe UI&quot;, &quot;Segoe WP&quot;, Tahoma, Arial, sans-serif;">\r\nwww.walla.co.il</div>\r\n</td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n</div>\r\n</div>\r\n<br>\r\n</body>\r\n</html>\r\n'
>>> htmlbody[1795:1799]
u'\xe5\xe5\xe0\xec'
>>>

This seems wrongly decoded while constructing the htmlbody as the part in 1795:1799 is not unicode. Using this htmlbody to create a text/html part fails while setting the payload with content charset.

>>> from email.charset import Charset, QP
>>> from email.mime.nonmultipart import MIMENonMultipart
>>> charset_name = 'iso-8859-8'  # text/plain content type is 'iso-8859-8-i' which is mapped to 'iso-8859-8'
>>> cs = Charset(charset_name)
>>> cs.body_encoding = QP
>>> text_body_part = MIMENonMultipart('text', 'html', charset=charset_name)
>>> text_body_part.set_payload(htmlbody, charset=cs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/email/message.py", line 226, in set_payload
    self.set_charset(charset)
  File "/usr/local/lib/python2.7/email/message.py", line 262, in set_charset
    self._payload = self._payload.encode(charset.output_charset)
  File "/usr/local/lib/python2.7/encodings/iso8859_8.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1795-1799: character maps to <undefined>
>>> htmlbody[1795:1799]
u'\xe5\xe5\xe0\xec'
>>>

Above code snippets were run with version 1.31 on ubuntu machine.

The issue seem to exist in 1.31 and 1.40 versions (didn't check earlier ones but I think they too have the bug).
It seems the bug is here: https://github.com/koodaamo/tnefparse/blob/master/tnefparse/mapi.py#L152-L155

@santoshghimire santoshghimire linked a pull request Jun 1, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant