-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Issue 4 mentions an issue on RT about how HTML containing Unicode will be corrupted. Last year I attempted to fix this but the PR was ignored; yesterday Karen Etheridge looked at the PR and found some problems, which prodded me into having another look, and the result was a revised PR into the current master.
As part of fixing my broken tests, I tried to replace badly-encoded Unicode with HTML entities, but had to back out that change because whichever library HTML::Formatter was using to decode entities was generating unicode strings rather than UTF8. While this would normally be an unalloyed benefit, HTML::FormatText (and very probably the other classes) expects to output Latin1 text and, presumably because it was written before Unicode strings were really A Thing, doesn't expect to handle anything that isn't in the ASCII range.
I'm by no means a Unicode expert, but as I understand it, one of the fundamental principles about being Unicode-aware is that you decode as early as possible and encode as late as possible, so you operate on Unicode strings pretty much all the time. Unfortunately, HTML::FormatText depends on HTML::Parser which takes the exactly opposite tack, deciding to read in files in binary mode. HTML::Parser should be able to work just fine on Unicode strings if supplied with them, if I read this issue correctly, so maybe it's just a matter of fixing HTML::Formatter so it manually reads in files itself and tries to read them using the UTF-8 encoding? But obviously that would only work on Perls that properly support Unicode strings.