HTML::Formatter isn't Unicode-aware

[Issue 4](https://github.com/nigelm/html-formatter/issues/4) mentions [an issue on RT](https://rt.cpan.org/Public/Bug/Display.html?id=9700) about how HTML containing Unicode will be corrupted. Last year I [attempted to fix this](https://github.com/nigelm/html-formatter/pull/5) but the PR was ignored; yesterday Karen Etheridge [looked at the PR and found some problems](https://github.com/nigelm/html-formatter/pull/5#issuecomment-262627197), which prodded me into having another look, and the result was [a revised PR into the current master](https://github.com/nigelm/html-formatter/pull/9).

As part of fixing my broken tests, I tried to [replace badly-encoded Unicode with HTML entities](https://github.com/skington/html-format/commit/5967b621551a6512993f5e6a2d8c286877487e07), but had to [back out that change](https://github.com/skington/html-format/commit/f7ae1823180714dcc19e52385fe144b127da85ca) because whichever library HTML::Formatter was using to decode entities was generating unicode strings rather than UTF8. While this would normally be an unalloyed benefit, HTML::FormatText (and very probably the other classes) [expects to output Latin1 text](https://github.com/nigelm/html-formatter/blob/94649a8c5acba3d02f96fcc5d27a3ce19f4de7f6/lib/HTML/FormatText.pm#L34) and, presumably because it was written before Unicode strings were really A Thing, doesn't expect to handle anything that isn't in the ASCII range.

I'm by no means a Unicode expert, but as I understand it, one of the fundamental principles about being Unicode-aware is that you decode as early as possible and encode as late as possible, so you operate on Unicode strings pretty much all the time. Unfortunately, HTML::FormatText depends on HTML::Parser which [takes the exactly opposite tack](https://github.com/gisle/html-parser/blob/ce42c7bac084bae957e5176af46d340be30ee8dd/Parser.pm#L86-L93), deciding to read in files in binary mode. HTML::Parser should be able to work just fine on Unicode strings if supplied with them, [if I read this issue correctly](https://rt.cpan.org/Public/Bug/Display.html?id=14212), so maybe it's just a matter of fixing HTML::Formatter so it manually reads in files itself and tries to read them using the UTF-8 encoding? But obviously that would only work on Perls that properly support Unicode strings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML::Formatter isn't Unicode-aware #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

HTML::Formatter isn't Unicode-aware #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions