Skip to content

HTML::Formatter isn't Unicode-aware #10

@skington

Description

@skington

Issue 4 mentions an issue on RT about how HTML containing Unicode will be corrupted. Last year I attempted to fix this but the PR was ignored; yesterday Karen Etheridge looked at the PR and found some problems, which prodded me into having another look, and the result was a revised PR into the current master.

As part of fixing my broken tests, I tried to replace badly-encoded Unicode with HTML entities, but had to back out that change because whichever library HTML::Formatter was using to decode entities was generating unicode strings rather than UTF8. While this would normally be an unalloyed benefit, HTML::FormatText (and very probably the other classes) expects to output Latin1 text and, presumably because it was written before Unicode strings were really A Thing, doesn't expect to handle anything that isn't in the ASCII range.

I'm by no means a Unicode expert, but as I understand it, one of the fundamental principles about being Unicode-aware is that you decode as early as possible and encode as late as possible, so you operate on Unicode strings pretty much all the time. Unfortunately, HTML::FormatText depends on HTML::Parser which takes the exactly opposite tack, deciding to read in files in binary mode. HTML::Parser should be able to work just fine on Unicode strings if supplied with them, if I read this issue correctly, so maybe it's just a matter of fixing HTML::Formatter so it manually reads in files itself and tries to read them using the UTF-8 encoding? But obviously that would only work on Perls that properly support Unicode strings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions