Skip to content

Allow specifying charset and/or improve charset detection #79

Open
@dhouck

Description

@dhouck

Currently the only way to specify the charset is in the document (with BOM or <meta charset=); if the charset is known but not specified in the document, there is no way to specify it.

Additionally, charset detection even with Heuristics.ALL does not always work well; in particular, it fails to recognize UTF-8 at least if the first non-ASCII byte is late in the document. The WHATWG spec recommends that systems are able to recognize UTF-8 even if they arenʼt good at other charsets (as a non-normative note)

The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. When a user agent can examine the whole file, rather than just the preamble, detecting for UTF-8 specifically can be especially effective. [PPUTF8] [UTF8DET]

(This is reproduced with multiple test documents; the smallest is below but another one output the warning method that the UTF-8 character was invalid in Windows-1252, meaning that went with the default which was a particularly bad guess)

<!DOCTYPE html>
<html lang="en">
    <head>
        <link rel="stylesheet" href="https://fred-wang.github.io/mathml.css/mathml.css">
        <title>Circle equation</title>
        <!-- <meta charset="utf-8" /> -->
    </head>
    <body>
        <p>
            The equation
            <math display=inline>
                <mi>y</mi><mo>=</mo><mo>±</mo>
                <msqrt>
                    <msup><mi>r</mi><mn>2</mn></msup>
                    <mo>-</mo>
                    <msup><mi>x</mi><mn>2</mn></msup>
                </msqrt>
            </math>
            produces a circle with radius <math display=inline><mi>r</mi></math>:
            </p>
        <svg width="10em" height="10em" viewBox="0 0 100 100">
            <desc>A circle</desc>
            <circle cx="50" cy="50" r="40" fill="none" stroke="blue" stroke-width="1" />
        </svg>
    </body>
</html>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions