Extending Unicode superscript/subscript substitution to all formats #10591

adunning · 2025-02-02T14:12:11Z

Pandoc has partial support for converting characters formatted as superscript or subscript to their Unicode equivalents, where possible:

pandoc/src/Text/Pandoc/Writers/Shared.hs

Line 443 in 1470b3a

-- | Tries to convert a character into a unicode superscript version of

This is applied to plain text only, but it would be helpful if it the list could include more characters (see http://unicode.org/reports/tr30/datafiles/SuperscriptFolding.txt and https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts) and if this functionality could be made available in all formats.

Using a native Unicode character better matches the weight and size of a typeface, while applied superscript formatting results in an overly light weight. You can see the difference here between added formatting (3^a, 4^o) and Unicode (3ª, 4º), and it would be very useful not to have to worry about encoding these differently. In addition, as noted in jgm/citeproc#147, Unicode superscripts are automatically converted to manual formatting, meaning they need to be replaced again if one cares about this.

This should probably be optional rather than modifying the default smart behaviour, since some fonts do not have a full set of Unicode superscripts.

jgm · 2025-02-02T16:54:34Z

My main worry is about the availability of the superscript glyphs in fonts. But I have no idea if this is a serious issue with modern fonts.
[Oh, I see the suggestion that it be made optional.]

jgm · 2025-02-02T17:07:37Z

Looking at this just a little, I find it extremely confusing. For example, there is a unicode code block for superscripts and subscripts.
https://unicode.org/charts/PDF/U2070.pdf
It contains subscripts for all the digits 0-9 but superscripts for only 4-9 (perhaps because 0-3 are elsewhere?) And then super/subscripts for a few random letters. There seems to be no rhyme or reason to it. By combing through a whole bunch of other code blocks, you can cobble together other letters, but using things called "spacing modifier letters" and IPA symbols for this purpose doesn't seem quite right.

I experimented with fonts and found that quite a few of the fonts I use don't have the glyphs for superscripted letters, though a few do.

bpj · 2025-02-03T10:24:02Z

Supercript digits 1-3 are in the Latin-1 Supplement block right after Basic Latin/ASCII.

While the superscript/subscript digits are meant for general use most of the superscript letters and the few subscript letters are meant for phonetic transcription, as is evident from the many phonetic “special” letters among them. Also AFAIK not all Basic Latin letters have superscript equivalents, not to speak of other scripts, nor do they form regular upper/lower case pairs.

It might possibly make sense to use superscript digits, which are well supported by many fonts, for footnote references in plain output, but note that you won’t find them if you search for regular digits, which IMO is a serious enough drawback to not do it.

Anyway below is a (TSV) list of all the “Latin” superscript and subscript digits. Note that the first three are in another block and also out of order relative to the others and eachother! (Note how random support is in the font GitHub uses for code blocks! I use Noto Sans Mono in my terminal/Vim so I can see them all and more besides.)

²   2   U+00B2  SUPERSCRIPT TWO
³   3   U+00B3  SUPERSCRIPT THREE
¹   1   U+00B9  SUPERSCRIPT ONE
⁰   0   U+2070  SUPERSCRIPT ZERO
⁴   4   U+2074  SUPERSCRIPT FOUR
⁵   5   U+2075  SUPERSCRIPT FIVE
⁶   6   U+2076  SUPERSCRIPT SIX
⁷   7   U+2077  SUPERSCRIPT SEVEN
⁸   8   U+2078  SUPERSCRIPT EIGHT
⁹   9   U+2079  SUPERSCRIPT NINE
₀   0   U+2080  SUBSCRIPT ZERO
₁   1   U+2081  SUBSCRIPT ONE
₂   2   U+2082  SUBSCRIPT TWO
₃   3   U+2083  SUBSCRIPT THREE
₄   4   U+2084  SUBSCRIPT FOUR
₅   5   U+2085  SUBSCRIPT FIVE
₆   6   U+2086  SUBSCRIPT SIX
₇   7   U+2087  SUBSCRIPT SEVEN
₈   8   U+2088  SUBSCRIPT EIGHT
₉   9   U+2089  SUBSCRIPT NINE

iandol · 2025-02-03T14:29:44Z

Superscript minus U+207B ⁻ is also really useful for scientific notation, i.e. 4.3×10⁻⁵

adunning · 2025-02-03T16:03:49Z

Yes, Unicode added superscript/subscript characters for specific purposes over time, hence the variable font support.

jgm · 2025-02-03T19:26:46Z

Note that we already do use unicode super/subscript digits in plain output.
(and superscript minus)

silby · 2025-02-12T07:22:55Z

In #9437 I tried a related idea in HTML specifically and apart from the questionable value of adding Pandoc's 900th command line option the font coverage of superscript numbers in web-safe fonts (on my computer anyway) was spotty.

fiapps · 2025-02-28T14:21:02Z

If the goal is to have superscripts and subscripts that match the weight and size of the typeface, another way to achieve this is with the OpenType features sups for superscripts and subs or sinf for subscripts. A font that has a set of glyphs for superscript or subscript characters ought to allow you to access them with OpenType Features. Depending on the font, this may provide additional glyphs, such as lowercase superscript letters, that Unicode alone cannot specify.

For formats that allow you to activate font features, no modification to pandoc is necessary. For example, LaTeX has the realscripts package, which checks for the presence of these font features and uses them to implement superscript and subscript, falling back to faking superscipt and subscript if necessary.

---
header-includes: | 
    ```{=latex}
    \usepackage{realscripts}
    ``` 
---

For other formats, you might need to use a filter to replace superscript/subscript with a class (defined in CSS) or custom style (defined in a reference document) that will activate the relevant OpenType feature.

adunning added the enhancement label Feb 2, 2025

adunning mentioned this issue Feb 2, 2025

Different processing of CSL JSON and YAML jgm/citeproc#147

Open

adunning changed the title ~~Extending Unicode superscript/subscript substitution too all formats~~ Extending Unicode superscript/subscript substitution to all formats Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending Unicode superscript/subscript substitution to all formats #10591

Extending Unicode superscript/subscript substitution to all formats #10591

adunning commented Feb 2, 2025

jgm commented Feb 2, 2025 •

edited

Loading

jgm commented Feb 2, 2025

bpj commented Feb 3, 2025

iandol commented Feb 3, 2025

adunning commented Feb 3, 2025

jgm commented Feb 3, 2025

silby commented Feb 12, 2025

fiapps commented Feb 28, 2025 •

edited

Loading

Extending Unicode superscript/subscript substitution to all formats #10591

Extending Unicode superscript/subscript substitution to all formats #10591

Comments

adunning commented Feb 2, 2025

jgm commented Feb 2, 2025 • edited Loading

jgm commented Feb 2, 2025

bpj commented Feb 3, 2025

iandol commented Feb 3, 2025

adunning commented Feb 3, 2025

jgm commented Feb 3, 2025

silby commented Feb 12, 2025

fiapps commented Feb 28, 2025 • edited Loading

jgm commented Feb 2, 2025 •

edited

Loading

fiapps commented Feb 28, 2025 •

edited

Loading