Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending Unicode superscript/subscript substitution to all formats #10591

Open
adunning opened this issue Feb 2, 2025 · 8 comments
Open

Extending Unicode superscript/subscript substitution to all formats #10591

adunning opened this issue Feb 2, 2025 · 8 comments

Comments

@adunning
Copy link
Contributor

adunning commented Feb 2, 2025

Pandoc has partial support for converting characters formatted as superscript or subscript to their Unicode equivalents, where possible:

-- | Tries to convert a character into a unicode superscript version of

This is applied to plain text only, but it would be helpful if it the list could include more characters (see http://unicode.org/reports/tr30/datafiles/SuperscriptFolding.txt and https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts) and if this functionality could be made available in all formats.

Using a native Unicode character better matches the weight and size of a typeface, while applied superscript formatting results in an overly light weight. You can see the difference here between added formatting (3a, 4o) and Unicode (3ª, 4º), and it would be very useful not to have to worry about encoding these differently. In addition, as noted in jgm/citeproc#147, Unicode superscripts are automatically converted to manual formatting, meaning they need to be replaced again if one cares about this.

This should probably be optional rather than modifying the default smart behaviour, since some fonts do not have a full set of Unicode superscripts.

@adunning adunning changed the title Extending Unicode superscript/subscript substitution too all formats Extending Unicode superscript/subscript substitution to all formats Feb 2, 2025
@jgm
Copy link
Owner

jgm commented Feb 2, 2025

My main worry is about the availability of the superscript glyphs in fonts. But I have no idea if this is a serious issue with modern fonts.
[Oh, I see the suggestion that it be made optional.]

@jgm
Copy link
Owner

jgm commented Feb 2, 2025

Looking at this just a little, I find it extremely confusing. For example, there is a unicode code block for superscripts and subscripts.
https://unicode.org/charts/PDF/U2070.pdf
It contains subscripts for all the digits 0-9 but superscripts for only 4-9 (perhaps because 0-3 are elsewhere?) And then super/subscripts for a few random letters. There seems to be no rhyme or reason to it. By combing through a whole bunch of other code blocks, you can cobble together other letters, but using things called "spacing modifier letters" and IPA symbols for this purpose doesn't seem quite right.

I experimented with fonts and found that quite a few of the fonts I use don't have the glyphs for superscripted letters, though a few do.

@bpj
Copy link

bpj commented Feb 3, 2025

Supercript digits 1-3 are in the Latin-1 Supplement block right after Basic Latin/ASCII.

While the superscript/subscript digits are meant for general use most of the superscript letters and the few subscript letters are meant for phonetic transcription, as is evident from the many phonetic “special” letters among them. Also AFAIK not all Basic Latin letters have superscript equivalents, not to speak of other scripts, nor do they form regular upper/lower case pairs.

It might possibly make sense to use superscript digits, which are well supported by many fonts, for footnote references in plain output, but note that you won’t find them if you search for regular digits, which IMO is a serious enough drawback to not do it.

Anyway below is a (TSV) list of all the “Latin” superscript and subscript digits. Note that the first three are in another block and also out of order relative to the others and eachother! (Note how random support is in the font GitHub uses for code blocks! I use Noto Sans Mono in my terminal/Vim so I can see them all and more besides.)

²   2   U+00B2  SUPERSCRIPT TWO
³   3   U+00B3  SUPERSCRIPT THREE
¹   1   U+00B9  SUPERSCRIPT ONE
⁰   0   U+2070  SUPERSCRIPT ZERO
⁴   4   U+2074  SUPERSCRIPT FOUR
⁵   5   U+2075  SUPERSCRIPT FIVE
⁶   6   U+2076  SUPERSCRIPT SIX
⁷   7   U+2077  SUPERSCRIPT SEVEN
⁸   8   U+2078  SUPERSCRIPT EIGHT
⁹   9   U+2079  SUPERSCRIPT NINE
₀   0   U+2080  SUBSCRIPT ZERO
₁   1   U+2081  SUBSCRIPT ONE
₂   2   U+2082  SUBSCRIPT TWO
₃   3   U+2083  SUBSCRIPT THREE
₄   4   U+2084  SUBSCRIPT FOUR
₅   5   U+2085  SUBSCRIPT FIVE
₆   6   U+2086  SUBSCRIPT SIX
₇   7   U+2087  SUBSCRIPT SEVEN
₈   8   U+2088  SUBSCRIPT EIGHT
₉   9   U+2089  SUBSCRIPT NINE

@iandol
Copy link
Contributor

iandol commented Feb 3, 2025

Superscript minus U+207B ⁻ is also really useful for scientific notation, i.e. 4.3×10⁻⁵

@adunning
Copy link
Contributor Author

adunning commented Feb 3, 2025

Yes, Unicode added superscript/subscript characters for specific purposes over time, hence the variable font support.

@jgm
Copy link
Owner

jgm commented Feb 3, 2025

Note that we already do use unicode super/subscript digits in plain output.
(and superscript minus)

@silby
Copy link
Contributor

silby commented Feb 12, 2025

In #9437 I tried a related idea in HTML specifically and apart from the questionable value of adding Pandoc's 900th command line option the font coverage of superscript numbers in web-safe fonts (on my computer anyway) was spotty.

@fiapps
Copy link

fiapps commented Feb 28, 2025

If the goal is to have superscripts and subscripts that match the weight and size of the typeface, another way to achieve this is with the OpenType features sups for superscripts and subs or sinf for subscripts. A font that has a set of glyphs for superscript or subscript characters ought to allow you to access them with OpenType Features. Depending on the font, this may provide additional glyphs, such as lowercase superscript letters, that Unicode alone cannot specify.

For formats that allow you to activate font features, no modification to pandoc is necessary. For example, LaTeX has the realscripts package, which checks for the presence of these font features and uses them to implement superscript and subscript, falling back to faking superscipt and subscript if necessary.

---
header-includes: | 
    ```{=latex}
    \usepackage{realscripts}
    ``` 
---

For other formats, you might need to use a filter to replace superscript/subscript with a class (defined in CSS) or custom style (defined in a reference document) that will activate the relevant OpenType feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants