Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode troubles? #349

Open
sixtyfive opened this issue Dec 1, 2021 · 10 comments
Open

Unicode troubles? #349

sixtyfive opened this issue Dec 1, 2021 · 10 comments
Labels

Comments

@sixtyfive
Copy link

sixtyfive commented Dec 1, 2021

$ ack '[ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω]' file.xml

74:         <String CONTENT="(2) AL-MUKHTAṢAR FĪ ʿILM AL-ISTIBDĀL, by AL-KĀ-"
91:         <String CONTENT="FIYĀJĪ."
108:        <String CONTENT="[Another tract on the same subject; foll. 9—16.]"
142:        <String CONTENT="Foll. 16. 17·7 × 13·3 cm. Clear scholar’s naskh."
159:        <String CONTENT="Copyist, Yaḥyā b. ʿAbd al-Ghanī b. ʿAlī al-Imām."
176:        <String CONTENT="Dated 13 Jumādā II 870 (30 January 1466)."
210:        <String CONTENT="(1) TAʾSĪS AL-NAẒĀʾIR, by Abū Zaid ʿAbd (ʿUbaid) Allāh b."
227:        <String CONTENT="ʿUmar b. ʿĪsā AL-DABŪSĪ (d. 430/1039)."
312:        <String CONTENT="(2) AL-IḤKĀM FĪ MAʿRIFAT AL-AIMĀN WAʾL-"
329:        <String CONTENT="AḤKĀM, by AL-KĀFIYĀJĪ (d. 879/1474)."
380:        <String CONTENT="Dated 4 Ramaḍān 866 (2 June 1462)."
414:        <String CONTENT="(3) IJĀRAT AL-IQṬĀʿ, by Zain al-Dīn Abu ʾl-Faḍl al-Qāsim b."
431:        <String CONTENT="ʿAbd Allāh B. QUṬLŪBUGHĀ al-Ḥanafī al-Sūdūnī (d. 879/1474)."
499:        <String CONTENT="Foll. 87. 17·7 × 13 cm. Clear scholar’s naskh."

I can't figure out what the output has to do with the regular expression here. Highlighted are all letters with diacritica, my intention was to search for all Greek letters (I also tried \p{Greek}, which had no results at all). What am I missing here?

@petdance
Copy link
Collaborator

petdance commented Dec 1, 2021

I'm sorry. ack does not handle Unicode very well at all. I'm tagging this issue as Unicode and maybe some day we can somehow address it.

@n1vux
Copy link
Contributor

n1vux commented Dec 1, 2021

While Perl and most other modern programming languages allow subroutine and variable names to be Unicode, and thus in the natural language of the coder, usage seen is nearly uniformly Latin 1 or ASCII, matching the alphabet and often the language of the keywords.

Ack is unapologetically defined as a coder's search tool for searching collections of code files, even though some folks (including myself, one of Andy Petdance's associate devs) use it off-label to search data, both structured and unstructured. (I have a large collection of OCR text, indexed with swish-e but searched internally with ack; but a paragrep mode with a scalable would surely be useful.) This is why it has --perl filetype shortcuts. As such, support of Unicode has not had the priority it would have were this a tool hypothetically defined as for searching data and only incidentally good for searching code.

If one has an "older" perl (5.28 or earlier), there is a workaround to trick Ack into using Perl's native Unicode support, that sortof mostly works.
#222 (comment)
(But alas sysread on a Unicode filehandle was deprecated in Perl 5.24-5.28 and is fatal in 5.30.)

@n1vux
Copy link
Contributor

n1vux commented Dec 1, 2021

The other issue with Greek letters in particular is that they appear at multiple Unicode codepoints with different semantics ... there's Math Greek, with Bold etc … variants; there' s Greek Greek (upper and lower); and there's Cyrillic and Armenian letters that are the same as Greek letters; and maybe more. https://codepoints.net/U+03C0 lists 20 related characters for π, and more than thrice as many "confusables".

(And there's no guarantee (unless you have wonderful provenance!) that a document/file uses the Nu ν or Omicron Ο codepoint from the semantically correct sequence (Unicode Script, Category, & Block). I have no clue which ν my ⏹*n X-compose sequence inserted here, or if GH will swap it! Documents may even sloppily use Latin O where &omicron; should have been used.)

@sixtyfive
Copy link
Author

there's no guarantee (unless you have wonderful provenance!) that a document/file uses the Nu ν or Omicron Ο codepoint from the semantically correct sequence (Unicode Script, Category, & Block)

Except for when your OCR engine takes a whitelist of allowed codepoints :-)

Ack is unapologetically defined as a coder's search tool for searching collections of code files

For what it's worth, even though the example was also of an OCR file, I do have collections of code files with non-ASCII characters in them, both as part of the comments as well as the code itself. Such is the nature of working with natural language. I'm aware that digital humanities is somewhat of a niche phenomenon, we're still coders none the less.

But hey, this is your tool, I just happen to love using it, and have (yesterday for the first time, by the way) stumbled upon something unexpected.

@n1vux
Copy link
Contributor

n1vux commented Dec 2, 2021

Except for when your OCR engine takes a whitelist of allowed codepoints

That could count as "wonderful provenance" 😄 .

digital humanities

indeed.
(One of the committee that spun XML off of SGML was a Digital Humanities academic. A dear friend.)

Comments outside of Latin-1 alphabet will be more common than identifiers, whether digital humanities or ^regular^ developers just writing their comments in language they're most expressive in when writing for themselves and not far away customers.
If only for comments, it would be good to support Unicode.

If you look at the linked Unicode tickets, you'll see that on of the requirements for doing Unicode right will be multiplying test cases and test data. Digital Humanities / Modern Languages talent might be useful when (I say when not if hopefully) we get to it. Since the commandline hack mostly worked (upto 5.28), i expect redoing the testing N times is most of the work, but there's some architectural choice on how to handle mix and match files.

@hftf
Copy link

hftf commented Dec 3, 2021

$ ack '[ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω]' file.xml

I can't figure out what the output has to do with the regular expression here. Highlighted are all letters with diacritica, my intention was to search for all Greek letters (I also tried \p{Greek}, which had no results at all). What am I missing here?

While I know this is not technically a support forum, I will try to directly reply to the filer's question with a simple to understand explanation and an easy practical workaround solution, since I've been in the exact same boat before and for a while. I leave it to others to handle this thread in terms of being a bug report for Ack.

  1. The common way Unicode data is stored is via UTF-8 encoding, in which most of the rarer characters are represented by a sequence of multiple bytes. For example, the character α (‎03B1 GREEK SMALL LETTER ALPHA) is represented by the two bytes CE B1 in the UTF-8 encoding.
  2. Regular expression engines have very different implementations. For example, some engines support a mechanism like \p{Greek} for matching any character in a particular Unicode class, while other engines do not understand it at all, or even use \p to mean something completely different. Ack likely doesn't support \p unfortunately.
  3. Currently in Ack, a multibyte Unicode character inside of a character class seems to behave as a character class over the character's individual bytes. So think of [αβ] as [\xCE\xB1\xCE\xB2], a character class over four half-characters (three unique); then it's no wonder this pattern would match (the first byte/the first half of) γ!
  4. Therefore, an easy workaround is replacing the character class [αβ...] with a disjunction of sequences (α|β|...).

@n1vux
Copy link
Contributor

n1vux commented Dec 3, 2021

FWIW, there is a support forum - ack-users mailing list.

Minor correction: Ack uses the Perl RE engine, in which \p is supported.
(Ack's only differences from Perl RE are just prohibitions on unsafe features, or failures in our input processing. One should be able to use the full documented RE features of whichever Perl you invoke Ack with, including (?xism: ) provided you managed the shell escapes.)
\p isn't specifically unsupported in Ack.
But without Unicode input handling, \p{Greek} will not be useful; as noted in #222 , to enable \p{Han}or\p{Greek}` with Ack, one needs to force filehandles to UTF-8, which isn't yet available as an Ack command flag option. The following hack workaround warns in 5.24-5.28 and fails with Perl 5.30+, so is NOT a longterm workaround, but if you have PerlBrew or an older Perl, you can use it:

$ perlbrew exec --with perl-5.24.2@class-std perl  -C '-Mopen IO=>":encoding(UTF-8)"' ~/bin/ack --noenv '\p{Han}'  bugs/han.txt
sysread() is deprecated on :utf8 handles at /home/wdr/bin/ack line 4894.
hello 世界

$ perlbrew exec --with perl-5.24.2@class-std perl  -C '-Mopen IO=>":encoding(UTF-8)"' ~/bin/ack --noenv '\p{Greek}'  bugs/greek.txt
sysread() is deprecated on :utf8 handles at /home/wdr/bin/ack line 4894.
Ἄλκηστις
Ἄδμηθ', ὁρᾷς γὰρ τἀμὰ πράγμαθ' ὡς ἔχει,
λέξαι θέλω σοι πρὶν θανεῖν ἃ βούλομαι.
...

You are correct that [αβ...] considered as non-Unicode is going to do the wrong thing. Were that pattern inline in Perl program, use utf8; at the top of the file would have it understood as UTF encoding properly. But we're reading it from the shell commandline. So to use [αβ...] correctly, Ack would need to handle the commandline regex argument as Unicode (if any high bits marking extension bytes present? or always?) as well as interpreting the input files as Unicode. That requires an additional code patch or workaround from that to enable \p{Han}. Quite possibly utf8::upgrade($re); ... utf8::upgrade($buffer); .
Whether this can be always or automatically done as needed or whether it requires a --do-Unicode commandflag needs exploration.

(On a current Ubuntu, grep does correctly handle RE [αβ] and an implicitly UTF-8 Greek test file, and since in many modern programming languages, having identifiers, string data, and comments containing UTF-8 Greek, Han, etc is perfectly legal, this is presumed desirable behavior.)

@n1vux
Copy link
Contributor

n1vux commented Dec 3, 2021

Additional note, the sysread incompatibility with the Unicode inputs workaround is only in our pre-check optimization which may be turned off with --passthru (which will result in non-highlighted lines printing).

@n1vux
Copy link
Contributor

n1vux commented Dec 3, 2021

Additional aside re Perl RE engine and Unicode:

One can not expect the RE [ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω] or equivalent [Α-Ωα-ω] to match the "pre-composed" accented Unicode codepoints in a text such as ὑμῖν δέ, παῖδες, μητρὸς ἐκπεφυκέναι. (which is figuratively as well as literally Greek to me, found test data!), it will only match the unaccented characters (including those trailed by combining accents; but not the "pre-composed" ones with accent built into the codepoint), so it matters which your OCR or whatever is generating.
(Same problem in Latin-1 actually. [A-Za-z] will match a combining a' but not á as a single Unicode codepoint. \w and \p{Letter} are your friend.)

Ref wikipedia Greek diacritics#Unicode

The simple RE will usually be enough to find lines containing at least one letter of Greek, but if expanded to [Α-Ωα-ω]+ find words, it won't match whole words with accented characters, which might be desired with -o or --output, it will only the strings of unaccented characters. In that case, the accented pre-composed codepoints (NFC) are recognized as words nicely by \p{Greek}+ but a combining accent when de-composed (NFD) breaks the word, ugh, have to move to ((?x: \p{Greek} | \p{Diacritic} )+)/ or ((?x: \p{Greek} | \p{gc:Mn} )+)/ to capture words of normalized-form-decomposed Greek.

(If the file is all Greek, one can just trust -o '\w+' to isolate the words but that won't reject English, French words. A lookahead to require the first word char to be Greek would heuristically make that mostly work, (?=\p{Greek})(\w+) , but would accept mixed alphabetic ΦW )

To handle these subtleties in an e.g. Perl program I would normalize the input to NFD or NFC, depending which behavior is desired.

Input on how Ack should handle Unicode is welcome.
(More such input may move it up the queue.)
( How soon we get to it will depend on having the right volunteer able to work the testing ... )

Should ack assume all files are NFD or NFC? I doubt it. Or trust the input files selected are already whichever of NFC or nFC makes sense for the given RE ? Maybe. I do not expect Ack to ever guess correctly based on detecting sequences in RE pattern and input files. Routinely conver all inputs to NFD (or NFC) whether needed or not is a non-starter, that makes it slower for all users to benefit a few. A --unicode=NFD|NFC option to request a specific normalization (that digital humanities can put in .ackrc) might be possible but at what cost ???

I'm guessing we'll only ever support UTF-8; I've experimented a bit with 16 and 32 bit UTF BOM, and while it's sometimes possible to detect a file format if it properly starts with a BOM, they are hardly universally provided; and while i even provided a workaround to allow a collection of UCS-2/UTF-16 files to be searched, it isn't always practical.

@DabeDotCom
Copy link

Input on how Ack should handle Unicode is welcome. (More such input may move it up the queue.)

Sorry to bump a six-month-old thread, but I arrived here because I was astonished to discover that this didn't DWIM:

perl -CSA -E 'say "w\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}rd"' | ack 'w\Xrd'

To be fair, neither did:

perl -CSA -E 'say "w\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}rd"' | pcre2grep 'w\Xrd'

However, pcre2grep -u worked, both for NFC and NFD forms:

perl -CSA -E 'say "w\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}rd"' | pcre2grep -u 'w\Xrd'
wôrd

perl -CSA -E 'say "wo\N{COMBINING CIRCUMFLEX ACCENT}rd"' | pcre2grep -u 'w\Xrd'
wôrd

PS: As an honorable mention, ack 'w\X+rd' did manage to back into the right answer(s) also — although it would obviously return a lot of false positives, as well:

perl -CSA -E 'say "w\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}rd"' | ack 'w\X+rd'
wôrd

perl -CSA -E 'say "wo\N{COMBINING CIRCUMFLEX ACCENT}rd"' | ack 'w\X+rd'
wôrd

perl -CSA -E 'say "wayward"' | ack 'w\X+rd'
wayward

Vis-a-vis "how Ack should handle Unicode", I would point to the old axiom: "Good artists imitate; great artists steal!" 😎

pcre2grep's -u | --utf and/or -U | --utf-allow-invalid options seem like excellent candidates/precedent for plagarism— er, I mean "inspiration!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants