Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode 787, combining comma above, sometimes occurs where there should be 8125 (koronis) or 8217 (apostrophe) #31

Open
bcrowell opened this issue Feb 4, 2022 · 0 comments

Comments

@bcrowell
Copy link

bcrowell commented Feb 4, 2022

There are certain xml files in version 2.1 of the treebank that contain unicode 787, combining comma above, as a marker for elision. I could be misunderstanding something, but this seems like a mistake. This character should be 8125 (koronis) or 8217 (apostrophe). The combining character is a non-spacing version of the koronis, and it combines with whatever character follows it.

The instances of this character can of course be located and cleaned up most easily with software, but as an example, there is tlg0012.tlg001.perseus-grc1.tb.xml, which has the following for Iliad 2.191:

  <word id="9" form="ἀλλ̓" lemma="ἀλλά" postag="c--------" head="0" relation="COORD" cite="urn:cts:greekLit:tlg0012.tlg001:2.191"/>

In the text editors and browser I'm using, what I see here is that the 787 character combines with the double quote after it and is displayed in a way that is clearly incorrect.

A similar issue occurs on the same line of Homer:

  <word id="1" form="δαιμόνἰ" lemma="δαιμόνιος" postag="a-s---mv-" head="4" relation="ExD" cite="urn:cts:greekLit:tlg0012.tlg001:2.190"/>

Here the iota+apostrophe in δαιμόνι' is encoded as an iota with smooth breathing. I think this should also be pretty simple to take care of with a computerized search. E.g., if you have a word that's three characters or more in length, and the final character has a breathing mark, then that has to be a mistake. Ditto if the word has a breathing mark but the first character isn't a vowel or ρ.

Here is some ruby code I wrote that I used to postprocess the treebank xml files to deal with these issues and some others:

def clean_up_combining_characters(s)
  combining_comma_above = [787].pack('U')
  greek_koronis = [8125].pack('U')
  s = s.sub(/#{combining_comma_above}/,greek_koronis)
  # seeming one-off errors in perseus:
  s2 = s
  s2 = s2.sub(/#{[8158, 7973].pack('U')}/,"ἥ") # dasia and oxia combining char with eta
  s2 = s2.sub(/#{[8142, 7940].pack('U')}/,"ἄ") # psili and oxia combining char with alpha
  s2 = s2.sub(/#{[8142, 7988].pack('U')}/,"ἴ")
  s2 = s2.sub(/ἄἄ/,'ἄ') # why is this necessary...??
  s2 = s2.sub(/ἥἥ/,'ἥ') # why is this necessary...??
  s2 = s2.sub(/#{[769].pack('U')}([μτ])/) {$1} # accent on a mu or tau, obvious error
  s2 = s2.sub(/#{[769].pack('U')}ε/) {'έ'}
  s2 = s2.sub(/#{[180].pack('U')}([κ])/) {$1} # accent on a kappa, obvious error
  s2 = s2.sub(/#{[834].pack('U')}/,'') # what the heck is this?  
  s2 = s2.sub(/ʽ([ἁἑἱὁὑἡὡ])/) {$1} # redundant rough breathing mark
  s2 = s2.sub(/(?<=[[:alpha:]][[:alpha:]])([ἀἐἰὀὐἠὠ])(?![[:alpha:]])/) { $1.tr("ἀἐἰὀὐἠὠ","αειουηω")+"᾽" }
  # ... smooth breathing on the last character of a long word; this is a mistake in representation of elision
  #     https://github.com/PerseusDL/treebank_data/issues/31
  s2 = s2.sub(/#{[787].pack('U')}/,"᾽")
  # ... mistaken use of combining comma above rather than the spacing version
  #     https://github.com/PerseusDL/treebank_data/issues/31
  if s2!=s then
    $stderr.print "cleaning up what appears to be an error in a combining character, #{s} -> #{s2}, unicode #{s.chars.map { |x| x.ord}} -> #{s2.chars.map { |x| x.ord}}\n"
    s = s2
  end
  return s
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant