unicode 787, combining comma above, sometimes occurs where there should be 8125 (koronis) or 8217 (apostrophe)

There are certain xml files in version 2.1 of the treebank that contain unicode 787, combining comma above, as a marker for elision. I could be misunderstanding something, but this seems like a mistake. This character should be 8125 (koronis) or 8217 (apostrophe). The combining character is a non-spacing version of the koronis, and it combines with whatever character follows it.

The instances of this character can of course be located and cleaned up most easily with software, but as an example, there is tlg0012.tlg001.perseus-grc1.tb.xml, which has the following for Iliad 2.191:

      <word id="9" form="ἀλλ̓" lemma="ἀλλά" postag="c--------" head="0" relation="COORD" cite="urn:cts:greekLit:tlg0012.tlg001:2.191"/>

In the text editors and browser I'm using, what I see here is that the 787 character combines with the double quote after it and is displayed in a way that is clearly incorrect.

A similar issue occurs on the same line of Homer:

      <word id="1" form="δαιμόνἰ" lemma="δαιμόνιος" postag="a-s---mv-" head="4" relation="ExD" cite="urn:cts:greekLit:tlg0012.tlg001:2.190"/>

Here the iota+apostrophe in δαιμόνι' is encoded as an iota with smooth breathing. I think this should also be pretty simple to take care of with a computerized search. E.g., if you have a word that's three characters or more in length, and the final character has a breathing mark, then that has to be a mistake. Ditto if the word has a breathing mark but the first character isn't a vowel or ρ.

Here is some ruby code I wrote that I used to postprocess the treebank xml files to deal with these issues and some others:

```
def clean_up_combining_characters(s)
  combining_comma_above = [787].pack('U')
  greek_koronis = [8125].pack('U')
  s = s.sub(/#{combining_comma_above}/,greek_koronis)
  # seeming one-off errors in perseus:
  s2 = s
  s2 = s2.sub(/#{[8158, 7973].pack('U')}/,"ἥ") # dasia and oxia combining char with eta
  s2 = s2.sub(/#{[8142, 7940].pack('U')}/,"ἄ") # psili and oxia combining char with alpha
  s2 = s2.sub(/#{[8142, 7988].pack('U')}/,"ἴ")
  s2 = s2.sub(/ἄἄ/,'ἄ') # why is this necessary...??
  s2 = s2.sub(/ἥἥ/,'ἥ') # why is this necessary...??
  s2 = s2.sub(/#{[769].pack('U')}([μτ])/) {$1} # accent on a mu or tau, obvious error
  s2 = s2.sub(/#{[769].pack('U')}ε/) {'έ'}
  s2 = s2.sub(/#{[180].pack('U')}([κ])/) {$1} # accent on a kappa, obvious error
  s2 = s2.sub(/#{[834].pack('U')}/,'') # what the heck is this?  
  s2 = s2.sub(/ʽ([ἁἑἱὁὑἡὡ])/) {$1} # redundant rough breathing mark
  s2 = s2.sub(/(?<=[[:alpha:]][[:alpha:]])([ἀἐἰὀὐἠὠ])(?![[:alpha:]])/) { $1.tr("ἀἐἰὀὐἠὠ","αειουηω")+"᾽" }
  # ... smooth breathing on the last character of a long word; this is a mistake in representation of elision
  #     https://github.com/PerseusDL/treebank_data/issues/31
  s2 = s2.sub(/#{[787].pack('U')}/,"᾽")
  # ... mistaken use of combining comma above rather than the spacing version
  #     https://github.com/PerseusDL/treebank_data/issues/31
  if s2!=s then
    $stderr.print "cleaning up what appears to be an error in a combining character, #{s} -> #{s2}, unicode #{s.chars.map { |x| x.ord}} -> #{s2.chars.map { |x| x.ord}}\n"
    s = s2
  end
  return s
end
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unicode 787, combining comma above, sometimes occurs where there should be 8125 (koronis) or 8217 (apostrophe) #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

unicode 787, combining comma above, sometimes occurs where there should be 8125 (koronis) or 8217 (apostrophe) #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions