Japanese Dakuten separation lead to incorrect conversion #1871

Rick-McCoy · 2024-03-07T09:16:55Z

Consider the phrase るゔぃは, which corresponds to r'uviha. The る character is r'u, ゔぃ is vi, and は is ha.
The current mechanism for Japanese splits the ゔ (U+3094) character into two characters, う (U+3046) and U+3099 (the dakuten character).
Alone, ゔぃ is converted into vi without a hitch. However, since there exists a rule for るう, this gets converted into r'u:, leaving behind a dangling U+3099 ぃは. The leading two characters doesn't have a corresponding rule and defaults to Japanese letter Japanese letter.
Thus, instead of r'uviha, the result becomes r'u: Japanese letter Japanese letter ha.

Disclaimer: I am not a native speaker of Japanese.

The text was updated successfully, but these errors were encountered:

jaacoppi · 2024-03-08T17:17:41Z

Confirmed. Any ideas for fixing this are welcome. $ espeak-ng -v ja "るうぃは" -X Translate 'るうぃは' 36 る [r`u] 57 るう [r`u:] Translate 'る' 36 る [r`u] Translate 'う' 36 う [u] Translate 'ぃ' Found: '_ja' ***@***.***:z] Translate 'は' 36 は [ha] r`'u 'u ***@***.***:z(ja)l'et@ h'a

Rick-McCoy · 2024-03-09T15:05:51Z

Huh, I think that's the true extent of this bug; the dakuten itself isn't the cause.
Since there are overlapping rules for both るう and うぃ, espeak-ng converts the former first, and fails upon encountering ぃ.

From what I can glean, espeak-ng consumes the longest possible grapheme sequence specified in the rules sequentially, i.e. a greedy algorithm.
If that is the case, we could handle these anomalies by specifying all possible corner cases:

かあぁ -> ka a:
しいぃ -> s\\i i:
つうぅ -> t_su u:
ねえぇ -> ne e:
...

Or alternatively, we could just add rules for all the smaller versions of the nouns and call it a day:

ぁ -> a
ぁー -> a:
ぃ -> i
...

Of course, this still leaves the problem of the dakuten (and handakuten), which by definition doesn't have a fixed sound.

I propose a mixed strategy: remove the separation of dakuten/handakuten and treat graphemes such as ば as one grapheme.
Then we could add the smaller versions separately.
We would still need to rewrite most rules, but I think this would minimize the work necessary.

Rick-McCoy · 2024-03-11T06:19:52Z

Hmm, this isn't limited to small kana, either. The long vowel indicator (chōonpu) ー causes this too:

$ espeak-ng -v ja とおー -X
Translate 'とおー'
 36     と      [to]
 57     とお   [to:]

Translate 'と'
 36     と      [to]

Translate 'お'
 36     お      [o]

Translate 'ー'
Found: '_ja' [dZ'ap@ni:z]  
t'o 'o _:(en)dZ'ap@ni:z(ja)l'et@

Unlike the above samples which are admittedly pretty niche, this is a very common combination.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese Dakuten separation lead to incorrect conversion #1871

Japanese Dakuten separation lead to incorrect conversion #1871

Rick-McCoy commented Mar 7, 2024

jaacoppi commented Mar 8, 2024 via email

Rick-McCoy commented Mar 9, 2024

Rick-McCoy commented Mar 11, 2024

Japanese Dakuten separation lead to incorrect conversion #1871

Japanese Dakuten separation lead to incorrect conversion #1871

Comments

Rick-McCoy commented Mar 7, 2024

jaacoppi commented Mar 8, 2024 via email

Rick-McCoy commented Mar 9, 2024

Rick-McCoy commented Mar 11, 2024