Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese Dakuten separation lead to incorrect conversion #1871

Open
Rick-McCoy opened this issue Mar 7, 2024 · 3 comments
Open

Japanese Dakuten separation lead to incorrect conversion #1871

Rick-McCoy opened this issue Mar 7, 2024 · 3 comments

Comments

@Rick-McCoy
Copy link

Consider the phrase るゔぃは, which corresponds to r'uviha. The character is r'u, ゔぃ is vi, and is ha.
The current mechanism for Japanese splits the ゔ (U+3094) character into two characters, う (U+3046) and U+3099 (the dakuten character).
Alone, ゔぃ is converted into vi without a hitch. However, since there exists a rule for るう, this gets converted into r'u:, leaving behind a dangling U+3099 ぃは. The leading two characters doesn't have a corresponding rule and defaults to Japanese letter Japanese letter.
Thus, instead of r'uviha, the result becomes r'u: Japanese letter Japanese letter ha.

Disclaimer: I am not a native speaker of Japanese.

@jaacoppi
Copy link
Collaborator

jaacoppi commented Mar 8, 2024 via email

@Rick-McCoy
Copy link
Author

Huh, I think that's the true extent of this bug; the dakuten itself isn't the cause.
Since there are overlapping rules for both るう and うぃ, espeak-ng converts the former first, and fails upon encountering .

From what I can glean, espeak-ng consumes the longest possible grapheme sequence specified in the rules sequentially, i.e. a greedy algorithm.
If that is the case, we could handle these anomalies by specifying all possible corner cases:

かあぁ -> ka a:
しいぃ -> s\\i i:
つうぅ -> t_su u:
ねえぇ -> ne e:
...

Or alternatively, we could just add rules for all the smaller versions of the nouns and call it a day:

ぁ -> a
ぁー -> a:
ぃ -> i
...

Of course, this still leaves the problem of the dakuten (and handakuten), which by definition doesn't have a fixed sound.

I propose a mixed strategy: remove the separation of dakuten/handakuten and treat graphemes such as as one grapheme.
Then we could add the smaller versions separately.
We would still need to rewrite most rules, but I think this would minimize the work necessary.

@Rick-McCoy
Copy link
Author

Hmm, this isn't limited to small kana, either. The long vowel indicator (chōonpu) causes this too:

$ espeak-ng -v ja とおー -X
Translate 'とおー'
 36     と      [to]
 57     とお   [to:]

Translate ''
 36     と      [to]

Translate ''
 36     お      [o]

Translate ''
Found: '_ja' [dZ'ap@ni:z]  
t'o 'o _:(en)dZ'ap@ni:z(ja)l'et@

Unlike the above samples which are admittedly pretty niche, this is a very common combination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants