Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Japanese keyboard #104

Open
3 tasks done
henrikth93 opened this issue Mar 13, 2024 · 8 comments
Open
3 tasks done

Add Japanese keyboard #104

henrikth93 opened this issue Mar 13, 2024 · 8 comments
Assignees
Labels
data Relates to data or Wikidata

Comments

@henrikth93
Copy link
Member

Terms

Language support

I do not have a lot of knowledge about the Japanese language, but I thought it would be good to implement.

Contribution

Might need help by someone who knows Japanese.

@andrewtavis
Copy link
Member

Do you want to start off making versions of the nouns and verbs SPARQL queries, @henrikth93? We might need to check the statements on the Wikidata items for Japanese nouns and verbs, but I'd be happy to help with this!

@andrewtavis andrewtavis added the data Relates to data or Wikidata label Mar 16, 2024
@andrewtavis
Copy link
Member

Linked to this issue is #105 and #106. For this issue we'll be working on the formatting process.

@wkyoshida
Copy link
Member

Sharing some pointers..


We'll have to think through how to store different written forms together for the same word. For example, using our classic book example. The following are both ways to write the same word:

    • This is the kanji version, which is logographic
    • This character represents 'book' (worth noting though that some words can be composed of more than one kanji to represent it)
  • ほん
    • This is the hiragana version, which is phonetic
    • These are two characters, ほ (ho) and ん (n), which make up ほん (hon)

Apart from the two scripts above, the third main one is katakana, which is also phonetic. Katakana is primarily for distinct cases/meanings, e.g. writing foreign words that have been incorporated into Japanese. Some words though can have variants in all three scripts - with the katakana version having a more specific meaning than the hiragana version. Worth noting as well though that katakana can also be used at times to what would be akin to bold and italic in English.

@wkyoshida
Copy link
Member

.. can also be used at times to what would be akin to bold and italic in English.

While this is true, we most likely do not have to store this, but just something to be aware of.

@andrewtavis
Copy link
Member

So we should plan on basically having ja and ja-hira versions of all of the queries? Each Japanese lexeme has versions of each of these, and then we'd have different interfaces for each?

@wkyoshida
Copy link
Member

So we should plan on basically having ja and ja-hira versions of all of the queries?

Hmm.. I just checked, and perhaps not quite, I think.

Some words do not have a kanji form, so I wouldn't expect them to have both ja and ja-hira.
The verb いる (iru) for instance, which very roughly translates to 'to be' or 'to exist', only has a hiragana form - made of the two characters (i) and (ru).
However - the lexeme actually marks いる with ja and not ja-hira as might be expected. My guess would be then that ja is marking what would be considered the "full" or the "proper" written form:

  • For 'to be/to exist', it is simply いる, since it has no kanji or katakana form
  • For 'book', it is
    • It is worth noting that a version with kanji, if a word has one, is often the "full" form (not sure what to call it 😆)
  • For the verb 'to eat', it is 食べる (taberu), which actually is a combination of kanji AND hiragana. is a kanji associated with eating and food; here it takes on the pronunciation (ta). and are hiragana, which respectively are for the sounds (be) and (ru)
    • Crucially, notice that there is also a ja-hira for 'to eat', which is the version written fully in hiragana, たべる, which is (ta) and the same (be) and (ru) used in 食べる
    • It is worth noting though that simply because in the verb 'to eat' has the sound (ta), it does not mean that it always has that sound. In the word 定食 (teishoku) for instance, which is a style of restaurant menu item, does not have the sound of (ta) but (shoku) instead
  • For 'person', it is the kanji (hito), which actually has three forms with:
    • the ja-hira form ひと, which is (hi) and (to)
    • the ja-kana form ヒト, which is (hi) and (to)
  • For 'America', it is the katakana アメリカ (amerika), with (a) (me) (ri) (ka)
    • Interestingly, it also has a ja-x-Q754018 form, which if I were to guess, is likely the spelling using kanji that puts together characters that may have the syllables/sounds to also spell it out the same phonetically. So in 亜米利加, the characters also sound out (amerika). This is more for proper nouns/names. The kanji that are used don't necessarily need to have a symbolic, associated meaning like in the other examples above. However, using kanji that both may have the correct sounds AND a symbolic meaning is often a poetic/creative deliberate decision. This is often done when naming children. Surnames also get this, for instance, mine is spelled with 吉田 which has the sounds (yoshi) (da), but also has the meaning (lucky) (ricefield) - perhaps alluding to some ancestors being farmers 🤷

In conclusion, I believe a lexeme should always have a ja form, but it may or may not also have ja-hira, ja-kana, and/or ja-x-Q754018 forms. Crucially, ja can be in any script, whatever the "proper" form is for the word. ja-x-Q754018 may show up (for words like names of places), but I would advocate for ignoring them actually

@andrewtavis
Copy link
Member

Thanks for the full explanation, @wkyoshida! Just checking as there are a lot of situations above and I'm trying a last ditch effort for a simple-ish system: would we be able to query such that for the ja words we just get them based on their language identifier, and for ja-hira we take it if it's there, or if not get the ja?

@wkyoshida
Copy link
Member

I'm thinking what likely makes sense is:

  • ja: Always grab it, regardless of which script it is using. It is the "full"/"proper" form.
  • ja-x-Q754018: If this shows up, we can ignore it.
  • ja-hira: If this shows up, still always grab it in addition to the ja. This will be needed to associate which pronunciation that the kanji in the ja form are taking on.
  • ja-kana: If this shows up, still always grab it in addition to the ja and ja-hira. If it is present, it is likely indicative of a more specific meaning. For our 'person' example , the ja-kana form is actually more understood to mean 'human' as in the species, i.e. Homo sapiens (you'll see this listed in Wikidata under senses). It's really almost a different word at that point.
    • For ja-kana though, we may not need to store the character string necessarily. There is pretty much a direct conversion hiragana-katakana, so simply using a boolean perhaps could suffice to understand that the katakana version has a particular meaning (beyond simply meaning, for instance, that it is bold or italics)

@andrewtavis andrewtavis moved this from Todo to In Progress in Scribe Board Mar 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Relates to data or Wikidata
Projects
Status: In Progress
Development

No branches or pull requests

3 participants