Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate he Wikidata into Unicode Inflection #61

Open
grhoten opened this issue Jan 22, 2025 · 3 comments
Open

Integrate he Wikidata into Unicode Inflection #61

grhoten opened this issue Jan 22, 2025 · 3 comments
Milestone

Comments

@grhoten
Copy link
Member

grhoten commented Jan 22, 2025

The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.

The initial issues include:

  • The dictionary-parser output needs to be addressed
  • The unit tests need to be fixed.

Tool output that needs to be addressed:

Line 23244: Q500726 is not a known grammeme for L184903(הלך)
Line 24914: Q462367 is not a known grammeme for L207795(היה)
Line 231920: Q44148 is not a known grammeme for L491804(אחת)
Line 352289: Q70798722 is not a known grammeme for L65603(ילד)
Line 694615: Q44148 is not a known grammeme for L63591(אב)
Line 1209032: Q115767254 is not a known grammeme for L68396(שם)
Line 1285665: Q6548647 is not a known part of speech grammeme for L707946(כינויי הקניין)

Here is the current generated lexical dictionary files to debug the test failures.

he.zip

@grhoten grhoten added this to the 0.1 milestone Jan 22, 2025
@nciric
Copy link
Contributor

nciric commented Jan 24, 2025

There are 32194 nouns in Hebrew. See this query

@grhoten
Copy link
Member Author

grhoten commented Jan 28, 2025

The current failures seem to be related to incomplete data. For example, סטונהנג׳ (Stonehenge) is missing from the data as a proper noun.

Here are the options used to generate the data:

--language he --inflection-types noun,adjective

@grhoten
Copy link
Member Author

grhoten commented Jan 28, 2025

Please note that סטונהנג׳ is spelt with a \u05f3 (Hebrew punctuation geresh) and not the apostrophe, which is a common typo in Hebrew as I've been told.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants