Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate es Wikidata into Unicode Inflection #49

Open
grhoten opened this issue Jan 21, 2025 · 1 comment
Open

Integrate es Wikidata into Unicode Inflection #49

grhoten opened this issue Jan 21, 2025 · 1 comment
Assignees
Milestone

Comments

@grhoten
Copy link
Member

grhoten commented Jan 21, 2025

The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.

The initial issues include:

  • The proper way to support or map Q112154 (apocope) into usable grammemes need to be taken into account. It might be trying to differentiate between usage in a sentence and usage in isolation.
  • The dictionary-parser output needs to be addressed
  • The unit tests need to be fixed.

Tool output that needs to be addressed:

Line 173574: Q112154 is not a known grammeme for L11746(para)
Line 179292: Q112154 is not a known grammeme for L56581(todo)
Line 179455: Q112154 is not a known grammeme for L57816(tanto)
Line 179497: Q112154 is not a known grammeme for L58249(alguno)
Line 239575: Q100919075 is not a known grammeme for L562064(mar)
Line 351385: Q112154 is not a known grammeme for L58235(cualquiera)
Line 593759: Q2878755 is not a known grammeme for L656693(periódico)
Line 596685: Q112154 is not a known grammeme for L680083(tercero)
Line 693956: Q112154 is not a known grammeme for L58267(nada)
Line 764290: Q112154 is not a known grammeme for L646357(santo)
Line 823594: Q112154 is not a known grammeme for L1130674(vigesimoprimero)
Line 865212: Q112154 is not a known grammeme for L58251(ninguno)
Line 991320: Q100919075 is not a known grammeme for L1096119(terminal)
Line 995406: Q112154 is not a known grammeme for L1130559(decimotercero)
Line 1282259: Q112154 is not a known grammeme for L680087(primero)

Here is the current generated lexical dictionary files to debug the test failures.

es.zip

@nciric nciric added this to the 0.1 milestone Jan 21, 2025
@grhoten grhoten self-assigned this Jan 23, 2025
@grhoten
Copy link
Member Author

grhoten commented Jan 28, 2025

Some of the issues seem to be regarding the gender of "gato". The data says that it can be feminine in some senses. This contradicts my simple knowledge, and it contradicts the Wiktionary entry for the same word.

This language also hit the 64 bit size by default, and some properties had to be ignored. This should be investigated further.

These are the options that were used.

--language es --add-extra-grammemes feminineNounStressData_es.lst --inflection-types noun,adjective,determiner,verb --ignore-unstructured-entries  --ignore-property countable --ignore-property vocative

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants