Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate en Wikidata into Unicode Inflection #47

Open
grhoten opened this issue Jan 21, 2025 · 2 comments
Open

Integrate en Wikidata into Unicode Inflection #47

grhoten opened this issue Jan 21, 2025 · 2 comments
Assignees
Milestone

Comments

@grhoten
Copy link
Member

grhoten commented Jan 21, 2025

The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.

The initial issues include:

  • The theater and theatre lemmas need to be separate inflection tables. The lemmas ax and axe need to follow a similar model.
  • The unknown Q types need a mapping to a known grammeme. Either it needs to be recognized in dictionary-parser, or the entry needs to be corrected to a legitimate Q reference.
  • Some phrases need to be recategorized as phrases. Either the tool needs to recognize it as a phrase and ignore it, or the data needs to be recategorized as a phrase. Examples of bad or unhelpful data include: L1326119, L1377720, L1396532, L622264, L192082 and more. I can explain further if needed.
  • The yen lexemes (L1397926 & L15388) need to be differentiated. Either L1397926 needs to be sorted first (e.g. mark as rare), or L15388 needs to be ignored, if it's considered a legitimate word.
  • The this/that determiners need to be supported. The tool couldn't handle non-unique grammatical categories. It needs to be investigated further.

Tool output that needs to be addressed:

Line 531: Q55965516 is not a known grammeme for L4315(nor)
Line 41317: Q122477358 is not a known grammeme for L342586(worse)
Line 172402: Q113076880 is not a known part of speech grammeme for L3240(ago)
Line 226524: Q96406487 is not a known grammeme for L450083(Muhammad)
Line 339410: Q2034977 is not a known part of speech grammeme for L1370697(up to)
Line 367850: Q4335462 is not a known part of speech grammeme for L201083(the same)
Line 514996: Q1941737 is not a known grammeme for L65(how)
Line 515372: Q10535365 is not a known part of speech grammeme for L2985(to)
Line 554767: Q65248385 is not a known grammeme for L333587(easy)
Line 687122: Q96406487 is not a known grammeme for L6982(district)
Line 857643: Q55965516 is not a known grammeme for L1386(or)
Line 858442: Q188224 is not a known grammeme for L7998(avocado)
Line 1029930: Q901711 is not a known grammeme for L7137(opposition)
Line 1052414: Q8102 is not a known grammeme for L191783(hella)
Line 1112081: Q96406487 is not a known grammeme for L691512(Gogera Branch)
Line 1189924: Q10535365 is not a known part of speech grammeme for L1326119(about to)
Line 1196171: Q4335462 is not a known part of speech grammeme for L1377720(the other)
Line 1199999: Q1522423 is not a known grammeme for L59(where)
Line 1200336: Q113198319 is not a known part of speech grammeme for L3038(no)
Line 1370172: Q2034977 is not a known part of speech grammeme for L1396532(away from)
Line 1370304: Q3517796 is not a known grammeme for L1397465(lobster)

Here is the current generated lexical dictionary files to debug the test failures.
en.zip

@nciric nciric added this to the 0.1 milestone Jan 21, 2025
@grhoten grhoten self-assigned this Jan 23, 2025
@grhoten
Copy link
Member Author

grhoten commented Jan 28, 2025

The remaining issue seems to be how the verbs are structured.

For example, the current inflection table looks like the following:

    <pattern name="3" words="26607">
        <pos>verb</pos>
        <suffix/>
        <inflections>
            <inflection number="singular" person="first" tense="present"><t><stem/></t></inflection>
            <inflection number="singular" person="second" tense="present"><t><stem/></t></inflection>
            <inflection number="singular" person="third" tense="present"><t><stem/>s</t></inflection>
            <inflection number="plural" person="first" tense="present"><t><stem/></t></inflection>
            <inflection number="plural" person="second" tense="present"><t><stem/></t></inflection>
            <inflection number="plural" person="third" tense="present"><t><stem/></t></inflection>
            <inflection number="singular" tense="past"><t><stem/>ed</t></inflection>
            <inflection number="plural" tense="past"><t><stem/>ed</t></inflection>
            <inflection tense="past" verb-type="participle"><t><stem/>ed</t></inflection>
            <inflection verb-type="infinitive"><t><stem/></t></inflection>
            <inflection verb-type="gerund"><t><stem/>ing</t></inflection>
        </inflections>
    </pattern>

The one extracted from Wikidata looks like the following:

    <pattern name="3" words="2665">
        <pos>verb</pos>
        <suffix/>
        <inflections>
            <inflection person="third" tense="present" number="singular" aspect="simple"><t><stem/>s</t></inflection>
            <inflection tense="past" aspect="simple"><t><stem/>ed</t></inflection>
            <inflection tense="present" aspect="simple"><t><stem/></t></inflection>
            <inflection tense="past" verb-type="participle"><t><stem/>ed</t></inflection>
            <inflection tense="present" verb-type="participle"><t><stem/>ing</t></inflection>
        </inflections>
    </pattern>

These are some possible solutions.

  1. Expand the inflection table in the dictionary-parser to be like the current table. This simplifies the code, but it requires more filesystem space.
  2. Change the order of the inflection table with the dictionary-parser so that the simple present verb is before the singular third simple present verb. This might work.
  3. Change the code to handle this type of inflection table. This isn't trivial, and it may be hard to generalize.

@grhoten
Copy link
Member Author

grhoten commented Jan 28, 2025

These were the dictionary-parser options used.

--language en --add-sound consonant-start,vowel-start --add-extra-grammemes vowelConsonantStartData_en.lst --inflection-types noun,verb,determiner --ignore-entries-with-grammemes abbreviation --ignore-entries-with-grammemes genitive --ignore-entries-with-grammemes Q4335462 --ignore-property particle --ignore-property vocative --ignore-property oblique --ignore-property nominative --ignore-property countable --ignore-unstructured-entries --add-sound consonant-start,vowel-start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants