You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(As of writing)
Currently the mainstay of multilingual classifiers seems to be pycld2.
This is a wrapper around cld2 itself which has not been maintained since around 2015
For pycld2 actual development seems to have finished since 2019.
It supports around 160 languages
There are indications as to attempts to also include cld3 (although unsuccessfully).
cld3 is an evolution of cld2, and includes its own python bindings, but has not been developed on since around 2021.
IT supports around 100 languages with some duplicates due to supporting multiple scripts for a single language(zh, zh_latn).
One issue is that it requires chromium to build since it was meant to run along or within a browser.
Fasttext is also included. Fasttext is a versatile text classifer and embedding library.
It can do more than classification but for the purposes of multilingual classifiers, this uses the officially available lid.176.bin model.
To my knowledge none of the above languages properly classify chinese dialects (simplified, traditional, yi,cantonese etc)
Some have issues with non slavic languages represented with cyrillic (central asian languages)
Some have issues with eastern european languages either in their latin or cyrillic representations.
Slavic dialect performance is also variable (ex : russian vs ukranian classification)
Potential Improvements
Move to Fasttext first classification
We can employ fasttext, arguably the best maintained and versatile atm, as the first class approach.
Fasttext needs to be fed data in a different way compared to cld2,3 and have different cutoff values that work well. This seems to be reflected in this codebase.
More development and testing would be required
Improve Fasttext implementation
There are several faster Fasttext implementations including kenpu's fastertext
If fasttext performance is revealed to be a bottleneck such implementations can be built and used.
Improve Fasttext model
There are multiple models that support 200+ languages and better dialect support (esp SEA or eastern european). NLLB has several, as well as the openlid-dataset-paper also has reproducible model and dataset available.
The current fasttext model is hard to beat in performance but overhead is a priority, a pruned version of the current model is available. defaulting to that can also reduce outside bandwidth reliance and maybe bundling the model itself could be possible as well (as license permits)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Current State of Dolma
(As of writing)
Currently the mainstay of multilingual classifiers seems to be pycld2.
This is a wrapper around cld2 itself which has not been maintained since around 2015
For pycld2 actual development seems to have finished since 2019.
It supports around 160 languages
There are indications as to attempts to also include cld3 (although unsuccessfully).
cld3 is an evolution of cld2, and includes its own python bindings, but has not been developed on since around 2021.
IT supports around 100 languages with some duplicates due to supporting multiple scripts for a single language(zh, zh_latn).
One issue is that it requires chromium to build since it was meant to run along or within a browser.
Fasttext is also included. Fasttext is a versatile text classifer and embedding library.
It can do more than classification but for the purposes of multilingual classifiers, this uses the officially available lid.176.bin model.
To my knowledge none of the above languages properly classify chinese dialects (simplified, traditional, yi,cantonese etc)
Some have issues with non slavic languages represented with cyrillic (central asian languages)
Some have issues with eastern european languages either in their latin or cyrillic representations.
Slavic dialect performance is also variable (ex : russian vs ukranian classification)
Potential Improvements
Beta Was this translation helpful? Give feedback.
All reactions