Multilingual Classifiers #25

chris-ha458 · 2023-08-20T02:54:43Z

chris-ha458
Aug 20, 2023

Current State of Dolma

(As of writing)
Currently the mainstay of multilingual classifiers seems to be pycld2.

This is a wrapper around cld2 itself which has not been maintained since around 2015
For pycld2 actual development seems to have finished since 2019.
It supports around 160 languages

There are indications as to attempts to also include cld3 (although unsuccessfully).
cld3 is an evolution of cld2, and includes its own python bindings, but has not been developed on since around 2021.
IT supports around 100 languages with some duplicates due to supporting multiple scripts for a single language(zh, zh_latn).
One issue is that it requires chromium to build since it was meant to run along or within a browser.

Fasttext is also included. Fasttext is a versatile text classifer and embedding library.
It can do more than classification but for the purposes of multilingual classifiers, this uses the officially available lid.176.bin model.

To my knowledge none of the above languages properly classify chinese dialects (simplified, traditional, yi,cantonese etc)
Some have issues with non slavic languages represented with cyrillic (central asian languages)
Some have issues with eastern european languages either in their latin or cyrillic representations.
Slavic dialect performance is also variable (ex : russian vs ukranian classification)

Potential Improvements

Move to Fasttext first classification
- We can employ fasttext, arguably the best maintained and versatile atm, as the first class approach.
- Fasttext needs to be fed data in a different way compared to cld2,3 and have different cutoff values that work well. This seems to be reflected in this codebase.
- More development and testing would be required
Improve Fasttext implementation
- There are several faster Fasttext implementations including kenpu's fastertext
- If fasttext performance is revealed to be a bottleneck such implementations can be built and used.
Improve Fasttext model
- There are multiple models that support 200+ languages and better dialect support (esp SEA or eastern european). NLLB has several, as well as the openlid-dataset-paper also has reproducible model and dataset available.
- The current fasttext model is hard to beat in performance but overhead is a priority, a pruned version of the current model is available. defaulting to that can also reduce outside bandwidth reliance and maybe bundling the model itself could be possible as well (as license permits)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual Classifiers #25

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Multilingual Classifiers #25

chris-ha458 Aug 20, 2023

Current State of Dolma

Potential Improvements

Replies: 0 comments

chris-ha458
Aug 20, 2023