Multi Language Tokenization Support #298

andrewdalpino · 2023-05-27T19:48:25Z

I'm hoping that we can get to the point where we fully support the following languages.

English
Spanish
German
French
Russian
Japanese
Hindi
Farsi
Chinese
Arabic

I started adding unit tests for these languages for a few tokenizers here https://github.com/RubixML/ML/tree/master/tests/Tokenizers - however, it doesn't look like we support all the langugaes. I only speak English so it's hard for me to tell. Could we get some help from the community to verify that our Tokenizers support all of these languages and, if not, contribute a fix?

https://github.com/RubixML/ML/tree/master/src/Tokenizers

Thank you!

taotecode · 2023-06-16T13:05:58Z

How to join the development of multiple languages? I am good at Chinese and English.

andrewdalpino · 2023-06-20T17:52:40Z

Hi @taotecode, thanks for your interest in contributing to the project! Here are the unit tests for the Tokenizers implemented in the library.

https://github.com/RubixML/ML/tree/master/tests/Tokenizers

We need help from native language speakers to ensure that we have test coverage for different languages and that the current tests are correct.

tanmayk · 2024-05-22T15:23:39Z

@andrewdalpino I can help with Hindi. I am not sure how it is going to work with though.

Here is the problem:

$text = "यदि कोई चीज़ काफ़ी महत्वपूर्ण है, तो आपको उसे आज़माना चाहिए। भले ही - संभावित परिणाम विफलता हो।";
$tokens = \Rubix\ML\Tokenizers\Word::tokenize($text);

Expected array:

[
  'यदि', 'कोई', 'चीज़', 'महत्वपूर्ण', 'है', 'तो', 'आपको', 'उसे', 'आज़माना',
  'चाहिए', 'भले', 'ही', '-', 'संभावित', 'परिणाम', 'विफलता', 'हो',
]

Actual array:

[
  'यद', 'क', 'ई', 'च', 'ज', 'क', 'फ', 'महत', 'वप', 'र', 'ण', 'ह', 'त', 'आपक',
  'उस', 'आज', 'म', 'न', 'च', 'ह', 'ए', 'भल', 'ह', '-', 'स', 'भ', 'व', 'त',
  'पर', 'ण', 'म', 'व', 'फलत', 'ह',
]

I only tested for \Rubix\ML\Tokenizers\Word yet.

mxmp210 · 2024-06-18T17:31:21Z

@andrewdalpino I can help with Hindi. I am not sure how it is going to work with though.

Here is the problem:

$text = "यदि कोई चीज़ काफ़ी महत्वपूर्ण है, तो आपको उसे आज़माना चाहिए। भले ही - संभावित परिणाम विफलता हो।";
$tokens = \Rubix\ML\Tokenizers\Word::tokenize($text);

Expected array:

[
  'यदि', 'कोई', 'चीज़', 'महत्वपूर्ण', 'है', 'तो', 'आपको', 'उसे', 'आज़माना',
  'चाहिए', 'भले', 'ही', '-', 'संभावित', 'परिणाम', 'विफलता', 'हो',
]

Actual array:

[
  'यद', 'क', 'ई', 'च', 'ज', 'क', 'फ', 'महत', 'वप', 'र', 'ण', 'ह', 'त', 'आपक',
  'उस', 'आज', 'म', 'न', 'च', 'ह', 'ए', 'भल', 'ह', '-', 'स', 'भ', 'व', 'त',
  'पर', 'ण', 'म', 'व', 'फलत', 'ह',
]

I only tested for \Rubix\ML\Tokenizers\Word yet.

This is because Hindi and many other languages are based on Complex Text Layout (CTL) - so you will need to account for partial words that becomes full words at the end. In general terms, they fall under complex script languages. I'm pretty sure there are many works in python for tokenizing these languages, PHP also needs one of those implementations such as hindi-tokenizer but for other languages as well to support further development.

AFAIK the tokenizers comes from NLTK and it's derivative works, there needs to be equivalent implementation in php or FFI wrapper in order to make this work.

andrewdalpino added enhancement New feature or request help wanted Extra attention is needed labels May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi Language Tokenization Support #298

Multi Language Tokenization Support #298

andrewdalpino commented May 27, 2023 •

edited

Loading

taotecode commented Jun 16, 2023

andrewdalpino commented Jun 20, 2023

tanmayk commented May 22, 2024 •

edited

Loading

mxmp210 commented Jun 18, 2024

Multi Language Tokenization Support #298

Multi Language Tokenization Support #298

Comments

andrewdalpino commented May 27, 2023 • edited Loading

taotecode commented Jun 16, 2023

andrewdalpino commented Jun 20, 2023

tanmayk commented May 22, 2024 • edited Loading

mxmp210 commented Jun 18, 2024

andrewdalpino commented May 27, 2023 •

edited

Loading

tanmayk commented May 22, 2024 •

edited

Loading