Non-English tokenizers #464

yf-hk · 2021-06-07T02:15:51Z

Describe the solution you'd like
For CJK languages, like for example Chinese, words are not separated by spaces. So there usually has a need to use a tokenizer to split sentences into word stems. Like this one: https://github.com/yanyiwu/cppjieba
Is it currently doable in Pisa? If not, is there any plan to add this feature in the future?

Additional context

amallia · 2021-06-07T07:36:53Z

Yes, it is doable. If you want to see this implemented you can send a PR and we will review it.
Thanks

elshize · 2022-02-27T21:21:03Z

Unfortunately, none of us regular contributors have much knowledge of these
languages, so we'll need someone with more knowledge step up to be able to
properly implement and test it.

If someone would want to help out with that, we can definitely provide some
help related to how parsing and tokenizing works within PISA.

yf-hk added the enhancement New feature or request label Jun 7, 2021

elshize added help wanted Extra attention is needed question Further information is requested labels Feb 27, 2022

elshize changed the title ~~Does it support custom tokenizers?~~ Non-English tokenizers Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-English tokenizers #464

Non-English tokenizers #464

yf-hk commented Jun 7, 2021

amallia commented Jun 7, 2021

elshize commented Feb 27, 2022

Non-English tokenizers #464

Non-English tokenizers #464

Comments

yf-hk commented Jun 7, 2021

amallia commented Jun 7, 2021

elshize commented Feb 27, 2022