Newmm-safe is inconsistence #755

chameleonTK · 2022-11-02T05:27:29Z

I tried newmm-safe engine but it gave inconsistent results. It sometimes tokenized correctly but sometimes not.

Description

Example:
"ในฐานข้อมูลกฎหมายของเว็บไซต์ ทส. ข้อมูลและทรัพยากร ข้อมูลกฎหมายว่าด้วยป่าชุมชน CSV downloads กฎหมายแม่บท และกฎหมายลำดับรอง ของพระราชบัญญัติป่าชุมชน พ.ศ. 2562..."

It can correctly tokenize "ข้อมูลกฎหมายว่าด้วยป่าชุมชน" into ['ข้อมูล', 'กฎหมาย', 'ว่าด้วย', 'ป่าชุมชน']

If I changed the input into
"ในฐานข้อมูลกฎหมายของเว็บไซต์ ทส. ข้อมูลและทรัพยากร ข้อมูลกฎหมายว่าด้วยป่าชุมชน CSV downloads กฎหมายแม่บท และกฎหมายลำดับรอง ของพระราชบัญญัติป่าชุมชน พ.ศ. 2562... สำรวจ"

It tokenizes "ข้อมูลกฎหมายว่าด้วยป่าชุมชน" into ['ข้อมูล', 'กฎ', 'หม', 'าย', 'ว่าด้วย', 'ป่าชุมชน']

Expected results

It should produce the same results for both inputs.

Steps to reproduce

from pythainlp.tokenize import word_tokenize
docs = '''ในฐานข้อมูลกฎหมายของเว็บไซต์ ทส. ข้อมูลและทรัพยากร ข้อมูลกฎหมายว่าด้วยป่าชุมชน CSV downloads กฎหมายแม่บท และกฎหมายลำดับรอง ของพระราชบัญญัติป่าชุมชน พ.ศ. 2562... สำรวจ
'''
words = word_tokenize(docs, engine="newmm-safe", keep_whitespace=False)

print(words)

Your environment

PyThaiNLP version: 3.1.0
Python version: 3.9.7
Operating system and version: MacOS

github-actions · 2022-11-02T05:28:06Z

Hello @chameleonTK, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

tongplw · 2023-08-16T10:08:09Z

https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/newmm.py#L193

Adding _TEXT_SCAN_BEGIN to cut_pos could help.
cut_pos = space_idx + 1 + _TEXT_SCAN_BEGIN

bact · 2023-08-18T16:16:51Z

Thx @chameleonTK for reporting and @tongplw for pointing out possible solution.
Let me take a look at this closely.

wannaphong added the bug bugs in the library label Nov 2, 2022

bact added this to the Future milestone Feb 22, 2023

bact added the Hacktoberfest for Hacktoberfest event label Oct 4, 2023

bact added this to To do in PyThaiNLP Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newmm-safe is inconsistence #755

Newmm-safe is inconsistence #755

chameleonTK commented Nov 2, 2022

github-actions bot commented Nov 2, 2022

tongplw commented Aug 16, 2023

bact commented Aug 18, 2023

Newmm-safe is inconsistence #755

Newmm-safe is inconsistence #755

Comments

chameleonTK commented Nov 2, 2022

Description

Expected results

Steps to reproduce

Your environment

github-actions bot commented Nov 2, 2022

tongplw commented Aug 16, 2023

bact commented Aug 18, 2023