Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boundary-value is not tokenized properly #36

Open
saito400 opened this issue Jul 23, 2017 · 3 comments
Open

Boundary-value is not tokenized properly #36

saito400 opened this issue Jul 23, 2017 · 3 comments

Comments

@saito400
Copy link
Contributor

If tests/text_large.txt is given, I think 「その他 名詞,代名詞,一般,,,*,その他,ソノタ,ソノタ」 should be returned, but following tokens are returned.

そ 名詞,特殊,助動詞語幹,,,,そ,ソ,ソ
の 助詞,連体化,
,,,,の,ノ,ノ
他 名詞,非自立,副詞可能,
,,,他,ホカ,ホカ

@mocobeta
Copy link
Owner

😅

@mocobeta
Copy link
Owner

This happens only when the input text is larger than inner buffer size and the text has no punctuation (like this example.)
https://github.com/mocobeta/janome/blob/master/tests/text_large.txt
For now, I do not come up with good idea for this issue. Anyone has smart solutions or suggestions?

@saito400
Copy link
Contributor Author

I wanted to fix this , but for now I have no idea...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants