Why SentencePieceTokenizer can't save vocab file #282

Codle · 2019-12-29T04:36:24Z

I want to use vocab file in PairedDataloader, but the the save_vocab function of SentencePieceTokenizer only save the model file.

The model file can't be load by Dataloader because of decoding error.

In sentencepiece_tokenizer.py, I saw you delete the vocab file.

gpengzhi · 2019-12-30T16:45:39Z

We deleted sentencepiece vocab file because sentencepiece mode file is purely self-contained, and vocab file is never used in the tokenizer. To the best of my knowledge, the vocab file itself is not very useful. Here is a simple vocab file:

<unk>	0
<s>	0
</s>	0
,	-3.39764
.	-3.53133
▁the	-3.56031
s	-3.70819
▁	-3.82609
▁I	-3.90308
▁to	-4.04041
▁a	-4.08637
ed	-4.16661
▁and	-4.26836
▁of	-4.27461
t	-4.31782
e	-4.43336
d	-4.44333
ing	-4.46929
a	-4.53839
▁in	-4.64852
o	-4.71318
▁was	-4.77909
▁"	-4.81017
i	-4.86229
...

Codle · 2019-12-31T08:48:05Z

@gpengzhi
But how to use the model file in PairedTextData?
The model file seems only can be used to restore a tokenizer, so I created my own "PairedTextData" with two DataSource to use SentencePieceTokenizer in my project.
Is there anyway more simple to use?

gpengzhi · 2019-12-31T19:43:48Z

Could you write down how you integrate tokenizer with pairedtextdata? There is another related issue #256 I think we should provide the interface to use tokenizer instead of vocab. Do you think if you can contribute to this feature enhancement? A feature enhancement pull request is welcome!

gpengzhi added question Further information is requested topic: data Issue about data loader modules labels Dec 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why SentencePieceTokenizer can't save vocab file #282

Why SentencePieceTokenizer can't save vocab file #282

Codle commented Dec 29, 2019 •

edited

gpengzhi commented Dec 30, 2019 •

edited

Codle commented Dec 31, 2019

gpengzhi commented Dec 31, 2019

Why SentencePieceTokenizer can't save vocab file #282

Why SentencePieceTokenizer can't save vocab file #282

Comments

Codle commented Dec 29, 2019 • edited

gpengzhi commented Dec 30, 2019 • edited

Codle commented Dec 31, 2019

gpengzhi commented Dec 31, 2019

Codle commented Dec 29, 2019 •

edited

gpengzhi commented Dec 30, 2019 •

edited