Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why SentencePieceTokenizer can't save vocab file #282

Open
Codle opened this issue Dec 29, 2019 · 3 comments
Open

Why SentencePieceTokenizer can't save vocab file #282

Codle opened this issue Dec 29, 2019 · 3 comments
Labels
question Further information is requested topic: data Issue about data loader modules

Comments

@Codle
Copy link
Contributor

Codle commented Dec 29, 2019

I want to use vocab file in PairedDataloader, but the the save_vocab function of SentencePieceTokenizer only save the model file.

The model file can't be load by Dataloader because of decoding error.

In sentencepiece_tokenizer.py, I saw you delete the vocab file.

@gpengzhi gpengzhi added question Further information is requested topic: data Issue about data loader modules labels Dec 30, 2019
@gpengzhi
Copy link
Collaborator

gpengzhi commented Dec 30, 2019

We deleted sentencepiece vocab file because sentencepiece mode file is purely self-contained, and vocab file is never used in the tokenizer. To the best of my knowledge, the vocab file itself is not very useful. Here is a simple vocab file:

<unk>	0
<s>	0
</s>	0
,	-3.39764
.	-3.53133
▁the	-3.56031
s	-3.70819
▁	-3.82609
▁I	-3.90308
▁to	-4.04041
▁a	-4.08637
ed	-4.16661
▁and	-4.26836
▁of	-4.27461
t	-4.31782
e	-4.43336
d	-4.44333
ing	-4.46929
a	-4.53839
▁in	-4.64852
o	-4.71318
▁was	-4.77909
▁"	-4.81017
i	-4.86229
...

@Codle
Copy link
Contributor Author

Codle commented Dec 31, 2019

@gpengzhi
But how to use the model file in PairedTextData?
The model file seems only can be used to restore a tokenizer, so I created my own "PairedTextData" with two DataSource to use SentencePieceTokenizer in my project.
Is there anyway more simple to use?

@gpengzhi
Copy link
Collaborator

Could you write down how you integrate tokenizer with pairedtextdata? There is another related issue #256 I think we should provide the interface to use tokenizer instead of vocab. Do you think if you can contribute to this feature enhancement? A feature enhancement pull request is welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested topic: data Issue about data loader modules
Projects
None yet
Development

No branches or pull requests

2 participants