[WIP] Safe bpe dropout for LM + joiner disjoin dropout #2009

funboarder13920 · 2021-02-17T09:04:57Z

On LM tasks: copy src to target in tokenizer transforms otherwise it will be hazardous when tokenizers have random behaviors like dropout
remove dropout in tokenizers during validation mode
implement a "disjoin joiner with dropout" transform to make possible inference at any point in the sentence

funboarder13920 · 2021-02-17T09:16:27Z

onmt/transforms/misc.py

+ else:
+ src_out = self.dropout_separate_joiner(example["src"], "src")
+ example["src"] = src_out
+ if self.opts.model_task == ModelTask.LANGUAGE_MODEL:


Currently this won't work on build_vocab, model_task is a model parameter and not a dynamic corpus parameter.
Same issue in tokenizers.py

is_train should be false anyways in build_vocab, no?
EDIT: Nevermind, I mixed up with build_vocab_only.

funboarder13920 · 2021-02-17T09:26:39Z

onmt/transforms/misc.py

+ if elem == SubwordMarker.JOINER:
+ continue
+ if elem.startswith(SubwordMarker.JOINER):
+ if random.random() < dropout:


Not sure about the necessity of doing right and left token sides
It might create a difficulty to retrieve the initial token when detokenizing.
I can use a special token do distinguish left and right or remove the left side disjoin which doesn't occur much (mainly in punctuation)

Zenglinxiao · 2021-02-17T13:54:32Z

onmt/transforms/tokenize.py

+ kwopts['bpe_dropout'] = subword_alpha if is_train else 0
 elif subword_type == 'sentencepiece':
 kwopts['sp_model_path'] = subword_model
 kwopts['sp_nbest_size'] = subword_nbest


You can directly reassign on subword_alpha & subword_nbest to disable both bpe_dropout/sp sampling.

Zenglinxiao · 2021-02-17T13:57:25Z

onmt/transforms/tokenize.py

- _diff_vocab = (
- src_subword_kwargs.get('vocabulary_path', '') !=
- tgt_subword_kwargs.get('vocabulary_path', '') or
- src_subword_kwargs.get('vocabulary_threshold', 0) !=
- tgt_subword_kwargs.get('vocabulary_threshold', 0))


I personally prefer current one which seems more readable

Zenglinxiao

Could you elaborate a bit about the interest of this joiner disjoin dropout mechanism?
If we drop independent joiner, we may have trouble getting the original sentence when detokenize for some sequence. Also by randomly spliting left/right joiner from tokens will increase sequence length and cause more tokens falls in <unk>.
Is there any good reason to apply this despite all those potential limitations/conflicts? What's the objective of this proposal?

francoishernandez · 2021-02-17T14:41:49Z

The idea is to allow LM generation from incomplete words, without explicitly knowing that the word is incomplete.
E.g.
Prefix: "We can go to another planet"
generation 1: "We can go to another planet like Mars."
generation 2: "We can go to another planetarium since this one is closed."
In the case of generation 2, we need to have a joiner at some point. In most cases, joiners are at the end of the subword. But, we can't know here beforehand that we will need a joiner. Hence, the model would learn itself to place a standalone joiner when needed.

funboarder13920 · 2021-02-17T14:46:27Z

Could you elaborate a bit about the interest of this joiner disjoin dropout mechanism?
If we drop independent joiner, we may have trouble getting the original sentence when detokenize for some sequence. Also by randomly spliting left/right joiner from tokens will increase sequence length and cause more tokens falls in <unk>.
Is there any good reason to apply this despite all those potential limitations/conflicts? What's the objective of this proposal?

The goal is to make possible inference at any point in the sentence even in the middle of a word without having to handle that at translation/generation time.
Do you have an example of a sequence that cannot be decoded with this mechanism ? I've tried a few and the joiner marker will join words even when it is not attached to any token.
Regarding the increase of the sequence length, it's the same issue with bpe dropout. We chose to introduce randomness to limit the increase of seq length rather than using joiner_new.
Some token might fall in <unk>, this will happen with rare token that might not be in the vocabulary anyway, the number of encountered <unk> increases with bpe_dropout as well as with joiner disjoin dropout. I chose to cut the bpe construction to quite high frequency merges so there should not be a lot of <unk>, most frequent tokens will be seen when building the vocabulary. For example, we go from a 43704 vocab size with bpe only to a 46621 vocab size with bpe dropout and joiner disjoin dropout, I think we got most of the tokens that can appear but I can still check the number of <unk> seen. Moreover, the randomness in these <unk> and the amount of data used for the task might make <unk> not even be an issue.

One issue might be that <unk> is considered in the loss function and this could increase it's likelihood.

Zenglinxiao · 2021-02-17T15:21:42Z

Do you have an example of a sequence that cannot be decoded with this mechanism ? I've tried a few and the joiner marker will join words even when it is not attached to any token.

Yes, the joiner mark will join words no matter attached to token or not. Actually I'm talking about the line 85-86 in the method dropout_separate_joiner where individual joiners were simply ignored.
Ex: "word ￭ ," --detok--> "word,"; But "word ," --detok--> "word ,".

funboarder13920 · 2021-02-17T15:26:34Z

It is a mistake, it needs to be added to the out_seq not removed.
I don't expect to encounter individual joiners, it seems that they are attached to the punctuation in this mode but I handled this case if someone wants to use another mode

implement safe bpe dropout for LM + joiner disjoin dropout

799783e

funboarder13920 commented Feb 17, 2021

View reviewed changes

Zenglinxiao reviewed Feb 17, 2021

View reviewed changes

fix missing model_task in opts

4965928

Zenglinxiao reviewed Feb 17, 2021

View reviewed changes

disable sp sampling

14cdcf4

fix dropped element in JoinerDropout

98f8189

funboarder13920 changed the title ~~[WIP] implement safe bpe dropout for LM + joiner disjoin dropout~~ Safe bpe dropout for LM + joiner disjoin dropout Feb 19, 2021

funboarder13920 changed the title ~~Safe bpe dropout for LM + joiner disjoin dropout~~ [WIP] Safe bpe dropout for LM + joiner disjoin dropout Feb 19, 2021

fix joiner dropout

003fe4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Safe bpe dropout for LM + joiner disjoin dropout #2009

[WIP] Safe bpe dropout for LM + joiner disjoin dropout #2009

funboarder13920 commented Feb 17, 2021 •

edited

funboarder13920 Feb 17, 2021 •

edited

francoishernandez Feb 17, 2021 •

edited

funboarder13920 Feb 17, 2021 •

edited

Zenglinxiao Feb 17, 2021

Zenglinxiao Feb 17, 2021

Zenglinxiao left a comment

francoishernandez commented Feb 17, 2021

funboarder13920 commented Feb 17, 2021 •

edited

Zenglinxiao commented Feb 17, 2021

funboarder13920 commented Feb 17, 2021 •

edited

[WIP] Safe bpe dropout for LM + joiner disjoin dropout #2009

Are you sure you want to change the base?

[WIP] Safe bpe dropout for LM + joiner disjoin dropout #2009

Conversation

funboarder13920 commented Feb 17, 2021 • edited

funboarder13920 Feb 17, 2021 • edited

Choose a reason for hiding this comment

francoishernandez Feb 17, 2021 • edited

Choose a reason for hiding this comment

funboarder13920 Feb 17, 2021 • edited

Choose a reason for hiding this comment

Zenglinxiao Feb 17, 2021

Choose a reason for hiding this comment

Zenglinxiao Feb 17, 2021

Choose a reason for hiding this comment

Zenglinxiao left a comment

Choose a reason for hiding this comment

francoishernandez commented Feb 17, 2021

funboarder13920 commented Feb 17, 2021 • edited

Zenglinxiao commented Feb 17, 2021

funboarder13920 commented Feb 17, 2021 • edited

funboarder13920 commented Feb 17, 2021 •

edited

funboarder13920 Feb 17, 2021 •

edited

francoishernandez Feb 17, 2021 •

edited

funboarder13920 Feb 17, 2021 •

edited

funboarder13920 commented Feb 17, 2021 •

edited

funboarder13920 commented Feb 17, 2021 •

edited