Correctly checking if token is in tokenizer #150

VeritasJoker · 2023-02-26T04:01:56Z

I was running encoding and found that the gpt2 base_df has some tokens that are marked as False in the in_gpt2-xl column so I looked into it. Turns out there are some sort of weird things going on in the tokenizer.

As seen by the example below, the word You're is tokenized into 'ĠYou' and "'re", and both tokens should exist in the gpt2-xl tokenizer. However, when we tokenize "'re" again in tfsemb_LMBase.py, it is further tokenized into "Ġ'" and 're', thus causing the helper function to return False here when it should have been True:

247-pickling/scripts/tfsemb_LMBase.py

Line 39 in 961fbec

if len(tokenizer.tokenize(x)) == 1:

Really not sure what's happening here. Is it because of the left padding or the double quotes vs single quotes? I will do some testing but if @hvgazula and @zkokaja have some ideas let me know! Thanks!

The text was updated successfully, but these errors were encountered:

zkokaja · 2023-02-26T15:44:22Z

Why and where do we "tokenize "'re" again" ? Should we be using another column instead?

Good catch

zkokaja · 2023-03-02T18:52:56Z

option 1: remove in_ models and align on word_idx instead
option 2: look at using token instead of token2word?

zkokaja · 2023-03-09T18:25:54Z

waiting on #153 to include word_idx and this can be handled in encoding. and option 1 is robust to changes in tokenization strategies

zkokaja assigned hvgazula and VeritasJoker Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly checking if token is in tokenizer #150

Correctly checking if token is in tokenizer #150

VeritasJoker commented Feb 26, 2023

zkokaja commented Feb 26, 2023

zkokaja commented Mar 2, 2023

zkokaja commented Mar 9, 2023

Correctly checking if token is in tokenizer #150

Correctly checking if token is in tokenizer #150

Comments

VeritasJoker commented Feb 26, 2023

zkokaja commented Feb 26, 2023

zkokaja commented Mar 2, 2023

zkokaja commented Mar 9, 2023