Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly checking if token is in tokenizer #150

Open
VeritasJoker opened this issue Feb 26, 2023 · 3 comments
Open

Correctly checking if token is in tokenizer #150

VeritasJoker opened this issue Feb 26, 2023 · 3 comments
Assignees

Comments

@VeritasJoker
Copy link
Contributor

I was running encoding and found that the gpt2 base_df has some tokens that are marked as False in the in_gpt2-xl column so I looked into it. Turns out there are some sort of weird things going on in the tokenizer.

As seen by the example below, the word You're is tokenized into 'ĠYou' and "'re", and both tokens should exist in the gpt2-xl tokenizer. However, when we tokenize "'re" again in tfsemb_LMBase.py, it is further tokenized into "Ġ'" and 're', thus causing the helper function to return False here when it should have been True:

if len(tokenizer.tokenize(x)) == 1:

Screen Shot 2023-02-25 at 10 51 25 PM

Really not sure what's happening here. Is it because of the left padding or the double quotes vs single quotes? I will do some testing but if @hvgazula and @zkokaja have some ideas let me know! Thanks!

@zkokaja
Copy link
Contributor

zkokaja commented Feb 26, 2023

Why and where do we "tokenize "'re" again" ? Should we be using another column instead?

Good catch

@zkokaja
Copy link
Contributor

zkokaja commented Mar 2, 2023

option 1: remove in_ models and align on word_idx instead
option 2: look at using token instead of token2word?

@zkokaja
Copy link
Contributor

zkokaja commented Mar 9, 2023

waiting on #153 to include word_idx and this can be handled in encoding. and option 1 is robust to changes in tokenization strategies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants