You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was running encoding and found that the gpt2 base_df has some tokens that are marked as False in the in_gpt2-xl column so I looked into it. Turns out there are some sort of weird things going on in the tokenizer.
As seen by the example below, the word You're is tokenized into 'ĠYou' and "'re", and both tokens should exist in the gpt2-xl tokenizer. However, when we tokenize "'re" again in tfsemb_LMBase.py, it is further tokenized into "Ġ'" and 're', thus causing the helper function to return False here when it should have been True:
Really not sure what's happening here. Is it because of the left padding or the double quotes vs single quotes? I will do some testing but if @hvgazula and @zkokaja have some ideas let me know! Thanks!
The text was updated successfully, but these errors were encountered:
I was running encoding and found that the gpt2 base_df has some tokens that are marked as
False
in thein_gpt2-xl
column so I looked into it. Turns out there are some sort of weird things going on in the tokenizer.As seen by the example below, the word
You're
is tokenized into'ĠYou'
and"'re"
, and both tokens should exist in the gpt2-xl tokenizer. However, when we tokenize"'re"
again intfsemb_LMBase.py
, it is further tokenized into"Ġ'"
and're'
, thus causing the helper function to returnFalse
here when it should have beenTrue
:247-pickling/scripts/tfsemb_LMBase.py
Line 39 in 961fbec
Really not sure what's happening here. Is it because of the left padding or the double quotes vs single quotes? I will do some testing but if @hvgazula and @zkokaja have some ideas let me know! Thanks!
The text was updated successfully, but these errors were encountered: