Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presence of character Ġ before each token in output #87

Open
osmanii opened this issue Apr 18, 2023 · 2 comments
Open

Presence of character Ġ before each token in output #87

osmanii opened this issue Apr 18, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@osmanii
Copy link

osmanii commented Apr 18, 2023

I was working on the "05- Neuron Factors.ipynb" notebook and noticed the presence of character Ġ before each token in the output. The output is for the code "nmf_1.explore()". I am not quite sure why it is doing that. Please check the screenshot below.

image

Your help is appreciated.

@cristianestojeda
Copy link

Happened to me using GPT-2 and solved this issue by adding the following line:
if self.config['token_prefix'] is not None and token[0] == self.config['token_prefix']: token = token[1:]

right after the first loop line of nmf.explore() method:
for idx, token in enumerate(self.tokens[input_sequence]): # self.tokens[:-1]
if self.config['token_prefix'] is not None and token[0] == self.config['token_prefix']: token = token[1:]
type = "input" if idx < self.n_input_tokens else 'output'
tokens.append({'token': token,
'token_id': int(self.token_ids[input_sequence][idx]),
# 'token_id': int(self.token_ids[idx]),
'type': type,
# 'value': str(components[0][comp_num][idx]), # because json complains of floats
'position': idx
})

@jalammar jalammar added the bug Something isn't working label Jun 28, 2023
@jalammar
Copy link
Owner

Yeah, that shouldn't happen. A bunch of tokenizers have a character like Ġ in the beginning of a token to indicate that the token is linked to whatever token comes before them in the sequence. Which is why rendering the output needs to run in tandem with the tokenizer and its settings.

KendallPark added a commit to KendallPark/ecco that referenced this issue Dec 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants