Presence of character Ġ before each token in output #87

osmanii · 2023-04-18T06:13:22Z

I was working on the "05- Neuron Factors.ipynb" notebook and noticed the presence of character Ġ before each token in the output. The output is for the code "nmf_1.explore()". I am not quite sure why it is doing that. Please check the screenshot below.

Your help is appreciated.

cristianestojeda · 2023-06-28T12:49:43Z

Happened to me using GPT-2 and solved this issue by adding the following line:
if self.config['token_prefix'] is not None and token[0] == self.config['token_prefix']: token = token[1:]

right after the first loop line of nmf.explore() method:
for idx, token in enumerate(self.tokens[input_sequence]): # self.tokens[:-1]
if self.config['token_prefix'] is not None and token[0] == self.config['token_prefix']: token = token[1:]
type = "input" if idx < self.n_input_tokens else 'output'
tokens.append({'token': token,
'token_id': int(self.token_ids[input_sequence][idx]),
# 'token_id': int(self.token_ids[idx]),
'type': type,
# 'value': str(components[0][comp_num][idx]), # because json complains of floats
'position': idx
})

jalammar · 2023-06-28T14:58:53Z

Yeah, that shouldn't happen. A bunch of tokenizers have a character like Ġ in the beginning of a token to indicate that the token is linked to whatever token comes before them in the sequence. Which is why rendering the output needs to run in tandem with the tokenizer and its settings.

jalammar added the bug Something isn't working label Jun 28, 2023

KendallPark added a commit to KendallPark/ecco that referenced this issue Dec 17, 2023

fix: jalammar#87

c3a81d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presence of character Ġ before each token in output #87

Presence of character Ġ before each token in output #87

osmanii commented Apr 18, 2023

cristianestojeda commented Jun 28, 2023

jalammar commented Jun 28, 2023

Presence of character Ġ before each token in output #87

Presence of character Ġ before each token in output #87

Comments

osmanii commented Apr 18, 2023

cristianestojeda commented Jun 28, 2023

jalammar commented Jun 28, 2023