Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of threshold in LLMLingua-2 #194

Open
cornzz opened this issue Nov 6, 2024 · 2 comments
Open

Calculation of threshold in LLMLingua-2 #194

cornzz opened this issue Nov 6, 2024 · 2 comments

Comments

@cornzz
Copy link

cornzz commented Nov 6, 2024

I was wondering, why for the calculation of the threshold the new_token_probs array is constructed, why is word_probs not used?

new_token_probs = []
for word, word_prob in zip(words, word_probs):
num_token = len(self.oai_tokenizer.encode(word))
new_token_probs.extend([word_prob for _ in range(num_token)])
threshold = np.percentile(
new_token_probs, int(100 * reduce_rate + 1)
)

This way, each token of the compressor model is tokenized using the OpenAI tokenizer and the word probability is repeated for each token the OpenAI tokenizer returns. The tokens of the compressor model look like this (after merging sub-word tokens):

['▁The', '▁report', '▁of', '▁the', '▁Civil', '▁Rights', ',', '▁Utilities', ',', '▁Economic', '▁Development', '▁and', '▁Arts', '▁Committee', ...]

Each word is prefixed with a special character. The OpenAI tokenizer encodes ▁The into 3 ids, because the special underscore character takes up 2 ids.

self.oai_tokenizer.encode('▁The')
[10634, 223, 791]

So effectively the word probability is repeated 3 times for each word.

Why is that?

I wonder how this affects the distribution and therefore the threshold, as the probabilities for all words are repeated additionally for 2 times (at least, longer words are split into more tokens, e.g. nondiscriminatory is 5 tokens), while the probabilites for punctuation characters aren't repeated additionally as they are not prefixed with .

@pzs19

@pzs19
Copy link
Contributor

pzs19 commented Nov 11, 2024

Good question, and that's exactly what we are aiming for. The purpose of using the OpenAI tokenizer is to align the specified compression rate with the actual token consumption when using GPT. For example, if we have the words ["Learn", "about", "Tooooooooooooooookenizer"] and assume a compression rate of 66%, without repeating the word probability, the compressor might handle the word "Tooooooooooooooookenizer," resulting in a token-level compression rate of 2/9 (since "Tooooooooooooooookenizer" consists of 7 tokens), which does not match the 66%. I hope this helps!

@cornzz
Copy link
Author

cornzz commented Nov 11, 2024

@pzs19 Thank you for the response! I understood the need to repeat the probability for each OpenAI token. However I am still not sure if including the character for each word skews the distribution, as that adds 2 tokens for each word, but not for punctuation characters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants