Calculation of threshold in LLMLingua-2 #194

cornzz · 2024-11-06T19:24:35Z

I was wondering, why for the calculation of the threshold the new_token_probs array is constructed, why is word_probs not used?

LLMLingua/llmlingua/prompt_compressor.py

Lines 2398 to 2404 in 2dbdbd3

    
           new_token_probs = [] 
        
           for word, word_prob in zip(words, word_probs): 
        
               num_token = len(self.oai_tokenizer.encode(word)) 
        
               new_token_probs.extend([word_prob for _ in range(num_token)]) 
        
           threshold = np.percentile( 
        
               new_token_probs, int(100 * reduce_rate + 1) 
        
           )

This way, each token of the compressor model is tokenized using the OpenAI tokenizer and the word probability is repeated for each token the OpenAI tokenizer returns. The tokens of the compressor model look like this (after merging sub-word tokens):

['▁The', '▁report', '▁of', '▁the', '▁Civil', '▁Rights', ',', '▁Utilities', ',', '▁Economic', '▁Development', '▁and', '▁Arts', '▁Committee', ...]

Each word is prefixed with a special ▁ character. The OpenAI tokenizer encodes ▁The into 3 ids, because the special underscore character takes up 2 ids.

self.oai_tokenizer.encode('▁The')
[10634, 223, 791]

So effectively the word probability is repeated 3 times for each word.

Why is that?

I wonder how this affects the distribution and therefore the threshold, as the probabilities for all words are repeated additionally for 2 times (at least, longer words are split into more tokens, e.g. nondiscriminatory is 5 tokens), while the probabilites for punctuation characters aren't repeated additionally as they are not prefixed with ▁.

@pzs19

The text was updated successfully, but these errors were encountered:

pzs19 · 2024-11-11T07:39:19Z

Good question, and that's exactly what we are aiming for. The purpose of using the OpenAI tokenizer is to align the specified compression rate with the actual token consumption when using GPT. For example, if we have the words ["Learn", "about", "Tooooooooooooooookenizer"] and assume a compression rate of 66%, without repeating the word probability, the compressor might handle the word "Tooooooooooooooookenizer," resulting in a token-level compression rate of 2/9 (since "Tooooooooooooooookenizer" consists of 7 tokens), which does not match the 66%. I hope this helps!

cornzz · 2024-11-11T11:49:38Z

@pzs19 Thank you for the response! I understood the need to repeat the probability for each OpenAI token. However I am still not sure if including the ▁ character for each word skews the distribution, as that adds 2 tokens for each word, but not for punctuation characters?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculation of threshold in LLMLingua-2 #194

Calculation of threshold in LLMLingua-2 #194

cornzz commented Nov 6, 2024 •

edited

Loading

pzs19 commented Nov 11, 2024

cornzz commented Nov 11, 2024

Calculation of threshold in LLMLingua-2 #194

Calculation of threshold in LLMLingua-2 #194

Comments

cornzz commented Nov 6, 2024 • edited Loading

pzs19 commented Nov 11, 2024

cornzz commented Nov 11, 2024

cornzz commented Nov 6, 2024 •

edited

Loading