Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Prompts smaller than iterative_size are not compressed #196

Open
cornzz opened this issue Nov 14, 2024 · 0 comments · May be fixed by #198
Open

[Bug]: Prompts smaller than iterative_size are not compressed #196

cornzz opened this issue Nov 14, 2024 · 0 comments · May be fixed by #198
Labels
bug Something isn't working

Comments

@cornzz
Copy link

cornzz commented Nov 14, 2024

Describe the bug

This concerns only LLMLingua / LongLLMLingua.

As part of #195 I noticed prompts smaller than iterative_size were not being compressed. The iterative_size parameter is 200 by default, causing the algorithm to ignore all tokens at certain prompt lengths below 200.

To be exact, it effectively means the following by default:

  • Prompts below 66 tokens are not compressed at all
  • For prompts between 66 and 98 tokens, compression starts working incresingly
  • At 99 tokens, compression drops back to none and it starts increasing again as prompt length increases to 200

Token lengths here are those produced by the compression model tokenizer.

Not sure if this is a bug or intended behaviour @iofu728?
The exact behaviour can be seen in this graph, the link to it and an explanation is below.

--
(From #61 I understand that iterative_size determines the length of the segments $s \in S$ from Eq. (5)?)

Why this is likely a bug:

Assume a 50 token prompt. First, end is set to the length of the prompt (compressed_input_ids is the original prompt here).

end = min(iterative_size + start, compressed_input_ids.shape[1])

In the get_compressed_input() call, the end parameter is set to end - iterative_size + delta_end (delta_end being the prompt length + 2 here), resulting in a value of -98.

) = self.get_compressed_input(
loss,
compressed_input_ids,
compressed_attention_mask,
end - iterative_size + delta_end,
iterative_size=delta_end,

In get_compressed_input(), the need_idx[end:] = 1 operation then causes all tokens to be kept (since end is -98), ignoring the result of the thresholding (need_idx signifies which tokens should be kept).

else:
need_idx = torch.concat([loss > threshold, loss[:1] > 0])
need_idx[end:] = 1
need_idx[: end - iterative_size] = 1
loss = loss[need_idx[:-1]]

Normally, the operations in lines 1426 and 1427 limit compression to the segment that is currently being processed in the Iterative Token-level Prompt Compression algorithm. As demonstrated, this breaks the algorithm when the prompt is smaller than iterative_size.

Graph explanation

The exact behaviour, how much of the prompt is considered for compression at different prompt lengths, can be seen here:
https://www.desmos.com/calculator/d7dbbqsdbv

The x-axis is prompt length, y-axis signifies how many tokens of the prompt are considered for compression.

  • The $i$ variable is iterative_size
  • $s(x)$ is the initial value of end in iterative_compress_prompt()
  • $d$ is delta_end (and iterative_size inside get_compressed_input())
  • $f(x)$ is the end parameter in get_compressed_input()
  • $g(x)$ yields the number of tokens considered for compression after need_idx[end:] = 1

Only $n(x)$ is displayed here.

Steps to reproduce

You can reproduce this using the official LLMLingua demo, trying to compress the following context of length 100 with target_token set to -1 and ratio to 0.5 (question and instruction left empty):

This report provides background information and issues for Congress regarding China's actions in the South China Sea (SCS) and East China Sea (ECS), with a focus on implications for U.S. strategic and policy interests. Other CRS reports focus on other aspects of maritime territorial disputes involving China. The issue for Congress is how the United States should respond to China's actions in the SCS and ECS—particularly China's island-building and base-construction activities in the Spratly 

The actual compression ratio will be 1.0x.
If you now remove the last word, "Spratly", suddenly compression works and the result is 2x compressed.
This is because now the token count dropped to 98 where, as previously mentioned, compression fully works. If you further reduce the number of words, the compression ratio will again fall, reaching 1.0x around 66 tokens prompt length.

How to fix:

I suppose a possible fix would be adding the following at the beginning of get_compressed_input():

        if end < iterative_size:
            end = iterative_size

This way, prompts shorter than iterative_size are still compressed. I don't think this introduces sideeffects for other cases, as end shouldn't be smaller than iterative_size other than in this specific case.

This screenshot shows the behaviour with this fix, $n(x)$ being the new end:

https://www.desmos.com/calculator/69cm0iasqz


Semi-related (bug?)

This line

while end <= compressed_input_ids.shape[1]:

means that there will be a remaining segment at the end of a prompt that will be ignored in the compression, if prompt length is not divisible by iterative_size. This is because end is incremented by iterative_size after each iteration and if the size of the remaining segment is smaller than iterative_size it will be ignored.

An example for a prompt of length 500, rate 0.5 (at least 3 iterations would be needed to process all tokens):

  • First iteration: the first 200 tokens are compressed to 100 tokens, the prompt length is now 400; the get_compressed_input() call sets end to 100, which is then incremented to 300 in line 1742.
  • Second iteration: the next 200 tokens are compressed to 100 tokens, the prompt length is now 300; the get_compressed_input() call sets end to 200, which is then incremented to 400.
  • Third iteration: there is no third iteration since the prompt length is 300 which is smaller than end, which is now 400. Therefore the last 100 tokens are not processed and left uncompressed.

In this case, 20% of the prompt are ignored completely. This graph shows how the ignored percentage of the prompt changes with size https://www.desmos.com/calculator/8kohofzyb5

The effect of this can also be seen in the results of #195 where the achieved compression ratio of prompts between 250 and 750 tokens deviates quite a bit from the expected ratio, presumably because significant portions of the original prompts were ignored in the compression process:

Of course this can be diminished by setting a smaller iterative_size, but even with the default value there should be a way to process the remaining tokens at the end of the prompt?

I don't have a solution here, as the algorithm breaks if you simply do one more iteration and I don't have time to look into a proper solution...


Expected Behavior

Prompts should be compressed even when smaller than iterative_size

Logs

No response

Additional Information

No response

@cornzz cornzz added the bug Something isn't working label Nov 14, 2024
@cornzz cornzz changed the title [Bug]: Prompts smaller than iterative_size are not compressed at all [Bug]: Prompts smaller than iterative_size are not compressed Nov 14, 2024
cornzz added a commit to cornzz/LLMLingua that referenced this issue Nov 14, 2024
@cornzz cornzz linked a pull request Nov 14, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant