Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Reproduce end2end latency results of LLMLingua-2 #193

Open
cornzz opened this issue Oct 23, 2024 · 3 comments
Open

[Question]: Reproduce end2end latency results of LLMLingua-2 #193

cornzz opened this issue Oct 23, 2024 · 3 comments
Labels
question Further information is requested

Comments

@cornzz
Copy link

cornzz commented Oct 23, 2024

Describe the issue

@pzs19
I would like to reproduce and expand the end2end latency benchmark results of the LLMLingua-2 paper and was therefore wondering if you could provide more details on your experiment setup? Specifically:

  • Which target LLM was evaluated (and how was it set up, was vLLM or similar used?)
  • For the result in Table 5, which prompt length was used, what was the prompt?
  • Whats the definition of end2end latency? From the beginning of compression until the generation of the first token or until the full response is generated?
  • What was max_token set to, and did you enforce the generation of a minimum number of tokens?

Thanks a lot!

@cornzz cornzz added the question Further information is requested label Oct 23, 2024
@cornzz cornzz changed the title [Question]: Reproduce end2end benchmarking of LLMLingua-2 [Question]: Reproduce end2end latency results of LLMLingua-2 Oct 23, 2024
@pzs19
Copy link
Contributor

pzs19 commented Nov 11, 2024

Thank you for raising the questions. There is point to point response:

  • The target LLM is GPT-3.5-Turbo-0613, so vllm is not used.
  • The latency experiment is conducted on the summarization task of MeetingBank, the prompt follows the main experiment.
  • End2end latency counts from the beginning of compression until the full response is generated.
  • We set the "max_token" to 400, following the main experiment.

@cornzz
Copy link
Author

cornzz commented Nov 11, 2024

Thank you very much! 🙂

@cornzz cornzz closed this as completed Nov 11, 2024
@cornzz
Copy link
Author

cornzz commented Nov 13, 2024

@pzs19 @iofu728 sorry, a follow up question: which LLM was used for compression in the end-to-end latency benchmark of the original LLMLingua paper? Under "Implementation Details" it says

In our experiments, we utilize either Alpaca-7B4 or GPT2-Alpaca as the small pre-trained language model M𝑠 for compression.

however, as far as I can see, it is not specified which of those two models was used for the end-to-end latency benchmark.
Actually it is not specified which compressor was used for the other benchmarks (gsm8k etc.) either, so that would be another question.

@cornzz cornzz reopened this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants