GitHub - naologic/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

(Long)LLMLingua: Enhancing Large Language Model Inference via Prompt Compression

LLMLingua_demo.mp4

News

🤳 Talk slides are available in AI Time Jan, 24.
🖥 EMNLP'23 slides are available in Session 5 and BoF-6.
📚 Check out our new blog post discussing RAG benefits and cost savings through prompt compression. See the script example here.
🎈 Visit our project page for real-world case studies in RAG, Online Meetings, CoT, and Code.
👨‍🦯 Explore our './examples' directory for practical applications, including RAG, Online Meeting, CoT, Code, and RAG using LlamaIndex.
👾 LongLLMLingua is now part of the LlamaIndex pipeline, a widely-used RAG framework.

TL;DR

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (EMNLP 2023)
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang and Lili Qiu

LongLLMLingua mitigates the 'lost in the middle' issue in LLMs, enhancing long-context information processing. It reduces costs and boosts efficiency with prompt compression, improving RAG performance by up to 21.4% using only 1/4 of the tokens.

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression (Under Review)
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu

🎥 Overview

Ever encountered the token limit when asking ChatGPT to summarize lengthy texts?
Frustrated with ChatGPT forgetting previous instructions after extensive fine-tuning?
Experienced high costs using GPT3.5/4 API for experiments despite excellent results?

While Large Language Models like ChatGPT and GPT-4 excel in generalization and reasoning, they often face challenges like prompt length limits and prompt-based pricing schemes.

Now you can use LLMLingua & LongLLMLingua!

These tools offer an efficient solution to compress prompts by up to 20x, enhancing the utility of LLMs.

💰 Cost Savings: Reduces both prompt and generation lengths.
📝 Extended Context Support: Enhances support for longer contexts, mitigates the "lost in the middle" issue, and boosts overall performance.
⚖️ Robustness: No additional training needed for LLMs.
🕵️ Knowledge Retention: Maintains original prompt information like ICL and reasoning.
📜 KV-Cache Compression: Accelerates inference process.
🪃 Comprehensive Recovery: GPT-4 can recover all key information from compressed prompts.

If you find this repo helpful, please cite the following papers:

@inproceedings{jiang-etal-2023-llmlingua,
    title = "{LLML}ingua: Compressing Prompts for Accelerated Inference of Large Language Models",
    author = "Huiqiang Jiang and Qianhui Wu and Chin-Yew Lin and Yuqing Yang and Lili Qiu",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.825",
    doi = "10.18653/v1/2023.emnlp-main.825",
    pages = "13358--13376",
}

@article{jiang-etal-2023-longllmlingua,
    title = "{L}ong{LLML}ingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression",
    author = "Huiqiang Jiang and Qianhui Wu and and Xufang Luo and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu",
    url = "https://arxiv.org/abs/2310.06839",
    journal = "ArXiv preprint",
    volume = "abs/2310.06839",
    year = "2023",
}

🎯 Quick Start

1. Installing (Long)LLMLingua:

To get started with (Long)LLMLingua, simply install it using pip:

pip install llmlingua

2. Using (Long)LLMLingua for Prompt Compression:

With (Long)LLMLingua, you can easily compress your prompts. Here’s how you can do it:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)

# > {'compressed_prompt': 'Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He reanged five of boxes into packages of sixlters each and sold them $3 per. He sold the rest theters separately at the of three pens $2. How much did make in total, dollars?\nLets think step step\nSam bought 1 boxes x00 oflters.\nHe bought 12 * 300ters in total\nSam then took 5 boxes 6ters0ters.\nHe sold these boxes for 5 *5\nAfterelling these  boxes there were 3030 highlighters remaining.\nThese form 330 / 3 = 110 groups of three pens.\nHe sold each of these groups for $2 each, so made 110 * 2 = $220 from them.\nIn total, then, he earned $220 + $15 = $235.\nSince his original cost was $120, he earned $235 - $120 = $115 in profit.\nThe answer is 115',
#  'origin_tokens': 2365,
#  'compressed_tokens': 211,
#  'ratio': '11.2x',
#  'saving': ', Saving $0.1 in GPT-4.'}

## Or use the quantation model, like TheBloke/Llama-2-7b-Chat-GPTQ, only need <8GB GPU memory.
## Before that, you need to pip install optimum auto-gptq
llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

3. Learning More:

To understand how to apply LLMLingua and LongLLMLingua in real-world scenarios like RAG, Online Meetings, CoT, and Code, please refer to our examples. For detailed guidance, the documentation provides extensive recommendations on effectively utilizing LLMLingua.

Frequently Asked Questions

For more insights and answers, visit our FAQ section.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
examples		examples
images		images
llmlingua		llmlingua
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DOCUMENT.md		DOCUMENT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
Transparency_FAQ.md		Transparency_FAQ.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

(Long)LLMLingua: Enhancing Large Language Model Inference via Prompt Compression

News

TL;DR

🎥 Overview

🎯 Quick Start

1. Installing (Long)LLMLingua:

2. Using (Long)LLMLingua for Prompt Compression:

3. Learning More:

Frequently Asked Questions

Contributing

Trademarks

About

Uh oh!

Releases

Packages

Languages

License

naologic/LLMLingua

Folders and files

Latest commit

History

Repository files navigation

(Long)LLMLingua: Enhancing Large Language Model Inference via Prompt Compression

News

TL;DR

🎥 Overview

🎯 Quick Start

1. Installing (Long)LLMLingua:

2. Using (Long)LLMLingua for Prompt Compression:

3. Learning More:

Frequently Asked Questions

Contributing

Trademarks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages