Fix prompt caching on llama.cpp endpoints #920

reversebias · 2024-03-09T12:11:40Z

In versions of llama.cpp since 3677, the prompt cache is dropped by the server unless cache_prompt: true is included in the request.

This change reduces prompt processing times in long chat threads: local inference with large models can have 10s of seconds of processing time for chats with 1000s of context tokens, this massively improves the responsiveness.

nsarrazin · 2024-03-11T08:20:19Z

Thanks for the contribution! 🚀

Explicitly enable prompt caching on llama.cpp endpoints

0b3e42a

nsarrazin approved these changes Mar 11, 2024

View reviewed changes

Merge branch 'main' into fix/llama_cpp_prompt_caching

7954923

nsarrazin merged commit eb071be into huggingface:main Mar 11, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prompt caching on llama.cpp endpoints #920

Fix prompt caching on llama.cpp endpoints #920

reversebias commented Mar 9, 2024

nsarrazin commented Mar 11, 2024

Fix prompt caching on llama.cpp endpoints #920

Fix prompt caching on llama.cpp endpoints #920

Conversation

reversebias commented Mar 9, 2024

nsarrazin commented Mar 11, 2024