Support Context Cache for Improved Conversation Efficiency

### 🚀 Feature Description and Motivation

In many large language model (LLM) scenarios, especially multi-turn conversations or sessions where the user interacts repeatedly with the same context (e.g. chatbots, agents, assistant-like use cases), it’s critical to efficiently re-use past prompt / history information without repeatedly sending the entire conversation back to the model.


Several popular APIs already support explicit context caching or context handles:

- [Anthropic Claude’s prompt caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) uses cache identifiers to rehydrate previous contexts.
- [Google Gemini context caching](https://ai.google.dev/gemini-api/docs/caching?lang=python) provides context_cache_id to continue conversations.
- [Moonshot Kimi context caching](https://platform.moonshot.cn/docs/guide/use-context-caching-feature-of-kimi-api) allows explicit reuse of context handles.
- [Volcengine](https://www.volcengine.com/docs/82379/1396491) also offers conversation_id for session reuse.



We’d like to introduce an optional context caching interface in AIBrix, so that:

- Clients can pass in a conversation/session ID or similar handle when making requests.
- AIBrix can reuse already-processed KV cache / embedding context for that session, reducing repeated computation.
- Expose:
  - A way to create a new context handle (first request)
  - A way to continue using an existing handle (subsequent requests)
  - A way to explicitly clear / expire handles (or auto-timeout)

This would likely require:

- Storing partial KV cache (or references) indexed by conversation/session IDs.
- Coordinating with AIBrix’ current GPU memory management and eviction mechanisms.
- Ensuring multi-tenant isolation and clean up on failures.

### Use Case




-  New API fields (e.g. context_id, clear_context).
- Internal engine / scheduler support to associate context ID with existing KV cache.
- Metrics to track cache hit/miss rate, and memory usage of stored contexts.

### Proposed Solution

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Context Cache for Improved Conversation Efficiency #1248

🚀 Feature Description and Motivation

Use Case

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Context Cache for Improved Conversation Efficiency #1248

Description

🚀 Feature Description and Motivation

Use Case

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions