-
Notifications
You must be signed in to change notification settings - Fork 397
Open
Labels
Description
🚀 Feature Description and Motivation
In many large language model (LLM) scenarios, especially multi-turn conversations or sessions where the user interacts repeatedly with the same context (e.g. chatbots, agents, assistant-like use cases), it’s critical to efficiently re-use past prompt / history information without repeatedly sending the entire conversation back to the model.
Several popular APIs already support explicit context caching or context handles:
- Anthropic Claude’s prompt caching uses cache identifiers to rehydrate previous contexts.
- Google Gemini context caching provides context_cache_id to continue conversations.
- Moonshot Kimi context caching allows explicit reuse of context handles.
- Volcengine also offers conversation_id for session reuse.
We’d like to introduce an optional context caching interface in AIBrix, so that:
- Clients can pass in a conversation/session ID or similar handle when making requests.
- AIBrix can reuse already-processed KV cache / embedding context for that session, reducing repeated computation.
- Expose:
- A way to create a new context handle (first request)
- A way to continue using an existing handle (subsequent requests)
- A way to explicitly clear / expire handles (or auto-timeout)
This would likely require:
- Storing partial KV cache (or references) indexed by conversation/session IDs.
- Coordinating with AIBrix’ current GPU memory management and eviction mechanisms.
- Ensuring multi-tenant isolation and clean up on failures.
Use Case
- New API fields (e.g. context_id, clear_context).
- Internal engine / scheduler support to associate context ID with existing KV cache.
- Metrics to track cache hit/miss rate, and memory usage of stored contexts.
Proposed Solution
No response