Skip to content

Support Context Cache for Improved Conversation Efficiency #1248

@Jeffwan

Description

@Jeffwan

🚀 Feature Description and Motivation

In many large language model (LLM) scenarios, especially multi-turn conversations or sessions where the user interacts repeatedly with the same context (e.g. chatbots, agents, assistant-like use cases), it’s critical to efficiently re-use past prompt / history information without repeatedly sending the entire conversation back to the model.

Several popular APIs already support explicit context caching or context handles:

We’d like to introduce an optional context caching interface in AIBrix, so that:

  • Clients can pass in a conversation/session ID or similar handle when making requests.
  • AIBrix can reuse already-processed KV cache / embedding context for that session, reducing repeated computation.
  • Expose:
    • A way to create a new context handle (first request)
    • A way to continue using an existing handle (subsequent requests)
    • A way to explicitly clear / expire handles (or auto-timeout)

This would likely require:

  • Storing partial KV cache (or references) indexed by conversation/session IDs.
  • Coordinating with AIBrix’ current GPU memory management and eviction mechanisms.
  • Ensuring multi-tenant isolation and clean up on failures.

Use Case

  • New API fields (e.g. context_id, clear_context).
  • Internal engine / scheduler support to associate context ID with existing KV cache.
  • Metrics to track cache hit/miss rate, and memory usage of stored contexts.

Proposed Solution

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions