Skip to content

Commit c2d9a4c

Browse files
committed
Merge branch 'prefix-aware-docs' of https://github.com/eicherseiji/ray into prefix-aware-docs
2 parents 9dd63dd + 1dde313 commit c2d9a4c

File tree

1 file changed

+24
-24
lines changed

1 file changed

+24
-24
lines changed

doc/source/serve/llm/prefix-aware-request-router.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,47 @@
11
(prefix-aware-request-router-guide)=
2-
# `PrefixCacheAffinityRouter` for LLM Inference Optimization
2+
# `PrefixCacheAffinityRouter` for LLM inference optimization
33

44
:::{warning}
55
This API is in alpha and may change before becoming stable.
66
:::
77

8-
LLM inference can benefit significantly from cache locality optimization. When prompts that share a prefix are processed by the same replica, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as [Automatic Prefix Caching (APC)](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html) in vLLM. The `PrefixCacheAffinityRouter` is designed specifically for this use case.
8+
LLM inference can benefit significantly from cache locality optimization. When one replica processes multiple prompts that share a prefix, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as [Automatic Prefix Caching (APC)](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html) in vLLM. The `PrefixCacheAffinityRouter` is designed specifically for this use case.
99

1010
This guide covers:
1111
- Understanding the prefix cache-aware routing algorithm
1212
- Building the components of a prefix-aware router
1313
- Configuration parameters and their impact
1414

1515
(prefix-aware-algorithm)=
16-
## How Ray Serve LLM Prefix Cache-Aware Routing Works
16+
## How Ray Serve LLM prefix cache-aware routing works
1717

1818
The `PrefixCacheAffinityRouter` implements a multi-tier routing strategy that balances cache locality with load distribution:
1919

20-
### 1. Load Balance Check
20+
### 1. Load balance check
2121
First, it evaluates whether the current load is balanced across replicas by comparing queue lengths. If the difference between the highest and lowest queue lengths is below the `imbalanced_threshold`, it proceeds with prefix cache-aware routing.
2222

23-
### 2. Prefix Matching Strategy
23+
### 2. Prefix matching strategy
2424
When load is balanced, the router uses a prefix tree to find replicas that have previously processed similar input text:
2525

2626
- **High Match Rate (≥10%)**: Routes to replicas with the highest prefix match rate for better cache hit rates
2727
- **Low Match Rate (<10%)**: Falls back to replicas with the lowest prefix cache utilization to increase utilization
2828
- **No Prefix Data**: Uses the default Power of Two Choices selection
2929

30-
### 3. Imbalanced Load Fallback
30+
### 3. Imbalanced load fallback
3131
When load is imbalanced (queue length difference exceeds threshold), the router prioritizes load balancing over cache locality and falls back to the standard Power of Two Choices algorithm.
3232

33-
### Prefix Tree Management
33+
### Prefix tree management
3434
The router maintains a distributed prefix tree actor that:
3535
- Tracks input text prefixes processed by each replica
3636
- Supports automatic eviction of old entries to manage memory usage
3737
- Persists across router instances using Ray's detached actor pattern
3838

3939
(building-prefix-aware-components)=
40-
## Building Prefix-Aware Router Components
40+
## Building prefix-aware router components
4141

4242
This section breaks down the key components of `PrefixCacheAffinityRouter` and shows how they work together. For a more basic example, see {ref}`custom-request-router-guide`.
4343

44-
### Base RequestRouter Foundation
44+
### Base RequestRouter foundation
4545

4646
Like all custom routers in Ray Serve, the `PrefixCacheAffinityRouter` extends the base [`RequestRouter`](../api/doc/ray.serve.request_router.RequestRouter.rst) class. The two core methods that define router behavior are:
4747

@@ -50,7 +50,7 @@ Like all custom routers in Ray Serve, the `PrefixCacheAffinityRouter` extends th
5050

5151
For a detailed explanation of these methods and their parameters, see the {ref}`simple-uniform-request-router` example in the custom request router guide.
5252

53-
### 1. Load Balance Detection Component
53+
### 1. Load balance detection component
5454

5555
The first component evaluates whether the current load is balanced across replicas:
5656

@@ -64,13 +64,13 @@ The first component evaluates whether the current load is balanced across replic
6464
This component prioritizes load balancing over cache locality when replicas become too imbalanced.
6565

6666

67-
### 2. Prefix Tree Management Component
67+
### 2. Prefix tree management component
6868

6969
The prefix tree component is implemented as a detached Ray actor that manages prefix tracking across the Serve application. The actual tree structure uses a multi-tenant prefix tree (approximate radix tree).
7070

7171
This distributed architecture allows the prefix information to persist across router restarts and be shared among multiple router instances.
7272

73-
### 3. Prefix Matching Logic Component
73+
### 3. Prefix matching logic component
7474

7575
The core prefix matching component implements the routing decision logic in the `_prefix_match_best_replicas` method. When load is balanced, it performs prefix matching to find the best replica:
7676

@@ -86,7 +86,7 @@ This logic implements the three-tier strategy:
8686
2. **Low match rate**: Falls back to replicas with smallest KV-cache usage when match rate is below threshold
8787
3. **No match**: Fall back to default Power of Two Choices selection when `_prefix_match_best_replicas` returns to `choose_replicas`.
8888

89-
### 4. Integration with Power of Two Choices
89+
### 4. Integration with Power of Two choices
9090

9191
The prefix-aware router extends the proven Power of Two Choices algorithm, falling back to it when prefix-based routing would degenerate. `PrefixCacheAffinityRouter` integrates this component in the `choose_replicas` method:
9292

@@ -98,7 +98,7 @@ The prefix-aware router extends the proven Power of Two Choices algorithm, falli
9898
```
9999

100100

101-
### 5. State Management and Callbacks
101+
### 5. State management and callbacks
102102

103103
The router uses the `on_request_routed()` callback to update the prefix tree with routing decisions:
104104

@@ -109,7 +109,7 @@ The router uses the `on_request_routed()` callback to update the prefix tree wit
109109
:caption: prefix_aware_router.py
110110
```
111111

112-
When a replica dies, the `on_replica_actor_died` callback is used to remove its entries from the shared prefix tree:
112+
When a replica dies, the router uses the `on_replica_actor_died` callback to remove the replica's entries from the shared prefix tree:
113113
```{literalinclude} ../../../../python/ray/llm/_internal/serve/request_router/prefix_aware/prefix_aware_router.py
114114
:start-after: __begin_on_replica_actor_died__
115115
:end-before: __end_on_replica_actor_died__
@@ -118,36 +118,36 @@ When a replica dies, the `on_replica_actor_died` callback is used to remove its
118118
```
119119

120120
(mixin-components)=
121-
## Mixin Components
121+
## Mixin components
122122

123-
The `PrefixCacheAffinityRouter` inherits from two mixins. For more details about these and other available mixins, see {ref}`utility-mixin`. These mixins are used to optimize the list of candidate replicas against which to calculate prefix cache hit rate.
123+
The `PrefixCacheAffinityRouter` inherits from two mixins. For more details about these and other available mixins, see {ref}`utility-mixin`. The router uses these mixins to optimize the list of candidate replicas against which it calculates prefix cache hit rate.
124124

125125
The [`LocalityMixin`](../api/doc/ray.serve.request_router.LocalityMixin.rst) provides locality-aware routing to optimize network latency by preferring replicas on the same node. The [`MultiplexMixin`](../api/doc/ray.serve.request_router.MultiplexMixin.rst) enables model multiplexing support by tracking which models are loaded on each replica and routing requests to replicas that already have the requested model in memory.
126126

127-
## Configuration Parameters
127+
## Configuration parameters
128128

129129
The `PrefixCacheAffinityRouter` provides several configuration parameters to tune its behavior:
130130

131-
### Core Routing Parameters
131+
### Core routing parameters
132132

133133
- **`imbalanced_threshold`** (default: 10): Queue length difference threshold for considering load balanced. Lower values prioritize load balancing over cache locality.
134134

135135
- **`match_rate_threshold`** (default: 0.1): Minimum prefix match rate (0.0-1.0) required to use prefix cache-aware routing. Higher values require stronger prefix matches before routing for cache locality.
136136

137-
### Memory Management Parameters
137+
### Memory management parameters
138138

139139
- **`do_eviction`** (default: False): Enable automatic eviction of old prefix tree entries to approximate the LLM engine's eviction policy.
140140

141-
- **`eviction_threshold_chars`** (default: 400,000): Maximum number of characters in the prefix tree before eviction is triggered.
141+
- **`eviction_threshold_chars`** (default: 400,000): Maximum number of characters in the prefix tree before the LLM engine triggers an eviction.
142142

143143
- **`eviction_target_chars`** (default: 360,000): Target number of characters to reduce the prefix tree to during eviction.
144144

145-
- **`eviction_interval_secs`** (default: 10): Interval in seconds between eviction checks when eviction is enabled.
145+
- **`eviction_interval_secs`** (default: 10): Interval in seconds between eviction checks for when eviction is enabled.
146146

147147
(deploy-llm-with-prefix-aware-router)=
148-
## Deploying LLM Applications with Prefix Cache-Aware Routing
148+
## Deploying LLM applications with Prefix Cache-Aware Routing
149149

150-
Here's how to deploy an LLM application using the prefix cache-aware request router:
150+
Deploy an LLM application using the prefix cache-aware request router as follows:
151151

152152
```{literalinclude} ../../llm/doc_code/serve/prefix_aware_router/prefix_aware_example.py
153153
:start-after: __prefix_aware_example_start__

0 commit comments

Comments
 (0)