You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/source/serve/llm/prefix-aware-request-router.md
+24-24Lines changed: 24 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,47 +1,47 @@
1
1
(prefix-aware-request-router-guide)=
2
-
# `PrefixCacheAffinityRouter` for LLM Inference Optimization
2
+
# `PrefixCacheAffinityRouter` for LLM inference optimization
3
3
4
4
:::{warning}
5
5
This API is in alpha and may change before becoming stable.
6
6
:::
7
7
8
-
LLM inference can benefit significantly from cache locality optimization. When prompts that share a prefix are processed by the same replica, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as [Automatic Prefix Caching (APC)](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html) in vLLM. The `PrefixCacheAffinityRouter` is designed specifically for this use case.
8
+
LLM inference can benefit significantly from cache locality optimization. When one replica processes multiple prompts that share a prefix, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as [Automatic Prefix Caching (APC)](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html) in vLLM. The `PrefixCacheAffinityRouter` is designed specifically for this use case.
9
9
10
10
This guide covers:
11
11
- Understanding the prefix cache-aware routing algorithm
12
12
- Building the components of a prefix-aware router
13
13
- Configuration parameters and their impact
14
14
15
15
(prefix-aware-algorithm)=
16
-
## How Ray Serve LLM Prefix Cache-Aware Routing Works
16
+
## How Ray Serve LLM prefix cache-aware routing works
17
17
18
18
The `PrefixCacheAffinityRouter` implements a multi-tier routing strategy that balances cache locality with load distribution:
19
19
20
-
### 1. Load Balance Check
20
+
### 1. Load balance check
21
21
First, it evaluates whether the current load is balanced across replicas by comparing queue lengths. If the difference between the highest and lowest queue lengths is below the `imbalanced_threshold`, it proceeds with prefix cache-aware routing.
22
22
23
-
### 2. Prefix Matching Strategy
23
+
### 2. Prefix matching strategy
24
24
When load is balanced, the router uses a prefix tree to find replicas that have previously processed similar input text:
25
25
26
26
-**High Match Rate (≥10%)**: Routes to replicas with the highest prefix match rate for better cache hit rates
27
27
-**Low Match Rate (<10%)**: Falls back to replicas with the lowest prefix cache utilization to increase utilization
28
28
-**No Prefix Data**: Uses the default Power of Two Choices selection
29
29
30
-
### 3. Imbalanced Load Fallback
30
+
### 3. Imbalanced load fallback
31
31
When load is imbalanced (queue length difference exceeds threshold), the router prioritizes load balancing over cache locality and falls back to the standard Power of Two Choices algorithm.
32
32
33
-
### Prefix Tree Management
33
+
### Prefix tree management
34
34
The router maintains a distributed prefix tree actor that:
35
35
- Tracks input text prefixes processed by each replica
36
36
- Supports automatic eviction of old entries to manage memory usage
37
37
- Persists across router instances using Ray's detached actor pattern
38
38
39
39
(building-prefix-aware-components)=
40
-
## Building Prefix-Aware Router Components
40
+
## Building prefix-aware router components
41
41
42
42
This section breaks down the key components of `PrefixCacheAffinityRouter` and shows how they work together. For a more basic example, see {ref}`custom-request-router-guide`.
43
43
44
-
### Base RequestRouter Foundation
44
+
### Base RequestRouter foundation
45
45
46
46
Like all custom routers in Ray Serve, the `PrefixCacheAffinityRouter` extends the base [`RequestRouter`](../api/doc/ray.serve.request_router.RequestRouter.rst) class. The two core methods that define router behavior are:
47
47
@@ -50,7 +50,7 @@ Like all custom routers in Ray Serve, the `PrefixCacheAffinityRouter` extends th
50
50
51
51
For a detailed explanation of these methods and their parameters, see the {ref}`simple-uniform-request-router` example in the custom request router guide.
52
52
53
-
### 1. Load Balance Detection Component
53
+
### 1. Load balance detection component
54
54
55
55
The first component evaluates whether the current load is balanced across replicas:
56
56
@@ -64,13 +64,13 @@ The first component evaluates whether the current load is balanced across replic
64
64
This component prioritizes load balancing over cache locality when replicas become too imbalanced.
65
65
66
66
67
-
### 2. Prefix Tree Management Component
67
+
### 2. Prefix tree management component
68
68
69
69
The prefix tree component is implemented as a detached Ray actor that manages prefix tracking across the Serve application. The actual tree structure uses a multi-tenant prefix tree (approximate radix tree).
70
70
71
71
This distributed architecture allows the prefix information to persist across router restarts and be shared among multiple router instances.
72
72
73
-
### 3. Prefix Matching Logic Component
73
+
### 3. Prefix matching logic component
74
74
75
75
The core prefix matching component implements the routing decision logic in the `_prefix_match_best_replicas` method. When load is balanced, it performs prefix matching to find the best replica:
76
76
@@ -86,7 +86,7 @@ This logic implements the three-tier strategy:
86
86
2.**Low match rate**: Falls back to replicas with smallest KV-cache usage when match rate is below threshold
87
87
3.**No match**: Fall back to default Power of Two Choices selection when `_prefix_match_best_replicas` returns to `choose_replicas`.
88
88
89
-
### 4. Integration with Power of Two Choices
89
+
### 4. Integration with Power of Two choices
90
90
91
91
The prefix-aware router extends the proven Power of Two Choices algorithm, falling back to it when prefix-based routing would degenerate. `PrefixCacheAffinityRouter` integrates this component in the `choose_replicas` method:
92
92
@@ -98,7 +98,7 @@ The prefix-aware router extends the proven Power of Two Choices algorithm, falli
98
98
```
99
99
100
100
101
-
### 5. State Management and Callbacks
101
+
### 5. State management and callbacks
102
102
103
103
The router uses the `on_request_routed()` callback to update the prefix tree with routing decisions:
104
104
@@ -109,7 +109,7 @@ The router uses the `on_request_routed()` callback to update the prefix tree wit
109
109
:caption: prefix_aware_router.py
110
110
```
111
111
112
-
When a replica dies, the `on_replica_actor_died` callback is used to remove its entries from the shared prefix tree:
112
+
When a replica dies, the router uses the `on_replica_actor_died` callback to remove the replica's entries from the shared prefix tree:
@@ -118,36 +118,36 @@ When a replica dies, the `on_replica_actor_died` callback is used to remove its
118
118
```
119
119
120
120
(mixin-components)=
121
-
## Mixin Components
121
+
## Mixin components
122
122
123
-
The `PrefixCacheAffinityRouter` inherits from two mixins. For more details about these and other available mixins, see {ref}`utility-mixin`. These mixins are used to optimize the list of candidate replicas against which to calculate prefix cache hit rate.
123
+
The `PrefixCacheAffinityRouter` inherits from two mixins. For more details about these and other available mixins, see {ref}`utility-mixin`. The router uses these mixins to optimize the list of candidate replicas against which it calculates prefix cache hit rate.
124
124
125
125
The [`LocalityMixin`](../api/doc/ray.serve.request_router.LocalityMixin.rst) provides locality-aware routing to optimize network latency by preferring replicas on the same node. The [`MultiplexMixin`](../api/doc/ray.serve.request_router.MultiplexMixin.rst) enables model multiplexing support by tracking which models are loaded on each replica and routing requests to replicas that already have the requested model in memory.
126
126
127
-
## Configuration Parameters
127
+
## Configuration parameters
128
128
129
129
The `PrefixCacheAffinityRouter` provides several configuration parameters to tune its behavior:
130
130
131
-
### Core Routing Parameters
131
+
### Core routing parameters
132
132
133
133
-**`imbalanced_threshold`** (default: 10): Queue length difference threshold for considering load balanced. Lower values prioritize load balancing over cache locality.
134
134
135
135
-**`match_rate_threshold`** (default: 0.1): Minimum prefix match rate (0.0-1.0) required to use prefix cache-aware routing. Higher values require stronger prefix matches before routing for cache locality.
136
136
137
-
### Memory Management Parameters
137
+
### Memory management parameters
138
138
139
139
-**`do_eviction`** (default: False): Enable automatic eviction of old prefix tree entries to approximate the LLM engine's eviction policy.
140
140
141
-
-**`eviction_threshold_chars`** (default: 400,000): Maximum number of characters in the prefix tree before eviction is triggered.
141
+
-**`eviction_threshold_chars`** (default: 400,000): Maximum number of characters in the prefix tree before the LLM engine triggers an eviction.
142
142
143
143
-**`eviction_target_chars`** (default: 360,000): Target number of characters to reduce the prefix tree to during eviction.
144
144
145
-
-**`eviction_interval_secs`** (default: 10): Interval in seconds between eviction checks when eviction is enabled.
145
+
-**`eviction_interval_secs`** (default: 10): Interval in seconds between eviction checks for when eviction is enabled.
146
146
147
147
(deploy-llm-with-prefix-aware-router)=
148
-
## Deploying LLM Applications with Prefix Cache-Aware Routing
148
+
## Deploying LLM applications with Prefix Cache-Aware Routing
149
149
150
-
Here's how to deploy an LLM application using the prefix cache-aware request router:
150
+
Deploy an LLM application using the prefix cache-aware request router as follows:
0 commit comments