You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
['The theory of special relativity states 1. The speed of light is constant in all inertial reference']
66
66
```
67
67
68
-
</hfoption>
69
-
<hfoptionid="setup_cache">
68
+
Under the hood, `generate` will attempt to reuse the same cache object, removing the need for re-compilation at each call. However, if the batch size or the maximum output length increase between calls, the cache will have to be reinitialized, triggering a new compilation.
70
69
71
-
> [!WARNING]
72
-
> The `_setup_cache` method is an internal and private method that is still under development. This means it may not be backward compatible and the API design may change in the future.
70
+
</hfoption>
71
+
<hfoptionid="Static Cache">
73
72
74
-
The `_setup_cache` method doesn't support [`~GenerationMixin.generate`] yet, so this method is a bit more involved. You'll need to write your own function to decode the next token given the current token and position and cache position of previously generated tokens.
73
+
A [`StaticCache`] object can be passed to the model's forward pass under the `past_key_values` argument, enabling the use of this object as a static kv-cache. Using this strategy, you can write your own function to decode the next token given the current token and position and cache position of previously generated tokens. You can also pass the [`StaticCache`] object to [`~GenerationMixin.generate`] and use it across calls, like you would do with a dynamic cache.
75
74
76
75
```py
77
76
from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging
There are a few important things you must do to enable static kv-cache and torch.compile with the `_setup_cache` method:
105
+
There are a few important things you must do to enable static kv-cache and torch.compile with the `StaticCache` method:
102
106
103
-
1.Access the model's `_setup_cache` method and pass it the [`StaticCache`]class. This is a more flexible method because it allows you to configure parameters like the maximum batch size and sequence length.
107
+
1.Initialize the [`StaticCache`]instance before using the model for inference. There you can configure parameters like the maximum batch size and sequence length.
104
108
105
109
2. Call torch.compile on the model to compile the forward pass with the static kv-cache.
106
110
@@ -109,31 +113,38 @@ There are a few important things you must do to enable static kv-cache and torch
text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
132
140
text
133
141
['Simply put, the theory of relativity states that 1) the speed of light is constant, 2) the speed of light is the same for all observers, and 3) the laws of physics are the same for all observers.',
134
142
'My favorite all time favorite condiment is ketchup. I love it on everything. I love it on my eggs, my fries, my chicken, my burgers, my hot dogs, my sandwiches, my salads, my p']
135
143
```
136
144
145
+
> [!TIP]
146
+
> If you want to reuse the [`StaticCache`] object on a new prompt, be sure to reset its contents with the `.reset()` method
"Using `past_key_values` argument with `generate()` when using a static KV cache is not supported. Please open an issue in Transformers GitHub repository."
0 commit comments