Generate: fix SinkCache
on Llama models
#30581
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
SinkCache
has been broken on Llama and Llama-based models since we released the static cache update (v4.38). Now that we are happy with the state of the static cache (#30476), we can move on to fix what we broke along the way 馃In a nutshell, the static cache rework changed the
sin
andcos
tensors passed around, from the full set of values for all possible positions (up toconfig. max_position_embeddings
) to the values used in a forward pass alone. This is a non-negotiable change to achieve top compiled performance.However,
SinkCache
needs access to the wholesin
andcos
tensors, and they are not trivial to compute from scratch in the cache instance (it would need access to the RoPE class, creating a cyclical dependency). TheSinkCache
instance was changed to hold a cache [meta cache 馃く ] ofsin
andcos
as it sees them, rebuilding the full tensor internally. Having the full tensor rebuilt, it can operate as expected.tests/test_cache_utils.py::CacheIntegrationTest::test_sink_cache_hard
is fixed as a result of this PR. All other sink cache tests were passing (they were not using llama 馃懠 )