Generate: fix `SinkCache` on Llama models #30581

gante · 2024-04-30T18:13:59Z

What does this PR do?

SinkCache has been broken on Llama and Llama-based models since we released the static cache update (v4.38). Now that we are happy with the state of the static cache (#30476), we can move on to fix what we broke along the way 🤗

In a nutshell, the static cache rework changed the sin and cos tensors passed around, from the full set of values for all possible positions (up to config. max_position_embeddings) to the values used in a forward pass alone. This is a non-negotiable change to achieve top compiled performance.

However, SinkCache needs access to the whole sin and cos tensors, and they are not trivial to compute from scratch in the cache instance (it would need access to the RoPE class, creating a cyclical dependency). The SinkCache instance was changed to hold a cache [meta cache 🤯 ] of sin and cos as it sees them, rebuilding the full tensor internally. Having the full tensor rebuilt, it can operate as expected.

tests/test_cache_utils.py::CacheIntegrationTest::test_sink_cache_hard is fixed as a result of this PR. All other sink cache tests were passing (they were not using llama 👼 )

HuggingFaceDocBuilderDev · 2024-04-30T18:35:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts · 2024-04-30T19:06:07Z

src/transformers/cache_utils.py

+ elif self._cos_cache.shape[0] < self.window_length:
+ self._cos_cache = torch.cat([self._cos_cache, cos[0, ...]], dim=0)
+ self._sin_cache = torch.cat([self._sin_cache, sin[0, ...]], dim=0)


Just for my own understanding of how the cache is meant to work, I have two Qs:

Values passed in on update call
if we call update with sin and cos passed in, is the cache keeping old values + new values i.e. self._cos_cache[:self._cos_cache_prev.shape[0]] are the old values and self._cos_cache[self._cos_cache_prev.shape[0]:] is the new values, or the passed in cos is just the new values to be appended?

Window length
Is the assumption here that the window length is constant once the cache is created?

@amyeroberts

The values passed in cos are new values to be appended. In RoPE models, sin and cos are a constant with shape [config.max_position_embeddings, rope_embedding_dims, config.hidden_size // config.num_attention_heads]. However, with the compile-optimized modeling code, we only materialize the needed parts of these matrices, with shape[0] = input_ids.shape[1] = input sequence length. Since SinkCache needs access to all sin and cos values up to shape[0] = self.window_length when going beyond the window length, this cache was created.

Alternatively, we could pass the the model config to compute the full sin and cos, but that would be (IMO) an ugly interface (we would have to use the model config to instantiate a RoPE layer inside the cache, to then compute these values and discard the layer).

Yes. SinkCache is a fixed-length cache -- its purpose is to be used with self.window_length < config.max_position_embeddings, while enabling coherent outputs beyond full sequence length = self.window_length. In other words, coherent long outputs with a relatively short cache :) Its limitation is that it can only recall content back up to the size of the window length, it quickly forgets things.

Got it - thanks for taking the time to write this up and explain!

amyeroberts

Thanks for fixing!

gante added 2 commits April 30, 2024 18:07

tmp commit

879d8c7

passing sink cache tests

d19bc05

gante requested a review from amyeroberts April 30, 2024 18:15

gante mentioned this pull request Apr 30, 2024

Generate: New Cache abstraction and Attention Sinks support #26681

Merged

5 tasks

gante changed the title ~~Generate: fix sink cache~~ Generate: fix sink cache on Llama models Apr 30, 2024

gante changed the title ~~Generate: fix sink cache on Llama models~~ Generate: fix SinkCache on Llama models Apr 30, 2024

amyeroberts reviewed Apr 30, 2024

View reviewed changes

amyeroberts approved these changes May 1, 2024

View reviewed changes

gante merged commit 9719202 into huggingface:main May 2, 2024
21 checks passed

gante deleted the fix_sink_cache branch May 2, 2024 14:24

itazap pushed a commit that referenced this pull request May 14, 2024

Generate: fix SinkCache on Llama models (#30581)

58d6e1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: fix `SinkCache` on Llama models #30581

Generate: fix `SinkCache` on Llama models #30581

gante commented Apr 30, 2024 •

edited

HuggingFaceDocBuilderDev commented Apr 30, 2024

amyeroberts Apr 30, 2024

gante May 1, 2024 •

edited

amyeroberts May 1, 2024

amyeroberts left a comment

Generate: fix SinkCache on Llama models #30581

Generate: fix SinkCache on Llama models #30581

Conversation

gante commented Apr 30, 2024 • edited

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 30, 2024

amyeroberts Apr 30, 2024

Choose a reason for hiding this comment

gante May 1, 2024 • edited

Choose a reason for hiding this comment

amyeroberts May 1, 2024

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Generate: fix `SinkCache` on Llama models #30581

Generate: fix `SinkCache` on Llama models #30581

gante commented Apr 30, 2024 •

edited

gante May 1, 2024 •

edited