ggml : support broadcast for ggml_soft_max_ext and ggml_flash_attn_ext #14435
+315
−185
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Extract broadcast changes from #14363 for
ggml_soft_max_ext()
andggml_flash_attn_ext()
:llama.cpp/ggml/include/ggml.h
Lines 1435 to 1451 in 236682a
llama.cpp/ggml/include/ggml.h
Lines 1876 to 1896 in 236682a
Both changes should be quite simple. On
master
we have the assumption that the mask is a 2D matrix and we always broadcast it across the dim 2 (i.e. the heads) and dim 3. With this change, we allow to have separate masks - i.e. generalized broadcast.Currently, I've added tests and implemented the CPU and Metal to support this. The rest of the backends will fallback to CPU, until this gets implemented:
Fallback is okay for now since these extensions are not used at the moment by
llama.cpp
. This support will be needed later for the #14363 PR, although it's better to support this either way.