Use native AVX512-FP16 instructions to speed up FP16 bulk similarity.#3181
Use native AVX512-FP16 instructions to speed up FP16 bulk similarity.#3181mulugetam wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
|
Thank you @mulugetam Do you mind forking a real cluster then ingest some real traffic (prefer to Cohere-10M) into it for recall validation? |
Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
dd343bf to
4ed17da
Compare
|
Rebased yours, please rebase this in your dev environment too. |
|
I think the secret source for keeping the precision is in periodically draining accumulated FP16 results. |
|
|
||
| // Max FP16 accumulations before draining to FP32. Trades accuracy for speed. | ||
| // Lower values improve precision; higher values improve performance. | ||
| static constexpr int32_t FP16_DRAIN_INTERVAL = 4; |
There was a problem hiding this comment.
So technically, we cannot avoid 100% precision loss.
If a value was 128, then at the worst case, dot product value can be greater than 65535, with 4 * a * a > 65535.
v1[i] * v2[i]
+ v1[i + 1] * v2[i + 1]
+ v1[i + 2] * v2[i + 2]
+ v1[i + 3] * v2[i + 3]
where all values in v1 and v2 are 128.
There was a problem hiding this comment.
Will deal with this issue.
There was a problem hiding this comment.
but curious, is there a way to 100% avoid overflow with avx512_fp16?
There was a problem hiding this comment.
Not that I’m aware of. The overflow isn’t just in the accumulation like in your example, it can also happen inside _mm512_fmadd_ph itself. We could add saturation or +/-Inf checks on the result, but that pretty much kills the performance gains (I tested this).
Unless the input vectors are normalized to something like [-1, 1], there’s no way to guarantee we won’t hit overflow. So my recommendation is to stick with the existing AVX-512 path.
That said, this could still be useful for bulk similarity when doing cosine distance. FYI, I intend to open a new PR for adding BF-16 today or tomorrow. BF-16 does not suffer from this overflow issues.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3181 +/- ##
============================================
- Coverage 82.60% 82.59% -0.02%
+ Complexity 3950 3949 -1
============================================
Files 426 426
Lines 14678 14678
Branches 1875 1875
============================================
- Hits 12125 12123 -2
- Misses 1793 1794 +1
- Partials 760 761 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
The current FP16 bulk implementation converts FP16 vectors to FP32 and uses AVX-512 instructions for the computations. We can speed this up further by instead converting the query vector from FP32 to FP16 and using native AVX512-FP16 instructions, processing 32 lanes at a time.
To keep precision loss from FP16 accumulation under control, we periodically drain the FP16 accumulators to FP32 accumulators. With this approach, we’re seeing up to a 215% speedup, with only 0.1% difference in precision compared to FP32 accumulation used in the AVX-512 implementation.
This changes applies only to
FAISS_OPT_LEVEL=avx512_sprbuilds.Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.