Skip to content

Use native AVX512-FP16 instructions to speed up FP16 bulk similarity.#3181

Open
mulugetam wants to merge 1 commit intoopensearch-project:mainfrom
mulugetam:avx512-fp16
Open

Use native AVX512-FP16 instructions to speed up FP16 bulk similarity.#3181
mulugetam wants to merge 1 commit intoopensearch-project:mainfrom
mulugetam:avx512-fp16

Conversation

@mulugetam
Copy link
Contributor

@mulugetam mulugetam commented Mar 17, 2026

Description

The current FP16 bulk implementation converts FP16 vectors to FP32 and uses AVX-512 instructions for the computations. We can speed this up further by instead converting the query vector from FP32 to FP16 and using native AVX512-FP16 instructions, processing 32 lanes at a time.

To keep precision loss from FP16 accumulation under control, we periodically drain the FP16 accumulators to FP32 accumulators. With this approach, we’re seeing up to a 215% speedup, with only 0.1% difference in precision compared to FP32 accumulation used in the AVX-512 implementation.

This changes applies only to FAISS_OPT_LEVEL=avx512_spr builds.

--------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                     Time             CPU   Iterations        dim items_per_second  score_avg  score_max  score_med  score_min       vecs
--------------------------------------------------------------------------------------------------------------------------------------------------
BM_OldIP/384/64            1.10 us         1.10 us       636222        384       58.2089M/s  -0.628508     11.282  -0.884114   -12.7797         64
BM_OldIP/384/256           4.40 us         4.39 us       159123        384       58.2481M/s -0.0888764    12.5413  -0.177539    -19.926        256
BM_OldIP/768/64            2.11 us         2.10 us       329941        768       30.4368M/s  -0.159039    20.3568    0.97491   -23.0292         64
BM_OldIP/768/256           8.43 us         8.39 us        83408        768       30.4981M/s   0.194271    25.5505     1.1346   -29.4789        256
BM_OldIP/1536/64           4.11 us         4.10 us       170689     1.536k       15.6107M/s  -0.593526    23.9326  -0.721343   -25.9956         64
BM_OldIP/1536/256          16.4 us         16.4 us        42730     1.536k       15.6014M/s  -0.659997    32.6758  -0.470105   -34.0017        256
BM_NativeIP/384/64        0.613 us        0.612 us      1115958        384       104.602M/s  -0.627967    11.2872  -0.878643    -12.779         64
BM_NativeIP/384/256        2.45 us         2.45 us       286045        384       104.612M/s -0.0887998    12.5389  -0.176788   -19.9275        256
BM_NativeIP/768/64         1.06 us         1.06 us       659205        768       60.2367M/s  -0.158783    20.3637   0.978151   -23.0278         64
BM_NativeIP/768/256        4.24 us         4.23 us       165564        768       60.5313M/s   0.194494    25.5478    1.13046   -29.4756        256
BM_NativeIP/1536/64        1.97 us         1.97 us       356209     1.536k       32.5169M/s  -0.593326    23.9312  -0.712671   -25.9964         64
BM_NativeIP/1536/256       7.89 us         7.88 us        89028     1.536k       32.5067M/s  -0.659653     32.678  -0.471512   -34.0028        256
BM_OldL2/384/64            1.39 us         1.39 us       506127        384       46.0337M/s    257.072    288.435    256.742    226.701         64
BM_OldL2/384/256           5.55 us         5.54 us       126467        384       46.1857M/s    256.211    300.111    255.654    223.407        256
BM_OldL2/768/64            2.64 us         2.63 us       266576        768       24.3543M/s    514.804    566.245    512.102     465.45         64
BM_OldL2/768/256           10.5 us         10.5 us        66800        768       24.3502M/s     514.45    582.075    512.397     465.45        256
BM_OldL2/1536/64           5.15 us         5.14 us       136024     1.536k       12.4434M/s    1.0247k    1.0886k   1.02536k    957.367         64
BM_OldL2/1536/256          20.6 us         20.6 us        34054     1.536k       12.4511M/s   1.02595k   1.09963k   1.02505k    957.281        256
BM_NativeL2/384/64        0.728 us        0.727 us       962238        384       88.0255M/s    257.084    288.412    256.768    226.711         64
BM_NativeL2/384/256        2.92 us         2.91 us       240432        384       87.9104M/s     256.22    300.144    255.672    223.408        256
BM_NativeL2/768/64         1.29 us         1.29 us       544201        768       49.6204M/s    514.812    566.274    512.103    465.439         64
BM_NativeL2/768/256        5.13 us         5.12 us       135827        768       50.0017M/s    514.457     582.08    512.409    465.439        256
BM_NativeL2/1536/64        2.40 us         2.39 us       292366     1.536k       26.7372M/s   1.02471k   1.08862k   1.02537k     957.37         64
BM_NativeL2/1536/256       9.57 us         9.56 us        73264     1.536k       26.7891M/s   1.02595k   1.09962k   1.02506k    957.281        256

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@0ctopus13prime
Copy link
Collaborator

0ctopus13prime commented Mar 17, 2026

Thank you @mulugetam
This is awesome.
But it somehow contradicts to my past experience with AVX512 FP16, which gave me huge error at that time made me to go with conversion path 🤔. Specifically, score value acquired with FP16 -> FP32 conversion versus the one with AVX512 FP16 was quite different.

Do you mind forking a real cluster then ingest some real traffic (prefer to Cohere-10M) into it for recall validation?
If you don't have bandwidth, let us do it.
Will target this for 3.6

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
@0ctopus13prime
Copy link
Collaborator

Rebased yours, please rebase this in your dev environment too.

@0ctopus13prime
Copy link
Collaborator

I think the secret source for keeping the precision is in periodically draining accumulated FP16 results.
Brilliant!
Will keep reading through the code.


// Max FP16 accumulations before draining to FP32. Trades accuracy for speed.
// Lower values improve precision; higher values improve performance.
static constexpr int32_t FP16_DRAIN_INTERVAL = 4;
Copy link
Collaborator

@0ctopus13prime 0ctopus13prime Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So technically, we cannot avoid 100% precision loss.
If a value was 128, then at the worst case, dot product value can be greater than 65535, with 4 * a * a > 65535.

v1[i] * v2[i]
+ v1[i + 1] * v2[i + 1]
+ v1[i + 2] * v2[i + 2]
+ v1[i + 3] * v2[i + 3]
where all values in v1 and v2 are 128.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will deal with this issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but curious, is there a way to 100% avoid overflow with avx512_fp16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I’m aware of. The overflow isn’t just in the accumulation like in your example, it can also happen inside _mm512_fmadd_ph itself. We could add saturation or +/-Inf checks on the result, but that pretty much kills the performance gains (I tested this).

Unless the input vectors are normalized to something like [-1, 1], there’s no way to guarantee we won’t hit overflow. So my recommendation is to stick with the existing AVX-512 path.

That said, this could still be useful for bulk similarity when doing cosine distance. FYI, I intend to open a new PR for adding BF-16 today or tomorrow. BF-16 does not suffer from this overflow issues.

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.59%. Comparing base (d66d79e) to head (4ed17da).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3181      +/-   ##
============================================
- Coverage     82.60%   82.59%   -0.02%     
+ Complexity     3950     3949       -1     
============================================
  Files           426      426              
  Lines         14678    14678              
  Branches       1875     1875              
============================================
- Hits          12125    12123       -2     
- Misses         1793     1794       +1     
- Partials        760      761       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants