Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simd vector similarity #575

Closed
wants to merge 6 commits into from
Closed

Simd vector similarity #575

wants to merge 6 commits into from

Conversation

hermeGarcia
Copy link
Contributor

Description

Simd cosine similarity + inlining of function

How was this PR tested?

Local tests + performance analysis with 1m_stats.

@codecov
Copy link

codecov bot commented Feb 8, 2023

Codecov Report

Base: 69.46% // Head: 69.46% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (2f6d4c7) compared to base (5a31924).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #575   +/-   ##
=======================================
  Coverage   69.46%   69.46%           
=======================================
  Files         225      225           
  Lines       15745    15751    +6     
=======================================
+ Hits        10937    10942    +5     
- Misses       4808     4809    +1     
Flag Coverage Δ
node-sidecar 75.32% <ø> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
nucliadb_node/nucliadb_node/servicer.py 74.07% <0.00%> (-1.40%) ⬇️
nucliadb_node/nucliadb_node/reader.py 94.87% <0.00%> (+0.75%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

let partial_dot_product = partial_simd_x * partial_simd_y;
let partial_power_x = partial_simd_x * partial_simd_x;
let partial_power_y = partial_simd_y * partial_simd_y;
dem_x += partial_power_x.reduce_add();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't want to reduce here. I think it can be done out of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I had a version with that out of the loop, but the performance stayed the same, I will do that version again and update the code!

let mut free_in_buffer = 0;
while i < len {
i += 1;
x.read_exact(&mut buff_x).unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid using read_exact here... The compiler may or may not be smart enough to convert that into a simd unaligned load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure I follow you on this.. Here im only reading the f32 value, that I have as bytes on x, then I add the f32 to the simd buffer.

Copy link

@fulmicoton fulmicoton Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// `left` & `right` do not need to be aligned.
/// This requires nightly due to portable SIMD,
/// and a cpu with AVX2 or equivalent to run.
/// f32 are assumed to be using native endianess
/// (most likely little endian in your case.)
///
/// AVX 512 should be 2x faster.
pub fn dot_product_simd(left: &[u8], right: &[u8]) -> f32 {
    const DIM: usize = 768;
    const NUM_LANES: usize = 8;
    assert_eq!(left.len(), DIM * 4);
    assert_eq!(right.len(), DIM * 4);
    let mut left_ptr = left.as_ptr() as *const f32x8;
    let mut right_ptr = right.as_ptr() as *const f32x8;
    let mut acc: f32x8 = Default::default();
    unsafe { //< ok thanks to the assert above.
        for i in 0..DIM / NUM_LANES {
            let mut left: f32x8 = std::ptr::read_unaligned(left_ptr);
            let right: f32x8 = std::ptr::read_unaligned(right_ptr);
            acc.add_assign(left.mul(right));
            left_ptr = left_ptr.offset(1 as isize);
            right_ptr = right_ptr.offset(1 as isize);
        }
    }
    acc.as_array().iter().copied().sum()
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(You need to add the sqrt root stuff or normalized your vector though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you so much! I am trying to adapt this to nucliadb_vectors where apart of adding the sqrt and all that the dimension of the vectors is user defined..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalar_dot              time:   [564.91 ns 574.43 ns 585.44 ns]
                        change: [-19.913% -16.021% -12.075%] (p = 0.00 < 0.05)
                        Performance has improved.

simd_dot                time:   [74.888 ns 76.182 ns 77.759 ns]
                        change: [+1.3111% +3.2754% +5.5679%] (p = 0.00 < 0.05)
                        Performance has regressed.

scalar_cosine           time:   [601.11 ns 602.38 ns 604.07 ns]
                        change: [+0.3208% +0.7946% +1.2116%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

simd_cosine             time:   [87.091 ns 87.313 ns 87.598 ns]
                        change: [-34.935% -34.503% -34.119%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the help! I am trying to figure out how would this behave in the case where a lane can not be filled, for examen if the dimension of the vectors is 5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the following to the code: 398587e
Now I am seeing a better perfomance!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share the bench? (and make sure you compile with target-cpu=native...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this binary for testing the performance.

let len_y = Len::from_le_bytes(buff_y);
assert_eq!(len_x, len_y);
let len = len_x;
let mut buff_x = [0; 4];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it 4 lane or 8 lane?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case is 4 bytes, is the buffer used to read the f32 values

@javitonino
Copy link
Contributor

Superseded by #2179

@javitonino javitonino closed this May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants