-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simd vector similarity #575
Conversation
Codecov ReportBase: 69.46% // Head: 69.46% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #575 +/- ##
=======================================
Coverage 69.46% 69.46%
=======================================
Files 225 225
Lines 15745 15751 +6
=======================================
+ Hits 10937 10942 +5
- Misses 4808 4809 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
let partial_dot_product = partial_simd_x * partial_simd_y; | ||
let partial_power_x = partial_simd_x * partial_simd_x; | ||
let partial_power_y = partial_simd_y * partial_simd_y; | ||
dem_x += partial_power_x.reduce_add(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't want to reduce here. I think it can be done out of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I had a version with that out of the loop, but the performance stayed the same, I will do that version again and update the code!
let mut free_in_buffer = 0; | ||
while i < len { | ||
i += 1; | ||
x.read_exact(&mut buff_x).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid using read_exact here... The compiler may or may not be smart enough to convert that into a simd unaligned load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure I follow you on this.. Here im only reading the f32 value, that I have as bytes on x, then I add the f32 to the simd buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// `left` & `right` do not need to be aligned.
/// This requires nightly due to portable SIMD,
/// and a cpu with AVX2 or equivalent to run.
/// f32 are assumed to be using native endianess
/// (most likely little endian in your case.)
///
/// AVX 512 should be 2x faster.
pub fn dot_product_simd(left: &[u8], right: &[u8]) -> f32 {
const DIM: usize = 768;
const NUM_LANES: usize = 8;
assert_eq!(left.len(), DIM * 4);
assert_eq!(right.len(), DIM * 4);
let mut left_ptr = left.as_ptr() as *const f32x8;
let mut right_ptr = right.as_ptr() as *const f32x8;
let mut acc: f32x8 = Default::default();
unsafe { //< ok thanks to the assert above.
for i in 0..DIM / NUM_LANES {
let mut left: f32x8 = std::ptr::read_unaligned(left_ptr);
let right: f32x8 = std::ptr::read_unaligned(right_ptr);
acc.add_assign(left.mul(right));
left_ptr = left_ptr.offset(1 as isize);
right_ptr = right_ptr.offset(1 as isize);
}
}
acc.as_array().iter().copied().sum()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You need to add the sqrt root stuff or normalized your vector though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you so much! I am trying to adapt this to nucliadb_vectors
where apart of adding the sqrt and all that the dimension of the vectors is user defined..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scalar_dot time: [564.91 ns 574.43 ns 585.44 ns]
change: [-19.913% -16.021% -12.075%] (p = 0.00 < 0.05)
Performance has improved.
simd_dot time: [74.888 ns 76.182 ns 77.759 ns]
change: [+1.3111% +3.2754% +5.5679%] (p = 0.00 < 0.05)
Performance has regressed.
scalar_cosine time: [601.11 ns 602.38 ns 604.07 ns]
change: [+0.3208% +0.7946% +1.2116%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
simd_cosine time: [87.091 ns 87.313 ns 87.598 ns]
change: [-34.935% -34.503% -34.119%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) high mild
4 (4.00%) high severe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the help! I am trying to figure out how would this behave in the case where a lane can not be filled, for examen if the dimension of the vectors is 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the following to the code: 398587e
Now I am seeing a better perfomance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share the bench? (and make sure you compile with target-cpu=native...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this binary for testing the performance.
let len_y = Len::from_le_bytes(buff_y); | ||
assert_eq!(len_x, len_y); | ||
let len = len_x; | ||
let mut buff_x = [0; 4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it 4 lane or 8 lane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case is 4 bytes, is the buffer used to read the f32 values
Superseded by #2179 |
Description
Simd cosine similarity + inlining of function
How was this PR tested?
Local tests + performance analysis with
1m_stats
.