-
Notifications
You must be signed in to change notification settings - Fork 221
Open
Labels
Description
func (z *Element) mulNaive(a, b *Element) *Element {
z[0] = uint32((uint64(a[0]) * uint64(b[0])) % q)
return z
}
func BenchmarkMulNaive(b *testing.B) {
var a, z Element
a.MustSetRandom()
z.MustSetRandom()
b.ResetTimer()
for range b.N {
z.mulNaive(&a, &z)
}
}
func BenchmarkMul(b *testing.B) {
var a, z Element
a.MustSetRandom()
z.MustSetRandom()
b.ResetTimer()
for range b.N {
z.Mul(&a, &z)
}
}On an M1 (ARM64) MacBook Pro I get the following:
BenchmarkMulNaive-8 203830368 5.818 ns/op
BenchmarkMul-8 195042295 6.128 ns/op
A 5% speedup in favor of the "naive" option, which apart from performance concerns is attractive due to its simplicity.
A major caveat here is that while AVX512 has shift instructions ("division by R" in Montgomery reduction), it doesn't seem to support generic integer division, which will have to be hacked around using floating point division. (i.e. divide as floats, truncate the quotient, multiply ad subtract). So it is possible that this change would speed up scalar ops but hurt vector ops. Barrett reduction on the other hand, which works on the naive representation, can be vectorized well.