Merge pull request #2 from mklarqvist/perf

Performance fixes and semantics
mklarqvist · Aug 19, 2019 · 944ced3 · 944ced3
2 parents 277c4c2 + 03d13a5
commit 944ced3
Show file tree

Hide file tree

Showing 4 changed files with 789 additions and 698 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -191,4 +191,9 @@ before_script:
 script:
  - cmake .
  - make
- - ./benchmark
+ - |
+ if [[ "${TRAVIS_OS_NAME}" == "linux" ]]; then
+ sudo ./benchmark -r 10
+ else
+ ./benchmark -r 10
+ fi
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ The core algorithms are described in the papers:
 
 ### Speedup
 
-Sample performance metrics (practical upper limit) on AVX512BW machine. We simulate a single data array or pairs of data arrays and compute the same statistics many times. This reflect the fastest possible throughput if you never have to leave the destination cache-level.
+Sample performance metrics (practical upper limit) on AVX512BW machine. We simulate a single data array or pairs of data arrays in a aligned memory location and compute the same statistics many times using the command `benchmark -p -r 10000` (required Linux `perf` subsystem). This reflect the fastest possible throughput if you never have to leave the destination cache-level.
 The host architecture used is a 10 nm Cannon Lake [Core i3-8121U](https://ark.intel.com/content/www/us/en/ark/products/136863/intel-core-i3-8121u-processor-4m-cache-up-to-3-20-ghz.html) with gcc (GCC) 8.2.1 20180905 (Red Hat 8.2.1-3).
 
 ### POSPOPCNT
@@ -49,24 +49,58 @@ CPUs compared to a naive unvectorized solution
 ### POPCNT
 
 Fold speedup compared to a naive unvectorized algorithm
-(`popcount_scalar_naive_nosimd`) for different array sizes.
-
-| Algorithm | 8 | 32 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 |
-|-----------|-----|-----|------|------|------|-------|-------|-------|-------|
-| popcnt | 3.3 | 6.5 | 20.1 | 25.4 | 62.4 | 116.7 | 175.1 | 226.4 | 247.8 |
+(`popcount_scalar_naive_nosimd`) for different array sizes as (CPU cycles/64-bit word, Instructions/64-bit word):
+
+| Words | libalgebra.h | Scalar | Speedup |
+|---------|--------------|---------------|---------|
+| 4 | 27.75 (37) | 26.75 (33.5) | 1 |
+| 8 | 16.38 (25.5) | 17.38 (30.25) | 1.1 |
+| 16 | 10.5 (19.94) | 12.75 (28.63) | 1.2 |
+| 32 | 7.72 (17.16) | 10.69 (27.81) | 1.4 |
+| 64 | 3.09 (4.36) | 9.61 (27.41) | 3.1 |
+| 128 | 2.53 (2.73) | 8.84 (27.2) | 3.5 |
+| 256 | 1.35 (1.7) | 8.5 (27.1) | 6.3 |
+| 512 | 0.67 (1.18) | 8.33 (27.05) | 12.4 |
+| 1024 | 0.5 (0.92) | 8.25 (27.03) | 16.4 |
+| 2048 | 0.41 (0.79) | 8.15 (27.01) | 20.1 |
+| 4096 | 0.46 (0.72) | 8.12 (27.01) | 17.8 |
+| 8192 | 0.39 (0.69) | 8.11 (27) | 21 |
+| 16384 | 0.39 (0.67) | 8.1 (27) | 20.6 |
+| 32768 | 0.89 (0.66) | 8.1 (27) | 9.1 |
+| 65536 | 0.84 (0.66) | 8.1 (27) | 9.6 |
+| 131072 | 0.68 (0.66) | 8.09 (27) | 11.9 |
+| 262144 | 1.11 (0.66) | 8.09 (27) | 7.3 |
+| 524288 | 1.84 (0.66) | 8.12 (27) | 4.4 |
+| 1048576 | 1.95 (0.66) | 8.15 (27) | 4.2 |
 
 ### Set algebra
 
 Fold speedup compared to naive unvectorized solution (`*_scalar_naive_nosimd`)
-for different array sizes (in number of _pairs_ of 64-bit values). These
+for different array sizes (in number of _pairs_ of 64-bit word but results reported per _single_ 64-bit word). These
 functions are identifical with the exception of the bitwise operator used (AND,
 OR, or XOR) which all have identical latency and throughput (CPI).
 
-| Algorithm | 8 | 32 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 |
-|-----------------|------|-------|-------|-------|-------|-------|-------|-------|-------|
-| intersect count | 4.73 | 10.8 | 17.58 | 24.82 | 31 | 35.78 | 37.75 | 23.08 | 20.81 |
-| union count | 4.64 | 10.96 | 17.19 | 24.88 | 31.09 | 35.74 | 37.95 | 22.92 | 21.11 |
-| diff count | 4.57 | 10.93 | 17.31 | 24.78 | 30.98 | 35.74 | 37.87 | 23.31 | 21.42 |
+| Words | libalgebra.h | Scalar | Speedup |
+|---------|--------------|---------------|---------|
+| 4 | 17.63 (8.63) | 14.63 (22.75) | 0.8 |
+| 8 | 8.13 (5.44) | 10 (20.88) | 1.2 |
+| 16 | 4.69 (3.84) | 7.91 (19.94) | 1.7 |
+| 32 | 2.38 (2.56) | 6.59 (19.47) | 2.8 |
+| 64 | 1.82 (2.06) | 5.87 (19.23) | 3.2 |
+| 128 | 0.88 (0.89) | 5.43 (19.12) | 6.2 |
+| 256 | 0.57 (0.64) | 5.18 (19.06) | 9.2 |
+| 512 | 0.41 (0.51) | 5.11 (19.03) | 12.4 |
+| 1024 | 0.33 (0.45) | 5.06 (19.02) | 15.3 |
+| 2048 | 0.39 (0.41) | 5.03 (19.01) | 13.1 |
+| 4096 | 0.36 (0.4) | 5.02 (19) | 13.9 |
+| 8192 | 0.37 (0.39) | 5.01 (19) | 13.7 |
+| 16384 | 0.55 (0.39) | 5.01 (19) | 9.1 |
+| 32768 | 0.55 (0.39) | 5 (19) | 9.2 |
+| 65536 | 0.52 (0.38) | 5 (19) | 9.7 |
+| 131072 | 0.56 (0.38) | 5.01 (19) | 9 |
+| 262144 | 1.25 (0.38) | 5.02 (19) | 4 |
+| 524288 | 1.76 (0.38) | 5.03 (19) | 2.9 |
+| 1048576 | 1.81 (0.38) | 5.07 (19) | 2.8 |
 
 ## C/C++ API
 
@@ -101,7 +135,7 @@ int STORM_pospopcnt_u16(const uint16_t* data, uint32_t size, uint32_t* flags);
  * Compute the intersection, union, or diff cardinality between pairs of bitmaps
  * @data1: A 64-bit array
  * @data2: A 64-bit array
- * @size: Size of data in bytes
+ * @size: Size of data in 64-bit words
  */
 // Intersect cardinality
 uint64_t STORM_intersect_count(const uint64_t* data1, const uint64_t* data2, const uint32_t size);
@@ -111,6 +145,25 @@ uint64_t STORM_union_count(const uint64_t* data1, const uint64_t* data2, const u
 uint64_t STORM_diff_count(const uint64_t* data1, const uint64_t* data2, const uint32_t size);
 ```
 
+### Advanced use
+
+Retrieve a function pointer to the optimal function given the target length.
+
+```C
+STORM_compute_func STORM_get_intersection_count_func(const size_t n_bitmaps_vector);
+STORM_compute_func STORM_get_union_count_func(const size_t n_bitmaps_vector);
+STORM_compute_func STORM_get_diff_count_func(const size_t n_bitmaps_vector);
+```
+
+Portable memory alignment.
+
+```C
+#include "libalgebra.h"
+
+void* STORM_aligned_malloc(size_t alignment, size_t size);
+void STORM_aligned_free(void* memblock);
+```
+
 ## How it works
 
 On x86 CPUs ```libalgebra.h``` uses a combination of algorithms depending on the input vector size and what instruction set your CPU supports. These checks are performed during **run-time**.