Skip to content

Commit

Permalink
results in readme
Browse files Browse the repository at this point in the history
  • Loading branch information
mklarqvist committed Aug 18, 2019
1 parent c150197 commit 03d13a5
Showing 1 changed file with 42 additions and 42 deletions.
84 changes: 42 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The core algorithms are described in the papers:

### Speedup

Sample performance metrics (practical upper limit) on AVX512BW machine. We simulate a single data array or pairs of data arrays in a aligned memory location and compute the same statistics many times. This reflect the fastest possible throughput if you never have to leave the destination cache-level.
Sample performance metrics (practical upper limit) on AVX512BW machine. We simulate a single data array or pairs of data arrays in a aligned memory location and compute the same statistics many times using the command `benchmark -p -r 10000` (required Linux `perf` subsystem). This reflect the fastest possible throughput if you never have to leave the destination cache-level.
The host architecture used is a 10 nm Cannon Lake [Core i3-8121U](https://ark.intel.com/content/www/us/en/ark/products/136863/intel-core-i3-8121u-processor-4m-cache-up-to-3-20-ghz.html) with gcc (GCC) 8.2.1 20180905 (Red Hat 8.2.1-3).

### POSPOPCNT
Expand All @@ -51,27 +51,27 @@ CPUs compared to a naive unvectorized solution
Fold speedup compared to a naive unvectorized algorithm
(`popcount_scalar_naive_nosimd`) for different array sizes as (CPU cycles/64-bit word, Instructions/64-bit word):

| Words | libalgebra.h | Scalar | Speedup |
|---------|---------------|---------------|---------|
| 4 | 28.5 (36.75) | 27.25 (34) | 1 |
| 8 | 18.38 (25.5) | 15.63 (30.5) | 0.9 |
| 16 | 10.81 (19.94) | 13.75 (28.75) | 1.3 |
| 32 | 7.91 (17.16) | 11.25 (27.88) | 1.4 |
| 64 | 3.47 (3.92) | 9.7 (27.44) | 2.8 |
| 128 | 2.23 (1.93) | 8.97 (27.22) | 4 |
| 256 | 1.09 (1.29) | 8.6 (27.11) | 7.9 |
| 512 | 0.7 (0.98) | 8.35 (27.06) | 11.9 |
| 1024 | 0.48 (0.82) | 8.22 (27.03) | 17.2 |
| 2048 | 0.35 (0.74) | 8.12 (27.01) | 23.5 |
| 4096 | 0.29 (0.7) | 8.09 (27.01) | 28.1 |
| 8192 | 0.24 (0.68) | 8.04 (27) | 33.2 |
| 16384 | 0.22 (0.67) | 8.02 (27) | 36.6 |
| 32768 | **0.21 (0.66)** | 8.02 (27) | **38.7** |
| 65536 | 0.83 (0.66) | 8.01 (27) | 9.7 |
| 131072 | 0.47 (0.66) | 8.02 (27) | 17 |
| 262144 | 0.49 (0.66) | 8.05 (27) | 16.5 |
| 524288 | 0.48 (0.66) | 8.07 (27) | 16.9 |
| 1048576 | 0.44 (0.66) | 8.07 (27) | 18.5 |
| Words | libalgebra.h | Scalar | Speedup |
|---------|--------------|---------------|---------|
| 4 | 27.75 (37) | 26.75 (33.5) | 1 |
| 8 | 16.38 (25.5) | 17.38 (30.25) | 1.1 |
| 16 | 10.5 (19.94) | 12.75 (28.63) | 1.2 |
| 32 | 7.72 (17.16) | 10.69 (27.81) | 1.4 |
| 64 | 3.09 (4.36) | 9.61 (27.41) | 3.1 |
| 128 | 2.53 (2.73) | 8.84 (27.2) | 3.5 |
| 256 | 1.35 (1.7) | 8.5 (27.1) | 6.3 |
| 512 | 0.67 (1.18) | 8.33 (27.05) | 12.4 |
| 1024 | 0.5 (0.92) | 8.25 (27.03) | 16.4 |
| 2048 | 0.41 (0.79) | 8.15 (27.01) | 20.1 |
| 4096 | 0.46 (0.72) | 8.12 (27.01) | 17.8 |
| 8192 | 0.39 (0.69) | 8.11 (27) | 21 |
| 16384 | 0.39 (0.67) | 8.1 (27) | 20.6 |
| 32768 | 0.89 (0.66) | 8.1 (27) | 9.1 |
| 65536 | 0.84 (0.66) | 8.1 (27) | 9.6 |
| 131072 | 0.68 (0.66) | 8.09 (27) | 11.9 |
| 262144 | 1.11 (0.66) | 8.09 (27) | 7.3 |
| 524288 | 1.84 (0.66) | 8.12 (27) | 4.4 |
| 1048576 | 1.95 (0.66) | 8.15 (27) | 4.2 |

### Set algebra

Expand All @@ -82,25 +82,25 @@ OR, or XOR) which all have identical latency and throughput (CPI).

| Words | libalgebra.h | Scalar | Speedup |
|---------|--------------|---------------|---------|
| 4 | 18.25 (8.63) | 14.63 (22.75) | 0.8 |
| 8 | 9.5 (5.44) | 10.25 (20.88) | 1.1 |
| 16 | 5.28 (3.84) | 7.91 (19.94) | 1.5 |
| 32 | 2.88 (2.56) | 6.88 (19.47) | 2.4 |
| 64 | 1.98 (2.06) | 5.94 (19.23) | 3 |
| 128 | 1.05 (0.89) | 5.49 (19.12) | 5.2 |
| 256 | 0.61 (0.64) | 5.24 (19.06) | 8.6 |
| 512 | 0.42 (0.51) | 5.13 (19.03) | 12.3 |
| 1024 | 0.29 (0.45) | 5.06 (19.02) | 17.4 |
| 2048 | 0.21 (0.41) | 5.04 (19.01) | 24.2 |
| 4096 | 0.17 (0.4) | 5.02 (19) | 29 |
| 8192 | 0.15 (0.39) | 5.01 (19) | 34.3 |
| 16384 | **0.13 (0.39)** | 5.01 (19) | **37.6** |
| 32768 | 0.54 (0.39) | 5 (19) | 9.3 |
| 65536 | 0.43 (0.38) | 5 (19) | 11.7 |
| 131072 | 0.28 (0.38) | 5 (19) | 17.8 |
| 262144 | 0.25 (0.38) | 5 (19) | 20.4 |
| 524288 | 0.24 (0.38) | 5 (19) | 20.7 |
| 1048576 | 0.22 (0.38) | 5 (19) | 22.9 |
| 4 | 17.63 (8.63) | 14.63 (22.75) | 0.8 |
| 8 | 8.13 (5.44) | 10 (20.88) | 1.2 |
| 16 | 4.69 (3.84) | 7.91 (19.94) | 1.7 |
| 32 | 2.38 (2.56) | 6.59 (19.47) | 2.8 |
| 64 | 1.82 (2.06) | 5.87 (19.23) | 3.2 |
| 128 | 0.88 (0.89) | 5.43 (19.12) | 6.2 |
| 256 | 0.57 (0.64) | 5.18 (19.06) | 9.2 |
| 512 | 0.41 (0.51) | 5.11 (19.03) | 12.4 |
| 1024 | 0.33 (0.45) | 5.06 (19.02) | 15.3 |
| 2048 | 0.39 (0.41) | 5.03 (19.01) | 13.1 |
| 4096 | 0.36 (0.4) | 5.02 (19) | 13.9 |
| 8192 | 0.37 (0.39) | 5.01 (19) | 13.7 |
| 16384 | 0.55 (0.39) | 5.01 (19) | 9.1 |
| 32768 | 0.55 (0.39) | 5 (19) | 9.2 |
| 65536 | 0.52 (0.38) | 5 (19) | 9.7 |
| 131072 | 0.56 (0.38) | 5.01 (19) | 9 |
| 262144 | 1.25 (0.38) | 5.02 (19) | 4 |
| 524288 | 1.76 (0.38) | 5.03 (19) | 2.9 |
| 1048576 | 1.81 (0.38) | 5.07 (19) | 2.8 |

## C/C++ API

Expand Down Expand Up @@ -135,7 +135,7 @@ int STORM_pospopcnt_u16(const uint16_t* data, uint32_t size, uint32_t* flags);
* Compute the intersection, union, or diff cardinality between pairs of bitmaps
* @data1: A 64-bit array
* @data2: A 64-bit array
* @size: Size of data in bytes
* @size: Size of data in 64-bit words
*/
// Intersect cardinality
uint64_t STORM_intersect_count(const uint64_t* data1, const uint64_t* data2, const uint32_t size);
Expand Down

0 comments on commit 03d13a5

Please sign in to comment.