Optimize (vectorize?) integer deserialization #25

k0ekk0ek · 2023-02-20T11:25:56Z

Currently, converting numbers to binary is done the idiomatic way. Since integers (int8, int16 and int32) occur quite a lot in zone files, e.g. TTL, optimization may result in a nice peformance boost.

To tackle this, it's good to know that for some integer values, use of a symbolic value is allowed to. e.g. algorithm field in DS records. Also, TTL values are 32-bit integer fields, but the MSB must not be set.

The text was updated successfully, but these errors were encountered:

aqrit · 2023-06-17T13:29:56Z

My thoughts on parsing integers start here: https://stackoverflow.com/a/74453028
I don't know if it actually makes anything faster.

k0ekk0ek · 2023-06-19T08:40:00Z

Thanks @aqrit! I'll be sure to give this a try down the road. Most RRs don't contain integer data (people tend to specify the TTL once), but while wrinting #68, I did see some types where this may prove very useful.

k0ekk0ek · 2023-11-22T09:26:50Z

@aqrit, looked at your posts (and mentioned posts/articles) quite a bit. I've changed the function to detect invalid characters too. This is a first stab at the problem, I've not benchmarked it. For simdzone, we know what type of integer (8 bit, 16 bit or 32 bit) we're going to parse, so chances are we can do much better.

static int parse_integer(const char *buffer, size_t length, uint64_t *number)
{
  uint64_t digits, bad_chars;
  static const uint8_t shift[] =
    { 0,  56, 48, 40, 32, 24, 16, 8,  0,  0,  0,  0,  0,  0,  0,  0 };

  length &= 0xfull;

  memcpy(&digits, buffer, sizeof(digits)); // assumes little-endian
  // flip 0x30 (ASCII 0 (0x30) - 9 (0x39)) bits, detect non-digits
  digits ^= 0x3030303030303030ull;
  // shift off non-digits
  digits <<= shift[length];

  bad_chars = (digits | (digits + 0x0606060606060606ull)) & 0xf0f0f0f0f0f0f0f0ull;

  digits = (digits & 0x0f0f0f0f0f0f0f0fllu) * 2561 >> 8;
  digits = (digits & 0x00ff00ff00ff00ffllu) * 6553601 >> 16;
  digits = (digits & 0x0000ffff0000ffffllu) * 42949672960001 >> 32;

  *number = digits;

  return (bad_chars == 0);
}

int main(int argc, char *argv[])
{
  char buffer[16] = { 0 }; // padding guaranteed in simdzone
  size_t length;
  if (argc == 2)
    length = strlen(argv[1]);
  else
    length = 0;

  memcpy(buffer, argv[1], length&0xfull);

  uint64_t number;

  if (parse_integer(buffer, length, &number)) {
    fprintf(stdout, "number: %llu\n", number);
    return 0;
  } else {
    fprintf(stderr, "Invalid number\n");
    return -1;
  }
}

k0ekk0ek · 2023-11-28T10:04:26Z

I found that with validation required, it's incredibly hard to beat the naive implementation. I focused on int8 (maximum of 255) first. What I found consistently faster is:

static int parse_int8_swar(const char *str, size_t len, uint8_t *num)
{
  union { uint8_t as_str[4]; uint32_t as_int; } digits;

  memcpy(&digits.as_int, str, sizeof(digits));
  // flip 0x30, detect non-digits
  digits.as_int ^= 0x30303030lu;
  // shift off trash bytes
  digits.as_int <<= (4 - (len & 0x3)) * 8;

  const uint32_t x = digits.as_str[1] * 10 + digits.as_str[2];
  const uint32_t y = x * 10 + digits.as_str[3];
  *num = (uint8_t)y;
  return (digits.as_int & 0xf0f0f0f0) == 0 && y < 256 && len && len < 4;
}

Where naive is:

__attribute__((noinline))
static int parse_int8_naive(const char *str, size_t len, uint8_t *num)
{
  uint32_t n = 0;

  for (size_t i=0, r=len&0x3; i < r; i++) {
    uint8_t d = str[i] - '0';
    if (d > 9)
      return 0;
    n = n * 10 + d;
  }

  *num = (uint8_t)n;
  return n < 256 && len && len < 4;
}

Don't know if it's worth the trouble. @aqrit, @lemire, any pointers?

lemire · 2023-11-28T20:21:54Z

Here is my attempt:

https://lemire.me/blog/2023/11/28/parsing-8-bit-integers-quickly/

lemire · 2023-11-28T20:22:31Z

(My tests suggest that you can definitively do much better than naive with SWAR.)

k0ekk0ek · 2023-11-29T08:50:47Z

Great stuff @lemire. Thanks!

lemire · 2023-11-29T13:13:03Z

@k0ekk0ek I am tweaking it a bit since the initial version did not fully validate the input.

k0ekk0ek · 2023-11-29T13:33:08Z

I forgot to add 0x06060606 after the initial XOR back to check for character greater than '9'. That may not cover all scenarios though, i.e. when the initial value is 11001111, then it'd overflow.

lemire · 2023-11-29T14:33:48Z

I recommend the following...

int parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x30303030lu;
  digits.as_int <<= ((4 - len) * 8);
  uint32_t all_digits = ((digits.as_int & 0xf0f0f0f0) |
                         ((0x76767676 + digits.as_int) & 0x80808080)) == 0;
  *num = (uint8_t)((UINT64_C(0x640a0100) * digits.as_int) >> 32);
  return all_digits & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

I think it covers everything.

lemire · 2023-11-29T19:48:47Z

Slightly more efficient...

int parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union {
    uint8_t as_str[4];
    uint32_t as_int;
  } digits;

  memcpy(&digits.as_int, str, sizeof(digits));
  // flip 0x30, detect non-digits
  digits.as_int ^= 0x30303030lu;
  // shift off trash bytes, technically undefined behaviour when ((4 - len) * 8) is not
  // in [0,32) prior to C17 / C++14.
  digits.as_int <<= ((4 - len) * 8);
  // check all digits
  uint32_t all_digits = ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) == 0;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 24);
  return all_digits & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

This is now squarely 2x faster than naive on random inputs, and still measurably faster on predictable inputs.

It is likely practical.

k0ekk0ek · 2023-11-30T10:50:24Z

Thanks @lemire! It's amazing how much time one can spend on this stuff.

I have been trying to workaround the shift (note that parse_uint8_fastswar_bob suffers from the same problem in case len is greater). e.g. ((len & 3) ^ 3) + 1) * 8, (0x0d << (len & 3)) & 0x18) and a shift table. However, they are all slower than what's currently there (dependency chain?). Maybe there's an alternative to get the bytes in the "right" location. For simdzone, a len of 0 (zero) is not really a problem as in that case there won't be a token.

~~There's one case we haven't covered though. Leading zeroes. i.e. 0 is allowed, but 01 is not.~~ Leading zeroes actually are allowed. At least, BIND and Knot (and the current parser in NSD), do.

I'll give this some more thought at a later stage.

k0ekk0ek · 2023-11-30T11:28:27Z

Or, maybe it's just better solved in SIMD? Trying to get things to go "fast" for the fallback parser may not be the right choice(?) From a practical perspective, users won't notice as no-one is using CPU's that don't support SIMD for real work. Anyway, I'll have to polish the current code for the first release, I'll work on this more afterwards. (I have some vacation coming up 🙂)

lemire · 2023-11-30T13:38:26Z

There's one case we haven't covered though. Leading zeroes. i.e. 0 is allowed, but 01 is not.

Note that this is true of the naïve version as well.

k0ekk0ek · 2024-02-20T13:37:12Z

Random blog post I came across. May be worth investigating at some point.

k0ekk0ek added enhancement New feature or request good first issue Good for newcomers labels Feb 20, 2023

k0ekk0ek added this to the Release 0.2.0 milestone Feb 20, 2023

k0ekk0ek added the help wanted Extra attention is needed label Feb 20, 2023

k0ekk0ek mentioned this issue Nov 22, 2023

Epoch timestamps in RRSIG not supported #110

Closed

k0ekk0ek mentioned this issue Apr 5, 2024

Make NSD work #176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize (vectorize?) integer deserialization #25

Optimize (vectorize?) integer deserialization #25

k0ekk0ek commented Feb 20, 2023

aqrit commented Jun 17, 2023

k0ekk0ek commented Jun 19, 2023

k0ekk0ek commented Nov 22, 2023

k0ekk0ek commented Nov 28, 2023

lemire commented Nov 28, 2023

lemire commented Nov 28, 2023

k0ekk0ek commented Nov 29, 2023

lemire commented Nov 29, 2023

k0ekk0ek commented Nov 29, 2023

lemire commented Nov 29, 2023

lemire commented Nov 29, 2023

k0ekk0ek commented Nov 30, 2023 •

edited

k0ekk0ek commented Nov 30, 2023

lemire commented Nov 30, 2023

k0ekk0ek commented Feb 20, 2024

Optimize (vectorize?) integer deserialization #25

Optimize (vectorize?) integer deserialization #25

Comments

k0ekk0ek commented Feb 20, 2023

aqrit commented Jun 17, 2023

k0ekk0ek commented Jun 19, 2023

k0ekk0ek commented Nov 22, 2023

k0ekk0ek commented Nov 28, 2023

lemire commented Nov 28, 2023

lemire commented Nov 28, 2023

k0ekk0ek commented Nov 29, 2023

lemire commented Nov 29, 2023

k0ekk0ek commented Nov 29, 2023

lemire commented Nov 29, 2023

lemire commented Nov 29, 2023

k0ekk0ek commented Nov 30, 2023 • edited

k0ekk0ek commented Nov 30, 2023

lemire commented Nov 30, 2023

k0ekk0ek commented Feb 20, 2024

k0ekk0ek commented Nov 30, 2023 •

edited