Skip to content
/ swar Public

Low level, branch-free functions for number and string conversion and other utils

License

Notifications You must be signed in to change notification settings

yb303/swar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SWAR

Low level, branch-free functions for number and string conversion and other utils.

SWAR stands for SIMD within a register. This means treating parts of a uint64_t as individual uint8_t's, uint16_t's, or uint32_t's.

This library is header-only. There is nothing to build.

Most functions have different variants optimized for limited lengths (8 and 4), and printable input (ascii < 128), and more

Language and build

I'm used to C++17 so used if constexpr, but the rest of the code is C++03 copatible, and can easily be converted to C.
Include swar.h and build with -std=c++17
For forward declarations only, include swar_fwd.h instead.

Test and benchmark

The test dir includes:

  • swar_test.cpp that is using google-test for unit testing. (TODO create a build: passing badge)
  • swar_bench.cpp that produces the numbers for the graph below.

Performance

Comparison of various atoi implementations (actually a-to-ull) on my home machine:
Machine and env spec: i5-3470, Cygwin on Windows 10 64 bit, g++ 9.3.0

The not-perfectly-straight lines, in SWAR's performance, are just measurement artifacts. In swar8, for example, it's the same instructions executed for every input, so there should be no difference. You can see how SWAR is faster than the naive impl and is fixed cost per word. The stock implementation is surprisingly slow. I don't know why as I didn't read its code yet.

Functions with 8, or 4, suffix are branchless and faster (see swar8 vs swar). Functions with longer input must have a branch per word.
An SSE implementation can follow the same ideas as here for longer inputs. However, using SSE instruction may switch some processors to a different P-state, if the BIOS allows, and the switching itself can take a few hundred cycles.

Branchless code is not always faster than branched code.
Benchmarks are typically less impacted by branch miss-predictions, then real world applications. This applies also in my benchmark. I did not take special care to litter the BP caches before each function call as this would make each call harder to measure.
Loops may be fully predicted, especially if BP caches are all working for the benchmark. However, in a real world app, using branchless low level code means that BP caches have more room for the application logic so the app as a whole may become faster. The only way to know for sure is to test within the app.

These performance characteristics are the same for strlen and similar functions.

Functions

All functions come in a few variants:

  • memchr and memrchr
  • strlen
  • atoi, htoi (hex string to int), atod
  • itoa
  • hasbyte - does word include a certain byte?

Supported operating systems

  • Linux
  • Cygwin

Supported programming languages

  • C++

Supported compilers

  • g++

Supported architectures

  • x86_64

About

Low level, branch-free functions for number and string conversion and other utils

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published