SWAR

Low level, branch-free functions for number and string conversion and other utils.

SWAR stands for SIMD within a register. This means treating parts of a uint64_t as individual uint8_t's, uint16_t's, or uint32_t's.

This library is header-only. There is nothing to build.

Most functions have different variants optimized for limited lengths (8 and 4), and printable input (ascii < 128), and more

Language and build

I'm used to C++17 so used if constexpr, but the rest of the code is C++03 copatible, and can easily be converted to C.
Include swar.h and build with -std=c++17
For forward declarations only, include swar_fwd.h instead.

Test and benchmark

The test dir includes:

swar_test.cpp that is using google-test for unit testing. (TODO create a build: passing badge)
swar_bench.cpp that produces the numbers for the graph below.

Performance

Comparison of various atoi implementations (actually a-to-ull) on my home machine:
Machine and env spec: i5-3470, Cygwin on Windows 10 64 bit, g++ 9.3.0

The not-perfectly-straight lines, in SWAR's performance, are just measurement artifacts. In swar8, for example, it's the same instructions executed for every input, so there should be no difference. You can see how SWAR is faster than the naive impl and is fixed cost per word. The stock implementation is surprisingly slow. I don't know why as I didn't read its code yet.

Functions with 8, or 4, suffix are branchless and faster (see swar8 vs swar). Functions with longer input must have a branch per word.
An SSE implementation can follow the same ideas as here for longer inputs. However, using SSE instruction may switch some processors to a different P-state, if the BIOS allows, and the switching itself can take a few hundred cycles.

Branchless code is not always faster than branched code.
Benchmarks are typically less impacted by branch miss-predictions, then real world applications. This applies also in my benchmark. I did not take special care to litter the BP caches before each function call as this would make each call harder to measure.
Loops may be fully predicted, especially if BP caches are all working for the benchmark. However, in a real world app, using branchless low level code means that BP caches have more room for the application logic so the app as a whole may become faster. The only way to know for sure is to test within the app.

These performance characteristics are the same for strlen and similar functions.

Functions

All functions come in a few variants:

memchr and memrchr
strlen
atoi, htoi (hex string to int), atod
itoa
hasbyte - does word include a certain byte?

Supported operating systems

Linux
Cygwin

Supported programming languages

C++

Supported compilers

g++

Supported architectures

x86_64

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
test		test
LICENSE		LICENSE
README.md		README.md
compiler.h		compiler.h
swar.h		swar.h
swar_fwd.h		swar_fwd.h
swar_inl.h		swar_inl.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWAR

Language and build

Test and benchmark

Performance

Functions

Supported operating systems

Supported programming languages

Supported compilers

Supported architectures

About

Releases

Packages

Languages

License

yb303/swar

Folders and files

Latest commit

History

Repository files navigation

SWAR

Language and build

Test and benchmark

Performance

Functions

Supported operating systems

Supported programming languages

Supported compilers

Supported architectures

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages