Name		Name	Last commit message	Last commit date
parent directory ..
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
ChangeLog		ChangeLog
LICENSE		LICENSE
README.md		README.md
appveyor.yml		appveyor.yml
benchmark.cpp		benchmark.cpp
libpopcnt.h		libpopcnt.h

README.md

libpopcnt

libpopcnt.h is a header-only C/C++ library for counting the number of 1 bits (bit population count) in an array as quickly as possible using specialized CPU instructions i.e. POPCNT, AVX2, AVX512, NEON. libpopcnt.h has been tested successfully using the GCC, Clang and MSVC compilers.

The algorithms used in libpopcnt.h are described in the paper Faster Population Counts using AVX2 Instructions by Daniel Lemire, Nathan Kurz and Wojciech Mula (23 Nov 2016).

How it works

On x86 CPUs libpopcnt.h uses a combination of 4 different bit population count algorithms:

For array sizes < 512 bytes an unrolled POPCNT algorithm is used.
For array sizes ≥ 512 bytes an AVX2 algorithm is used.
For array sizes ≥ 1024 bytes an AVX512 algorithm is used.
For CPUs without POPCNT instruction a portable integer algorithm is used.

Note that libpopcnt.h works on all CPUs, it checks at run-time whether your CPU supports POPCNT, AVX2, AVX512 before using it and it is also thread-safe.

C/C++ API

#include "libpopcnt.h"

/*
 * Count the number of 1 bits in the data array
 * @data: An array
 * @size: Size of data in bytes
 */
uint64_t popcnt(const void* data, uint64_t size);

Speedup

This benchmark shows the speedup of the 4 popcount algorithms used on x86 CPUs compared to the basic lookup-8 popcount algorithm for different array sizes (in bytes).

Algorithm	32 B	64 B	128 B	256 B	512 B	1024 B	2048 B	4096 B
lookup-8	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
bit-parallel-mul	1.41	1.54	1.63	1.78	1.60	1.62	1.63	1.64
builtin-popcnt-unrolled	4.75	6.36	8.58	8.55	6.72	7.60	7.88	7.94
avx2-harley-seal	1.15	1.85	3.22	4.17	8.46	10.74	12.52	13.66
avx512-harley-seal	0.35	1.49	2.54	3.83	5.63	15.12	22.18	25.60

libpopcnt.h automatically picks the fastest algorithm for the given array size. This benchmark was run on an Intel Xeon Platinum 8168 CPU with GCC 5.4.

CPU architectures

libpopcnt.h has hardware accelerated popcount algorithms for the following CPU architectures:

x86	`POPCNT`, `AVX2`, `AVX512`
x86-64	`POPCNT`, `AVX2`, `AVX512`
ARM	`NEON`
PPC64	`POPCNTD`

For other CPU architectures a fast integer popcount algorithm is used.

How to compile

For GCC and Clang compilation does not require any special compiler flags (like -mavx2)!

cc  -O3 program.c
c++ -O3 program.cpp

Microsoft Visual C++

Using the MSVC compiler libpopcnt.h will only use POPCNT by default. You can enable AVX2 or AVX512 to get the best performance but then your program requires a CPU with AVX2 or AVX512 support.

# Enable AVX2
cl /O2 /arch:AVX2 program.cpp

# Enable AVX512
cl /O2 /arch:AVX512 program.cpp

Development

cmake .
make -j
make test

The above commands also build the benchmark program which is useful for benchmarking libpopcnt.h. Below is a usage example run on an Intel Xeon Platinum 8168 CPU from 2017:

# Usage: ./benchmark [array bytes] [iters]
./benchmark
Iters: 10000000
Array size: 16.00 KB
Algorithm: AVX512
Status: 100%
Seconds: 1.59
103.4 GB/s

Acknowledgments

The vectorized popcount algorithms used in libpopcnt.h have originally been written by Wojciech Muła, I just made a convenient and portable C/C++ library using these algorithms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libpopcnt

libpopcnt

README.md

libpopcnt

How it works

C/C++ API

Speedup

CPU architectures

How to compile

Microsoft Visual C++

Development

Acknowledgments

Files

libpopcnt

Directory actions

More options

Directory actions

More options

Latest commit

History

libpopcnt

Folders and files

parent directory

README.md

libpopcnt

How it works

C/C++ API

Speedup

CPU architectures

How to compile

Microsoft Visual C++

Development

Acknowledgments