POPCNT for counting bits

Counting the number of bits set in a bit vector is needed for operations such as measuring the Hamming distance between two bit vectors. This operation is called population count and a nice description is in the book Beautiful Code. I’ve used it to count the nucleotide mismatches between strings of two bit encoded DNA.

A few earlier processors have specialized instructions to perform this operation, but up until recently desktop computers lacked specialized instruction. Intel’s Nehalem adds the instruction POPCNT to perform the operation in hardware (or at least microcode)*. Program 1 below implements three methods of performing population count: 1 – the straightforward method, 2 – a well known bit twiddler method, and 3 – the POPCNT method. The POPCNT instruction can be easily be used through an intrinsic call. To stress the methods, the bits of 2^30 quad-word values are counted***.

Here are the timings as collected on an Intel X5570:

Method Time (s) Speed up
1 86.330 1
2 19.584 4.4
3 0.672 128.5

Unfortunately, if run the POPCNT method on a machine without the POPCNT instruction, the program fails with an ‘illegal instruction’ error. Instead of making two separate executables, one for machines with POPCNT and another for machines without, we can use runtime dispatchers, as in program 2, to control which implementation is used. Here only methods two and three are implemented as they are sufficient to cover all machines with the best runtime performance***. Unfortunately there is some overhead to the use of dispatchers. On the same machine, method three now runs in 3.023s instead of 0.672s, but we still achieve a gain and the program will run on processors earlier than Nehalem. In practice, one would want to balance pushing the dispatcher to as high a level as possible and minimization of code duplication. In this case, the dispatcher is just above a very simple, short duration, and highly called function so the overhead of the dispatcher is magnified.

To quickly check if your Linux machine has the POPCNT instruction, check the contents of the “flags” line in /proc/cpuinfo. If the line contains “popcnt” you are set, at least works on kernel 2.6.18.

It is also worth mentioning that Nvidia’s CUDA includes the __popc() and __popcll() equivalents of POPCNT. I don’t know when these were introduced, but I can say they work and provide better performance than methods one and two on GTX285 cards.

_____Program 1_____

#include <stdio.h>
#include <stdlib.h>
#include <smmintrin.h>

int c; // prevent optimizer from removing c from loops

int main() {
size_t n = 1<<30;

for(size_t i=0; i<n; i++) {
// Enabled desired method below

#if 0
// Straightforward method
c = 0;
for(int j=0; j<sizeof(i)*8; j++) {
if(i & (1ul << j)) c++;
}
#endif

#if 0
// bit twiddling method
c = 0;
unsigned long z = i;
for(; z; c++) {
z &= z - 1;
}
#endif

#if 0
// POPCNT method
c = _mm_popcnt_u64(i);
#endif
}

return 0;
}

_____Program 1_____

_____Program 2_____

#include <stdio.h>
#include <stdlib.h>
#include <smmintrin.h>

int c; // prevent optimizer from removing c from loops

__declspec(cpu_specific(core_i7_sse4_2))
int popcnt(unsigned long i) {
return _mm_popcnt_u64(i);
}

__declspec(cpu_specific(generic))
int popcnt(unsigned long i) {
int c = 0;
for(; i; c++) {
i &= i - 1;
}
return c;
}

__declspec(cpu_dispatch(generic, core_i7_sse4_2))
int popcnt(unsigned long i) {
}

int main() {
size_t n = 1<<30;

for(size_t i=0; i<n; i++) {
c = popcnt(i);
}

return 0;
}

_____Program 2_____

* The first “desktop” processor to add POPCNT is probably’s AMD’s Barcelona. It has been suggested the instruction appeared to secure an NSA contract. While a neat story, it could just as well be just that.
** All programs compiled with Intel C++ Compiler 11.1.046 using flags “-xSSE4.2 -O3”
*** Method two requires time proportional to the number of bits set. Its runtime could be as bad as method one, but not in the average case.

(Also see Majek’s article on AMD’s POPCNT and a couple other methods.)

This Post Has 3 Comments

  1. Scott

    There is a log-n way of doing this. Take an 8-bit integer:
    unsigned char value=123;

    unsigned char c1 = (value & 0x55) + ((value >> 1) & 0x55);
    unsigned char c2 = (c1 & 0x33) + ((c1 >> 2) & 0x33);
    unsigned char c3 = (c2 & 0xf) + ((c2 >> 4) & 0xf);

    return c3;

    Since its log-n with the # bits in the operand, it scales well on 64-bit platforms (I'm just too lazy to write it out). The 32-bit version is 17.6x faster than method #1. Depending on how much work you need to do surrounding POPCNT, having a more portable version may be worth the extra cost.

  2. -tom!

    If you extend Scott's implementation with N bytes (extend 0x5555…, 0x3333…, and 0x0F0F… to fill your word size) then actually you can accumulate the value in the high byte with one subsequent multiply, then shift that down.

    int c4 = (c3 * 0x0101…) >> (wordsize-8)
    return c4;

    This works with word sizes up to 128 bits (as a 256 bit register with all bits set would overflow the byte sized accumulator).

  3. Anonymous

    maybe this is just a MSVC thing but _mm_popcnt_u64 seems to require rather than

Comments are closed.