The Infinite Loop

Tales from a lean programmer.


2 Comments

Test-and-set spinlocks

In my previous blog post I wrote about the most important properties of spinlocks and why they matter. This time I’ll present the first out of three concrete spinlock implementations in C++11 for x86 processors. I’m assuming that you’ve read the first post. You can find all source code in this Github repository.

Introduction

The Test-And-Set (TAS) Lock is the simplest possible spinlock implementation. It uses a single shared memory location for synchronization, which indicates if the lock is taken or not. The memory location is updated using a test-and-set (TAS) operation. TAS atomically writes to the memory location and returns its old value in a single indivisible step. Actually, the name test-and-set is a little misleading, because the caller is responsible for testing if the operation has succeeded or not. The TAS Lock is not fair as it doesn’t guarantee FIFO ordering amongst the threads competing for the lock.

In C++11 std::atomic_bool::exchange() can be used to perform a TAS operation on an std::atomic_bool synchronization variable. On x86 CPUs std::atomic::exchange() is turned into the LOCK XCHG instruction (side note: the LOCK prefix is implicit and not strictly required, because there’s no non-atomic version of XCHG). The following code implements the described TAS Lock.

class TasSpinLock
{
public:
    ALWAYS_INLINE void Enter()
    {
        // atomic_bool::exchange() returns previous value of Locked
        while (Locked.exchange(true, std::memory_order_acquire) == true);
    }

    ALWAYS_INLINE void Leave()
    {
        Locked.store(false, std::memory_order_release);
    }

private:
    alignas(CACHELINE_SIZE) std::atomic_bool Locked = {false};
};

static_assert(sizeof(TasSpinLock) == CACHELINE_SIZE, "");

The TasSpinLock::Locked flag is cache line size padded using alignas to prevent false sharing. You can remove the padding, e.g. in memory-limited environments, when false sharing is not an issue.

The test-and-test-and-set optimization

While the TAS Lock is extremely easy to implement, its scalability is very bad. Already with just a few threads competing for the lock, the amount of required cache line invalidations to acquire/release the lock quickly degrades performance. The problem is two-fold.

  1. std::atomic_bool::exchange() always invalidates the cache line Locked resides in, regardless of whether it succeeded in updating Locked or not.
  2. When the spinlock is released, all waiting threads simultaneously try to acquire it. This is sometimes referred to as Thundering Herd problem. Acquiring the lock means first invalidating the cache line copy of all threads waiting for the lock and then reading the valid cache line copy from the core which released the lock. With t threads this results in O(t) memory bus transactions.

The latter issue is inherent to TAS Locks, because just a single synchronization variable is used and shared between all threads. However, the first issue can be fixed by testing if the lock is free before calling exchange(). First testing if the lock is free only loads the synchronization variable and therefore doesn’t cause a cache line invalidation. This spinlock variant is called Test-And-Test-And-Set (TTAS) Lock. The improved Enter() method looks like this:

ALWAYS_INLINE void TTasSpinLock::Enter()
{
    do
        WaitUntilLockIsFree();
    while (Locked.exchange(true, std::memory_order_acquire) == true);
}

ALWAYS_INLINE void TTasSpinLock::WaitUntilLockIsFree() const
{
    while (Locked.load(std::memory_order_relaxed) == true);
}

Exponential back-off

The TTAS Lock successfully reduces the amount of cache line invalidations occurring while the lock is trying to be acquired. However, as soon as the lock is released, again all threads try to update the Locked flag. If a thread sees that the lock is free, but fails acquiring it subsequently because some other thread was faster, the lock is most probably under high contention. In this situation it can help to wait for a short time to let other threads finish before trying to enter the critical section (CS) again.

Waiting a short time without trying to acquire the lock reduces the number of threads simultaneously competing for the lock. The larger the number of unsuccessful retries, the higher the lock contention and the longer the thread should back-off. Experiments have shown that a good strategy is to back-off exponentially; similar to the congestion avoidance mechanism used in the Ethernet protocol. To avoid threads running in lock step the initial wait time is randomized.

ALWAYS_INLINE static void BackoffExp(size_t &curMaxIters)
{
    assert(curMaxIters > 0);

    thread_local std::uniform_int_distribution<size_t> dist;
    thread_local std::minstd_rand gen(std::random_device{}());
    const size_t spinIters = dist(gen, decltype(dist)::param_type{0, curMaxIters});

    curMaxIters = std::min(2*curMaxIters, MAX_BACKOFF_ITERS);
    for (volatile size_t i=0; i<spinIters; i++); // Avoid being optimized out!
}

ALWAYS_INLINE void ExpBoTTasSpinLock::Enter()
{
    size_t curMaxIters = MIN_BACKOFF_ITERS;

    while (true)
    {
        // Not strictly required but doesn't hurt
        WaitUntilLockIsFree();

        if (Locked.exchange(true, std::memory_order_acquire) == true)
            BackoffExp(curMaxIters); // Couldn't acquire lock => back-off
        else
            break; // Acquired lock => done
    }
}

The downside of waiting is that it renders the lock even unfairer than it is already. Threads that have been attempting to acquire the lock longest are also backing-off longest. Consequently, newly arriving threads have a higher chance to acquire the lock than older threads. On top of that threads might back-off for too long, causing the CS to be underutilized. Furthermore, there are unfortunately no single best values for MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS. Optimally, they’re determined experimentally depending on the workload (contention, size of CS).

Pausing and sleeping

There are two more improvements we can do. First, according to the Intel Architectures Optimization Reference Manual, adding a PAUSE instruction to the body of all spin loops improves performance on Pentium 4 CPUs and later. The PAUSE instruction provides a hint to the CPU that the code being executed is a spin-wait loop. PAUSE is backwards compatible and turns into a NOP instruction on older CPUs. There are three reasons why using PAUSE improves performance:

  • It avoids memory order violations which result in expensive pipeline flushes, because the CPU doesn’t go ahead to fill its pipeline with speculatively executed load and compare instructions.
  • It gives the other hardware thread time to execute on simultaneous multi-threading (SMT) processors (e.g. Intel’s Hyper Threading).
  • It adds a slight delay to the spin loop, which synchronizes it with the system’s memory bus frequency. This frequency is typically much lower than the frequency of the CPU, which in turn reduces power consumption considerably.

The second improvement is to stop spinning, put the thread to sleep and reschedule if the the lock doesn’t become available after some time. The important bit here is to not yield (sched_yield() on Linux, SwitchToThread() on Windows, or more generally std::this_thread::yield() in C++11), but to explicitly put the thread to sleep for a short period of time. This ensures that the thread is really paused for some time and isn’t immediately run again by the scheduler if there’s no other thread available in the scheduler’s run queue. This is especially important on SMT processors where every two logical cores share most of the execution units. Only when the thread is really sleeping the other logical core can fully make use of the shared execution units. The following TTAS Lock implementation uses the PAUSE instruction and sleeps for 500us if the lock isn’t released within MAX_WAIT_ITERS iterations. There’s no single best value for MAX_WAIT_ITERS. Ideally, it’s chosen experimentally like the MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS constants from before.

ALWAYS_INLINE static void CpuRelax()
{
#if (COMPILER == MVCC)
    _mm_pause();
#elif (COMPILER == GCC || COMPILER == LLVM)
    asm("pause");
#endif
}

ALWAYS_INLINE static void YieldSleep()
{
    // Don't yield but sleep to ensure that the thread is not
    // immediately run again in case scheduler's run queue is empty
    using namespace std::chrono;
    std::this_thread::sleep_for(500us);
}

ALWAYS_INLINE static void BackoffExp(size_t &curMaxIters)
{
    ... Unchanged code not repeated, look above ...

    for (size_t i=0; i<spinIters; i++)
        CpuRelax();
}

ALWAYS_INLINE void ExpBoRelaxTTasSpinLock::WaitUntilLockIsFree() const
{
    size_t numIters = 0;

    while (Locked.load(std::memory_order_relaxed))
    {
        if (numIters < MAX_WAIT_ITERS)
        {
            CpuRelax();
            numIters++;
        }
        else
            YieldSleep();
    }
}

Performance comparison

The benchmark used to compare the different TAS Lock variants is extremely simple and should be taken with a grain of salt. It launches t threads, each competing for the lock and incrementing a global atomic counter inside the CS. This causes high lock contention, because all threads are trying to enter the lock at the same time. Though, it’s important to keep in mind that due to the lock’s lack of fairness, it’s entirely possible that the same thread acquires the lock multiple times in a row. I ran the benchmark on a dual socket NUMA system with two Intel Xeon E5-2630 v2 CPUs with 2.60 GHz. The CPUs support Hyper Threading which gives a total of 12 physical and 24 logical cores. The benchmark results are shown below.

TAS Lock benchmark results

The most interesting finding is that the amount of lock reacquisitions is a lot higher than I anticipated. This can be derived from the fact that all TTAS-based locks scale almost linearly, where typically the observed drop in performance for TAS-based locks is exponential. Large numbers of lock reacquisitions by the same thread reduces the amount of cache line invalidations and increases throughput. At the same time the latency of all other threads goes up, because they’re granted the lock later.

The TasSpinLock scales so much worse, even though lock reacquisiton happens, because while spinning the TAS operation causes cache line invalidations also if it doesn’t succeed in acquiring the lock. By looking at the difference between the TasSpinLock and the other TTAS-based versions, it’s highly visible how big the influence of cache line invalidations is on the lock’s performance.

Wrapup

The exponential back-off TTAS Lock tends to perform fairly well as long as the number of competing threads is low or lock reacquisitions boost throughput while increasing latency. It’s especially useful when it’s not known a priori that the number of competing threads is bounded by the number of cores in the system (in that case better scaling spinlock variants perform really badly as we’ll see in the next post). Beyond that, without padding the TTAS Lock just needs a single byte. The compact representation is useful in extremely memory-constrained environments or applications that require vast amounts of spinlocks; e.g. for fine-grained locking of lots of small pieces of data.


9 Comments

Advanced Octrees 2: node representations

Introduction

Welcome to the second installment of the Advanced Octrees series. Make sure you’ve also read part one. In the second part I discuss different Octree data structure layouts. Essentially, literature distinguishes between two different layouts: the traditional layout using a pointer-based node representation and the implicit layout using an index-based node representation. Both layouts have their strong points and their weak points. Pointer-based node representations are advantageous if the Octree needs to be updated frequently and if memory consumption is not an issue. Implicit node representations pay off in memory limited applications.

Regardless of the chosen representation, the node struct always contains a pointer to the list of objects it encloses. Additionally, almost always the cell’s axis-aligned bounding box (AABB) is stored inside the node. The AABB can be stored in two different ways: either as two vectors AabbMin and AabbMax containing the AABB’s minimum and maximum corners, or as two vectors Center and HalfSize containing the AABB’s center and extent. If the Octree is cubic, all AABB sides have the same length and the storage size of the latter AABB representation can be reduced by storing HalfSize as a single float. Speedwise the center-extent representation is advantageous in most calculations (e.g. for view-frustum culling). Instead of storing the AABB inside the node, it can be recomputed while traversing the Octree. This is a memory consumption vs. compute trade-off.

All struct sizes given in the remainder of this post assume 64-bit wide pointers and the Vector3 class consisting of three 32-bit float variables. Let’s start with looking at pointer-based node representations first.

Pointer-based node representations

Standard representation

The most intuitive, pointer-based node representation consists of eight pointers to each of the eight child nodes. This representation supports on-demand allocation of nodes. On-demand allocation only allocates memory for child nodes, once an object is encountered which falls into the respective sub-cell. Some Octree implementations add pointers to the parent node for bottom-up traversals.

// standard representation (104 bytes)
struct OctreeNode
{
    OctreeNode * Children[8];
    OctreeNode * Parent; // optional
    Object *     Objects;
    Vector3      Center;
    Vector3      HalfSize;
};

As leaf nodes have no children, they don’t need to store child pointers. Therefore, two different node types, one for inner nodes and one for leaf nodes, can be used. To distinguish between inner nodes and leaf nodes, a flag must be stored additionally. The flag can be either stored as an additional bool variable IsLeaf, or encoded in the least significant bit of one of the pointers if the nodes are allocated with appropriate alignment (C++’s new operator usually aligns object types to the size of their largest member).

// inner node (105 bytes)                   // leaf node (41 bytes)
struct OctreeInnerNode                      struct OctreeLeafNode
{                                           {
    OctreeInnerNode * Children[8];              // leaf has no children
    OctreeInnerNode * Parent;                   OctreeInnerNode * Parent; // optional
    Object *          FirstObj;                 Object *          Objects;
    Vector3           Center;                   Vector3           Center;
    Vector3           HalfSize;                 Vector3           HalfSize;
    bool              IsLeaf;                   bool              IsLeaf;
};                                          };

Using two different node types, one for inner nodes and one for leaf nodes, can be applied as well to the following two representations.

Block representation

A significant amount of memory can be saved by storing just one pointer to a block of eight children, instead of eight pointers to eight children. That way the storage size of an inner node can be reduced from 105 bytes down to 49 bytes, which is only 47% of the original size. However, when a leaf node is subdivided always all eight children must be allocated. It’s not possible anymore to allocate child nodes on-demand, once the first object falling into the octant in question is encountered. Look at the following figure for an illustration of the block representation.

The corresponding code for the node struct is:

// block representation (49 bytes)
struct OctreeNode
{
    OctreeNode * Children;
    OctreeNode * Parent; // optional
    Object *     FirstObj;
    Vector3      Center;
    Vector3      HalfSize;
    bool         IsLeaf;
};

Sibling-child representation

On-demand allocation can reduce the amount of required memory for nodes significantly if the world is sparsely populated and thereby, many octants contain no objects. A trade-off between the standard representation and the block representation is the so called sibling-child representation. This representation allows on-demand allocation while storing only two node pointers per node instead of eight. The first pointer is NextSibling, which points to the next child node of the node’s parent. The second pointer is FirstChild, which points to the node’s first child node. Look at the following figure for an illustration of the sibling-child representation. Compare the number of required pointers per node to the standard representation.

In comparison to the standard representation, the sibling-child representation needs only 25% of the memory for pointers (two instead of eight pointers). As long as the child nodes are accessed sequentially the two representations perform equally well. However, accessing nodes randomly requires dereferencing on average four times more pointers. The code for the node struct is given below.

// sibling-child representation (56 bytes)
struct OctreeNode
{
    OctreeNode * NextSibling;
    OctreeNode * FirstChild;
    OctreeNode * Parent; // optional
    Object *     FirstObj;
    Vector3      Center;
    Vector3      HalfSize;
};

Comparison

The choice of the right pointer-based representation depends mainly on the importance of memory usage vs. traversal speed. Explicitly storing all eight child pointers wastes memory but makes traversing and modifying the Octree easy to implement and fast. In contrast, the sibling-child representation saves 50% memory as a single node is only 48 bytes instead of 92 bytes. However, the additional pointer indirections might complicate the traversal code and make it slower. It can be a good trade-off to store just a single pointer to a block of eight sub-cells. This representation needs only 40 bytes of memory per node and the traversal code is as easy as in the representation with eight child pointers. However, always allocating all eight sub-cells can waste memory in sparsely populated worlds with many empty sub-cells.

Implicit node representations

Linear (hashed) Octrees

Linear Octrees [Gargantini, 1982]1, originally proposed for Quadtrees, combine the advantages of pointer-based and pointer-less representations. Linear Octrees provide easy and efficient access to parent and child nodes, even though no explicit tree structure information must be stored per node.

Overview

Instead of child and parent pointers, Linear Octrees store a unique index called locational code in each node. Additionally, all Octree nodes are stored in a hash map which allows directly accessing any node based on its locational code. The locational code is constructed in such a way, that deriving the locational codes for any node’s parent and children based on its own locational code is feasible and fast. To avoid unnecessary hash map look-ups for children which don’t exist, the node struct can be extended by a bit-mask indicating which children have been allocated and which haven’t.

struct OctreeNode // 13 bytes
{
    Object *    Objects;
    uint32_t    LocCode;     // or 64-bit, depends on max. required tree depth
    uint8_t     ChildExists; // optional
};

The locational code

In order to create the locational code each octant gets a 3-bit number between 0 and 7 assigned, depending on the node’s relative position to it’s parent’s center. The possible relative positions are: bottom-left-front (000), bottom-right-front (001), bottom-left-back (010), bottom-right-back (011), top-left-front (100), top-right-front (101), top-left-back (110), top-right-back (111). The locational code of any child node in the tree can be computed recursively by concatenating the octant numbers of all the nodes from the root down to the node in question. The octant numbers are illustrated in the figure below.

The AABB of the node can be stored explicitly as before, or it can be computed from the node’s tree depth stored implicitly inside the locational code. To derive the tree depth at a node from its locational code a flag bit is required to indicate the end of the locational code. Without such a flag it wouldn’t be possible to distinguish e.g. between 001 and 000 001. By using a 1 bit to mark the end of the sequence 1 001 can be easily distinguished from 1 000 001. Using such a flag is equivalent to setting the locational code of the Octree root to 1. With one bit for the flag and three bits per Octree level, a 32-bit locational code can represent at maximum a tree depth of 10, a 64-bit locational code a tree depth of 21. Given the locational code c of a node, its depth in the Octree can be computed as \lfloor\log_2(c)/3\rfloor. An efficient implementation using bit scanning intrinsics is given below for GCC and Visual C++.

size_t Octree::GetNodeTreeDepth(const OctreeNode *node)
{
    assert(node->LocCode); // at least flag bit must be set
    // for (uint32_t lc=node->LocCode, depth=0; lc!=1; lc>>=3, depth++);
    // return depth;

#if defined(__GNUC__)
    return (31-__builtin_clz(node->LocCode))/3;
#elif defined(_MSC_VER)
    long msb;
    _BitScanReverse(&msb, node->LocCode);
    return msb/3;
#endif
}

When sorting the nodes by locational code the resulting order is the same as the pre-order traversal of the Octree, which in turn is equivalent to the Morton Code (also known as Z-Order Curve). The Morton Code linearly indexes multi-dimensional data, preserving data locality on multiple levels.

Tree traversal

Given the locational code, moving further down or up the Octree is a simple two-step operation consisting of (1) deriving the locational code of the next node and (2) looking-up the node in the hash-map.

For traversing up the Octree, first, the locational code of the parent node must be determined. This is done by removing the least significant three bits of the locational code of the current node. Now, the parent node can be retrieved by doing a hash map look-up with the previously computed locational code. An exemplary implementation is given below.

class Octree
{
public:
    OctreeNode * GetParentNode(OctreeNode *node)
    {
        const uint32_t locCodeParent = node->LocCode>>3;
        return LookupNode(locCodeParent);
    }

private:
    OctreeNode * LookupNode(uint32_t locCode)
    {
        const auto iter = Nodes.find(locCode);
        return (iter == Nodes.end() ? nullptr : &(*iter));
    }

private:
    std::unordered_map<uint32_t, OctreeNode> Nodes;
};

For traversing down the Octree, first, the locational code of the child in question must be computed. This is done by appending the octant number of the child to the current node’s locational code. After that the child node can be retrieved by doing a hash map look-up with the previously computed locational code. The following code visits all nodes of an Octree from the root down to the leafs.

void Octree::VisitAll(OctreeNode *node)
{
    for (int i=0; i<8; i++)
    {
        if (node->ChildExists&(1<<i))
        {
            const uint32_t locCodeChild = (node->LocCode<<3)|i;
            const auto *child = LookupNode(locCodeChild);
            VisitAll(child);
        }
    }
}

Full Octrees

In a full or complete Octree, every internal node has eight children and all leaf nodes have exactly the same tree depth D which is fixed a priori. A full Octree has N_L=8^D leaf nodes. Thus, it’s equal to a regular 3D grid with a resolution of 2^D\times 2^D\times 2^D. The total number of tree nodes can be computed as N_T=\sum_{i=0}^{D}8^i=\frac{8^{D+1}-1}{7}. Full Octrees of four successive subdivision levels are depicted in the figure below.

Thanks to the regularity of a full Octree it can be implemented without explicitly storing any tree structure and cell size information in the nodes. Hence, a single node consists solely of the pointer to the objects; which is eight bytes on a 64-bit machine. Similar to binary trees, full Octrees can be stored pointer-less in an array FullOctreeNode Nodes[K] (zero-based). The children of any node Nodes[i] can be found at Nodes[8*i+1] to Nodes[8*i+8], the parent of node Nodes[i] can be found at Nodes[floor((i-1)/8)] if i is not the root node (\Rightarrow i>0).

The most common application of full Octrees are non-sparse, static scenes with very evenly distributed geometry. If not most of the nodes contain objects, memory savings due to small node structs are quickly lost by the huge amount of nodes that need to be allocated. A full Octree of depth D=10 consists of N_T=1227133513 (1.2 billion) nodes which consume around 9.14 GiB of memory.

Wrap up

Which node representation to choose for an Octree implementation depends mainly on the application. There are three major aspects that can help deciding on the representation.

  1. How much data will supposedly be stored in the Octree? Is the reduced node size of an implicit node representation crucial for keeping the memory usage in check?
  2. Will the Octree be subject to frequent changes? Pointer-based node representations are much more suitable for modifying Octrees than implicit node representations.
  3. Will the Octree be sparsely or densely populated? How much memory can be saved by supporting on-demand allocation of nodes? Is maybe a full Octree suitable?

  1. I. Gargantini. An Effective Way to Represent Octrees. Communications of the ACM, Volume 25 Issue 12, Dec. 1982, Pages 905-910 


Leave a comment

Optimizing binary search

Introduction

Binary search finds the index of a specified value (also called search key) in a pre-sorted array. Binary search is a very efficient and elegant algorithm which is used in a large number of data structures and algorithms. It has a runtime complexity of O(log N), where N is the number of elements to be searched through, because in each step the remaining number of array elements is halved.

Even though, the binary search algorithm is generally extremely fast I recently designed a data structure where the binary search was the major bottleneck. The said data structure performs successively a vast number of binary searches on a very large, static array of integers. The array can have a size of up to 100 billion unsigned 32-bit integers which corresponds to around 372.5 GB of raw data. To search for the first occurrence of an arbitrary key in an array of 100 billion elements a binary search touches floor(log2(N)+1)) = 30 elements, with N = 100000000000.

This is not an article about how to implement the binary search algorithm. More information on that can be found in plenty of resources online. The implementation of the optimization technique described in the rest of this post can be found in the accompanying github repository.

Memory hierarchy

Considering the size of the data set the first few array accesses are guaranteed to cause cache misses; or even worse page misses depending on the amount of available main memory. On machines with little main memory a page miss requires the operating system to fetch the corresponding block containing the required array element from disk (obviously this would require some additional code logic). Accessing main memory is roughly 200x slower than accessing the first level cache. The random access time of a modern SSD is typically under 0.1 ms, which is roughly 1000x slower than accessing main memory. The random access times of spinning disks vary considerably and are usually in the order of a few milliseconds. Assuming a random access time of 5 ms, accessing a spinning disk is roughly 50000x slower than accessing main memory. Take these numbers with a grain of salt. They are by no means accurate and will be considerably different measured on concrete systems. However, the overall orders of magnitude are correct and give an intuitive feeling of how costly array accesses can be.

Look-up tables

The previous paragraph should have cleared up why array accesses can be costly, especially when data is stored "far away" from the CPU. As a consequence it seems reasonable to try reducing the runtime costs of a binary search by reducing the number of array accesses needed for a search. This is a widely used optimization technique: an increased memory consumption is traded for increased performance.

The number of array accesses can be reduced by limiting the initial search range from left = 0, right = N-1 to anything smaller which is guaranteed to still contain the search key. The simple binary structure of unsigned integers allows us to use some of the high bits as an index into a look-up table (LUT) containing the reduced search range for the corresponding key. The more bits are used the bigger is the resulting speed-up, but the more memory is used. Using 16 bits requires only a 512 KB LUT. Using 24 bits requires already a 128 MB LUT. Let's consider exemplary a 16-bit LUT where the highest 16-bit of the search key are mapped. Each LUT entry maps to a sub-range in the value array that can contain up to 65536 different values.

0 [0x00000..0x0ffff] -> left = index of first element >= 0, right = index of last element < 65536
1 [0x10000..0x1ffff] -> left = index of first element >= 65536, right = index of last element < 131072
2 [0x20000..0x2ffff] -> left = index of first element >= 131072, right = index of last element < 196608
...

Let's consider a search for the key 114910 which is in hex 0x0001c0de. In order to obtain the LUT index for the key a right-shift by 16 bits is performed: lutIdx = key>>16, which results in lutIdx = 0x0001 = 1. The sub-range stored in the LUT at element 1 is guaranteed to contain all values in the range 65536 <= key < 131072. Following, the binary search, taking the search key, as well as the interval start and end as arguments, can be invoked by calling:

BinarySearch(key, Lut[lutIdx].left, Lut[lutIdx].right);

This is very little, simple and efficient code but how to populate the LUT? It turns out that populating the LUT can be done very efficiently in O(N) with a single, sequential pass over the input array. Even though, all elements must be touched once to populate the LUT, the sequential nature of the algorithm allows prefetching required data in advance and exploits the largely increased performance characteristics of spinning disks for sequential reads. The algorithm to populate the LUT is depicted in C++ below.

std::vector<std::pair<size_t, size_t>> Lut;

void InitLut(const std::vector<uint32_t> &vals, size_t lutBits)
{
    // pre-allocate memory for LUT
    const size_t lutSize = 1<<lutBits;
    const size_t shiftBits = 32-lutBits; // for 32-bit integers!
    Lut.resize(lutSize);

    // populate look-up-table
    size_t thresh = 0, last = 0;

    for (ssize_t i=0; i<(ssize_t)vals.size()-1; i++)
    {
        const size_t nextThresh = vals[i+1]>>shiftBits;
        Lut[thresh] = {last, i};

        if (nextThresh > thresh) // more than one LUT entry to fill?
        {
            last = i+1;
            for (size_t j=thresh+1; j<=nextThresh; j++)
                Lut[j] = {last, i+1};
        }

        thresh = nextThresh;
    }

    // set remaining thresholds that couldn't be found
    for (size_t i=thresh; i<Lut.size(); i++)
        Lut[i] = {last, vals.size()-1};
}

The LUT is implemented as an array of pairs. Each pair represents the interval the corresponding LUT entry maps to. To populate the LUT, the outer for-loop iterates over the pre-sorted array and calculates the number of LUT entries to fill between two successive array elements. The inner for-loop fills the LUT accordingly. As the start of each LUT entry can be derived from the end of the previous one, the last variable keeps track of the end of the last interval. The intervals are non-overlapping and it holds that the end of an interval is the beginning of the next interval minus one: Lut[i].second = Lut[i+1].first-1. As a consequence, the memory size of the LUT can be halved by storing only the interval starts and computing the interval ends as described. For simplicity reasons the depicted code is without this improvement, in contrast to the code in the accompanying github repository.

Other data types

The presented LUT-based optimization only works for data types that are compatible with a simple bit-wise comparison. This means that more significant bits must have a bigger influence on the value's magnitude than less significant bits. So do signed integers and floats work as well? As long as they're positive there are no problems, because for any two positive signed integers or floats x and y it holds that if x <= y it follows that UIR(x) <= UIR(y), where UIR() denotes the binary unsigned integer representation. However, negative values cause problems. Hence, when creating the LUT and when doing a range look-up the array elements and the search key must be mapped to an ordered range using a bijective mapping as described below.

Signed integers

Signed integers are stored in two's complement, in which the most significant bit (MSB) represents the sign: if the MSB is set the value is negative, otherwise it's positive. Therefore, signed integers compare greater using bit-wise comparison than unsigned ones. It turns out, that by simply flipping the sign bit the ordering can be fixed. It's important to note that flipping the sign bit is only enough if two's complement is used. If a simple signed magnitude representation is used negative values are not automatically reversed. In that case not only the sign bit but also all the remaining bits have to flipped when the sign bit is set. For clarity look at the following example of two's complement integers where FSB() denotes the function that flips the sign bit: FSB(-6 = 11111010) = 01111010 = 122 < FSB(-5 = 11111011) = 01111011 = 123 < FSB(2 = 00000010) = 10000010 = 130. In contrast, if a signed magnitude representation is used flipping the sign bit is not sufficient anymore as the following example illustrates: -6 < -5, however FSB(-6 = 10000110) = 00000110 = 6 > FSB(-5 = 10000101) = 00000101 = 5.

Floating point

Floating point values (I'm talking about IEEE 754) are a bit more nasty than signed integers. Their signed magnitude representation has exactly the same problem as signed magnitude represented integers have, outlined in the previous section. To overcome this problem all the remaining bits just have to be flipped as well if the sign bit was set. For further information read the article of Michael Herf about Radix Sort for floats on Stereopsis. One possible implementation in C++ is:

// taken from Michael Herf's article on Stereopsis
uint32_t MapFloat(float val)
{
    // 1) flip sign bit
    // 2) if sign bit was set flip all other bits as well
    const uint32_t cv = (uint32_t &)val;
    const uint32_t mask = (-(int32_t)(cv>>31))|0x80000000;
    return cv^mask;
}

Results

I analyzed the performance gain of the LUT optimization by searching 10 million times through an array of 1 billion 32-bit integers. The integers were uniformly distributed in the interval [0, 0xffffffff]. In each search a randomly determined key from the data set was used. The performance was measured on a laptop with a Core(TM) i7-3612QM CPU (Ivy Bridge) at 2.1 GHz with 6 MB L3 cache and 8 GB RAM. The results of the standard binary search and the LUT optimized binary search for different LUT sizes are depicted in the table below. The speed-up is related to C++ standard binary search function std::lower_bound(), not my own implementation used for the range reduced binary search.

Algorithm LUT size Time needed Searches/second Speed-up
std::lower_bound() 9.373 secs 1.07 m
LUT binary search 8-bit (2 KB) 8.592 secs 1.16 m 1.09x
LUT binary search 16-bit (512 KB) 3.873 secs 2.58 m 2.42x
LUT binary search 24-bit (128 MB) 1.99 secs 5.03 m 4.71x

The 16-bit LUT binary search is more than 2.5x, the 24-bit LUT binary search is even almost 5x faster than C++'s standard binary search implementation. The amount of required additional code is neglectable and the LUT memory size can be configured granularly by adjust the number of bits used to index the LUT. Thanks to the O(N) complexity of the LUT population algorithm the discussed optimization is even feasible in situations where the underlying value array changes from time to time. Especially, in situations where parts of the data set are only resident on hard disk the reduced number of array accesses saves a considerable amount of work. By using simple mappings the LUT optimization can be applied as well to signed integer and even floating point values. For a sample implementation checkout the accompanying github repository.