Advanced Octrees 4: finding neighbor nodes

Introduction

Welcome to part four of the the Advanced Octrees series. Here you can find part one, two and three. In this part I show you how to access neighbor nodes in Quadtrees. The algorithm can be naturally extended to Octrees, but it’s easier to understand and code in just two dimensions. Neighbors are such nodes that either share an edge in horizontal (west, east) or vertical (north, south) direction, or that share a vertex in one of the four diagonal directions (north-west, north-east, south-west, south-east). See the figure below for an illustration. The node of which we want to find the neighbors is colored green. In the following this node is called source node. The neighbor nodes are colored red and labeled with the corresponding direction.

Different neighbor directions

In the literature Quadtree neighbor finding algorithms are distinguished by two properties:

Does the algorithm operate on pointer-based Quadtrees or on linear Quadtrees? Read up on different Quadtree representations in part two of this series. The algorithm discussed in this post is based on a pointer representation.
Can the algorithm only find neighbors of equal or greater size, or can it also find neighbors of smaller size. We’ll see later that finding neighbors of smaller size is somewhat harder. Greater / equal / smaller refers to the difference in tree level between the source node and the neighbor nodes.

The figure below shows examples of neighbor relationships in north direction between nodes on different tree levels. Note, that a node can have more than one neighbor per direction if the sub-tree on the other side of the north edge is deeper than the sub-tree of the source node.

Examples for neighbor relationship

Algorithm

The input to the algorithm is a direction and an arbitrary Quadtree node (inner of leaf). The output is a list of neighbor nodes in the given direction. The neighbors can be of greater size, equal size or smaller size (see previous figure). The algorithm is symmetric with respect to the direction. That’s why I only show an implementation for the north direction. It should be easy to extend the implementation to the other seven directions.

The neighbor finding happens in two steps. In step 1 we search for a neighbor node of greater or equal size (yellow). If such a node can be found we perform the second step, in which we traverse the sub-tree of the neighbor node found in step 1 to find all neighbors of smaller size. The two steps are illustrated in the figure below.

Two steps of algorithm

So how to find the neighbor of greater or equal size? Starting from the source node the algorithm traverses up the tree as long as the current node is not a south child of the parent, or the root is reached. After that the algorithm unrolls the recursion descending downwards always picking the south node. The node the recursion stops at is the north neighbor of the source node. This node is always on the same tree level as the source node, or further up the tree. Because the source node can be in the west or in the east, there’s a case distinction necessary to handle both possibilities.

Below is an implementation in Python. The code uses a pointer based Quadtree data structure, in which every node consists of a parent and four children. The children array is indexed using the constants NW, NE, SW and SE. The full code can be found on Github.

def get_neighbor_of_greater_or_equal_size(self, direction):   
  if direction == self.Direction.N:       
    if self.parent is None: # Reached root?
      return None
    if self.parent.children[self.Child.SW] == self: # Is 'self' SW child?
      return self.parent.children[self.Child.NW]
    if self.parent.children[self.Child.SE] == self: # Is 'self' SE child?
      return self.parent.children[self.Child.NE]

    node = self.parent.get_neighbor_of_greater_or_same_size(direction)
    if node is None or node.is_leaf():
      return node

    # 'self' is guaranteed to be a north child
    return (node.children[self.Child.SW]
            if self.parent.children[self.Child.NW] == self # Is 'self' NW child?
            else node.children[self.Child.SE])
  else:
    # TODO: implement symmetric to NORTH case

In step 2 we use the neighbor node obtained in step 1 to find neighbors of smaller size by recursively descending the tree in south direction until we reach the leaf nodes. On the way we always keep both, the south west and the south east node, because any node lower in the tree than the neighbor node from step 2 is smaller than the source node and therefore a neighbor candidate.

def find_neighbors_of_smaller_size(self, neighbor, direction):   
  candidates = [] if neighbor is None else [neighbor]
  neighbors = []

  if direction == self.Direction.N:
    while len(candidates) > 0:
      if candidates[0].is_leaf():
        neighbors.append(candidates[0])
      else:
        candidates.append(candidates[0].children[self.Child.SW])
        candidates.append(candidates[0].children[self.Child.SE])

      candidates.remove(candidates[0])

    return neighbors
  else:
    # TODO: implement other directions symmetric to NORTH case

Finally, we combine the functions from step 1 and 2 to get the complete neighbor finding function.

def get_neighbors(self, direction):   
  neighbor = self.get_neighbor_of_greater_or_equal_size(direction)
  neighbors = self.find_neighbors_of_smaller_size(neighbor, direction)
  return neighbors

Complexity

The complexity of this algorithm is bounded by the depth of the Quadtree. In the worst-case the source node and its neighbors are on the last tree level but their first common ancestor is only the root node. This is the case for any node that lies directly below/above or left/right of the line splitting the root node horizontally/vertically. See the following picture for an illustration.

Scenario with Worst-case runtime

October 29, 2017
by David Leave a comment

Custom deleters for smart pointers in modern C++

Introduction

Many C++ programmers are not aware of a big difference in how the custom deleters of std::unique_ptr and std::shared_ptr are implemented in the C++ standard library. std::unique_ptr carries the custom deleter as part of its type (template<class T, class Deleter> std::unique_ptr). In contrast, the custom deleter of std::shared_ptr is not part of the type (template<class T> std::shared_ptr) but part of the constructor’s template argument list (template<class Y, class Deleter> shared_ptr(Y *ptr, Deleter d)). Mostly, this difference doesn’t matter much. Though, there are use-cases, like e.g. factories returning std::unique_ptr with a custom deleter, were the difference does matter.

Design choices

`std::unique_ptr`

The advantage of making the custom deleter part of std::unqiue_ptr‘s type is that, as long as the deleter is stateless (e.g. a lambda that doesn’t capture anything or a function with no member variables), storing it doesn’t take up any additional memory thanks to the empty base optimization. This makes std::unique_ptr a zero-overhead abstraction, which means that:

Its size is identical to the size of a raw pointer on the underlying architecture.
All calls to the deleter can be inlined.

One possible implemention which makes use of the empty base optimization is to store the wrapped pointer together with the deleter in a compressed pair. The obvious disadvantage of making the custom deleter part of the type is that two std::unique_ptrs with different custom deleters are of different type, even if they wrap the same pointer type.

`std::shared_ptr`

In contrast to std::unique_ptr, std::shared_ptr provides the convinience of a type erased deleter. Type erased means that the type of the custom deleter is not dragged into std::shared_ptr‘s type. Hence, one cannot know by just looking at the type if two std::shared_ptr instances have different custom deleters or not.
The type erasure makes std::shared_ptr more flexible. For example changing the allocation strategy of a factory, and with it the custom deleter of the returned std::shared_ptrs, doesn’t break source/binary compatibility and thereby, doesn’t require any recompilation of client software.
The drawback is that storing the custom deleter takes up additional memory, because some wrapper (e.g. std::function or a raw function pointer) is needed to store the custom deleter. The rationale behind this design choice is that std::shared_ptr must anyways heap allocate memory for its shared control block, containing the wrapped pointer and the reference counter. Additionally including the custom deleter didn’t seem like a big cost, taking the increased flexiblity into account.

Type erased custom deleters with `std::unique_ptr`

Imagine you’re building an object factory which returns std::unique_ptrs. The return type of the factory’s Create() function must allow casting instances of different derived classes to the same std::unique_ptr type. One way to do that is to use a std::unique_ptr to the base class. This, however, requires the base class’ destructor to be virtual. What if the destructor cannot be virtual for some reason or the implications for source and binary compatibility are limiting?

An alternative is to create a type erased custom deleter for std::unique_ptr by wrapping the deleter e.g. in an std::function. The wrapped function is then responsible for casting the void * to the correct type when deleting it. This construction works for virtual and non-virtual classes as well as for multiple inheritance, because the deleter casts the void * argument containing the address to the most derived class simply back to the type of the most derived class.

template<typename Type>
void MyDelete(void *ptr) // Casts 'ptr' to real type and deletes it
{
    delete static_cast<Type *>(ptr);
}

auto Factory::Create(int typeId)
{
    // Unique pointer with type erased custom deleter
    using UniquePtr = std::unique_ptr<Base, std::function<void(void *)>>;

    switch (typeId)
    {
    case 0: return UniquePtr(new Derived0, MyDelete<Derived0>); 
    case 1: return UniquePtr(new Derived1, MyDelete<Derived1>);
    // ...
   }
}

The applied type erasure doesn’t come for free. There are two penalties to pay:

Destroying the pointer cannot be inlined anymore and therefore always requires an additional function call.
Additional memory is required to store the deleter.

It turns out that the std::function wrapper increases the memory footprint of the std::unique_ptr type considerably (32 bytes with GCC 5.3.1’s libc++ and 64 bytes with Visual C++ 2015, both 64 bit). Luckily, we can use a simple function pointer to reduce the total size of the final std::unique_ptr to 16 bytes.

using UniquePtr = std::unique_ptr<Base, void(*)(void *)>;

December 3, 2016
by David Leave a comment

Scalable spinlocks 1: array-based

Last time we saw that spinlock implementations which only use a single synchronization variable (Test-And-Set Lock, Ticket Lock) don’t scale with growing numbers of threads. Today, I want to talk about two spinlock variants that scale. Namely the Graunke and Thakkar Lock¹ (1989) and the Anderson Lock² (1990). Their underlying key idea is to use one synchronization variable per thread instead of one for all threads, to reduce the amount of cache line invalidations when acquiring/releasing the lock. Both spinlock variants store the synchronization variables in an array. This means that there’s an upper bound on the maximum number of thread’s that can compete for the lock concurrently, which must be known a priori. In upcoming blog posts I’m going to show spinlock variants (the MCS Lock and the CLH Lock) that improve upon array-based spinlocks by removing this hard upper limit.

Anderson’s lock

Anderson’s spinlock is basically a FIFO queue of threads. It’s implemented as an array, which is accessed via an atomic ring-buffer-style index counter. Hence, it requires an atomic fetch-and-add (std::atomic::fetch_add in C++11) operation to advance the ring-buffer index. The basic idea is that each thread spins on its own synchronization variable while it’s = true. The synchronization variable of the next thread in line is eventually set to false when the previous thread leaves the critical section (CS).

Below is an implementation of Anderson’s spinlock in C++11. Note, that this implementation is for illustrative purpose only. It’s not optimized for performance. The array of synchronization variables LockedFlags is initially set to [false, true, true, ..., true]. The first element is set to false, so that the very first thread in line can enter the critical section: for the first thread there’s no previous thread which can enable the first thread to enter the CS. After successfully entering the CS the synchronization variable is false and needs to be reset to true for future threads. It must be taken care for that NextFreeIdx doesn’t overflow. If the maximum number of threads is a power-of-2 integer arithmetic will just work fine, because the overflow matches up with the modulo computation. Otherwise, NextFreeIdx must be kept in check to avoid overflows.

Anderson’s lock requires $O(L*T)$ memory to store $L$ locks with a maximum thread count of $T$ .

class AndersonSpinLock
{
public:
    AndersonSpinLock(size_t maxThreads=std::thread::hardware_concurrency()) :
        LockedFlags(maxThreads)
    {
        for (auto &flag : LockedFlags)
            flag.first = true;

        LockedFlags[0].first = false;
    }

    ALWAYS_INLINE void Enter()
    {
        const size_t index = NextFreeIdx.fetch_add(1)%LockedFlags.size();
        auto &flag = LockedFlags[index].first;

        // Ensure overflow never happens
        if (index == 0)
            NextFreeIdx -= LockedFlags.size();

        while (flag)
            CpuRelax();

        flag = true;
    }

    ALWAYS_INLINE void Leave()
    {
        const size_t idx = NextServingIdx.fetch_add(1);
        LockedFlags[idx%LockedFlags.size()].first = false;
    }

private:
    using PaddedFlag = std::pair<std::atomic_bool,
                                 uint8_t[CACHELINE_SIZE-sizeof(std::atomic_bool)]>;
    static_assert(sizeof(PaddedFlag) == CACHELINE_SIZE, "");

    alignas(CACHELINE_SIZE) std::vector<PaddedFlag> LockedFlags;
    alignas(CACHELINE_SIZE) std::atomic_size_t      NextFreeIdx = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t      NextServingIdx = {1};
};

Graunke and Thakkar’s lock

In contrast to Anderson’s ring-buffer-based lock, Graunke and Thakkar use a linked list to establish a FIFO queue of threads attempting to acquire the lock. Hence, an atomic fetch-and-store operation (std::atomic::exchange in C++11) is needed to atomically update the tail of the queue. Like Anderson’s lock, every thread spins on a different synchronization variable to reduce cache traffic due to cache line invalidations. A thread trying to enter the CS spins on the synchronization variable of the thread in line before it-self by looking at the tail pointer of the wait queue. Every thread is responsible for flipping its synchronization variable when it leaves the CS, which in turn enables the next thread in line to enter.

Below is an implementation of Graunke and Thakkar’s spinlock in C++11. Note, that this implementation is for illustrative purpose only. It’s neither optimized for performance, nor does it work if threads are recreated because of how thread indices are assigned. In their original implementation – which is reflected below – the synchronization flag is not reset after successfully entering the CS. That’s why additionally the old value of the flag must be stored along with the flag pointer in the queue. The spin loop compares the current flag value with the old flag value to see if the previous thread in line has already unlocked the CS by flipping its synchronization variable. The approach used in Anderson’s lock would simplify the code significantly: the old flag value wouldn’t have to be stored anymore in the tail, which means we could get rid of all the binary arithmetic to crunch in and extract the old flag value from the tail.

Graunke and Thakkar’s lock requires $O(L*T)$ memory to store $L$ locks with a maximum thread count of $T$ .

class GraunkeAndThakkarSpinLock
{
public:
    GraunkeAndThakkarSpinLock(size_t maxThreads=std::thread::hardware_concurrency()) :
        LockedFlags(maxThreads)
    {
        for (auto &flag : LockedFlags)
            flag.first = 1;

        assert(Tail.is_lock_free());
        Tail = reinterpret_cast(&LockedFlags[0].first);
        // Make sure there's space to store the old flag value in the LSB
        assert((Tail&1) == 0);
    }

    ALWAYS_INLINE void Enter()
    {
        // Create new tail by chaining my synchronization variable into the list
        const auto &newFlag = LockedFlags[GetThreadIndex()].first;
        const auto newTail = reinterpret_cast(&newFlag)|static_cast(newFlag);
        const auto ahead = Tail.exchange(newTail);

        // Extract flag and old value of previous thread in line,
        // so that we can wait for its completion
        const auto *aheadFlag = reinterpret_cast(ahead&(~static_cast(1)));
        const auto aheadValue = static_cast(ahead&1);

        // Wait for previous thread in line to flip my synchronization variable
        while (aheadFlag->load() == aheadValue)
            CpuRelax();
    }

    ALWAYS_INLINE void Leave()
    {
        // Flipping synchronization variable enables next thread in line to enter CS
        auto &flag = LockedFlags[GetThreadIndex()].first;
        flag = !flag;
    }

private:
    ALWAYS_INLINE size_t GetThreadIndex() const
    {
        static std::atomic_size_t threadCounter = {0};
        thread_local size_t threadIdx = threadCounter++;
        assert(threadIdx < LockedFlags.size());
        return threadIdx;
    }

private:
    using PaddedFlag = std::pair<std::atomic_uint16_t,
                                 uint8_t[CACHELINE_SIZE-sizeof(std::atomic_uint16_t)]>;
    static_assert(sizeof(PaddedFlag) == CACHELINE_SIZE, "");

    // In the LSB the old value of the flag is stored
    alignas(CACHELINE_SIZE) std::atomic<uintptr_t>  Tail;
    alignas(CACHELINE_SIZE) std::vector<PaddedFlag> LockedFlags;

    static_assert(sizeof(decltype(LockedFlags)::value_type) > 1,
                  "Flag size > 1 required: thanks to alignment, "\
                  "old flag value can be stored in LSB");
};

Comparison

The major difference between Anderson’s and Graunke and Thakkar’s spinlock implementations is how synchronization variables are assigned to threads. In Graunke and Thakkar’s lock there’s a 1:1 mapping from synchronization variables to threads. Always the same synchronization variable is assigned to the same thread. As a consequence the lock can be implemented in such a way that synchronization variables are stored in memory locations local to the spinning thread. This allows the lock to perform well on even on cache-less NUMA machines.

In contrast, Anderson’s lock has no control over the memory location in which a thread’s synchronization variable is stored. The synchronization variable assigned to a thread only depends on the order in which threads attempt to acquire the lock. Therefore, the Anderson lock performs much worse on cache-less machines, because it’s entirely possible that a thread spins on a synchronization variable in a remote memory location. On all modern mainstream CPUs this difference has marginal effect, because the CPUs have caches into which the synchronization variables can migrate during spinning.

G. Graunke, S. Thakkar. Synchronization Algorithms for Shared-Memory Multiprocessors. IEEE Computer, Vol. 23, No. 6, (1990), pp. 60-69 ↩
T. E. Anderson. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 1. (1990), pp. 6-16 ↩

April 9, 2016
by David 6 Comments

The ticket spinlock

Last time we saw how a spinlock can be implemented using an atomic test-and-set (TAS) operation on a single shared synchronization variable. Today, I want to talk about another spinlock variant called Ticket Lock. The Ticket Lock is very similar to TAS-based spinlocks in terms of scalability, but supports first-in-first-out (FIFO) fairness.

As the name suggests, the Ticket Lock employs the same concept as e.g. hair dressers do to serve their customers in the order of arrival. On arrival customers draw a ticket from a ticket dispenser which hands out tickets with increasing numbers. A screen displays the ticket number served next. The customer holding the ticket with the number currently displayed on the screen is served next.

Implementation

The C++11 implementation below uses two std::atomic_size_t variables as counters for the ticket number currently served (ServingTicketNo) and the ticket number handed out to the next arriving thread (NextTicketNo). The implementation is optimized for x86 CPUs. The PAUSE instruction (called from CpuRelax()) is used when spin-waiting and both counters are cache line padded to prevent false sharing. Read the previous article on TAS-based locks for more information.

class TicketSpinLock
{
public:
    ALWAYS_INLINE void Enter()
    {
        const auto myTicketNo = NextTicketNo.fetch_add(1, std::memory_order_relaxed);

        while (ServingTicketNo.load(std::memory_order_acquire) != myTicketNo)
            CpuRelax();
    }

    ALWAYS_INLINE void Leave()
    {
        // We can get around a more expensive read-modify-write operation
        // (std::atomic_size_t::fetch_add()), because no one can modify
        // ServingTicketNo while we're in the critical section.
        const auto newNo = ServingTicketNo.load(std::memory_order_relaxed)+1;
        ServingTicketNo.store(newNo, std::memory_order_release);
    }

private:
    alignas(CACHELINE_SIZE) std::atomic_size_t ServingTicketNo = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t NextTicketNo = {0};
};

static_assert(sizeof(TicketSpinLock) == 2*CACHELINE_SIZE, "");

Overflow is pretty much impossible when 64-bit counters are used. But what happens when the counters are of smaller bit width and eventually overflow? It turns out that overflow is safe, as long as the number of threads using the lock is less than or equal to the value range representable by the counter’s underlying integer type (e.g. 256 for 8-bit counters). Let’s consider a 3-bit integer, which can represent values from 0 to 7. Overflow is safe as long as the there are never more than 8 threads competing for the lock, because the condition ServingTicketNo != myTicketNo is guaranteed to be always only false for the next thread in line. If there were 9 or more threads, the NextTicketNo counter could reach the same value ServingTicketNo has and accordingly two threads could enter the critical section (CS) at the same time. The figure below illustrates the case where 8 threads are competing for the lock. Just one more competing thread could cause multiple threads entering the CS at the same time.

   ServingTicketNo   NextTicketNo
          |             |
          V             V
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 ...
          \_____________/
          Max. 8 threads

No second thread may grab another NextTicketNo=5!

The above means that we can crunch down our implementation in terms of memory footprint by removing the padding and using 8- or 16-bit counters, depending on the expected maximum number of threads. Removing the padding comes at the cost of potential false sharing issues.

Proportional back-off

Under high contention the Test-And-Test-And-Set Lock backs-off exponentially. The reduced amount of memory bus traffic improves performance. What about using the same backing-off strategy with Ticket Locks? It turns out that this is a very bad idea, because the FIFO order causes the back-off delays to accumulate. For example, the thread with myTicketNo == ServingTicketNo+3 must always wait as least as long as the thread with myTicketNo == ServingTicketNo+2 and the thread with myTicketNo == ServingTicketNo+2 must always wait as least as long as the thread with myTicketNo == ServingTicketNo+1.

Is there something else we can do? It turns out we can thanks to the FIFO order in which threads are granted access to the CS. Every thread can calculate how many other threads are going to be granted access to the CS before it-self as numBeforeMe = myTicketNo-ServingTicketNo. With this knowledge and under the assumption that every thread holds the lock approximately for the same duration, we can back-off for the number of threads in line before us times some constant BACKOFF_BASE. BACKOFF_BASE is the expected average time that every thread spends inside the CS. This technique is called proportional back-off.

ALWAYS_INLINE void PropBoTicketSpinLock::Enter()
{
    const auto myTicketNo = NextTicketNo.fetch_add(1, std::memory_order_relaxed);

    while (true)
    {
        const auto servingTicketNo = ServingTicketNo.load(std::memory_order_acquire);
        if (servingTicketNo == myTicketNo)
            break;

        const size_t numBeforeMe = myTicketNo-servingTicketNo;
        const size_t waitIters = BACKOFF_BASE*numBeforeMe;

        for (size_t i=0; i<waitIters; i++)
            CpuRelax();
    }
}

Wrap up

How does the Ticket Lock compare against TAS-based locks and in which situations which of the two locks variants is preferably used? The Ticket Lock has the following advantages over TAS-based locks:

The Ticket Lock is fair, because the threads are granted access to the CS in FIFO order. This prevents the same thread from reacquiring the lock multiple times in a row, which – at least in theory – could starve other threads.
In contrast to all TAS-based locks the Ticket Lock avoids the Thundering Herd problem, because waiting for the lock and acquiring it doesn’t require any read-modify-write or store operation.
In contrast to the Test-And-Set Lock only a single atomic load operation is repeatedly executed while waiting for the lock to become available. Though, this problem is solved by the TTAS Lock.

The biggest disadvantage of the Ticket Lock is that the fairness property backfires once there are more threads competing for the lock than there are CPU cores in the system. The problem is that in that case the thread which can enter the CS next might be sleeping. This means that all other threads must wait, because of the strict fairness guarantee. This property is sometimes referred to as preemption intolerance.

Furthermore, the Ticket Lock doesn’t solve the scalability issue of TAS-based spinlocks. Both spinlock variants don’t scale well, because the number of cache line invalidations triggered when acquiring/releasing the lock is O(#threads). There are scalable lock implementations where all threads spin on different memory locations. These spinlock variants only trigger O(1) many cache line invalidations. Next time we’ll look into scalable spinlock variants.

March 23, 2016
by David 2 Comments

Test-and-set spinlocks

In my previous blog post I wrote about the most important properties of spinlocks and why they matter. This time I’ll present the first out of three concrete spinlock implementations in C++11 for x86 processors. I’m assuming that you’ve read the first post. You can find all source code in this Github repository.

Introduction

The Test-And-Set (TAS) Lock is the simplest possible spinlock implementation. It uses a single shared memory location for synchronization, which indicates if the lock is taken or not. The memory location is updated using a test-and-set (TAS) operation. TAS atomically writes to the memory location and returns its old value in a single indivisible step. Actually, the name test-and-set is a little misleading, because the caller is responsible for testing if the operation has succeeded or not. The TAS Lock is not fair as it doesn’t guarantee FIFO ordering amongst the threads competing for the lock.

In C++11 std::atomic_bool::exchange() can be used to perform a TAS operation on an std::atomic_bool synchronization variable. On x86 CPUs std::atomic::exchange() is turned into the LOCK XCHG instruction (side note: the LOCK prefix is implicit and not strictly required, because there’s no non-atomic version of XCHG). The following code implements the described TAS Lock.

class TasSpinLock
{
public:
    ALWAYS_INLINE void Enter()
    {
        // atomic_bool::exchange() returns previous value of Locked
        while (Locked.exchange(true, std::memory_order_acquire) == true);
    }

    ALWAYS_INLINE void Leave()
    {
        Locked.store(false, std::memory_order_release);
    }

private:
    alignas(CACHELINE_SIZE) std::atomic_bool Locked = {false};
};

static_assert(sizeof(TasSpinLock) == CACHELINE_SIZE, "");

The TasSpinLock::Locked flag is cache line size padded using alignas to prevent false sharing. You can remove the padding, e.g. in memory-limited environments, when false sharing is not an issue.

The test-and-test-and-set optimization

While the TAS Lock is extremely easy to implement, its scalability is very bad. Already with just a few threads competing for the lock, the amount of required cache line invalidations to acquire/release the lock quickly degrades performance. The problem is two-fold.

std::atomic_bool::exchange() always invalidates the cache line Locked resides in, regardless of whether it succeeded in updating Locked or not.
When the spinlock is released, all waiting threads simultaneously try to acquire it. This is sometimes referred to as Thundering Herd problem. Acquiring the lock means first invalidating the cache line copy of all threads waiting for the lock and then reading the valid cache line copy from the core which released the lock. With t threads this results in O(t) memory bus transactions.

The latter issue is inherent to TAS Locks, because just a single synchronization variable is used and shared between all threads. However, the first issue can be fixed by testing if the lock is free before calling exchange(). First testing if the lock is free only loads the synchronization variable and therefore doesn’t cause a cache line invalidation. This spinlock variant is called Test-And-Test-And-Set (TTAS) Lock. The improved Enter() method looks like this:

ALWAYS_INLINE void TTasSpinLock::Enter()
{
    do
        WaitUntilLockIsFree();
    while (Locked.exchange(true, std::memory_order_acquire) == true);
}

ALWAYS_INLINE void TTasSpinLock::WaitUntilLockIsFree() const
{
    while (Locked.load(std::memory_order_relaxed) == true);
}

Exponential back-off

The TTAS Lock successfully reduces the amount of cache line invalidations occurring while the lock is trying to be acquired. However, as soon as the lock is released, again all threads try to update the Locked flag. If a thread sees that the lock is free, but fails acquiring it subsequently because some other thread was faster, the lock is most probably under high contention. In this situation it can help to wait for a short time to let other threads finish before trying to enter the critical section (CS) again.

Waiting a short time without trying to acquire the lock reduces the number of threads simultaneously competing for the lock. The larger the number of unsuccessful retries, the higher the lock contention and the longer the thread should back-off. Experiments have shown that a good strategy is to back-off exponentially; similar to the congestion avoidance mechanism used in the Ethernet protocol. To avoid threads running in lock step the initial wait time is randomized.

ALWAYS_INLINE static void BackoffExp(size_t &curMaxIters)
{
    assert(curMaxIters > 0);

    thread_local std::uniform_int_distribution<size_t> dist;
    thread_local std::minstd_rand gen(std::random_device{}());
    const size_t spinIters = dist(gen, decltype(dist)::param_type{0, curMaxIters});

    curMaxIters = std::min(2*curMaxIters, MAX_BACKOFF_ITERS);
    for (volatile size_t i=0; i<spinIters; i++); // Avoid being optimized out!
}

ALWAYS_INLINE void ExpBoTTasSpinLock::Enter()
{
    size_t curMaxIters = MIN_BACKOFF_ITERS;

    while (true)
    {
        // Not strictly required but doesn't hurt
        WaitUntilLockIsFree();

        if (Locked.exchange(true, std::memory_order_acquire) == true)
            BackoffExp(curMaxIters); // Couldn't acquire lock => back-off
        else
            break; // Acquired lock => done
    }
}

The downside of waiting is that it renders the lock even unfairer than it is already. Threads that have been attempting to acquire the lock longest are also backing-off longest. Consequently, newly arriving threads have a higher chance to acquire the lock than older threads. On top of that threads might back-off for too long, causing the CS to be underutilized. Furthermore, there are unfortunately no single best values for MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS. Optimally, they’re determined experimentally depending on the workload (contention, size of CS).

Pausing and sleeping

There are two more improvements we can do. First, according to the Intel Architectures Optimization Reference Manual, adding a PAUSE instruction to the body of all spin loops improves performance on Pentium 4 CPUs and later. The PAUSE instruction provides a hint to the CPU that the code being executed is a spin-wait loop. PAUSE is backwards compatible and turns into a NOP instruction on older CPUs. There are three reasons why using PAUSE improves performance:

It avoids memory order violations which result in expensive pipeline flushes, because the CPU doesn’t go ahead to fill its pipeline with speculatively executed load and compare instructions.
It gives the other hardware thread time to execute on simultaneous multi-threading (SMT) processors (e.g. Intel’s Hyper Threading).
It adds a slight delay to the spin loop, which synchronizes it with the system’s memory bus frequency. This frequency is typically much lower than the frequency of the CPU, which in turn reduces power consumption considerably.

The second improvement is to stop spinning, put the thread to sleep and reschedule if the the lock doesn’t become available after some time. The important bit here is to not yield (sched_yield() on Linux, SwitchToThread() on Windows, or more generally std::this_thread::yield() in C++11), but to explicitly put the thread to sleep for a short period of time. This ensures that the thread is really paused for some time and isn’t immediately run again by the scheduler if there’s no other thread available in the scheduler’s run queue. This is especially important on SMT processors where every two logical cores share most of the execution units. Only when the thread is really sleeping the other logical core can fully make use of the shared execution units. The following TTAS Lock implementation uses the PAUSE instruction and sleeps for 500us if the lock isn’t released within MAX_WAIT_ITERS iterations. There’s no single best value for MAX_WAIT_ITERS. Ideally, it’s chosen experimentally like the MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS constants from before.

ALWAYS_INLINE static void CpuRelax()
{
#if (COMPILER == MVCC)
    _mm_pause();
#elif (COMPILER == GCC || COMPILER == LLVM)
    asm("pause");
#endif
}

ALWAYS_INLINE static void YieldSleep()
{
    // Don't yield but sleep to ensure that the thread is not
    // immediately run again in case scheduler's run queue is empty
    using namespace std::chrono;
    std::this_thread::sleep_for(500us);
}

ALWAYS_INLINE static void BackoffExp(size_t &curMaxIters)
{
    ... Unchanged code not repeated, look above ...

    for (size_t i=0; i<spinIters; i++)
        CpuRelax();
}

ALWAYS_INLINE void ExpBoRelaxTTasSpinLock::WaitUntilLockIsFree() const
{
    size_t numIters = 0;

    while (Locked.load(std::memory_order_relaxed))
    {
        if (numIters < MAX_WAIT_ITERS)
        {
            CpuRelax();
            numIters++;
        }
        else
            YieldSleep();
    }
}

Performance comparison

The benchmark used to compare the different TAS Lock variants is extremely simple and should be taken with a grain of salt. It launches t threads, each competing for the lock and incrementing a global atomic counter inside the CS. This causes high lock contention, because all threads are trying to enter the lock at the same time. Though, it’s important to keep in mind that due to the lock’s lack of fairness, it’s entirely possible that the same thread acquires the lock multiple times in a row. I ran the benchmark on a dual socket NUMA system with two Intel Xeon E5-2630 v2 CPUs with 2.60 GHz. The CPUs support Hyper Threading which gives a total of 12 physical and 24 logical cores. The benchmark results are shown below.

TAS Lock benchmark results

The most interesting finding is that the amount of lock reacquisitions is a lot higher than I anticipated. This can be derived from the fact that all TTAS-based locks scale almost linearly, where typically the observed drop in performance for TAS-based locks is exponential. Large numbers of lock reacquisitions by the same thread reduces the amount of cache line invalidations and increases throughput. At the same time the latency of all other threads goes up, because they’re granted the lock later.

The TasSpinLock scales so much worse, even though lock reacquisiton happens, because while spinning the TAS operation causes cache line invalidations also if it doesn’t succeed in acquiring the lock. By looking at the difference between the TasSpinLock and the other TTAS-based versions, it’s highly visible how big the influence of cache line invalidations is on the lock’s performance.

Wrapup

The exponential back-off TTAS Lock tends to perform fairly well as long as the number of competing threads is low or lock reacquisitions boost throughput while increasing latency. It’s especially useful when it’s not known a priori that the number of competing threads is bounded by the number of cores in the system (in that case better scaling spinlock variants perform really badly as we’ll see in the next post). Beyond that, without padding the TTAS Lock just needs a single byte. The compact representation is useful in extremely memory-constrained environments or applications that require vast amounts of spinlocks; e.g. for fine-grained locking of lots of small pieces of data.

March 12, 2016
by David 1 Comment

Important properties of spinlocks

Typically, when a thread tries to acquire an already locked mutex it will go to sleep and give up its time slice, allowing another thread to run immediately instead. If the duration for which the mutex is held is extremely short, the time for putting the waiting thread to sleep and waking it up again to retry can easily be longer. In such cases spinlocks can be used to increase performance, by continuously trying to acquire the lock without sleeping. Side note: Obviously, it makes no sense to use spinlocks in systems with a single CPU core. Polling for the spinlock to become available would only prevent the single available CPU core from running the other thread to release the lock.

To deep dive into the multitude of spinlock implementations, it’s crucial to understand what their key characteristics are. Generally, spinlocks can be grouped into fair and unfair as well as scalable and unscalable implementations. In addition, we look at the spinlock’s memory footprint. The three properties are discussed in more detail in the rest of this post.

For all the nitty-gritty architectural details of concrete C++11 spinlock implementations read the upcoming blog posts about Test-And-Set (TAS) Locks, Ticket Locks and MCS Locks. I’ll upload the source code to this Github repository.

Scalability

Existing spinlock variants have very different performance characteristics under contention. A lock is said to be contented when there are multiple threads concurrently trying to acquire the lock while it’s taken. The performance of unscalable spinlock implementations degrades as the number of threads trying to acquire the lock concurrently goes up. The performance of simple Ticket or TAS Locks for example drops exponentially with increasing numbers of threads. In contrast, scalable spinlock implementations – like e.g. the MCS Lock – exhibit no degradation in performance even for large numbers of threads. The scalability graph of the optimal spinlock vs the Ticket Lock is depicted in the graph below.

Scalability of optimal spinlock

Crucial for the scalability of spinlocks is the amount of cache line invalidations which take place when the lock is acquired and released. Cache line invalidations take place when e.g. a thread modifies a synchronization variable while leaving a critical section (CS) on which other threads are polling at the same time. Suddenly, cache line copies stored in caches of other CPU cores are not valid anymore. Accordingly, these cache lines must be invalidated and reobtained from the CPU core’s cache which modified it. Thus, the number of cache line invalidations which occur when the spinlock is acquired/released is often used as a metric to compare spinlock implementations with each other. Usually, scalable spinlocks require O(1) many cache line invalidations to acquire/release the lock, while unscalable spinlocks require O(#threads) many.

Advanced scalable spinlock implementations may not only take the total number of CPU cores in a system into account, but also the details of the underlying cache coherency mechanism. For example in systems with multiple CPUs and non-uniform memory access times (NUMA), inter-core communication costs may vary greatly. For example in a NUMA system with multiple multi-core CPUs it’s entirely possible to pass a lock to another core within in the same CPU in less time than to a core of another CPU.

Fairness

Fair spinlocks guarantee that the same thread may never reacquire the lock if there are other threads trying to acquire the same lock at the same time. Fair spinlocks maintain a first-in-first-out (FIFO) order among the threads attempting to acquire the lock.

Often, unfair spinlocks trade latency for throughput, while fair spinlocks trade throughput for latency. Consider the following example. Let’s say we’re running t threads each performing n/t loop iterations in which they try to enter a CS. When using an unfair spinlock, in the worst case (at least in theory) every thread could enter the CS n/t times in a row instead of alternating with lock acquisition between threads. In certain scenarios this can boost throughput considerably, because the number of expensive cache line invalidations is reduced. However, the time until any other thread may enter the CS rises. In theory this can lead to starvation. In practice the cache coherency protocol should ensure that no starvation occurs.

Typically, the performance of fair spinlocks drops extremely when the number of threads competing for the same lock is greater than the number of CPU cores in the system. The reason for this is that the thread which is next in the sequence to acquire the lock might be sleeping. In contrast to unfair spinlocks, during the time the next thread is sleeping no other thread can acquire the lock because of the strict acquisition order guaranteed by the fair lock. This property is sometimes referred to as preemption intolerance.

Memory footprint

Generally, the required memory to store l spinlocks can depend on the number of threads t which are using the lock. While some spinlock implementations require only constant memory for t threads, other’s memory footprint scales linearly with the number of threads. The amount of needed memory to store a spinlock mostly doesn’t matter, because the memory footprint for the majority of locks is in the order of a few dozen bytes per thread. Mostly it’s a multiple of the CPU’s cache line size to account for false sharing. However, there are cases where the memory footprint plays an important role. Examples are extremely memory-constrained environments or applications that require vast amounts of spinlocks e.g. for fine-grained locking of lots of small pieces of data.

March 21, 2015
by David 13 Comments

Splitting an arbitrary polygon by a line

In this post I present an algorithm for splitting an arbitrary, simple polygon by a line. My implementation is based on ideas from the article Spatial Partitioning of a Polygon by a Plane by George Vanecek, published in the book Graphics Gems V. The intention behind this post is covering all the details which the original article omitted. All polygons are assumed to be oriented counter clockwise. You can find the full source code of my implementation in this github repository.

Introduction

Splitting a polygon by a line is very closely related to clipping a polygon against a clip region, like a rectangle or another polygon. Splitting a polygon is easier and at the same time harder than clipping a polygon. It’s easier because you only clip against a single line and it’s harder because the resulting polygons cannot simply be discarded. It turns out, that splitting convex polygons is a lot easier than splitting concave polygons. The reason is that splitting convex polygons results in at most two new polygons, whereas splitting concave polygons can result in arbitrary many new polygons. Look at the following figure for an illustration.

Convex vs concave polygon

Coming up with a correct implementation for concave polygons is tricky, because there are many corner cases that must be considered. In the following, first, an algorithm for splitting convex polygons is outlined. Afterwards, the algorithm is extended to concave polygons.

Convex polygons

Splitting convex polygons is pretty simple. Essentially, one iterates over the polygon’s vertices, checks on which side of the split line they lie and adds them accordingly either to the left or the right result polygon. When an edge crosses the split line, additionally, the intersection point must be added to both result polygons. Special care must be taken when multiple successive vertices lie on the split line. In that case, due to the convexity property, the entire polygon lies either on the left-hand or on the right-hand side of the polygon. When for the second time a polygon edge crosses the split line, we know that the first half of the polygon is already completely split off.

Concave polygons

In contrast to splitting convex polygons it requires some more algorithmic effort to split concave polygons. When the split line crosses a concave polygon for the second time, it doesn’t necessarily mean that the first half of the polygon is now completely split off. On the contrary, it might be that another new polygon to be split off just opened. To account for that, first, I tried to iterate over the polygon edges while carrying along a stack of open polygons. Whenever, I came across a vertex which belonged to the open polygon on the top of the stack I knew it just closed. This approach worked out well until I tried accounting for all the corner cases. Finally, I decided to use the method from the article mentioned in the introduction.

Idea of the algorithm

The intersection points between the split line and the polygon are key to the splitting algorithm. By pairing up 2-tuples of intersection points along the split line, the edges closing the result polygons can be determined. Conceptually, every two intersection points along the split line form a pair of edges, each closing one side of two neighboring result polygons. In a perfect world, the intersection points could be easily paired up after sorting them along the split line, relative to the split line’s starting point (cf. the left figure below). Unfortunately, in the real world special care must be taken when the split line doesn’t cross a polygon edge in middle, but only touches the start and/or end vertex. In such cases the pairing up of source points (red) and destination points (green) along the split line gets trickier (cf. the figure in the middle blow). As you can see, only certain intersection points qualify as sources and destinations and some intersection points even qualify as both (yellow, cf. the right figure below). Luckily, it’s possible to account for all of these cases as we’ll see later.

Edge classification cases

Computing the intersection points

To ease the splitting process it’s helpful to represent the polygon as a doubly linked list. We compute the intersection points between the split line and the polygon by iterating over each polygon edge e and determining on which side of the split line the edge’s start vertex e.Start and end vertex e.End lies. If e.Start lies on the split line e is added to the array EdgesOnLine, which contains all edges starting on the split line. If e.Start and e.End lie on opposite sides of the split line, a new edge starting at the intersection point is inserted into the polygon and added to EdgesOnLine. The edges in this array are of interest, because they start at the intersection points with the split line. The algorithm is outlined below.

for (auto &e : poly)
{
    const LineSide sideStart = GetLineSide(SplitLine, e.Start);
    const LineSide sideEnd = GetLineSide(SplitLine, e.End);

    if (sideStart == LineSide::On)
    {
        EdgesOnLine.push_back(e);
    }
    else if (sideStart != sideEnd && sideEnd != LineSide::On)
    {
        Point2 ip = IntersectLines(SplitLine, e);
        Edge *newEdge = poly.InsertEdge(e, ip);
        EdgesOnLine.push_back(newEdge);
    }
}

Sorting the intersection points

After all polygon edges have been intersected with the split line, the edges in the EdgesOnLine array are sorted along the split line by the distance between their starting points and the starting point of the split line. Sorting the EdgesOnLine array is crucial for pairing up the source and the destination edges by simply iterating over the array as we see in the following section.

std::sort(EdgesOnLine.begin(), EdgesOnLine.end(), [&](PolyEdge *e0, PolyEdge *e1)
{
    return (CalcSignedDistance(line, e0->StartPos) <
            CalcSignedDistance(line, e1->StartPos));
});

As Glenn Burkhardt pointed out in a comment, the Euclidean distance cannot be used as distance metric because it’s unsigned. The Euclidean distance can only be used as long as the split line’s starting point lies outside the polygon. In that case all points are on the same side of the split line’s starting point and the Euclidean distance provides a strict order along the split line. However, if the split line’s starting point is inside the polygon, points on two different sides of the split line’s starting point will both have positive distances, which can result in wrong sort orders. A formula for computing the signed Euclidean distance between two points can be derived from the formula for scalar projections. In case of co-linear lines (angle between the two vectors is |1|) the result equals the signed distance along the split line.

static double CalcSignedDistance(const QLineF &line, const QPointF &p)
{
    return (p.x()-line.p1().x())*(line.p2().x()-line.p1().x())+
           (p.y()-line.p1().y())*(line.p2().y()-line.p1().y());
}

Pairing up edges

As we’ve seen before, all edges in the EdgesOnLine array are either source edges, destination edges or both. The edges must be classified accordingly, in order to be able to determine between which pairs of edges a bridge must be created to split the polygon. The source/destination classification is the most important part of the algorithm and the hardest to get right due to plenty of corner cases.

The starting point e.Start of any edge e in the EdgesOnLine array lies on the split line. Each edge e has a predecessor edge e.Prev and a successor edge e.Next. They can lie either on the left-hand side (L), on the right-hand side (R) or on the split line (O). Hence, there’s a total of nine configuration possibilities that must be evaluated in order to find out if an edge is a source, a destination or both. All nine configurations are depicted in the figure below. The vertex order is shown in the caption of each configuration.

Edge configurations

I simply examined examples to classify each configuration as source, destination or both. The configurations’ winding orders are key for the classification. Imagine to move along the split line from its starting point to its endpoint. The winding order at the current intersection point indicates if we’re currently inside the polygon or not. Let’s take exemplary the LOR and ROL cases. The LOR case is only encountered when transitioning from the outside to the inside of a polygon. Hence, it’s a source configuration. In contrast, the ROL case is only encountered when transitioning from the inside of the polygon to the outside. Hence, it’s a destination configuration. Below are sample polygons depicted for the classification of each source and each destination configuration.

Classification cases

It follows that LOR, ROR, LOL, OOR and LOO are source configurations and ROL, ROR, LOL, ROO and OOL are destination configurations. Note, that these two sets are not disjoint: ROR and LOL are source and destination configurations. However, these two configurations can only be sources if they have served as destinations before!
Another corner case occurs when two successive edge starting points lie on the split line. Look at the example depicted in the figure below. There are two source edges (red) of which the second one must be used as source, even though it comes second in the EdgesOnLine array.

Source/destination selection problems

The EdgesOnLine array contains three edges: two source edges (red) and one destination edge (green). When splitting the polygon it’s crucial to pair up the the second and the third edge, not the first and the third. So how do we figure out which source edge to select when iterating over the EdgesOnLine array? It turns out that the distance between the starting points of the two successive source edges in question and the starting point of the split line can be used. An OOR and LOO source edge is only selected if the following candidate edge isn’t further away from the start of the split line and therefore closer to the next destination.

Rewiring the polygon

After we’ve figured out how the classification of source and destination edges works, we can finally formulate the algorithm to split concave polygons. All we need to do is to iterate over the EdgesOnLine array while pairing up source and destination edges. Each pair denotes one side of the polygon that has to be be split off. At this point it pays off that we use a linked list representation for the polygon. The current polygon side is split off by inserting two new edges between the source and the destination edge. Therefore, we iterate over the EdgesOnLine array, obtain the next source and the next destination edge and insert two new, oppositely oriented edges between them. This bridging operation is depicted in the following figure.

Edge configurations

The main loop of the rewiring process is depicted below. For reasons of brevity only the overall structure is shown. For more details look into the full implementation in the github repository.

for (size_t i=0; i<EdgesOnLine.size(); i++)
{
    // find source edge
    PolyEdge *srcEdge = useSrc;

    for (; !srcEdge && i<EdgesOnLine.size(); i++)
    {
        // test conditions if current edge is a source
    }

    // find destination edge
    PolyEdge *dstEdge = nullptr;

    for (; !dstEdge && i<EdgesOnLine.size(); )
    {
        // test conditions if current edge is a destination
    }

    // bridge source and destination edges
    if (srcEdge && dstEdge)
        CreateBridge(srcEdge, dstEdge);
}

Collecting the result polygons

The final step, after all bridge edges have been inserted, is collecting the resulting polygons. For that purpose I use the Visited attribute in the edge data structure to indicate if an edge has been visited already and therefore, has been assigned already to one of the result polygons.

November 18, 2014
by David 6 Comments

Advanced Octrees 3: non-static Octrees

Welcome to the third part of the Advanced Octrees series. Make sure you’ve also read part one and two. Typically, Octrees are constructed once for geometry which is known a priori and doesn’t change anymore (e.g. the level in a computer game). However, there are applications where objects are moving through the world, or objects located outside the Octree’s root cell must be inserted after the Octree has been constructed already. One approach would be to simply reconstruct the entire Octree from scratch each time it’s modified. While working, obviously, this turns very inefficient as soon as the Octree contains more than just a handful objects. In this post I present a way to relocate moving objects in already constructed Octrees, as well as a way to expand/shrink Octrees.

Relocate moving objects

An Octree which contains moving objects must be updated whenever a moving object starts straddling its parent cell’s bounding box. Usually, the number of moving objects is considerably smaller than the number of static objects. Thus, it can be advantageous to maintain two Octrees: one containing all static and one containing all moving objects. Instead of reconstructing the Octree from scratch, it’s much faster to relocate only the objects that have moved out of their parent cell.

Duplicating objects straddling cell boundaries works great for static scenes. However, in dynamic scenes keeping track of multiple references to the same object contained in multiple different nodes adds unnecessary complexity. Therefore, it’s better to place objects in the lowest Octree cell which completely encloses the object (see part one). That way a minimum amount of pointer information must be updated when a moving object transitions from one cell into another.

To update an Octree with moving objects, first, a list of all objects that have moved out of their parent cell is obtained. Each object in this list is pushed up the Octree until it ends up in a node which completely encloses it. The object must remain in this node, as long as it straddles any child cell’s bounding box of its new parent. However, when the object keeps on moving into the same direction, there will be the moment it’s again completely enclosed by one of its parent’s child cells. Therefore, finally, all previously pushed up objects are tried to be moved down the Octree again, in order to place them in the smallest enclosing cell possible.
It can happen that objects move out of the Octree’s root cell. In that case the Octree must be expanded as described in the following section. It can also happen that after pushing a moving object up the Octree, the former parent node and all its child nodes remain empty. In that case the former parent node and its children can be safely removed.

Expanding and shrinking

The final extent of the world isn’t always known at the time the Octree is constructed. Consider a space game in which world entities can spawn at arbitrary locations, or where space ships can move around freely until they leave the Octree’s root cell. To handle such situations an Octree must be expanded and shrunk dynamically as game entities spawn, disappear or move.

Octrees can be expanded by allocating a new root node with seven new child nodes and the 8th child node being the old root node. It’s crucial to expand the Octree into the direction of the outlying object. Therefore, the center of the new root node must be chosen in such a way, that the outlying object falls into it, or that at least the distance between the outlying object and the new Octree root node decreases. This operation is repeated recursively until the outlying object finally falls into the Octree’s root cell. As the Octree’s extent grows exponentially (it doubles each tree level) any reasonably far away object will be enclosed after a few expansion steps.
To shrink an Octree the reverse operation can be applied. If seven out of eight root node children are empty, all seven children can be removed from the Octree and the remaining child becomes the Octree’s new root node.

Creating and deleting nodes at the top of hashed Octrees is very costly, because the locational code of all nodes below the new root node gets 3 bits longer and must be updated. Consequently, the hash map must be updated as well. If expanding/shrinking are rare operations it might be still worth using hashed Octrees. Though, usually, pointer-based implementations perform much better. For more information read part two.

August 18, 2014
by David 14 Comments

Advanced Octrees 2: node representations

Introduction

Welcome to the second installment of the Advanced Octrees series. Make sure you’ve also read part one. In the second part I discuss different Octree data structure layouts. Essentially, literature distinguishes between two different layouts: the traditional layout using a pointer-based node representation and the implicit layout using an index-based node representation. Both layouts have their strong points and their weak points. Pointer-based node representations are advantageous if the Octree needs to be updated frequently and if memory consumption is not an issue. Implicit node representations pay off in memory limited applications.

Regardless of the chosen representation, the node struct always contains a pointer to the list of objects it encloses. Additionally, almost always the cell’s axis-aligned bounding box (AABB) is stored inside the node. The AABB can be stored in two different ways: either as two vectors AabbMin and AabbMax containing the AABB’s minimum and maximum corners, or as two vectors Center and HalfSize containing the AABB’s center and extent. If the Octree is cubic, all AABB sides have the same length and the storage size of the latter AABB representation can be reduced by storing HalfSize as a single float. Speedwise the center-extent representation is advantageous in most calculations (e.g. for view-frustum culling). Instead of storing the AABB inside the node, it can be recomputed while traversing the Octree. This is a memory consumption vs. compute trade-off.

All struct sizes given in the remainder of this post assume 64-bit wide pointers and the Vector3 class consisting of three 32-bit float variables. Let’s start with looking at pointer-based node representations first.

Pointer-based node representations

Standard representation

The most intuitive, pointer-based node representation consists of eight pointers to each of the eight child nodes. This representation supports on-demand allocation of nodes. On-demand allocation only allocates memory for child nodes, once an object is encountered which falls into the respective sub-cell. Some Octree implementations add pointers to the parent node for bottom-up traversals.

// standard representation (104 bytes)
struct OctreeNode
{
    OctreeNode * Children[8];
    OctreeNode * Parent; // optional
    Object *     Objects;
    Vector3      Center;
    Vector3      HalfSize;
};

As leaf nodes have no children, they don’t need to store child pointers. Therefore, two different node types, one for inner nodes and one for leaf nodes, can be used. To distinguish between inner nodes and leaf nodes, a flag must be stored additionally. The flag can be either stored as an additional bool variable IsLeaf, or encoded in the least significant bit of one of the pointers if the nodes are allocated with appropriate alignment (C++’s new operator usually aligns object types to the size of their largest member).

// inner node (105 bytes)                   // leaf node (41 bytes)
struct OctreeInnerNode                      struct OctreeLeafNode
{                                           {
    OctreeInnerNode * Children[8];              // leaf has no children
    OctreeInnerNode * Parent;                   OctreeInnerNode * Parent; // optional
    Object *          FirstObj;                 Object *          Objects;
    Vector3           Center;                   Vector3           Center;
    Vector3           HalfSize;                 Vector3           HalfSize;
    bool              IsLeaf;                   bool              IsLeaf;
};                                          };

Using two different node types, one for inner nodes and one for leaf nodes, can be applied as well to the following two representations.

Block representation

A significant amount of memory can be saved by storing just one pointer to a block of eight children, instead of eight pointers to eight children. That way the storage size of an inner node can be reduced from 105 bytes down to 49 bytes, which is only 47% of the original size. However, when a leaf node is subdivided always all eight children must be allocated. It’s not possible anymore to allocate child nodes on-demand, once the first object falling into the octant in question is encountered. Look at the following figure for an illustration of the block representation.

The corresponding code for the node struct is:

// block representation (49 bytes)
struct OctreeNode
{
    OctreeNode * Children;
    OctreeNode * Parent; // optional
    Object *     FirstObj;
    Vector3      Center;
    Vector3      HalfSize;
    bool         IsLeaf;
};

Sibling-child representation

On-demand allocation can reduce the amount of required memory for nodes significantly if the world is sparsely populated and thereby, many octants contain no objects. A trade-off between the standard representation and the block representation is the so called sibling-child representation. This representation allows on-demand allocation while storing only two node pointers per node instead of eight. The first pointer is NextSibling, which points to the next child node of the node’s parent. The second pointer is FirstChild, which points to the node’s first child node. Look at the following figure for an illustration of the sibling-child representation. Compare the number of required pointers per node to the standard representation.

In comparison to the standard representation, the sibling-child representation needs only 25% of the memory for pointers (two instead of eight pointers). As long as the child nodes are accessed sequentially the two representations perform equally well. However, accessing nodes randomly requires dereferencing on average four times more pointers. The code for the node struct is given below.

// sibling-child representation (56 bytes)
struct OctreeNode
{
    OctreeNode * NextSibling;
    OctreeNode * FirstChild;
    OctreeNode * Parent; // optional
    Object *     FirstObj;
    Vector3      Center;
    Vector3      HalfSize;
};

Comparison

The choice of the right pointer-based representation depends mainly on the importance of memory usage vs. traversal speed. Explicitly storing all eight child pointers wastes memory but makes traversing and modifying the Octree easy to implement and fast. In contrast, the sibling-child representation saves 50% memory as a single node is only 48 bytes instead of 92 bytes. However, the additional pointer indirections might complicate the traversal code and make it slower. It can be a good trade-off to store just a single pointer to a block of eight sub-cells. This representation needs only 40 bytes of memory per node and the traversal code is as easy as in the representation with eight child pointers. However, always allocating all eight sub-cells can waste memory in sparsely populated worlds with many empty sub-cells.

Implicit node representations

Linear (hashed) Octrees

Linear Octrees [Gargantini, 1982]¹, originally proposed for Quadtrees, combine the advantages of pointer-based and pointer-less representations. Linear Octrees provide easy and efficient access to parent and child nodes, even though no explicit tree structure information must be stored per node.

Overview

Instead of child and parent pointers, Linear Octrees store a unique index called locational code in each node. Additionally, all Octree nodes are stored in a hash map which allows directly accessing any node based on its locational code. The locational code is constructed in such a way, that deriving the locational codes for any node’s parent and children based on its own locational code is feasible and fast. To avoid unnecessary hash map look-ups for children which don’t exist, the node struct can be extended by a bit-mask indicating which children have been allocated and which haven’t.

struct OctreeNode // 13 bytes
{
    Object *    Objects;
    uint32_t    LocCode;     // or 64-bit, depends on max. required tree depth
    uint8_t     ChildExists; // optional
};

The locational code

In order to create the locational code each octant gets a 3-bit number between 0 and 7 assigned, depending on the node’s relative position to it’s parent’s center. The possible relative positions are: bottom-left-front (000), bottom-right-front (001), bottom-left-back (010), bottom-right-back (011), top-left-front (100), top-right-front (101), top-left-back (110), top-right-back (111). The locational code of any child node in the tree can be computed recursively by concatenating the octant numbers of all the nodes from the root down to the node in question. The octant numbers are illustrated in the figure below.

The AABB of the node can be stored explicitly as before, or it can be computed from the node’s tree depth stored implicitly inside the locational code. To derive the tree depth at a node from its locational code a flag bit is required to indicate the end of the locational code. Without such a flag it wouldn’t be possible to distinguish e.g. between 001 and 000 001. By using a 1 bit to mark the end of the sequence 1 001 can be easily distinguished from 1 000 001. Using such a flag is equivalent to setting the locational code of the Octree root to 1. With one bit for the flag and three bits per Octree level, a 32-bit locational code can represent at maximum a tree depth of 10, a 64-bit locational code a tree depth of 21. Given the locational code $c$ of a node, its depth in the Octree can be computed as $\lfloor\log_2(c)/3\rfloor$ . An efficient implementation using bit scanning intrinsics is given below for GCC and Visual C++.

size_t Octree::GetNodeTreeDepth(const OctreeNode *node)
{
    assert(node->LocCode); // at least flag bit must be set
    // for (uint32_t lc=node->LocCode, depth=0; lc!=1; lc>>=3, depth++);
    // return depth;

#if defined(__GNUC__)
    return (31-__builtin_clz(node->LocCode))/3;
#elif defined(_MSC_VER)
    long msb;
    _BitScanReverse(&msb, node->LocCode);
    return msb/3;
#endif
}

When sorting the nodes by locational code the resulting order is the same as the pre-order traversal of the Octree, which in turn is equivalent to the Morton Code (also known as Z-Order Curve). The Morton Code linearly indexes multi-dimensional data, preserving data locality on multiple levels.

Tree traversal

Given the locational code, moving further down or up the Octree is a simple two-step operation consisting of (1) deriving the locational code of the next node and (2) looking-up the node in the hash-map.

For traversing up the Octree, first, the locational code of the parent node must be determined. This is done by removing the least significant three bits of the locational code of the current node. Now, the parent node can be retrieved by doing a hash map look-up with the previously computed locational code. An exemplary implementation is given below.

class Octree
{
public:
    OctreeNode * GetParentNode(OctreeNode *node)
    {
        const uint32_t locCodeParent = node->LocCode>>3;
        return LookupNode(locCodeParent);
    }

private:
    OctreeNode * LookupNode(uint32_t locCode)
    {
        const auto iter = Nodes.find(locCode);
        return (iter == Nodes.end() ? nullptr : &(*iter));
    }

private:
    std::unordered_map<uint32_t, OctreeNode> Nodes;
};

For traversing down the Octree, first, the locational code of the child in question must be computed. This is done by appending the octant number of the child to the current node’s locational code. After that the child node can be retrieved by doing a hash map look-up with the previously computed locational code. The following code visits all nodes of an Octree from the root down to the leafs.

void Octree::VisitAll(OctreeNode *node)
{
    for (int i=0; i<8; i++)
    {
        if (node->ChildExists&(1<<i))
        {
            const uint32_t locCodeChild = (node->LocCode<<3)|i;
            const auto *child = LookupNode(locCodeChild);
            VisitAll(child);
        }
    }
}

Full Octrees

In a full or complete Octree, every internal node has eight children and all leaf nodes have exactly the same tree depth $D$ which is fixed a priori. A full Octree has $N_L=8^D$ leaf nodes. Thus, it’s equal to a regular 3D grid with a resolution of $2^D\times 2^D\times 2^D$ . The total number of tree nodes can be computed as $N_T=\sum_{i=0}^{D}8^i=\frac{8^{D+1}-1}{7}$ . Full Octrees of four successive subdivision levels are depicted in the figure below.

Thanks to the regularity of a full Octree it can be implemented without explicitly storing any tree structure and cell size information in the nodes. Hence, a single node consists solely of the pointer to the objects; which is eight bytes on a 64-bit machine. Similar to binary trees, full Octrees can be stored pointer-less in an array FullOctreeNode Nodes[K] (zero-based). The children of any node Nodes[i] can be found at Nodes[8*i+1] to Nodes[8*i+8], the parent of node Nodes[i] can be found at Nodes[floor((i-1)/8)] if i is not the root node ( $\Rightarrow i>0$ ).

The most common application of full Octrees are non-sparse, static scenes with very evenly distributed geometry. If not most of the nodes contain objects, memory savings due to small node structs are quickly lost by the huge amount of nodes that need to be allocated. A full Octree of depth $D=10$ consists of $N_T=1227133513$ (1.2 billion) nodes which consume around 9.14 GiB of memory.

Wrap up

Which node representation to choose for an Octree implementation depends mainly on the application. There are three major aspects that can help deciding on the representation.

How much data will supposedly be stored in the Octree? Is the reduced node size of an implicit node representation crucial for keeping the memory usage in check?
Will the Octree be subject to frequent changes? Pointer-based node representations are much more suitable for modifying Octrees than implicit node representations.
Will the Octree be sparsely or densely populated? How much memory can be saved by supporting on-demand allocation of nodes? Is maybe a full Octree suitable?

I. Gargantini. An Effective Way to Represent Octrees. Communications of the ACM, Volume 25 Issue 12, Dec. 1982, Pages 905-910 ↩

July 18, 2014
by David 4 Comments

Advanced Octrees 1: preliminaries, insertion strategies and maximum tree depth

Introduction

An Octree is a recursive, axis-aligned, spatial partitioning data structure commonly used in computer graphics to optimize collision detection, nearest neighbor search, frustum culling and more. Conceptually, an Octree is a simple data structure. However, when digging deeper into the literature one will find many interesting, not very well-known techniques for optimizing and extending Octrees. This is why I decided to write a series of blog posts about Octree techniques not widely covered on the Internet. This series will consist of five posts covering the following topics:

Preliminaries, insertion strategies and maximum tree depth
Different node representations for memory footprint reduction
Non-static Octrees to support moving objects and expanding/shrinking Octrees
Loose Octrees for optimizing insertion and culling hotspots
Accessing a node’s neighbors

I’ll publish the posts of this series one by one during the next few weeks, starting today with the first one on preliminaries, insertion strategies and an upper bound for the maximum Octree depth. Thanks for reading!

Preliminaries

An Octree hierarchically subdivides a finite 3D volume into eight disjoint octants. In the following, octants are also called nodes in the context of tree data structures and cells in the context of space. An Octree’s root cell encloses the entire world. Sometimes, Octrees are introduced as subdividing space into cubes instead of arbitrarily sized boxes. Generally, arbitrarily sized boxes work equally well and there’s no need to adjust the root cell’s size to have the shape of a cube. However, cube-shaped cells slightly speed-up cell subdivision computations and the cell size can be stored as just one float per node instead of three. The general structure of an Octree is illustrated in the figure below.

Octree structure

An Octree is constructed by recursively subdividing space into eight cells until the remaining number of objects in each cell is below a pre-defined threshold, or a maximum tree depth is reached. Every cell is subdivided by three axis-aligned planes, which are usually placed in the middle of the parent node. Thus, each node can have up to eight children. The possibility not to allocate certain child nodes allows, in contrast to regular grids, to store sparsely populated worlds in Octrees.

Insertion strategies

Points are dimensionless and thereby have no spatial extent. Thus, if points are stored in an Octree, they can be unambiguously assigned to exactly one node. However, if objects with spatial extent like polygons are stored in an Octree, their midpoint can fall into a cell while the object it-self straddles the cell boundary. In that case there are basically three options.

The object in question is split along the boundaries of the straddling cells and each part is inserted into its corresponding cell. This approach has two disadvantages. First, the splitting operation might imply costly computations. Second, the data structures are more complicated, because the split-off objects need to be stored somewhere.
The object in question is added to each cell it straddles. This option is disadvantageous when the Octree needs to be updated, because it can contain duplicate references to the same object. Furthermore, the culling performance gets worse, because the same object might be found in more than one visible cell. Additionally, special care must be taken when subdividing cells. If after subdividing a cell all objects straddle the very same newly created sub-cell(s), all objects will be inserted again into the same sub-cell(s) causing yet another subdivision step. This results in an infinite loop, only stopped when the maximum tree depth is reached.
The object in question is stored in the smallest Octree cell it’s completely enclosed by. This option results in many unnecessary tests, because objects are stored in inner nodes instead of leaf nodes in case they straddle any child cell further down the tree.

In some of the following posts of this series I’ll come back to the different insertion strategies and show in which situation which of the strategies is advantageous. Especially, Loose Octrees are a particularly nice way of overcoming most of the downsides discussed above.

Maximum tree depth

Let’s assume an Octree contains $M$ points. As described in the previous section, each of the points can only fall exactly into one node. Is it possible to establish a relation between the number of points $M$ and the maximum Octree depth $D_\text{max}$ ? It turns out, that the number of Octree nodes (=> the Octree depth) is not limited by the number of points. The reason is, that if the points are distributed closely enough in multiple widespread clusters, the number of Octree nodes can grow arbitrarily large. Look at the following figure for an illustration.

Octree point clusters

In order to split any of the two point clusters, the Octree must be subdivided a few times first. As the points inside the clusters can be arbitrarily close and the clusters can be arbitrarily far away from each other, the number of subdivision steps, and therefore the number of Octree nodes is not limited by the size of the point set. That shows that generally, Octrees cannot be balanced as we are used to from traditional tree data structures.

Nevertheless, we can come up with another upper bound for the maximum tree depth. Let’s assume a cubic Octree for simplicity. Given the minimum distance $d_\text{min}$ between any two points in the point set and the side length of the root cell $s$ , it can be shown that the maximum Octree depth is limited by $\log\frac{s}{d_\text{min}}+\log\sqrt{3}\geq D_\text{max}$ . The following proof for this upper bound is rather simple.
The maximum distance between any two points in a cell at depth $k$ is given by $\sqrt{3(s/2^k)^2}=\sqrt{3}\frac{s}{2^k}$ . Any inner node encloses at least two points, otherwise it would be a leaf node. Hence, the maximum distance between any two points in this cell is guaranteed to be bigger than the minimum distance $d_\text{min}$ between any two points of the point set. Therefore, it holds that $d_\text{min}\leq\sqrt{3}\frac{s}{2^k}\Leftrightarrow k\leq\log\sqrt{3}+\log\frac{s}{d_\text{min}}$ .

That’s it for today. Stay tuned for the next article!

Introduction

Algorithm

Complexity

Introduction

Design choices

std::unique_ptr

std::shared_ptr

Type erased custom deleters with std::unique_ptr

Anderson’s lock

Graunke and Thakkar’s lock

Comparison

Implementation

Proportional back-off

Wrap up

Introduction

The test-and-test-and-set optimization

Exponential back-off

Pausing and sleeping

Performance comparison

Wrapup

Scalability

Fairness

Memory footprint

Introduction

Convex polygons

Concave polygons

Idea of the algorithm

Computing the intersection points

Sorting the intersection points

Pairing up edges

Rewiring the polygon

Collecting the result polygons

Relocate moving objects

Expanding and shrinking

Introduction

Pointer-based node representations

Standard representation

Block representation

Sibling-child representation

Comparison

Implicit node representations

Linear (hashed) Octrees

Overview

The locational code

Tree traversal

Full Octrees

Wrap up

Introduction

Preliminaries

Insertion strategies

Maximum tree depth

`std::unique_ptr`

`std::shared_ptr`

Type erased custom deleters with `std::unique_ptr`