Advanced Octrees 2: node representations

Introduction

Welcome to the second installment of the Advanced Octrees series. Make sure you’ve also read part one. In the second part I discuss different Octree data structure layouts. Essentially, literature distinguishes between two different layouts: the traditional layout using a pointer-based node representation and the implicit layout using an index-based node representation. Both layouts have their strong points and their weak points. Pointer-based node representations are advantageous if the Octree needs to be updated frequently and if memory consumption is not an issue. Implicit node representations pay off in memory limited applications.

Regardless of the chosen representation, the node struct always contains a pointer to the list of objects it encloses. Additionally, almost always the cell’s axis-aligned bounding box (AABB) is stored inside the node. The AABB can be stored in two different ways: either as two vectors AabbMin and AabbMax containing the AABB’s minimum and maximum corners, or as two vectors Center and HalfSize containing the AABB’s center and extent. If the Octree is cubic, all AABB sides have the same length and the storage size of the latter AABB representation can be reduced by storing HalfSize as a single float. Speedwise the center-extent representation is advantageous in most calculations (e.g. for view-frustum culling). Instead of storing the AABB inside the node, it can be recomputed while traversing the Octree. This is a memory consumption vs. compute trade-off.

All struct sizes given in the remainder of this post assume 64-bit wide pointers and the Vector3 class consisting of three 32-bit float variables. Let’s start with looking at pointer-based node representations first.

Pointer-based node representations

Standard representation

The most intuitive, pointer-based node representation consists of eight pointers to each of the eight child nodes. This representation supports on-demand allocation of nodes. On-demand allocation only allocates memory for child nodes, once an object is encountered which falls into the respective sub-cell. Some Octree implementations add pointers to the parent node for bottom-up traversals.

// standard representation (104 bytes)
struct OctreeNode
{
    OctreeNode * Children[8];
    OctreeNode * Parent; // optional
    Object *     Objects;
    Vector3      Center;
    Vector3      HalfSize;
};

As leaf nodes have no children, they don’t need to store child pointers. Therefore, two different node types, one for inner nodes and one for leaf nodes, can be used. To distinguish between inner nodes and leaf nodes, a flag must be stored additionally. The flag can be either stored as an additional bool variable IsLeaf, or encoded in the least significant bit of one of the pointers if the nodes are allocated with appropriate alignment (C++’s new operator usually aligns object types to the size of their largest member).

// inner node (105 bytes)                   // leaf node (41 bytes)
struct OctreeInnerNode                      struct OctreeLeafNode
{                                           {
    OctreeInnerNode * Children[8];              // leaf has no children
    OctreeInnerNode * Parent;                   OctreeInnerNode * Parent; // optional
    Object *          FirstObj;                 Object *          Objects;
    Vector3           Center;                   Vector3           Center;
    Vector3           HalfSize;                 Vector3           HalfSize;
    bool              IsLeaf;                   bool              IsLeaf;
};                                          };

Using two different node types, one for inner nodes and one for leaf nodes, can be applied as well to the following two representations.

Block representation

A significant amount of memory can be saved by storing just one pointer to a block of eight children, instead of eight pointers to eight children. That way the storage size of an inner node can be reduced from 105 bytes down to 49 bytes, which is only 47% of the original size. However, when a leaf node is subdivided always all eight children must be allocated. It’s not possible anymore to allocate child nodes on-demand, once the first object falling into the octant in question is encountered. Look at the following figure for an illustration of the block representation.

The corresponding code for the node struct is:

// block representation (49 bytes)
struct OctreeNode
{
    OctreeNode * Children;
    OctreeNode * Parent; // optional
    Object *     FirstObj;
    Vector3      Center;
    Vector3      HalfSize;
    bool         IsLeaf;
};

Sibling-child representation

On-demand allocation can reduce the amount of required memory for nodes significantly if the world is sparsely populated and thereby, many octants contain no objects. A trade-off between the standard representation and the block representation is the so called sibling-child representation. This representation allows on-demand allocation while storing only two node pointers per node instead of eight. The first pointer is NextSibling, which points to the next child node of the node’s parent. The second pointer is FirstChild, which points to the node’s first child node. Look at the following figure for an illustration of the sibling-child representation. Compare the number of required pointers per node to the standard representation.

In comparison to the standard representation, the sibling-child representation needs only 25% of the memory for pointers (two instead of eight pointers). As long as the child nodes are accessed sequentially the two representations perform equally well. However, accessing nodes randomly requires dereferencing on average four times more pointers. The code for the node struct is given below.

// sibling-child representation (56 bytes)
struct OctreeNode
{
    OctreeNode * NextSibling;
    OctreeNode * FirstChild;
    OctreeNode * Parent; // optional
    Object *     FirstObj;
    Vector3      Center;
    Vector3      HalfSize;
};

Comparison

The choice of the right pointer-based representation depends mainly on the importance of memory usage vs. traversal speed. Explicitly storing all eight child pointers wastes memory but makes traversing and modifying the Octree easy to implement and fast. In contrast, the sibling-child representation saves 50% memory as a single node is only 48 bytes instead of 92 bytes. However, the additional pointer indirections might complicate the traversal code and make it slower. It can be a good trade-off to store just a single pointer to a block of eight sub-cells. This representation needs only 40 bytes of memory per node and the traversal code is as easy as in the representation with eight child pointers. However, always allocating all eight sub-cells can waste memory in sparsely populated worlds with many empty sub-cells.

Implicit node representations

Linear (hashed) Octrees

Linear Octrees [Gargantini, 1982]1, originally proposed for Quadtrees, combine the advantages of pointer-based and pointer-less representations. Linear Octrees provide easy and efficient access to parent and child nodes, even though no explicit tree structure information must be stored per node.

Overview

Instead of child and parent pointers, Linear Octrees store a unique index called locational code in each node. Additionally, all Octree nodes are stored in a hash map which allows directly accessing any node based on its locational code. The locational code is constructed in such a way, that deriving the locational codes for any node’s parent and children based on its own locational code is feasible and fast. To avoid unnecessary hash map look-ups for children which don’t exist, the node struct can be extended by a bit-mask indicating which children have been allocated and which haven’t.

struct OctreeNode // 13 bytes
{
    Object *    Objects;
    uint32_t    LocCode;     // or 64-bit, depends on max. required tree depth
    uint8_t     ChildExists; // optional
};

The locational code

In order to create the locational code each octant gets a 3-bit number between 0 and 7 assigned, depending on the node’s relative position to it’s parent’s center. The possible relative positions are: bottom-left-front (000), bottom-right-front (001), bottom-left-back (010), bottom-right-back (011), top-left-front (100), top-right-front (101), top-left-back (110), top-right-back (111). The locational code of any child node in the tree can be computed recursively by concatenating the octant numbers of all the nodes from the root down to the node in question. The octant numbers are illustrated in the figure below.

The AABB of the node can be stored explicitly as before, or it can be computed from the node’s tree depth stored implicitly inside the locational code. To derive the tree depth at a node from its locational code a flag bit is required to indicate the end of the locational code. Without such a flag it wouldn’t be possible to distinguish e.g. between 001 and 000 001. By using a 1 bit to mark the end of the sequence 1 001 can be easily distinguished from 1 000 001. Using such a flag is equivalent to setting the locational code of the Octree root to 1. With one bit for the flag and three bits per Octree level, a 32-bit locational code can represent at maximum a tree depth of 10, a 64-bit locational code a tree depth of 21. Given the locational code c of a node, its depth in the Octree can be computed as \lfloor\log_2(c)/3\rfloor. An efficient implementation using bit scanning intrinsics is given below for GCC and Visual C++.

size_t Octree::GetNodeTreeDepth(const OctreeNode *node)
{
    assert(node->LocCode); // at least flag bit must be set
    // for (uint32_t lc=node->LocCode, depth=0; lc!=1; lc>>=3, depth++);
    // return depth;

#if defined(__GNUC__)
    return (31-__builtin_clz(node->LocCode))/3;
#elif defined(_MSC_VER)
    long msb;
    _BitScanReverse(&msb, node->LocCode);
    return msb/3;
#endif
}

When sorting the nodes by locational code the resulting order is the same as the pre-order traversal of the Octree, which in turn is equivalent to the Morton Code (also known as Z-Order Curve). The Morton Code linearly indexes multi-dimensional data, preserving data locality on multiple levels.

Tree traversal

Given the locational code, moving further down or up the Octree is a simple two-step operation consisting of (1) deriving the locational code of the next node and (2) looking-up the node in the hash-map.

For traversing up the Octree, first, the locational code of the parent node must be determined. This is done by removing the least significant three bits of the locational code of the current node. Now, the parent node can be retrieved by doing a hash map look-up with the previously computed locational code. An exemplary implementation is given below.

class Octree
{
public:
    OctreeNode * GetParentNode(OctreeNode *node)
    {
        const uint32_t locCodeParent = node->LocCode>>3;
        return LookupNode(locCodeParent);
    }

private:
    OctreeNode * LookupNode(uint32_t locCode)
    {
        const auto iter = Nodes.find(locCode);
        return (iter == Nodes.end() ? nullptr : &(*iter));
    }

private:
    std::unordered_map<uint32_t, OctreeNode> Nodes;
};

For traversing down the Octree, first, the locational code of the child in question must be computed. This is done by appending the octant number of the child to the current node’s locational code. After that the child node can be retrieved by doing a hash map look-up with the previously computed locational code. The following code visits all nodes of an Octree from the root down to the leafs.

void Octree::VisitAll(OctreeNode *node)
{
    for (int i=0; i<8; i++)
    {
        if (node->ChildExists&(1<<i))
        {
            const uint32_t locCodeChild = (node->LocCode<<3)|i;
            const auto *child = LookupNode(locCodeChild);
            VisitAll(child);
        }
    }
}

Full Octrees

In a full or complete Octree, every internal node has eight children and all leaf nodes have exactly the same tree depth D which is fixed a priori. A full Octree has N_L=8^D leaf nodes. Thus, it’s equal to a regular 3D grid with a resolution of 2^D\times 2^D\times 2^D. The total number of tree nodes can be computed as N_T=\sum_{i=0}^{D}8^i=\frac{8^{D+1}-1}{7}. Full Octrees of four successive subdivision levels are depicted in the figure below.

Thanks to the regularity of a full Octree it can be implemented without explicitly storing any tree structure and cell size information in the nodes. Hence, a single node consists solely of the pointer to the objects; which is eight bytes on a 64-bit machine. Similar to binary trees, full Octrees can be stored pointer-less in an array FullOctreeNode Nodes[K] (zero-based). The children of any node Nodes[i] can be found at Nodes[8*i+1] to Nodes[8*i+8], the parent of node Nodes[i] can be found at Nodes[floor((i-1)/8)] if i is not the root node (\Rightarrow i>0).

The most common application of full Octrees are non-sparse, static scenes with very evenly distributed geometry. If not most of the nodes contain objects, memory savings due to small node structs are quickly lost by the huge amount of nodes that need to be allocated. A full Octree of depth D=10 consists of N_T=1227133513 (1.2 billion) nodes which consume around 9.14 GiB of memory.

Wrap up

Which node representation to choose for an Octree implementation depends mainly on the application. There are three major aspects that can help deciding on the representation.

  1. How much data will supposedly be stored in the Octree? Is the reduced node size of an implicit node representation crucial for keeping the memory usage in check?
  2. Will the Octree be subject to frequent changes? Pointer-based node representations are much more suitable for modifying Octrees than implicit node representations.
  3. Will the Octree be sparsely or densely populated? How much memory can be saved by supporting on-demand allocation of nodes? Is maybe a full Octree suitable?

  1. I. Gargantini. An Effective Way to Represent Octrees. Communications of the ACM, Volume 25 Issue 12, Dec. 1982, Pages 905-910 

Advanced Octrees 1: preliminaries, insertion strategies and maximum tree depth

Introduction

An Octree is a recursive, axis-aligned, spatial partitioning data structure commonly used in computer graphics to optimize collision detection, nearest neighbor search, frustum culling and more. Conceptually, an Octree is a simple data structure. However, when digging deeper into the literature one will find many interesting, not very well-known techniques for optimizing and extending Octrees. This is why I decided to write a series of blog posts about Octree techniques not widely covered on the Internet. This series will consist of five posts covering the following topics:

  1. Preliminaries, insertion strategies and maximum tree depth
  2. Different node representations for memory footprint reduction
  3. Non-static Octrees to support moving objects and inserting objects not falling into the root cell
  4. Loose Octrees for optimizing insertion and culling hotspots
  5. Accessing a node’s neighbors

I’ll publish the posts of this series one by one during the next few weeks, starting today with the first one on preliminaries, insertion strategies and an upper bound for the maximum Octree depth. Thanks for reading!

Preliminaries

An Octree hierarchically subdivides a finite 3D volume into eight disjoint octants. In the following, octants are also called nodes in the context of tree data structures and cells in the context of space. An Octree’s root cell encloses the entire world. Sometimes, Octrees are introduced as subdividing space into cubes instead of arbitrarily sized boxes. Generally, arbitrarily sized boxes work equally well and there’s no need to adjust the root cell’s size to have the shape of a cube. However, cube-shaped cells slightly speed-up cell subdivision computations and the cell size can be stored as just one float per node instead of three. The general structure of an Octree is illustrated in the figure below.

Octree structure

An Octree is constructed by recursively subdividing space into eight cells until the remaining number of objects in each cell is below a pre-defined threshold, or a maximum tree depth is reached. Every cell is subdivided by three axis-aligned planes, which are usually placed in the middle of the parent node. Thus, each node can have up to eight children. The possibility not to allocate certain child nodes allows, in contrast to regular grids, to store sparsely populated worlds in Octrees.

Insertion strategies

Points are dimensionless and thereby have no spatial extent. Thus, if points are stored in an Octree, they can be unambiguously assigned to exactly one node. However, if objects with spatial extent like polygons are stored in an Octree, their midpoint can fall into a cell while the object it-self straddles the cell boundary. In that case there are basically three options.

  1. The object in question is split along the boundaries of the straddling cells and each part is inserted into its corresponding cell. This approach has two disadvantages. First, the splitting operation might imply costly computations. Second, the data structures are more complicated, because the split-off objects need to be stored somewhere.

  2. The object in question is added to each cell it straddles. This option is disadvantageous when the Octree needs to be updated, because it can contain duplicate references to the same object. Furthermore, the culling performance gets worse, because the same object might be found in more than one visible cell. Additionally, special care must be taken when subdividing cells. If after subdividing a cell all objects straddle the very same newly created sub-cell(s), all objects will be inserted again into the same sub-cell(s) causing yet another subdivision step. This results in an infinite loop, only stopped when the maximum tree depth is reached.

  3. The object in question is stored in the smallest Octree cell it’s completely enclosed by. This option results in many unnecessary tests, because objects are stored in inner nodes instead of leaf nodes in case they straddle any child cell further down the tree.

In some of the following posts of this series I’ll come back to the different insertion strategies and show in which situation which of the strategies is advantageous. Especially, Loose Octrees are a particularly nice way of overcoming most of the downsides discussed above.

Maximum tree depth

Let’s assume an Octree contains M points. As described in the previous section, each of the points can only fall exactly into one node. Is it possible to establish a relation between the number of points M and the maximum Octree depth D_\text{max}? It turns out, that the number of Octree nodes (=> the Octree depth) is not limited by the number of points. The reason is, that if the points are distributed closely enough in multiple widespread clusters, the number of Octree nodes can grow arbitrarily large. Look at the following figure for an illustration.

Octree point clusters

In order to split any of the two point clusters, the Octree must be subdivided a few times first. As the points inside the clusters can be arbitrarily close and the clusters can be arbitrarily far away from each other, the number of subdivision steps, and therefore the number of Octree nodes is not limited by the size of the point set. That shows that generally, Octrees cannot be balanced as we are used to from traditional tree data structures.

Nevertheless, we can come up with another upper bound for the maximum tree depth. Let’s assume a cubic Octree for simplicity. Given the minimum distance d_\text{min} between any two points in the point set and the side length of the root cell s, it can be shown that the maximum Octree depth is limited by \log\frac{s}{d_\text{min}}+\log\sqrt{3}\geq D_\text{max}. The following proof for this upper bound is rather simple.
The maximum distance between any two points in a cell at depth k is given by \sqrt{3(s/2^k)^2}=\sqrt{3}\frac{s}{2^k}. Any inner node encloses at least two points, otherwise it would be a leaf node. Hence, the maximum distance between any two points in this cell is guaranteed to be bigger than the minimum distance d_\text{min} between any two points of the point set. Therefore, it holds that d_\text{min}\leq\sqrt{3}\frac{s}{2^k}\Leftrightarrow k\leq\log\sqrt{3}+\log\frac{s}{d_\text{min}}.

That’s it for today. Stay tuned for the next article!

Slides for “Beyond the Limits – The Usage of C++ in the Demoscene”

At the C++ User Group Berlin meeting this May, Eivind Liland and I gave a talk about the usage of C++ in the demoscene. Below is the abstract of the talk. The slides can be found in the downloads section.

Eivind and David are demosceners since decades. In this talk they’re going to show you two of their award-winning, real-time graphics demos, both highly optimized for different limitations and platforms, and both written in C++.

Turtles all the Way Down by Brain Control (2013) is a 64k-intro for the PC. It’s an almost 5 minutes long audio-visual journey using cutting edge algorithms in the areas of computer graphics, generative art and music synthesis. Being a 64k-intro, all textures, 3D objects and music fit into a single executable of merely 65.536 bytes.
Matt Current by Shitfaced Clowns (2007) is a demo for the Gameboy Advance. It features at that time never-seen-before graphics effects and a software-rendered 3d engine that pushes the device’s hardware to their limits. One prevailing opinion is that only by coding in 100% assembly one can push such platforms beyond their limits. Eivind will explain how they used C++ to carefully squeeze the maximum out of every cycle of the GBA’s 16 MHz CPU.

Though seemingly esoteric, all the techniques employed to realize these demos have their application in professional software development nowadays. In times of GHz multi-core processors, GPUs and terabyte hard-drives, performance critical code and compact code for embedded and mobile platforms still plays an important role. Eivind and David are going to guide you through the process of creating these graphics demos. They talk about the used algorithms and tools keeping the focus of how C++ was used to do the job.

An overview of direct memory access

Introduction

Direct memory access (DMA) is conceptually easy, but without experience in hardware design or driver development it can be cumbersome to understand. In this blog post I will explain what DMA is and how it evolved during the last decades. My goal is to make it comprehensible especially for people without experience in hardware design or driver development.

DMA allows computer devices of certain hardware sub-systems to directly access system memory and other device’s memory independently of the CPU. This enables the CPU to keep on working concurrently on other task while long lasting memory operations take place; considerably boosting overall system performance. DMA is used by different hardware like graphics cards, sound cards, network cards and disk drive controllers. DMA is rather a concept than a specific technology. There is no specification which describes in detail how DMA transfers work. Even on the contrary, the concept of directly accessing memory without CPU interaction is employed in many different hardware sub-systems in today’s computers. The most typical application is communicating with peripheral devices plugged into a bus system like ATA, SATA, PCI or PCI Express. Beyond that, DMA transfers are used for intra-core communication in micro processors and even to copy data from the memory of one computer into the memory of another computer over the network via remote DMA (don’t mix up this technology with NVIDIA’s new GPUDirect RDMA feature).

To give a concrete example, imagine you’re playing an open world computer game which loads new game assets on demand from your hard disk. Large amounts of game data must be copied over from hard disk into system RAM. Without DMA the CPU would be actively involved in each and every memory transfer operation. Consequently, less computing time would be left for other game play related tasks like AI or physics. In times of multi-core processors this seems less like a problem. However, as data volumes and work load sizes are ever growing, off-loading large memory transfer operations from the CPU is also today absolutely essential in order to achieve high system performance.

How DMA evolved over time

In my experience many software people think that DMA nowadays still works as it did in the old days. I guess this is because it’s the more intuitive way to think about DMA. Back then, extension devices did not actively take part in DMA transfers, but there was a DMA controller (e.g. the Intel 8237, first used in the IBM PC in 1981) which enabled DMA transfers between system memory and device I/O over the good old Industrial Standard Architecture (ISA) bus. The DMA controller could be programmed by the CPU to perform a number of memory transfers on behalf of the CPU. This way of accomplishing DMA transfers is also known as third party DMA. At that time the system bus was identical to the ISA expansion bus. To account for reduced bus performance in situations where CPU and DMA controller needed to access the bus simultaneously, different DMA modes (cycle stealing, transparent and burst) could be used. When the first IBM AT clones came out, the expansion bus got physically separated from the system bus using an ISA bridge. This was necessary because the AT clones had CPUs running at higher frequencies than the expansion bus. In the figure below the single bus and the separated bus architectures are depicted.

Single vs separate bus architecture

With the introduction of the conventional Peripheral Component Interface (PCI) bus architecture in 1992, the DMA controller became obsolete because of a technique called bus mastering, or first party DMA. PCI DMA transfers were implemented by allowing only one device at a time to access the bus. This device is called the bus master. While the bus master holds the bus it can perform memory transfers without CPU interaction. The fundamental difference between bus mastering and the use of a DMA controller is that DMA compatible devices must contain a DMA engine driving the memory transfers. As multiple PCI devices can master the bus, an arbitration scheme is required to avoid that more than one device drives the bus simultaneously. The advantage of bus mastering is a significant latency reduction because communication with the third party DMA controller is avoided. Additionally, each device’s DMA engine can be specifically optimized for the sort of DMA transfers it performs.

PCI architecture

Today’s computers don’t contain DMA controllers anymore. If they do so, it’s only to support legacy buses like e.g. ISA, often by simulating an ISA interface using a Low Pin Count (LPC) bus bridge. In 2004 the PCI successor and latest peripheral computer bus system PCI Express (PCIe) was introduced. PCIe turned the conventional PCI bus from a true bus architecture, with several devices physically sharing the same bus, into a serial, packet-switched, point-to-point architecture; very similar to how packet-switched networks function. PCIe connects each device with a dedicated, bi-directional link to a PCIe switch. As a result, PCIe supports full duplex DMA transfers of multiple devices at the same time. All arbitration logic is replaced by the packet routing logic implemented in the PCIe switches. While PCIe is entirely different to PCI on the hardware level, PCIe preserves backwards compatibility with PCI on the driver level. Newer PCIe devices can be detected and used by PCI drivers without explicit support for the PCIe standard. Though, the new PCIe features cannot be used of course.

PCIe architecture

DMA from a driver developer’s perspective

Now you know what DMA is and how it fits into a computer’s hardware architecture. So let’s see how DMA can be used in practice to speed up data heavy tasks. Since the dawn of DMA the driver (software) must prepare any peripheral DMA transfers, because only the operating system (OS) has full control over the memory system (we see later why this is important), the file system and the user-space processes. In the first step, the driver determines the source and destination memory addresses for the transfer. Next, the driver programs the hardware to perform the DMA transfer. The major difference between PCI/PCIe DMA and legacy ISA DMA is the way a DMA transfer is initiated. For PCI/PCIe no uniform, device independent way to initiate DMA transfers exists anymore, because each device contains its own, proprietary DMA engine. In contrast, the legacy DMA controller is always the same.

First, the peripheral device’s DMA engine is programmed with the source and destination addresses of the memory ranges to copy. Second, the device is signaled to begin the DMA transfer. Fair enough, but how can the driver know when the DMA transfer has finished? Usually, the device raises interrupts to inform the CPU about transfers that have finished. For each interrupt an interrupt handler, previously installed by the driver, is called and the finished transfer can be acknowledged accordingly by the OS (e.g. signaling the block I/O layer that a block has been read from disk and control can be handed back to the user-space process which requested this block). Back in the times of high latency spinning disks an slow network interfaces this was sufficient. Today, however, we’ve got solid state disks (SSD) and gigabit, low-latency network interfaces. To avoid completely maxing out the system by a vast number of interrupts, a common technique is to hold back and queue up multiple interrupts on the device until e.g. a timeout triggers, a certain number of interrupts are pending or any other condition suiting the application is met. This technique is known as interrupt coalescing. Obviously, the condition is always a trade-off between low latency and high throughput. The more frequently new interrupts are raised, the quicker the OS and its waiting processes are informed about finished memory transfers. However, if the OS is interrupted less often it can spend more time on other jobs.

DMA seems to be a nice feature in theory, but how does transferring large continuous memory regions play together with virtual memory? Virtual memory is usually organized in chunks of 4 KiB, called pages. Virtual memory is continuous as seen from a process’ point-of-view thanks to page tables and the memory management unit (MMU). However, it’s non-continuous as seen from the device point-of-view, because there is no MMU between the PCIe bus and the memory controller (well, some CPUs have an IO-MMU but let’s keep things simple). Hence, in a single DMA transfer only one page could be copied at a time. To overcome this limitation OS usually provide a scatter/gather API. Such an API chains together multiple page-sized memory transfers by creating a list of addresses of pages to be transferred.

Take home message

DMA is an indispensable technique for memory-heavy, high-performance computing. Over the last decades, the entire bus system and DMA controller concept was superseded by moving the DMA controller into the devices and using a point-to-point bus architecture. This reduced latency, made concurrent DMA transfers possible and allowed for device specific DMA engine optimizations. For the drivers less has changed. They are still responsible for initiating the DMA transfers. Though, today, instead of programming a DMA controller in a device independent way, drivers must program device specific DMA engines. Therefore, programming DMA transfers and processing DMA status information can look very different depending on the device.

On finding 1-bit sequences

Principles

Given is an arbitrary integer variable. How to find the index of the least significant bit (LSB) of the first 1-bit sequence of length >= n? Assuming n=4, let’s consider the following example of a random 32-bit integer value. The index we’re looking for is 10 in this case (indicated by the V character).

    31      24 | 23      16 | 15    V 8 | 7      0
MSB  01000111  |  11111101  |  10111100 | 01101001  LSB

Using a series of bit-wise and and shift-right operations the index of the LSB of the first 1111 sequence in the integer x can be found with the following trick.

x &= x>>1;
x &= x>>2;
index = __builtin_ffs(x)-1; // use _BitScanForward in Visual C++

After the first statement every 1 in x indicates the start of a 11 sequence. After the second statement every 1 in x indicates the start of a 1111 sequence. In the last statement the GCC intrinsic __builtin_ffs() (use _BitScanForward() if you’re on Visual C++) returns the bit position of the first set bit, starting from the LSB.
Note, that it doesn’t work to shift by four bits at once because it’s necessary to combine neighboring 1-bits to make sure that there are no 0-bits in-between. The following example illustrates how shifting by 3 bits wrongly yields two isolated 1-bits. In contrast, shifting by 2 bits correctly yields a sequence of 2 bits which can be further reduced into a single 1-bit indicating the start of the 1111 sequence.

 shift by 2    shift by 3
  01111010      01111010
& 00011110    & 00001111
= 00011010    = 00001010
    ok           wrong

Arbitrary sequence lengths

By cleverly choosing the number of bits to shift, it’s even possible to extend this construction to find bit sequences which length is not a power of two. As the order of the and-shift-right operations has no relevance, the following algorithm can be used to compute the number of bits to shift in order to find the n-bit sequence index. The sum of shifted bits must be equal to n-1 and the number of bits to shift is halved in each iteration. Therefore, the total number of executed iterations is ceil(log2(n)).

int FindBitSeqIndexLsb(int x, int n)
{
    assert(n >= 0 && n <= 32);

    while (n > 1)
    {
        const int shiftBits = n>>1;
        x &= (unsigned)x>>shiftBits; // shift in zeros from left
        n -= shiftBits;
    }

    return __builtin_ffs(x)-1; // use _BitScanForward in Visual C++
}

Exact sequence length

The described method finds bit sequences of length >= n. In case you’re looking for a bit sequence of exactly n bits, the following statement can be inserted right before the LSB scan is performed. This statement masks out any 1-bit which has a 1 on its left or right side. All the remaining 1-bits are isolated and indicate the start of a sequence of exactly n bits.

mask = (~(x<<1))&(~((unsigned)x>>1)); // shift in zeros from left and right
x &= mask;

Sequence alignment

To account for aligned bit sequences, unaligned 1-bits can be simply masked out from x before the LSB scan is performed. For example, to regard only bit sequences starting at nibble boundaries x can be modified with the operation x &= 0x11111111 (it is 0x...11 = 0b...00010001) to clear all bits not starting at an index which is a multiple of four.

Computing oriented minimum bounding boxes in 2D

Introduction

Oriented bounding boxes are an important tool for visibility testing and collision detection in computer graphics. In this post I want to talk about how to compute the oriented minimum bounding box (OMBB) of an arbitrary polygon in two dimensions. As a polygon just enforces an ordering on a set of points (vertices), everything described in the following equally applies to simple point sets. Minimum in this context refers to the area of the bounding box. A minimum oriented bounding box is also known as smallest-area enclosing rectangle. However, I will stick to the former term throughout this article as it is more frequently used in the computer graphics world.

The easiest way of computing a bounding box for a polygon is to determine the minimum and maximum x- and y- coordinates of its vertices. Such an axis aligned bounding box (AABB) can be computed trivially but it’s in most cases significantly bigger than the polygon’s OMBB. Finding the OMBB requires some more work as the bounding box’ area must be minimized, constrained by the location of the polygon’s vertices. Look at the following figure for an illustration (AABB in blue, OMBB in red).

AABB vs OMBB

The technique for computing OMBBs presented in the following consists of two detached steps. In the first step the convex hull of the input polygon is computed. If the polygon is convex this step can be omitted because a convex polygon is equal to its convex hull. In the second step the Rotating Calipers method is employed on the convex hull to compute the resulting OMBB. I will focus on the Rotating Calipers method because it’s not very widely known in comparison to the numerous ways of computing convex hulls.

Convex hulls

In less mathematical but more illustrative terms the convex hull of a set of n points can be described as the closed polygonal chain of all outer points of the set, which entirely encloses all set elements. You can picture it as the shape of a rubber band stretched around all set elements. The convex hull of a set of two-dimensional points can be efficiently computed in O(n\log n). In the figure below the convex hull of the vertices of a concave polygon is depicted.

Convex hull of vertices of concave polygon

There are numerous algorithms for computing convex hulls: Quick Hull, Gift Wrapping (also known as Jarvis March), Graham’s Algorithm and some more. I’ve chosen the Gift Wrapping algorithm for my implementation because it’s easy to implement and provides good performance in case n is small or the polygon’s convex hull contains only a few vertices. The runtime complexity is O(nh), where h is the number of vertices in the convex hull. In the general case Gift Wrapping is outperformed by other algorithms. Especially, when all points are part of the convex hull. In that case the complexity degrades to O(n^2).

As there are many good articles on the Gift Wrapping algorithm available online, I won’t describe it another time here. Instead I want to focus on the lesser-known Rotating Calipers method for computing OMBBs. However, take care that your convex hull algorithm correctly handles collinear points. If multiple points lie on a convex hull edge, only the spanning points should end up in the convex hull.

Rotating Calipers

Rotating Calipers is a versatile method for solving a number of problems from the field of computational geometry. It resembles the idea of rotating a dynamically adjustable caliper around the outside of a polygon’s convex hull. Originally, this method was invented to compute the diameter of convex polygons. Beyond that, it can be used to compute OMBBs, the minimum and maximum distance between two convex polygons, the intersection of convex polygons and many things more.

The idea of using the Rotating Calipers method for computing OMBBs is based on the following theorem, establishing a connection between the input polygon’s convex hull and the orientation of the resulting OMBB. The theorem was proven in 1975 by Freeman and Shapira1:

The smallest-area enclosing rectangle of a polygon has a side collinear with one of the edges of its convex hull.

Thanks to this theorem the number of OMBB candidates is dramatically reduced to the number of convex hull edges. Thus, the complexity of the Rotating Calipers method is linear if the convex hull is already available. If it isn’t available the overall complexity is bound by the cost of computing the convex hull. An example of a set of OMBB candidates (red) for a convex hull (green) is depicted in the figure below. Note, that there are as many OMBB candidates as convex hull edges and each OMBB candidate has one side flush with one edge of the convex hull.

OMBB candidates

To determine the OMBB of a polygon, first, two orthogonally aligned pairs of parallel supporting lines through the convex hull’s extreme points are created. The intersection of the four lines forms a rectangle. Next, the lines are simultaneously rotated about their supporting points until one line coincides with an edge of the convex hull. Each time an edge coincides, the four lines form another rectangle / OMBB candidate. This process is repeated until each convex hull edge once coincided with one of the four caliper lines. The resulting OMBB is the OMBB candidate with the smallest area. The entire algorithm is outlined step by step below.

  1. Compute the convex hull of the input polygon.
  2. Find the the extreme points p_\text{min}=(x_\text{min},y_\text{min})^T and p_\text{max}=(x_\text{max},y_\text{max})^T of the convex hull.
  3. Construct two vertical supporting lines at x_\text{min} and x_\text{max} and two horizontal ones at y_\text{min} and y_\text{max}.
  4. Initialize the current minimum rectangle area A_\text{min}=\infty.
  5. Rotate the supporting lines until one coincides with an edge of the convex hull.
    1. Compute the area A of the current rectangle.
    2. Update the minimum area and store the current rectangle if A<A_\text{min}.
  6. Repeat step 5 until all edges of the convex hull coincided once with one of the supporting lines.
  7. Output the minimum area rectangle stored in step 5.2.

In practice, in every iteration the smallest angle \phi_\text{min} between each caliper line and its associated, following convex hull edge is determined. Then, all caliper lines are rotated at once by \phi_\text{min} and the associated convex hull edge of the caliper line enclosing the smallest angle is advanced to the next convex hull edge.

Wrap up

Rotating Calipers is a very elegant method for computing OMBBs in two dimensions. O’Rourke generalized it to three dimensions, yielding an algorithm of cubic runtime complexity. However, in practice approximation algorithms are used for three dimensional data because they’re usually faster. Beyond that, it's worth knowing the Rotating Calipers technique as it can be employed with minor changes to numerous other geometric problems. Depending on the programming language the implementation of the entire algorithm, including the convex hull computation, requires merely 150-200 lines of code. My sample implementation in Javascript can be found in my github repository.


  1. H. Freeman, R. Shapira. Determining the minimum-area encasing rectangle for an arbitrary closed curve. Communications of the ACM, 18 Issue 7, July 1975, Pages 409-413 

Mutex lock guards in C++11

The new concurrency library of C++11 comes with two different classes for managing mutex locks: namely std::lock_guard and std::unique_lock. How do they compare with each other and in which situation which of the two classes should be preferably used?

The std::lock_guard class keeps its associated mutex locked during the entire life time by acquiring the lock on construction and releasing the lock on destruction. This makes it impossible to forget unlocking a critical section and it guarantees exception safety because any critical section is automatically unlocked when the stack is unwound after an exception was thrown. The std::lock_guard class should be used when a limited scope, like a class method, is to be locked.

void Foo::Bar()
{
    std::lock_guard<std::mutex> guard(this->Mutex);
    // mutex is locked now
}   // mutex is unlocked when lock guard goes out of scope

In contrast, the std::unique_lock class is a lot more flexible when dealing with mutex locks. It has the same interface as std::lock_guard but provides additional methods for explicitly locking and unlocking mutexes and deferring locking on construction. By passing std::defer_lock instead of std::adopt_lock the mutex remains unlocked when a std::unique_lock instance is constructed. The lock can then be obtained later by calling lock() on the std::unique_lock instance or alternatively, by passing it to the std::lock() function. To check if a std::unique_lock currently owns its associated mutex the owns_lock() method can be used. Hence, the mutex associated with a std::unique_lock doesn’t have to be locked (sometimes also referred to as owned) during the lock guard’s entire life time. As a consequence, the ownership of a std::unqiue_lock can be transferred between instances. This is why std::unique_lock is movable whereas std::lock_guard is not. Thus, more flexible locking schemes can be implemented by passing around locks between scopes.

For example a std::unique_lock can be returned from a function, or instances of a class containing a std::unique_lock attribute can be stored in containers. Consider the following example in which a mutex is locked in the function Foo(), returned to the function Bar() and only then unlocked on destruction.

std::mutex Mutex;

std::unique_lock<std::mutex> Foo()
{
    std::unique_lock<std::mutex> lock(Mutex);
    return lock;
    // mutex isn't unlocked here!
}

void Bar()
{
    auto lock = Foo();
}   // mutex is unlocked when lock goes out of scope

Keeping std::unique_lock’s additional lock status up-to-date induces some additional, minimal space and speed overhead in comparison to std::lock_guard. Hence, as a general rule, std::lock_guard should be preferably used when the additional features of std::unique_lock are not needed.