# The Infinite Loop

## Slides for “Beyond the Limits – The Usage of C++ in the Demoscene”

At the C++ User Group Berlin meeting this May, Eivind Liland and I gave a talk about the usage of C++ in the demoscene. Below is the abstract of the talk. The slides can be found in the downloads section.

Eivind and David are demosceners since decades. In this talk they’re going to show you two of their award-winning, real-time graphics demos, both highly optimized for different limitations and platforms, and both written in C++.

Turtles all the Way Down by Brain Control (2013) is a 64k-intro for the PC. It’s an almost 5 minutes long audio-visual journey using cutting edge algorithms in the areas of computer graphics, generative art and music synthesis. Being a 64k-intro, all textures, 3D objects and music fit into a single executable of merely 65.536 bytes.
Matt Current by Shitfaced Clowns (2007) is a demo for the Gameboy Advance. It features at that time never-seen-before graphics effects and a software-rendered 3d engine that pushes the device’s hardware to their limits. One prevailing opinion is that only by coding in 100% assembly one can push such platforms beyond their limits. Eivind will explain how they used C++ to carefully squeeze the maximum out of every cycle of the GBA’s 16 MHz CPU.

Though seemingly esoteric, all the techniques employed to realize these demos have their application in professional software development nowadays. In times of GHz multi-core processors, GPUs and terabyte hard-drives, performance critical code and compact code for embedded and mobile platforms still plays an important role. Eivind and David are going to guide you through the process of creating these graphics demos. They talk about the used algorithms and tools keeping the focus of how C++ was used to do the job.

# Introduction

Direct memory access (DMA) is conceptually easy, but without experience in hardware design or driver development it can be cumbersome to understand. In this blog post I will explain what DMA is and how it evolved during the last decades. My goal is to make it comprehensible especially for people without experience in hardware design or driver development.

DMA allows computer devices of certain hardware sub-systems to directly access system memory and other device’s memory independently of the CPU. This enables the CPU to keep on working concurrently on other task while long lasting memory operations take place; considerably boosting overall system performance. DMA is used by different hardware like graphics cards, sound cards, network cards and disk drive controllers. DMA is rather a concept than a specific technology. There is no specification which describes in detail how DMA transfers work. Even on the contrary, the concept of directly accessing memory without CPU interaction is employed in many different hardware sub-systems in today’s computers. The most typical application is communicating with peripheral devices plugged into a bus system like ATA, SATA, PCI or PCI Express. Beyond that, DMA transfers are used for intra-core communication in micro processors and even to copy data from the memory of one computer into the memory of another computer over the network via remote DMA (don’t mix up this technology with NVIDIA’s new GPUDirect RDMA feature).

To give a concrete example, imagine you’re playing an open world computer game which loads new game assets on demand from your hard disk. Large amounts of game data must be copied over from hard disk into system RAM. Without DMA the CPU would be actively involved in each and every memory transfer operation. Consequently, less computing time would be left for other game play related tasks like AI or physics. In times of multi-core processors this seems less like a problem. However, as data volumes and work load sizes are ever growing, off-loading large memory transfer operations from the CPU is also today absolutely essential in order to achieve high system performance.

# How DMA evolved over time

In my experience many software people think that DMA nowadays still works as it did in the old days. I guess this is because it’s the more intuitive way to think about DMA. Back then, extension devices did not actively take part in DMA transfers, but there was a DMA controller (e.g. the Intel 8237, first used in the IBM PC in 1981) which enabled DMA transfers between system memory and device I/O over the good old Industrial Standard Architecture (ISA) bus. The DMA controller could be programmed by the CPU to perform a number of memory transfers on behalf of the CPU. This way of accomplishing DMA transfers is also known as third party DMA. At that time the system bus was identical to the ISA expansion bus. To account for reduced bus performance in situations where CPU and DMA controller needed to access the bus simultaneously, different DMA modes (cycle stealing, transparent and burst) could be used. When the first IBM AT clones came out, the expansion bus got physically separated from the system bus using an ISA bridge. This was necessary because the AT clones had CPUs running at higher frequencies than the expansion bus. In the figure below the single bus and the separated bus architectures are depicted.

With the introduction of the conventional Peripheral Component Interface (PCI) bus architecture in 1992, the DMA controller became obsolete because of a technique called bus mastering, or first party DMA. PCI DMA transfers were implemented by allowing only one device at a time to access the bus. This device is called the bus master. While the bus master holds the bus it can perform memory transfers without CPU interaction. The fundamental difference between bus mastering and the use of a DMA controller is that DMA compatible devices must contain a DMA engine driving the memory transfers. As multiple PCI devices can master the bus, an arbitration scheme is required to avoid that more than one device drives the bus simultaneously. The advantage of bus mastering is a significant latency reduction because communication with the third party DMA controller is avoided. Additionally, each device’s DMA engine can be specifically optimized for the sort of DMA transfers it performs.

Today’s computers don’t contain DMA controllers anymore. If they do so, it’s only to support legacy buses like e.g. ISA, often by simulating an ISA interface using a Low Pin Count (LPC) bus bridge. In 2004 the PCI successor and latest peripheral computer bus system PCI Express (PCIe) was introduced. PCIe turned the conventional PCI bus from a true bus architecture, with several devices physically sharing the same bus, into a serial, packet-switched, point-to-point architecture; very similar to how packet-switched networks function. PCIe connects each device with a dedicated, bi-directional link to a PCIe switch. As a result, PCIe supports full duplex DMA transfers of multiple devices at the same time. All arbitration logic is replaced by the packet routing logic implemented in the PCIe switches. While PCIe is entirely different to PCI on the hardware level, PCIe preserves backwards compatibility with PCI on the driver level. Newer PCIe devices can be detected and used by PCI drivers without explicit support for the PCIe standard. Though, the new PCIe features cannot be used of course.

# DMA from a driver developer’s perspective

Now you know what DMA is and how it fits into a computer’s hardware architecture. So let’s see how DMA can be used in practice to speed up data heavy tasks. Since the dawn of DMA the driver (software) must prepare any peripheral DMA transfers, because only the operating system (OS) has full control over the memory system (we see later why this is important), the file system and the user-space processes. In the first step, the driver determines the source and destination memory addresses for the transfer. Next, the driver programs the hardware to perform the DMA transfer. The major difference between PCI/PCIe DMA and legacy ISA DMA is the way a DMA transfer is initiated. For PCI/PCIe no uniform, device independent way to initiate DMA transfers exists anymore, because each device contains its own, proprietary DMA engine. In contrast, the legacy DMA controller is always the same.

First, the peripheral device’s DMA engine is programmed with the source and destination addresses of the memory ranges to copy. Second, the device is signaled to begin the DMA transfer. Fair enough, but how can the driver know when the DMA transfer has finished? Usually, the device raises interrupts to inform the CPU about transfers that have finished. For each interrupt an interrupt handler, previously installed by the driver, is called and the finished transfer can be acknowledged accordingly by the OS (e.g. signaling the block I/O layer that a block has been read from disk and control can be handed back to the user-space process which requested this block). Back in the times of high latency spinning disks an slow network interfaces this was sufficient. Today, however, we’ve got solid state disks (SSD) and gigabit, low-latency network interfaces. To avoid completely maxing out the system by a vast number of interrupts, a common technique is to hold back and queue up multiple interrupts on the device until e.g. a timeout triggers, a certain number of interrupts are pending or any other condition suiting the application is met. This technique is known as interrupt coalescing. Obviously, the condition is always a trade-off between low latency and high throughput. The more frequently new interrupts are raised, the quicker the OS and its waiting processes are informed about finished memory transfers. However, if the OS is interrupted less often it can spend more time on other jobs.

DMA seems to be a nice feature in theory, but how does transferring large continuous memory regions play together with virtual memory? Virtual memory is usually organized in chunks of 4 KiB, called pages. Virtual memory is continuous as seen from a process’ point-of-view thanks to page tables and the memory management unit (MMU). However, it’s non-continuous as seen from the device point-of-view, because there is no MMU between the PCIe bus and the memory controller (well, some CPUs have an IO-MMU but let’s keep things simple). Hence, in a single DMA transfer only one page could be copied at a time. To overcome this limitation OS usually provide a scatter/gather API. Such an API chains together multiple page-sized memory transfers by creating a list of addresses of pages to be transferred.

# Take home message

DMA is an indispensable technique for memory-heavy, high-performance computing. Over the last decades, the entire bus system and DMA controller concept was superseded by moving the DMA controller into the devices and using a point-to-point bus architecture. This reduced latency, made concurrent DMA transfers possible and allowed for device specific DMA engine optimizations. For the drivers less has changed. They are still responsible for initiating the DMA transfers. Though, today, instead of programming a DMA controller in a device independent way, drivers must program device specific DMA engines. Therefore, programming DMA transfers and processing DMA status information can look very different depending on the device.

# Principles

Given is an arbitrary integer variable. How to find the index of the least significant bit (LSB) of the first 1-bit sequence of length >= n? Assuming n=4, let’s consider the following example of a random 32-bit integer value. The index we’re looking for is 10 in this case (indicated by the V character).

    31      24 | 23      16 | 15    V 8 | 7      0
MSB  01000111  |  11111101  |  10111100 | 01101001  LSB

Using a series of bit-wise and and shift-right operations the index of the LSB of the first 1111 sequence in the integer x can be found with the following trick.

x &= x>>1;
x &= x>>2;
index = __builtin_ffs(x)-1; // use _BitScanForward in Visual C++

After the first statement every 1 in x indicates the start of a 11 sequence. After the second statement every 1 in x indicates the start of a 1111 sequence. In the last statement the GCC intrinsic __builtin_ffs() (use _BitScanForward() if you’re on Visual C++) returns the bit position of the first set bit, starting from the LSB.
Note, that it doesn’t work to shift by four bits at once because it’s necessary to combine neighboring 1-bits to make sure that there are no 0-bits in-between. The following example illustrates how shifting by 3 bits wrongly yields two isolated 1-bits. In contrast, shifting by 2 bits correctly yields a sequence of 2 bits which can be further reduced into a single 1-bit indicating the start of the 1111 sequence.

 shift by 2    shift by 3
01111010      01111010
& 00011110    & 00001111
= 00011010    = 00001010
ok           wrong

# Arbitrary sequence lengths

By cleverly choosing the number of bits to shift, it’s even possible to extend this construction to find bit sequences which length is not a power of two. As the order of the and-shift-right operations has no relevance, the following algorithm can be used to compute the number of bits to shift in order to find the n-bit sequence index. The sum of shifted bits must be equal to n-1 and the number of bits to shift is halved in each iteration. Therefore, the total number of executed iterations is ceil(log2(n)).

int FindBitSeqIndexLsb(int x, int n)
{
assert(n >= 0 && n <= 32);

while (n > 1)
{
const int shiftBits = n>>1;
x &= (unsigned)x>>shiftBits; // shift in zeros from left
n -= shiftBits;
}

return __builtin_ffs(x)-1; // use _BitScanForward in Visual C++
}

# Exact sequence length

The described method finds bit sequences of length >= n. In case you’re looking for a bit sequence of exactly n bits, the following statement can be inserted right before the LSB scan is performed. This statement masks out any 1-bit which has a 1 on its left or right side. All the remaining 1-bits are isolated and indicate the start of a sequence of exactly n bits.

mask = (~(x<<1))&(~((unsigned)x>>1)); // shift in zeros from left and right
x &= mask;

# Sequence alignment

To account for aligned bit sequences, unaligned 1-bits can be simply masked out from x before the LSB scan is performed. For example, to regard only bit sequences starting at nibble boundaries x can be modified with the operation x &= 0x11111111 (it is 0x...11 = 0b...00010001) to clear all bits not starting at an index which is a multiple of four.

# Introduction

Oriented bounding boxes are an important tool for visibility testing and collision detection in computer graphics. In this post I want to talk about how to compute the oriented minimum bounding box (OMBB) of an arbitrary polygon in two dimensions. As a polygon just enforces an ordering on a set of points (vertices), everything described in the following equally applies to simple point sets. Minimum in this context refers to the area of the bounding box. A minimum oriented bounding box is also known as smallest-area enclosing rectangle. However, I will stick to the former term throughout this article as it is more frequently used in the computer graphics world.

The easiest way of computing a bounding box for a polygon is to determine the minimum and maximum $x$– and $y$– coordinates of its vertices. Such an axis aligned bounding box (AABB) can be computed trivially but it’s in most cases significantly bigger than the polygon’s OMBB. Finding the OMBB requires some more work as the bounding box’ area must be minimized, constrained by the location of the polygon’s vertices. Look at the following figure for an illustration (AABB in blue, OMBB in red).

The technique for computing OMBBs presented in the following consists of two detached steps. In the first step the convex hull of the input polygon is computed. If the polygon is convex this step can be omitted because a convex polygon is equal to its convex hull. In the second step the Rotating Calipers method is employed on the convex hull to compute the resulting OMBB. I will focus on the Rotating Calipers method because it’s not very widely known in comparison to the numerous ways of computing convex hulls.

# Convex hulls

In less mathematical but more illustrative terms the convex hull of a set of $n$ points can be described as the closed polygonal chain of all outer points of the set, which entirely encloses all set elements. You can picture it as the shape of a rubber band stretched around all set elements. The convex hull of a set of two-dimensional points can be efficiently computed in $O(n\log n)$. In the figure below the convex hull of the vertices of a concave polygon is depicted.

There are numerous algorithms for computing convex hulls: Quick Hull, Gift Wrapping (also known as Jarvis March), Graham’s Algorithm and some more. I’ve chosen the Gift Wrapping algorithm for my implementation because it’s easy to implement and provides good performance in case $n$ is small or the polygon’s convex hull contains only a few vertices. The runtime complexity is $O(nh)$, where $h$ is the number of vertices in the convex hull. In the general case Gift Wrapping is outperformed by other algorithms. Especially, when all points are part of the convex hull. In that case the complexity degrades to $O(n^2)$.

As there are many good articles on the Gift Wrapping algorithm available online, I won’t describe it another time here. Instead I want to focus on the lesser-known Rotating Calipers method for computing OMBBs. However, take care that your convex hull algorithm correctly handles collinear points. If multiple points lie on a convex hull edge, only the spanning points should end up in the convex hull.

# Rotating Calipers

Rotating Calipers is a versatile method for solving a number of problems from the field of computational geometry. It resembles the idea of rotating a dynamically adjustable caliper around the outside of a polygon’s convex hull. Originally, this method was invented to compute the diameter of convex polygons. Beyond that, it can be used to compute OMBBs, the minimum and maximum distance between two convex polygons, the intersection of convex polygons and many things more.

The idea of using the Rotating Calipers method for computing OMBBs is based on the following theorem, establishing a connection between the input polygon’s convex hull and the orientation of the resulting OMBB. The theorem was proven in 1975 by Freeman and Shapira1:

The smallest-area enclosing rectangle of a polygon has a side collinear with one of the edges of its convex hull.

Thanks to this theorem the number of OMBB candidates is dramatically reduced to the number of convex hull edges. Thus, the complexity of the Rotating Calipers method is linear if the convex hull is already available. If it isn’t available the overall complexity is bound by the cost of computing the convex hull. An example of a set of OMBB candidates (red) for a convex hull (green) is depicted in the figure below. Note, that there are as many OMBB candidates as convex hull edges and each OMBB candidate has one side flush with one edge of the convex hull.

To determine the OMBB of a polygon, first, two orthogonally aligned pairs of parallel supporting lines through the convex hull’s extreme points are created. The intersection of the four lines forms a rectangle. Next, the lines are simultaneously rotated about their supporting points until one line coincides with an edge of the convex hull. Each time an edge coincides, the four lines form another rectangle / OMBB candidate. This process is repeated until each convex hull edge once coincided with one of the four caliper lines. The resulting OMBB is the OMBB candidate with the smallest area. The entire algorithm is outlined step by step below.

1. Compute the convex hull of the input polygon.
2. Find the the extreme points $p_\text{min}=(x_\text{min},y_\text{min})^T$ and $p_\text{max}=(x_\text{max},y_\text{max})^T$ of the convex hull.
3. Construct two vertical supporting lines at $x_\text{min}$ and $x_\text{max}$ and two horizontal ones at $y_\text{min}$ and $y_\text{max}$.
4. Initialize the current minimum rectangle area $A_\text{min}=\infty$.
5. Rotate the supporting lines until one coincides with an edge of the convex hull.
1. Compute the area $A$ of the current rectangle.
2. Update the minimum area and store the current rectangle if $A.
6. Repeat step 5 until all edges of the convex hull coincided once with one of the supporting lines.
7. Output the minimum area rectangle stored in step 5.2.

In practice, in every iteration the smallest angle $\phi_\text{min}$ between each caliper line and its associated, following convex hull edge is determined. Then, all caliper lines are rotated at once by $\phi_\text{min}$ and the associated convex hull edge of the caliper line enclosing the smallest angle is advanced to the next convex hull edge.

# Wrap up

Rotating Calipers is a very elegant method for computing OMBBs in two dimensions. O’Rourke generalized it to three dimensions, yielding an algorithm of cubic runtime complexity. However, in practice approximation algorithms are used for three dimensional data because they’re usually faster. Beyond that, it's worth knowing the Rotating Calipers technique as it can be employed with minor changes to numerous other geometric problems. Depending on the programming language the implementation of the entire algorithm, including the convex hull computation, requires merely 150-200 lines of code. My sample implementation in Javascript can be found in my github repository.

1. H. Freeman, R. Shapira. Determining the minimum-area encasing rectangle for an arbitrary closed curve. Communications of the ACM, 18 Issue 7, July 1975, Pages 409-413

## Mutex lock guards in C++11

The new concurrency library of C++11 comes with two different classes for managing mutex locks: namely std::lock_guard and std::unique_lock. How do they compare with each other and in which situation which of the two classes should be preferably used?

The std::lock_guard class keeps its associated mutex locked during the entire life time by acquiring the lock on construction and releasing the lock on destruction. This makes it impossible to forget unlocking a critical section and it guarantees exception safety because any critical section is automatically unlocked when the stack is unwound after an exception was thrown. The std::lock_guard class should be used when a limited scope, like a class method, is to be locked.

void Foo::Bar()
{
std::lock_guard<std::mutex> guard(this->Mutex);
// mutex is locked now
}   // mutex is unlocked when lock guard goes out of scope


In contrast, the std::unique_lock class is a lot more flexible when dealing with mutex locks. It has the same interface as std::lock_guard but provides additional methods for explicitly locking and unlocking mutexes and deferring locking on construction. By passing std::defer_lock instead of std::adopt_lock the mutex remains unlocked when a std::unique_lock instance is constructed. The lock can then be obtained later by calling lock() on the std::unique_lock instance or alternatively, by passing it to the std::lock() function. To check if a std::unique_lock currently owns its associated mutex the owns_lock() method can be used. Hence, the mutex associated with a std::unique_lock doesn’t have to be locked (sometimes also referred to as owned) during the lock guard’s entire life time. As a consequence, the ownership of a std::unqiue_lock can be transferred between instances. This is why std::unique_lock is movable whereas std::lock_guard is not. Thus, more flexible locking schemes can be implemented by passing around locks between scopes.
For example a std::unique_lock can be returned from a function, or instances of a class containing a std::unique_lock attribute can be stored in containers. Consider the following example in which a mutex is locked in the function Foo(), returned to the function Bar() and only then unlocked on destruction.

std::mutex Mutex;

std::unique_lock<std::mutex> Foo()
{
std::unique_lock<std::mutex> lock(Mutex);
return lock;
// mutex isn't unlocked here!
}

void Bar()
{
auto lock = Foo();
}   // mutex is unlocked when lock goes out of scope


Keeping std::unique_lock‘s additional lock status up-to-date induces some additional, minimal space and speed overhead in comparison to std::lock_guard. Hence, as a general rule, std::lock_guard should be preferably used when the additional features of std::unique_lock are not needed.

# Introduction

Binary search finds the index of a specified value (also called search key) in a pre-sorted array. Binary search is a very efficient and elegant algorithm which is used in a large number of data structures and algorithms. It has a runtime complexity of O(log N), where N is the number of elements to be searched through, because in each step the remaining number of array elements is halved.

Even though, the binary search algorithm is generally extremely fast I recently designed a data structure where the binary search was the major bottleneck. The said data structure performs successively a vast number of binary searches on a very large, static array of integers. The array can have a size of up to 100 billion unsigned 32-bit integers which corresponds to around 372.5 GB of raw data. To search for the first occurrence of an arbitrary key in an array of 100 billion elements a binary search touches floor(log2(N)+1)) = 30 elements, with N = 100000000000.

This is not an article about how to implement the binary search algorithm. More information on that can be found in plenty of resources online. The implementation of the optimization technique described in the rest of this post can be found in the accompanying github repository.

# Memory hierarchy

Considering the size of the data set the first few array accesses are guaranteed to cause cache misses; or even worse page misses depending on the amount of available main memory. On machines with little main memory a page miss requires the operating system to fetch the corresponding block containing the required array element from disk (obviously this would require some additional code logic). Accessing main memory is roughly 200x slower than accessing the first level cache. The random access time of a modern SSD is typically under 0.1 ms, which is roughly 1000x slower than accessing main memory. The random access times of spinning disks vary considerably and are usually in the order of a few milliseconds. Assuming a random access time of 5 ms, accessing a spinning disk is roughly 50000x slower than accessing main memory. Take these numbers with a grain of salt. They are by no means accurate and will be considerably different measured on concrete systems. However, the overall orders of magnitude are correct and give an intuitive feeling of how costly array accesses can be.

# Look-up tables

The previous paragraph should have cleared up why array accesses can be costly, especially when data is stored "far away" from the CPU. As a consequence it seems reasonable to try reducing the runtime costs of a binary search by reducing the number of array accesses needed for a search. This is a widely used optimization technique: an increased memory consumption is traded for increased performance.

The number of array accesses can be reduced by limiting the initial search range from left = 0, right = N-1 to anything smaller which is guaranteed to still contain the search key. The simple binary structure of unsigned integers allows us to use some of the high bits as an index into a look-up table (LUT) containing the reduced search range for the corresponding key. The more bits are used the bigger is the resulting speed-up, but the more memory is used. Using 16 bits requires only a 512 KB LUT. Using 24 bits requires already a 128 MB LUT. Let's consider exemplary a 16-bit LUT where the highest 16-bit of the search key are mapped. Each LUT entry maps to a sub-range in the value array that can contain up to 65536 different values.

0 [0x00000..0x0ffff] -> left = index of first element >= 0, right = index of last element < 65536
1 [0x10000..0x1ffff] -> left = index of first element >= 65536, right = index of last element < 131072
2 [0x20000..0x2ffff] -> left = index of first element >= 131072, right = index of last element < 196608
...

Let's consider a search for the key 114910 which is in hex 0x0001c0de. In order to obtain the LUT index for the key a right-shift by 16 bits is performed: lutIdx = key>>16, which results in lutIdx = 0x0001 = 1. The sub-range stored in the LUT at element 1 is guaranteed to contain all values in the range 65536 <= key < 131072. Following, the binary search, taking the search key, as well as the interval start and end as arguments, can be invoked by calling:

BinarySearch(key, Lut[lutIdx].left, Lut[lutIdx].right);

This is very little, simple and efficient code but how to populate the LUT? It turns out that populating the LUT can be done very efficiently in O(N) with a single, sequential pass over the input array. Even though, all elements must be touched once to populate the LUT, the sequential nature of the algorithm allows prefetching required data in advance and exploits the largely increased performance characteristics of spinning disks for sequential reads. The algorithm to populate the LUT is depicted in C++ below.

std::vector<std::pair<size_t, size_t>> Lut;

void InitLut(const std::vector<uint32_t> &vals, size_t lutBits)
{
// pre-allocate memory for LUT
const size_t lutSize = 1<<lutBits;
const size_t shiftBits = 32-lutBits; // for 32-bit integers!
Lut.resize(lutSize);

// populate look-up-table
size_t thresh = 0, last = 0;

for (ssize_t i=0; i<(ssize_t)vals.size()-1; i++)
{
const size_t nextThresh = vals[i+1]>>shiftBits;
Lut[thresh] = {last, i};

if (nextThresh > thresh) // more than one LUT entry to fill?
{
last = i+1;
for (size_t j=thresh+1; j<=nextThresh; j++)
Lut[j] = {last, i+1};
}

thresh = nextThresh;
}

// set remaining thresholds that couldn't be found
for (size_t i=thresh; i<Lut.size(); i++)
Lut[i] = {last, vals.size()-1};
}

The LUT is implemented as an array of pairs. Each pair represents the interval the corresponding LUT entry maps to. To populate the LUT, the outer for-loop iterates over the pre-sorted array and calculates the number of LUT entries to fill between two successive array elements. The inner for-loop fills the LUT accordingly. As the start of each LUT entry can be derived from the end of the previous one, the last variable keeps track of the end of the last interval. The intervals are non-overlapping and it holds that the end of an interval is the beginning of the next interval minus one: Lut[i].second = Lut[i+1].first-1. As a consequence, the memory size of the LUT can be halved by storing only the interval starts and computing the interval ends as described. For simplicity reasons the depicted code is without this improvement, in contrast to the code in the accompanying github repository.

# Other data types

The presented LUT-based optimization only works for data types that are compatible with a simple bit-wise comparison. This means that more significant bits must have a bigger influence on the value's magnitude than less significant bits. So do signed integers and floats work as well? As long as they're positive there are no problems, because for any two positive signed integers or floats x and y it holds that if x <= y it follows that UIR(x) <= UIR(y), where UIR() denotes the binary unsigned integer representation. However, negative values cause problems. Hence, when creating the LUT and when doing a range look-up the array elements and the search key must be mapped to an ordered range using a bijective mapping as described below.

## Signed integers

Signed integers are stored in two's complement, in which the most significant bit (MSB) represents the sign: if the MSB is set the value is negative, otherwise it's positive. Therefore, signed integers compare greater using bit-wise comparison than unsigned ones. It turns out, that by simply flipping the sign bit the ordering can be fixed. It's important to note that flipping the sign bit is only enough if two's complement is used. If a simple signed magnitude representation is used negative values are not automatically reversed. In that case not only the sign bit but also all the remaining bits have to flipped when the sign bit is set. For clarity look at the following example of two's complement integers where FSB() denotes the function that flips the sign bit: FSB(-6 = 11111010) = 01111010 = 122 < FSB(-5 = 11111011) = 01111011 = 123 < FSB(2 = 00000010) = 10000010 = 130. In contrast, if a signed magnitude representation is used flipping the sign bit is not sufficient anymore as the following example illustrates: -6 < -5, however FSB(-6 = 10000110) = 00000110 = 6 > FSB(-5 = 10000101) = 00000101 = 5.

## Floating point

Floating point values (I'm talking about IEEE 754) are a bit more nasty than signed integers. Their signed magnitude representation has exactly the same problem as signed magnitude represented integers have, outlined in the previous section. To overcome this problem all the remaining bits just have to be flipped as well if the sign bit was set. For further information read the article of Michael Herf about Radix Sort for floats on Stereopsis. One possible implementation in C++ is:

// taken from Michael Herf's article on Stereopsis
uint32_t MapFloat(float val)
{
// 1) flip sign bit
// 2) if sign bit was set flip all other bits as well
const uint32_t cv = (uint32_t &)val;
}

# Results

I analyzed the performance gain of the LUT optimization by searching 10 million times through an array of 1 billion 32-bit integers. The integers were uniformly distributed in the interval [0, 0xffffffff]. In each search a randomly determined key from the data set was used. The performance was measured on a laptop with a Core(TM) i7-3612QM CPU (Ivy Bridge) at 2.1 GHz with 6 MB L3 cache and 8 GB RAM. The results of the standard binary search and the LUT optimized binary search for different LUT sizes are depicted in the table below. The speed-up is related to C++ standard binary search function std::lower_bound(), not my own implementation used for the range reduced binary search.

Algorithm LUT size Time needed Searches/second Speed-up
std::lower_bound() 9.373 secs 1.07 m
LUT binary search 8-bit (2 KB) 8.592 secs 1.16 m 1.09x
LUT binary search 16-bit (512 KB) 3.873 secs 2.58 m 2.42x
LUT binary search 24-bit (128 MB) 1.99 secs 5.03 m 4.71x

The 16-bit LUT binary search is more than 2.5x, the 24-bit LUT binary search is even almost 5x faster than C++'s standard binary search implementation. The amount of required additional code is neglectable and the LUT memory size can be configured granularly by adjust the number of bits used to index the LUT. Thanks to the O(N) complexity of the LUT population algorithm the discussed optimization is even feasible in situations where the underlying value array changes from time to time. Especially, in situations where parts of the data set are only resident on hard disk the reduced number of array accesses saves a considerable amount of work. By using simple mappings the LUT optimization can be applied as well to signed integer and even floating point values. For a sample implementation checkout the accompanying github repository.

# Introduction

This post documents a lot of nasty details I had to learn the hard way when I wrote a MySQL storage engine for the first time. The explanations are kept rather short and are indented for people writing a storage engine them-self. Most of this knowledge comes from reading the MySQL source code and online forums. I will add new points in the future.

## Lessons learned

### Return values of row retrieval functions

MySQL calls a couple of handler methods of the storage engine in order to retrieve rows from a table. The methods in question are rnd_next(), index_read(), index_read_last(), index_first(), index_last(), index_next(), index_prev() and index_read_idx(). All these methods return int as result. Obviously, in case a row could be successfully read 0, and if no matching row could be found HA_ERR_END_OF_FILE must be returned. It’s less obvious that additionally, the table->status attribute must be updated accordingly to 0 on success and to STATUS_NOT_FOUND on failure. If you forget that you’ll encounter unexpected behavior.

### Using extended MySQL classes

In order to use extended MySQL classes (like e.g. the THD class) in a MySQL plugin, MYSQL_SERVER needs to be defined as 1 before including any MySQL header. A typical include block looks like this:

#define MYSQL_SERVER 1 // required for THD class
#include <sql_table.h>
#include <sql_class.h>
#include <probes_mysql.h>

### Row-level locking

For a long time table locking was used to implement multi-version concurrency control (MVCC) in DBMS. However, modern storage engines tend to implement finer grained row-level locking to speed-up concurrent accesses to the same table. In MySQL row-level locking can be implemented in the handler::store_lock() method by cleverly adjusting the locking flags. When a pending, concurrent write operation is signaled to the storage engine, the store_lock() method returns TL_WRITE_ALLOW_WRITE as long as the table isn’t locked and no table space operation is currently executed. In queries like INSERT INTO table0 SELECT ... from table1 ... MySQL by default protects table1 from write operations by passing the TL_READ_NO_INSERT flag. Though, this flag conflicts with the TL_WRITE_ALLOW_WRITE flag, blocking all write operations on table1. The conflict can be solved by converting the lock to a normal read lock TL_READ.

THR_LOCK_DATA ** Handler::store_lock(THD *thd, THR_LOCK_DATA **to, enum thr_lock_type lockType)
{
if (lockType != TL_IGNORE && lock.type == TL_UNLOCK)
{
// allow concurrent write operations on table
if (lockType >= TL_WRITE_CONCURRENT_INSERT && lockType <= TL_WRITE)
{
if (!thd->in_lock_tables && !thd->tablespace_op)
lockType = TL_WRITE_ALLOW_WRITE;
}
// allow write operations on tables currently read from
else if (lockType == TL_READ_NO_INSERT && !thd->in_lock_tables)

lock.type = lockType;
}

*to++ = &lock;
}

In the handler::index_read() method the next index value has to be read as indicated by the ha_rkey_function argument. As the MySQL flag constants are not very intuitive to grasp, the five different index read modes are explained with the help of an example. Let’s assume a table column contains the following set of values: 1, 1, 1, 5, 5, 5, 7, 7, 7. In the table below the results of all possible index reads are depicted. The result of each index read is marked bold.

SQL Key Flag Found
a = 5 5 HA_READ_KEY_EXACT 1 1 1 5 5 5 7 7 7
a = 6 6 HA_READ_KEY_EXACT not found
a >= 5 5 HA_READ_KEY_OR_NEXT 1 1 1 5 5 5 7 7 7
a >= 6 6 HA_READ_KEY_OR_NEXT 1 1 1 5 5 5 7 7 7
a <= 5 5 HA_READ_KEY_OR_PREV 1 1 1 5 5 5 7 7 7
a <= 6 6 HA_READ_KEY_OR_PREV 1 1 1 5 5 5 7 7 7
a > 5 5 HA_READ_AFTER_KEY 1 1 1 5 5 5 7 7 7
a > 6 6 HA_READ_AFTER_KEY 1 1 1 5 5 5 7 7 7
a < 5 5 HA_READ_BEFORE_KEY 1 1 1 5 5 5 7 7 7
a < 6 6 HA_READ_BEFORE_KEY 1 1 1 5 5 5 7 7 7

### Single vs. multi-statement transactions

MySQL distinguishes between single and multi-statement transactions. Single statement transactions are transactions that are executed in auto-commit mode. In auto-commit mode MySQL performs a commit operation after each statement. Multi-statement transactions consist of a BEGIN TRANSACTION and a COMMIT command. All statements that are executed in between those two commands are not auto-committed. They are only committed when COMMIT is invoked. When a new transaction is started via the trans_register_ha() method, the second argument indicates if the transaction to be started is a single or multi-statement transaction. A simple way to figure this out is to use the THD::in_multi_stmt_transaction_mode() method. Hence, a new transaction is properly started by calling:

trans_register_ha(thd, thd->in_multi_stmt_transaction_mode(), ht);

### Detecting the start/end of a statement/transaction

The MySQL source code is already relatively old. This is also reflected by the storage engine API. Especially, the way storage engines usually detect the start and end of a statement or a transaction is rather obscure and a lot of people ask how to do it.
There are two methods that are usually used to detect if a statement or a transaction has started: handler::external_lock() and handler::start_stmt(). MySQL calls external_lock() once at the beginning of a statement for each table that is used in this statement. In case a table was locked previously by a LOCK TABLES command, external_lock() won’t be called anymore. In that case the start_stmt() method is invoked once per table instance during LOCK TABLES. If auto-committing is enabled (see the previous point) each statement must be committed individually. Otherwise, all changes performed by the statements of one transaction must be committed at once. To determine which commit mode is active the in_multi_stmt_transaction_mode() method is used.

int Handler::external_lock(THD *thd, int lockType)
{
ha_statistic_increment(&SSV::ha_external_lock_count);

if (lockType == F_UNLCK)
{
auto *tsx = (Transaction *)thd_get_ha_data(thd, ht);
tsx->NumUsedTables--;

if (!tsx->NumUsedTables)
{
// end of statement

if (!thd->in_multi_stmt_transaction_mode())
{
// auto-commit statement
tsx->Commit();
thd_set_ha_data(thd, ht, nullptr);
}
}
}
else
{
auto *tsx = (Transaction *)thd_get_ha_data(thd, ht);
if (!tsx)
{
// beginning of new transaction
tsx = TsxMgr->NextTsx();
thd_set_ha_data(thd, ht, tsx);
trans_register_ha(thd, thd->in_multi_stmt_transaction_mode(), ht);
}

if (!tsx->NumUsedTables)
{
// beginning of new statement
}

tsx->NumUsedTables++;
}

return 0;
}

// handlerton callback function
int Commit(handlerton *hton, THD *thd, bool all)
{
if (thd->in_multi_stmt_transaction_mode() || all)
{
// end of transaction =>
// commit transaction (non auto-commit)
status_var_increment(thd->status_var.ha_commit_count);
auto *tsx = (Transaction *)thd_get_ha_data(thd, hton);
tsx->Commit();
thd_set_ha_data(thd, hton, nullptr);
return 0;
}
else
{
// end of statement
}

return 0;
}

### Make sure to implement the records-in-range approximation

The method records_in_range() of the handler class is supposed to return an approximation of the number of rows between two keys in an index. It’s used by the query optimizer to determine which table index to use for a query (e.g. for a JOIN). It happens easily to oversee this hugely important function, because there’s a default implementation which makes it unnecessary to implement it. Unfortunately, the default implementation simply returns 10, effectively drawing any index selection pointless.

Some indexes can fully reconstruct the column data from an index key. Hence, query statements that only refer to the indexed columns don’t need to read the corresponding table rows from disk. This is known as keyread optimization. To find out whether the whole row needs to be read or if all required row columns can be fully reconstructed from the index key, the HA_EXTRA_KEYREAD and HA_EXTRA_NO_KEYREAD flags passed to the handler::extra() method can be checked. Additionally, the storage engine must signal to MySQL that it supports the keyread optimization by additionally returning the HA_KEYREAD_ONLY flag from the handler::index_flags() method.

### Mapping row buffers to index keys

When a new row is inserted into a table or when an existing row is replaced, the table indexes must be updated accordingly. The new index keys are derived from the new row data. For efficiency reasons only those parts of the row that correspond to table columns indexed by the current index should end up in the key. The key format can be choosen arbitrarily. Though, MySQL provides the key_copy() function which comes in very handy when creating index keys. The following method KeyFromRowBuffer() maps a row buffer to a key in MySQL’s default key format. The advantage of using MySQL’s default keys format is that the key compare function can be implemented in merely a few lines of code (see the next point).

size_t Handler::KeyFromRowBuffer(uint8_t *rowData, size_t keyIndex, uint8_t *keyData) const
{
auto &ki = table->key_info[keyIndex];
key_copy(keyData, rowData, &ki, 0);
return table->key_info[keyIndex].key_length;
}

### Comparing index keys

Every database index needs to compare index keys to store its data in sorted order and to retrieve data from the index. Usually, for this purpose a key compare function is passed to the index data structure. The key compare function has three arguments: the two keys to be compared key0 and key1 and a user defined parameter param. The user defined parameter can be used to pass in information about the underlying key structure from MySQL. The key compare function returns 0 if both keys are equal, -1 if key0 < key1 and 1 if key0 > key1. If you stick to the way of creating index keys described in the previous point, you can use the following generic key compare function.

static int CmpKeys(uint8_t *key0, uint8_t *key1, const void *param)
{
const KEY *key = (KEY *)param; // obtain pointer to KEY structure
int res = 0;

for (size_t i=0; i<key->user_defined_key_parts && !res; i++)
{
const auto &keyPart = key->key_part[i];
const int off = (keyPart.null_bit ? 1 : 0); // to step over null-byte

if (keyPart.null_bit) // does the key part have a null-byte?
if (*key0 != *key1)
return (int)((*key1)-(*key0));

res = keyPart.field->key_cmp(key0+off, key1+off); // compare key parts
key0 += keyPart.store_length; // go to next key part
key1 += keyPart.store_length;
}

return res;
}

### Testing keys for uniqueness on row insert/update

MySQL distinguishes between unique and non-unique indexes. In unique indexes each table row can be uniquely identified by exactly one index key. A unique index can be created by invoking CREATE UNIQUE INDEX index ON table (column0, column1, ...) or directly when creating the table. A primary key is always implicitly unique because it’s used together with foreign keys to establish links between different tables.
When new rows are inserted into a table via write_row() or when existing rows are updated via update_row(), the storage engine must guarantee that the unique key constraint isn’t violated for any unique index. This can be achieved by testing if the key to be inserted already exists in any of the unique indexes. An index is a unique index if the HA_NOSAME bit in the index flags is set or if the index is the primary key index. The index as such and the row to key mapping is implementation specific. For one simple way to map row buffers to keys read this point.

bool Handler::IsKeyUnique(uint8_t *rowData)
{
// optimizes inserts in some cases
if (!thd_test_options(ha_thd(), OPTION_RELAXED_UNIQUE_CHECKS))
{
for (size_t i=0; i<table->s->keys; i++)
{
// it's a unique key?
if ((table->key_info[i].flags&HA_NOSAME) || i == table->s->primary_key)
{
// create key + check if unique
uint8_t keyData[MAX_KEY_LENGTH];
KeyFromRowBuffer(rowData, i, keyData);

if (Index[i]->ExistsKey(RowDataToKey(i, rowData)))
return false;
}
}
}

return true;
}

int Handler::write_row(uint8_t *rowData) // same for update_row()
{
if (!IsKeyUnique(rowData))
return HA_ERR_FOUND_DUP_KEY;

// insert row into table and into index
return 0;
}

# Introduction

In Enigma Studio we have a number of operators. Each operator performs a certain operation on a data structure (e.g. mesh, bitmap, …) based on a set of parameters. The parameters can be of different type; e.g. 1-, 2-, 3- and 4-dimensional int or float vector, flag, enumeration or string. When an operator is executed to perform its operation the operator's execute handler gets called. In the implementation of the execute handler there has to be a way for the programmer to access the operator's set of parameters. Coming up with a programmer-friendly way to access the parameters is not that straightforward, because all obvious approaches require writing code where the individual parameters have to be accessed explicitly by calling a getter function or by accessing an array. Look at the following code for an illustration. Note, that all code presented in the following is purely for educational purpose. It is by no mean intended to be a feature complete operator stacking system.

enum OpParamType
{
OPT_INT,
OPT_FLT,
OPT_STR
};

struct OpParam
{
OpParamType type;
int         chans;

union
{
float   flts[4];
int     ints[4];
};

std::string str; // can't be in union
};

struct Op
{
void execute()
{
execFunc(this);
}

void (* execFunc)(Op *op); // pointer to execute handler
OpParam params[16];
size_t  numParams;
};

// execute handler
// param[0]: width
// param[1]: height
// param[2]: UV coordinates
void OpExec(Op *op)
{
int size = op->params[0].ints[0]*op->params[1].ints[0];
Vector2 uv(op->params[2].flts[0], op->params[2].flts[1]);
}

This approach has some issues. First of all, it is very error-prone because a parameter is solely identified by its index (we could use strings, but it doesn't make it much better). Furthermore, whenever the order of parameters or the type of a parameter changes, each access to the params array of the respective operator has to be changed as well. This sucks and this is something no programmer wants to do. Finally, the syntax used above is not very readable; especially when user-defined data types like Vector2 are used. Let's do something about it!

# Concept

In the following I'll present the approach I've come up with. It's based on directly passing the parameter data to the execute handler. This way the only thing that has to be done is to declare/define the execute handler accordingly to its parameter definition. Meaning, that the order of arguments and the type of arguments has to match with the definition of the operator's set of parameters. The execute handler from above e.g. would simply look like:

void OpExec(Op *op, int width, int height, const Vector2 &uv)
{
int size = width*height;
}

This is exactly what you are used to when programming. The code is not cluttered with accesses to the params array anymore, nor is it as error-prone. Parameters are identified by their names and there is only one source of error (the function's declaration/definition) when it comes to the parameters' types and order.

# Implementation

In order to implement such a system we have to go low-level. This means crafting handwritten assembly code to implement the underlying calling convention's argument passing and function calling ourselves. Consequently, you need profound knowledge of C++, x86/x64 assembly and calling conventions to understand what follows. Furthermore, the following code is mostly neither platform nor assembler independent, because of differences in how the operating systems implement certain calling conventions and the assembly syntax. I used Windows 8, Visual C++ 2012 and MASM.

## x86 Implementation

On x86 I use Visual C++'s standard calling convention CDECL. CDECL as well as STDCALL (which differ only in who, the caller or the callee, cleans the stack) are very easy to implement by hand, because all arguments are passed on the stack. To do so, all parameters that should be passed to an execute handler are copied into an array first. This array is passed to the callFunc() function, which creates a stack frame, copies the data to the stack and calls the execute handler. The following code is used to copy each parameter to the array. To account for call-by-value and call-by-reference arguments the data's address or the data it-self is copied, depending on the parameter type.

// implemented in an .asm file
extern "C" void callFunc(const void *funcPtr, const size_t *stack, size_t numArgs);

class Op
{
public:
Op(const void *execFunc, const OpParam *params, size_t numParams) :
m_execFunc(execFunc),
m_numParams(numParams)
{
// copy over parameters
std::copy(params, params+numParams, m_params);
}

void execute()
{
size_t args[16+1] = {(size_t)this}; // pass pointer to operator

for (uint32_t i=0; i<m_numParams; i++)
{
const OpParam &p = m_params[i];
switch (p.type)
{
case OPT_STR:
args[i+1] = (size_t)&p.str;
break;
default: // float and integer types (it's a union!)
args[i+1] = (p.chans > 1 ? (size_t)p.ints : (size_t)p.ints[0]);
break;
}
}

callFunc(m_execFunc, args, m_numParams+1);
}

private:
const void * m_execFunc;
size_t       m_numParams;
OpParam      m_params[16];
};

The implementation of callFunc() is given below. I used the MASM assembler because its shipped with Visual C++. I don't use inline assembly, because Visual C++ doesn't support x64 inline assembly and I wanted to have both callFunc() implementations in one file.

.686
.model flat, c             ; calling convention = cdecl
.code
callFunc proc funcPtr:ptr, stack:ptr, numArgs:dword
push    esi            ; save callee saved regs
push    edi
mov     eax, [funcPtr] ; store arguments on stack before new
mov     ecx, [numArgs] ; stack frame is installed (they
mov     esi, [stack]   ; won't be accessible anymore)
push    ebp
mov     ebp, esp
sub     esp, ecx       ; reserve stack space for parameters
sub     esp, ecx       ; (4 * number of arguments)
sub     esp, ecx
sub     esp, ecx
mov     edi, esp       ; copy parameters on stack
rep     movsd
call    eax            ; call execute handler
mov     esp, ebp       ; remove stack frame
pop     ebp
pop     edi            ; restore callee saved regs
pop     esi
ret
callFunc endp
end

## x64 Implementation

The x64 code is slightly more complex. The reason for this is that on x64 there is only one calling convention called FASTCALL. In this calling convention not all arguments just go onto the stack, but some of them have to be passed in registers. Furthermore, there are some differences in how call-by-value for structs works. Structures which's size does not exceed 64 bytes are passed in registers or on the stack. For bigger structures a pointer is passed. If the structure is passed by value, its data is first copied to the home space and the passed pointer points there. As I only needed call-by-reference for user-defined data types bigger than 64 bytes, I didn't have to care about this. Furthermore, the stack pointer RSP has to be 16 byte aligned when calling the execute handler. You might be wondering why 16 and not 8 byte. The reason for this is that it simplifies the alignment of SSE data types for the compiler (sizeof(__m128) = 16 bytes). Describing the calling convention in more detail is beyond the scope of this post, but there are plenty of good articles online.
As I mentioned before already, Visual C++ does not support x64 inline assembly. Therefore, I had to go for an extra .asm file which gets assembled by MASM64. The steps undertaken by the callFunc() implementation are:

1. Allocate stack space for the parameters that won't be passed in registers.
2. Align the stack pointer RSP on 16 byte boundary.
3. Copy the 5th, 6th, … arguments to the stack.
4. Copy the first four arguments to the registers RAX, RDX, R8 and R9.
5. Call the execute handler.

In the following the source code of the x64 callFunc() function is depicted. You should note that MASM64 does not support using the arguments declared in the proc statement. This is why I directly access the arguments via the registers. For the ease of much simpler manual register allocation I copy them in the beginning to memory.

.data
numArgs dq 0
stack   dq 0
funcPtr dq 0

.code
callFunc proc funcPtrP:ptr, stackP:ptr, numArgsP:dword
push    rdi
push    rsi

mov     [funcPtr], rcx  ; simplifies register allocation
mov     [stack], rdx    ; don't use passed variable names!
mov     [numArgs], r8   ; MASM doesn't support this for x64

; ----- allocate stack space and copy parameters -----

mov     rcx, [numArgs]  ; number of passed arguments
sub     rcx, 4          ; 4 of them will be passed in regs
cmp     rcx, 0          ; some left for the stack?
jng     noParamsOnStack
lea     r10, [rcx*8]    ; calculate required stack space
sub     rsp, r10        ; reserve stack space

mov     r11, rsp        ; align stack pointer to 16 bytes
and     r11, 15         ; mod 16
jz      dontAlign       ; is already 16 bytes aligned?
sub     rsp, r11        ; perform RSP alignment
dontAlign:

mov     rdi, rsp        ; copy parameters to stack
mov     rsi, [stack]
add     rsi, 4*8        ; first 4 parameters are in regs
rep     movsq

noParamsOnStack:

; ----- store first 4 arguments in RCX, RDX, R8, R9 -----

mov     rsi, [stack]
mov     r10, [numArgs]  ; switch (numArgs)
cmp     r10, 4
jge     fourParams
cmp     r10, 3
je      threeParams
cmp     r10, 2
je      twoParams
cmp     r10, 1
je      oneParam
jmp     noParams

fourParams:                 ; fall through used
mov     r9, [rsi+24]
movss   xmm3, dword ptr [rsi+24]
threeParams:
mov     r8, [rsi+16]
movss   xmm2, dword ptr [rsi+16]
twoParams:
mov     rdx, [rsi+8]
movss   xmm1, dword ptr [rsi+8]
oneParam:
mov     rcx, [rsi]
movss   xmm0, dword ptr [rsi]
noParams:

; ----- call execute handler for operator -----

sub     rsp, 20h        ; reserve 32 byte home space
call    [funcPtr]       ; call function pointer

; ----- restore non-volatile registers -----

mov     rsp, rbp
pop     rsi
pop     rdi
ret
callFunc endp
end

# Conclusion

The presented solution for implementing execute handlers works very well for me. Nevertheless, it is important to get the handler definitions/declarations right in order to avoid annoying crashes. There are no validity checks at compile-time. Apart from argument order, parameters cannot be passed call-by-value (at least in x86, in x64 call-by-value silently turns into call-by-reference because the parameter data is not copied to home space; which is not much better). You can download the full source code from my github repository.

## A pitfall with initialization lists in C++

Welcome to this quick post on constructor initialization lists in C++. If you are not familiar with initialization lists here is a very short introduction. When in C++ a class is instantiated, first, all base classes and second, all class attributes are constructed. Initialization lists are used to control that process. In initialization lists the constructors of the base class and the class attributes can be explicitly called. Otherwise, they are initialized by calling their default constructor. For efficiency reasons it is important to use initialization lists, because all member initialization takes place before the body of the constructor is entered. So much about the basics of initialization lists. Now a little riddle. The following piece of code will crash. Can you find the problem? Give it a try your-self before you continue reading.

struct Foo
{
Foo(int newId) : id(newId) {}
int id;
};

struct Bar
{
Bar(const Foo &newFoo) : foo(newFoo), fooId(foo.id) {}

const int   fooId;
const Foo & foo;
};

int main(int argc, char **argv)
{
Foo foo(303);
Bar bar(foo);
return 0;
}


If you don’t know the nasty details of how class attributes are initialized via initialization lists, you most probably couldn’t figure out what causes the crash in those few lines of code. The issue with the code above is that the order of attribute initialization is not determined by the order of appearance in the initialization list, but by the order of declaration within the class. Note, that foo appears in the initialization list before fooId, but fooId is declared before foo. Hence, not foo but fooId is initialized first, accessing the still uninitialized attribute foo.

So, why is that you might ask? This, on the first glance, strange behavior actually makes a lot of sense. When an object is destroyed, its destructor calls the destructors of every class attribute in the reverse order they were initialized. As there can be potentially more than one constructor with different orders of attribute initialization the order of destruction wouldn’t be defined. To solve the ambiguity simply the order of declaration is used.

# Beginnings

A few weeks after Revision 2012 payne came up with the idea of making an intro zooming from microcosm all the way out to macrocosm. Inspiration came from the famous Powers of ten video (1977) and the opening scene of the movie Contact (1997). It did not take long to convince the other team members, namely hunta, pap, strepto and skyrunner, of this idea and to start working on the intro.

All content creation, animation, sequencing, and post-processing was done solely with our demo-tool Enigma Studio 4 (depicted in the figure below). You can download the tool, the project files and the full source code from our website. The current version counts around 100.000 lines of code, is written in C++ and uses Direct3D 11 and Qt 5. We have actively developed Enigma Studio since 2005. Thus, during the last year we could mainly concentrate on creating the intro content because most of user-interface and fundamental engine code had been written already. Though, we had to face a lot of new challenges such as memory management, size optimizing shaders, level-of-detail (LOD) techniques, L-systems for plants and animals, realistic skin generation for shells and finally the planet rendering. Besides the graphics, payne rewrote once more our software synthesizer Tunefish, but more on that later.

The next step was to distribute the work to be done. Skyrunner is the main musician of our team. Hence, he was in charge of doing the soundtrack. Pap undertook the job of bringing life to the underwater scene. This task meant rewriting the L-system code of Transplant so that even animals (nautilus and shells) could be generated. Additionally, a swarm behavior simulation and a skin texture generator had to be implemented. Strepto decided to work on the terrain and the planet rendering on the basis of ray-marching. Payne was responsible for the intro content and all special effects shaders. Last but not least, hunta cared about all the remaining engine and tool code, as well as porting the texture generator from the CPU to the GPU on the basis of compute shaders.

# One year of work

Payne began working on the scenes taking place in microcosm: atoms, molecules, DNA and chromosomes. For that purpose hunta implemented a dynamic LOD system. It allowed showing a lot of geometry distributed over a wide depth range. Payne implemented a microscope shader, which gave all the beginning scenes a bluish look. After a while we realized that this look is too boring to be used for all beginning scenes. Therefore, payne created a different style for the first two scenes (atom and molecules). Below are some screenshots of the first attempts depicted.

This slideshow requires JavaScript.

The more payne progressed with the beginning scenes, the more the additional effects for the underwater and the planet scenes were needed. Pap progressed fast with his code. It was already early possible to generate undersea animals with very realistic skin textures and to simulate swarm behavior for the fish. The skin texture generation was based on the book Algorithmic beauty of sea shells. As pap knew his code and operators the best, he undertook the creation of the underwater scene geometry: placing plants, shells and nautilus, as well as simulating the fish swarms. All post-processing was added by payne. The caustics were faked by simply projecting an animated sum of two sine waves onto the ground and the water surface. The effect of sun shining diffusely through the water was faked by creating a halo post-processing shader, which added a shine over the final image. No real light source was used here. The same code was later used to create the sun glare in the space scenes. Below you can see some work-in-progress screenshots of the L-system and the underwater scene.

This slideshow requires JavaScript.

We knew early that every bit we could get would be needed to fit the whole code and data into 64 KB. Thus, it was clear that we also needed to come up with new ideas of how to synthesize audio.
Our old synthesizer Tunefish 3 was still relatively new and produced very satisfying sound. However, a range of different oscillators and the iFFT-based pad synth were simply too big. As throwing some of them out would have considerably reduced the final audio quality, payne sought a way to reduce the code size but not the audio quality. What if we found a way to produce all these nice sounds by coming up with just one type of oscillator instead of many? Classic oscillators can produce cool analog sounds, which we wanted to have. The iFFT-based pad synth algorithm can produce fat and rich sounds we did not want to miss neither. Though, it cannot produce the simple analog ones. Further more, it is not a real-time algorithm since it requires performing an iFFT on a table of 65536 frequencies. This was just not suitable for every kind of sound we needed, especially not for percussions. Consequently, the idea for the new generator of Tunefish 4 was born. What if we used an iFFT size that is

1. small enough to be evaluated in real-time,
2. still produces fat sounds and,
3. can also produce the simple square, sine and saw-tooth sounds that we all love?

After some days of experimenting it got clear that this was possible. The first attempts sounded more like alien sounds than anything we could possibly use to produce music. Though, it was only a question of time to get used to the new generator. What you hear in Turtles all the way down is not all Tunefish 4 is capable of by any means. There is still a lot of potential left in this approach. As the full source-code is public now, everybody can try out the VSTi and play with the knobs. Below is a screenshot of the Tunefish 4 VSTi depicted.

Let’s go back to the graphics. The terrain and the planet were both rendered via ray-marching, without using a single triangle. The difficulty with this approach was that we needed to change from a planar noise function for the terrain surface to a spherical one for the planet. Noise derivatives were used to slightly modify the traditional fractional Brownian motion (fBm) construction in order to generate a realistic surface. For performance reasons it was essential to ray-march as few steps as possible. This is why we used sphere tracing to step along the ray, starting only from the top of the atmosphere. Shadows, water reflections and water refractions were ray-marched using the same algorithm. The planet shader was combined with atmospheric scattering and clouds to increase the realism. Finally, some flock of birds and a space station rendered using traditional rasterization were combined with the planet by performing a per-pixel depth test. For that purpose the camera used in the terrain/planet shader had to match 1:1 the camera used for rasterization. Big thanks to IQ for his articles on noise derivatives, terrain marching and the Elevated making-of. Below are some work-in-progress screenshots depicted.

This slideshow requires JavaScript.

After the terrain and the planet scene the outer space begins. As we were not sure yet about the ending of the intro, payne started freely creating new artwork. There were plenty of ideas for the end of the intro:

• simply looping the intro,
• ending with a galaxy cluster, morphing into the shape of the Brain Control logo,
• the camera flies out of a computer screen.

In the end, mainly due to size problems, we opted for a simple fade-to-black after the credits. To cope with the large amount of geometry, especially for the asteroids, again hunta’s LOD system was used. Below there are screenshots of some space scenes depicted which didn’t make it into the final intro.

This slideshow requires JavaScript.

When payne began working on the space scenes we started having some bigger issues with the engine sub-systems. First of all, we exhausted more and more often the amount of available virtual memory and sometimes even video memory. The problem was that while working on the intro, all scenes and all intermediate operator results got loaded into memory over time. Therefore, hunta implemented an operator memory manager. The memory manager assigned every operator class which consumed a significant amount of memory (currently texture, mesh and scene-graph operators) a budget size. It automatically freed memory of those operators that were not part of any active operator stack, as long as their class’ budget size was exceeded. The operators to be freed were chosen in a least-recently-used (LRU) fashion.

The second problem was massive turn-around times between successive stack updates, when working on large (e.g. 4096×4096) textures. For such texture dimensions the CPU texture generator, especially the Perlin noise implementation, was too slow. Thus, hunta finally did what he wanted to do already for a long time. He ported all texture operators as compute shaders to the GPU. This boosted the performance dramatically and consequently the turn-around times shortened. As a result we obtained much nicer nebula textures.

Around two month before Revision 2013 nero/Still began to help us. He offered to do what he called “color consulting”. Though, it turned out to be much more than this. Besides creating plenty of overpaints giving us new ideas of how to tweak colors and light to increase the overall scene quality, he provided tons of valuable feedback and tips on how to improve the flow and the style of the intro. Below you can see some of the original scenes and the corrections as overpaints as proposed by nero.

This slideshow requires JavaScript.

# At the party place

Usually, we do a one week coding meeting directly before a demo party if we plan to release something at it. These meetings are of great value as they allow us to fully concentrate on the production. This year we did it again and progressed so well that we believed we would have plenty of time for boozing at Revision. It turned out we were wrong. After arriving at the party place we directly tested our intro on the compo machine. Unfortunately, it turned out that there was still a hell lot of work to do, if we wanted to make it for the deadline. We faced horrible stuttering issues in the underwater scene, discontinuities in the terrain scene and geometry popping issues in the chromosomes, DNA and molecules scenes. After three days of non-stop coding, we luckily managed to fix most of the issues. Unfortunately, we were not able to fix the stuttering, because this problem only appeared on the compo machine. After some blind attempts to fix it we simply ran out of time.

While hunta worked on fixing the issues mentioned above, pap and payne continued crunching even more content into the 64 KB. Pap worked hard on further size-optimizing the intro. This time he focused on the shaders. The first thing we did after Revision 2012 was changing from binary shaders to text shaders; size optimized by Shader Minifier. This approach saved us around 4 KB of compressed size in the player executable, but pap believed that there could be done more. Therefore, he crafted our own custom shader cruncher that applied a couple of more optimizations than Shader Minifier. It replaced e.g. all HLSL keywords by short macros and added a #define for that replacement to the shader, or it minified the variables declared in constant buffers. Our own shader cruncher saved another 2-3 KB just two days before the deadline right at the party place. This allowed us to add the space station and the space ships. Overall, 58 shaders were used in the intro with a total, uncompressed size of 170 KB (comments counted). Below there are some work-in-progress screenshots of the space station depicted.

This slideshow requires JavaScript.

All in all we are very happy about the first place at Revision 2013. Though, for us it was even more important that Brain Control as a demo group did the next step when it comes to teamwork, technology, design and direction. We are already looking forward to the next 64k-intro competition and hope to see you at the next demoparty!