The Infinite Loop

Tales from a lean programmer.


Leave a comment

Slides for “Beyond the Limits – The Usage of C++ in the Demoscene”

At the C++ User Group Berlin meeting this May, Eivind Liland and I gave a talk about the usage of C++ in the demoscene. Below is the abstract of the talk. The slides can be found in the downloads section.

Eivind and David are demosceners since decades. In this talk they’re going to show you two of their award-winning, real-time graphics demos, both highly optimized for different limitations and platforms, and both written in C++.

Turtles all the Way Down by Brain Control (2013) is a 64k-intro for the PC. It’s an almost 5 minutes long audio-visual journey using cutting edge algorithms in the areas of computer graphics, generative art and music synthesis. Being a 64k-intro, all textures, 3D objects and music fit into a single executable of merely 65.536 bytes.
Matt Current by Shitfaced Clowns (2007) is a demo for the Gameboy Advance. It features at that time never-seen-before graphics effects and a software-rendered 3d engine that pushes the device’s hardware to their limits. One prevailing opinion is that only by coding in 100% assembly one can push such platforms beyond their limits. Eivind will explain how they used C++ to carefully squeeze the maximum out of every cycle of the GBA’s 16 MHz CPU.

Though seemingly esoteric, all the techniques employed to realize these demos have their application in professional software development nowadays. In times of GHz multi-core processors, GPUs and terabyte hard-drives, performance critical code and compact code for embedded and mobile platforms still plays an important role. Eivind and David are going to guide you through the process of creating these graphics demos. They talk about the used algorithms and tools keeping the focus of how C++ was used to do the job.


1 Comment

Operator systems: accessing parameters in execute handlers

Introduction

In Enigma Studio we have a number of operators. Each operator performs a certain operation on a data structure (e.g. mesh, bitmap, …) based on a set of parameters. The parameters can be of different type; e.g. 1-, 2-, 3- and 4-dimensional int or float vector, flag, enumeration or string. When an operator is executed to perform its operation the operator's execute handler gets called. In the implementation of the execute handler there has to be a way for the programmer to access the operator's set of parameters. Coming up with a programmer-friendly way to access the parameters is not that straightforward, because all obvious approaches require writing code where the individual parameters have to be accessed explicitly by calling a getter function or by accessing an array. Look at the following code for an illustration. Note, that all code presented in the following is purely for educational purpose. It is by no mean intended to be a feature complete operator stacking system.

enum OpParamType
{
    OPT_INT,
    OPT_FLT,
    OPT_STR
};

struct OpParam
{
    OpParamType type;
    int         chans;

    union
    {
        float   flts[4];
        int     ints[4];
    };

    std::string str; // can't be in union
};

struct Op
{
    void execute()
    {
        execFunc(this);
    }

    void (* execFunc)(Op *op); // pointer to execute handler
    OpParam params[16];
    size_t  numParams;
};

// execute handler
// param[0]: width
// param[1]: height
// param[2]: UV coordinates
void OpExec(Op *op)
{
    int size = op->params[0].ints[0]*op->params[1].ints[0];
    Vector2 uv(op->params[2].flts[0], op->params[2].flts[1]);
}

This approach has some issues. First of all, it is very error-prone because a parameter is solely identified by its index (we could use strings, but it doesn't make it much better). Furthermore, whenever the order of parameters or the type of a parameter changes, each access to the params array of the respective operator has to be changed as well. This sucks and this is something no programmer wants to do. Finally, the syntax used above is not very readable; especially when user-defined data types like Vector2 are used. Let's do something about it!

Concept

In the following I'll present the approach I've come up with. It's based on directly passing the parameter data to the execute handler. This way the only thing that has to be done is to declare/define the execute handler accordingly to its parameter definition. Meaning, that the order of arguments and the type of arguments has to match with the definition of the operator's set of parameters. The execute handler from above e.g. would simply look like:

void OpExec(Op *op, int width, int height, const Vector2 &uv)
{
    int size = width*height;
}

This is exactly what you are used to when programming. The code is not cluttered with accesses to the params array anymore, nor is it as error-prone. Parameters are identified by their names and there is only one source of error (the function's declaration/definition) when it comes to the parameters' types and order.

Implementation

In order to implement such a system we have to go low-level. This means crafting handwritten assembly code to implement the underlying calling convention's argument passing and function calling ourselves. Consequently, you need profound knowledge of C++, x86/x64 assembly and calling conventions to understand what follows. Furthermore, the following code is mostly neither platform nor assembler independent, because of differences in how the operating systems implement certain calling conventions and the assembly syntax. I used Windows 8, Visual C++ 2012 and MASM.

x86 Implementation

On x86 I use Visual C++'s standard calling convention CDECL. CDECL as well as STDCALL (which differ only in who, the caller or the callee, cleans the stack) are very easy to implement by hand, because all arguments are passed on the stack. To do so, all parameters that should be passed to an execute handler are copied into an array first. This array is passed to the callFunc() function, which creates a stack frame, copies the data to the stack and calls the execute handler. The following code is used to copy each parameter to the array. To account for call-by-value and call-by-reference arguments the data's address or the data it-self is copied, depending on the parameter type.

// implemented in an .asm file
extern "C" void callFunc(const void *funcPtr, const size_t *stack, size_t numArgs);

class Op
{
public:
    Op(const void *execFunc, const OpParam *params, size_t numParams) : 
       m_execFunc(execFunc),
       m_numParams(numParams)
    {
        // copy over parameters
        std::copy(params, params+numParams, m_params);
    }

    void execute()
    {
        size_t args[16+1] = {(size_t)this}; // pass pointer to operator

        for (uint32_t i=0; i<m_numParams; i++)
        {
            const OpParam &p = m_params[i];
            switch (p.type)
            {
            case OPT_STR:
                args[i+1] = (size_t)&p.str;
                break;
            default: // float and integer types (it's a union!)
                args[i+1] = (p.chans > 1 ? (size_t)p.ints : (size_t)p.ints[0]);
                break;
            }
        }

        callFunc(m_execFunc, args, m_numParams+1);
    }

private:
    const void * m_execFunc;
    size_t       m_numParams;
    OpParam      m_params[16];
};

The implementation of callFunc() is given below. I used the MASM assembler because its shipped with Visual C++. I don't use inline assembly, because Visual C++ doesn't support x64 inline assembly and I wanted to have both callFunc() implementations in one file.

.686
.model flat, c             ; calling convention = cdecl
.code
callFunc proc funcPtr:ptr, stack:ptr, numArgs:dword
    push    esi            ; save callee saved regs
    push    edi
    mov     eax, [funcPtr] ; store arguments on stack before new
    mov     ecx, [numArgs] ; stack frame is installed (they
    mov     esi, [stack]   ; won't be accessible anymore)
    push    ebp
    mov     ebp, esp
    sub     esp, ecx       ; reserve stack space for parameters
    sub     esp, ecx       ; (4 * number of arguments)
    sub     esp, ecx
    sub     esp, ecx
    mov     edi, esp       ; copy parameters on stack
    rep     movsd
    call    eax            ; call execute handler
    mov     esp, ebp       ; remove stack frame
    pop     ebp
    pop     edi            ; restore callee saved regs
    pop     esi
    ret
callFunc endp
end

x64 Implementation

The x64 code is slightly more complex. The reason for this is that on x64 there is only one calling convention called FASTCALL. In this calling convention not all arguments just go onto the stack, but some of them have to be passed in registers. Furthermore, there are some differences in how call-by-value for structs works. Structures which's size does not exceed 64 bytes are passed in registers or on the stack. For bigger structures a pointer is passed. If the structure is passed by value, its data is first copied to the home space and the passed pointer points there. As I only needed call-by-reference for user-defined data types bigger than 64 bytes, I didn't have to care about this. Furthermore, the stack pointer RSP has to be 16 byte aligned when calling the execute handler. You might be wondering why 16 and not 8 byte. The reason for this is that it simplifies the alignment of SSE data types for the compiler (sizeof(__m128) = 16 bytes). Describing the calling convention in more detail is beyond the scope of this post, but there are plenty of good articles online.
As I mentioned before already, Visual C++ does not support x64 inline assembly. Therefore, I had to go for an extra .asm file which gets assembled by MASM64. The steps undertaken by the callFunc() implementation are:

  1. Allocate stack space for the parameters that won't be passed in registers.
  2. Align the stack pointer RSP on 16 byte boundary.
  3. Copy the 5th, 6th, … arguments to the stack.
  4. Copy the first four arguments to the registers RAX, RDX, R8 and R9.
  5. Call the execute handler.

In the following the source code of the x64 callFunc() function is depicted. You should note that MASM64 does not support using the arguments declared in the proc statement. This is why I directly access the arguments via the registers. For the ease of much simpler manual register allocation I copy them in the beginning to memory.

.data
numArgs dq 0
stack   dq 0
funcPtr dq 0

.code
callFunc proc funcPtrP:ptr, stackP:ptr, numArgsP:dword
    push    rdi
    push    rsi

    mov     [funcPtr], rcx  ; simplifies register allocation
    mov     [stack], rdx    ; don't use passed variable names!
    mov     [numArgs], r8   ; MASM doesn't support this for x64

; ----- allocate stack space and copy parameters -----

    mov     rcx, [numArgs]  ; number of passed arguments
    sub     rcx, 4          ; 4 of them will be passed in regs
    cmp     rcx, 0          ; some left for the stack?
    jng     noParamsOnStack
    lea     r10, [rcx*8]    ; calculate required stack space
    sub     rsp, r10        ; reserve stack space

    mov     r11, rsp        ; align stack pointer to 16 bytes
    and     r11, 15         ; mod 16
    jz      dontAlign       ; is already 16 bytes aligned?
    sub     rsp, r11        ; perform RSP alignment
dontAlign:

    mov     rdi, rsp        ; copy parameters to stack
    mov     rsi, [stack]
    add     rsi, 4*8        ; first 4 parameters are in regs
    rep     movsq

noParamsOnStack:

; ----- store first 4 arguments in RCX, RDX, R8, R9 -----

    mov     rsi, [stack]
    mov     r10, [numArgs]  ; switch (numArgs)
    cmp     r10, 4
    jge     fourParams
    cmp     r10, 3
    je      threeParams
    cmp     r10, 2
    je      twoParams
    cmp     r10, 1
    je      oneParam
    jmp     noParams

fourParams:                 ; fall through used
    mov     r9, [rsi+24]
    movss   xmm3, dword ptr [rsi+24]
threeParams:
    mov     r8, [rsi+16]
    movss   xmm2, dword ptr [rsi+16]
twoParams:
    mov     rdx, [rsi+8]
    movss   xmm1, dword ptr [rsi+8]
oneParam:
    mov     rcx, [rsi]
    movss   xmm0, dword ptr [rsi]
noParams:

; ----- call execute handler for operator -----

    sub     rsp, 20h        ; reserve 32 byte home space
    call    [funcPtr]       ; call function pointer

; ----- restore non-volatile registers -----

    mov     rsp, rbp
    pop     rsi
    pop     rdi
    ret
callFunc endp
end

Conclusion

The presented solution for implementing execute handlers works very well for me. Nevertheless, it is important to get the handler definitions/declarations right in order to avoid annoying crashes. There are no validity checks at compile-time. Apart from argument order, parameters cannot be passed call-by-value (at least in x86, in x64 call-by-value silently turns into call-by-reference because the parameter data is not copied to home space; which is not much better). You can download the full source code from my github repository.


24 Comments

Making-of “Turtles all the way down”

Introduction

At Revision 2013 we (Brain Control) released our newest production Turtles all the way down, which won the 64k-intro competition. A 64k-intro is an audio-visual presentation computed in real-time, where all code and data has to be contained in a single executable which size may not exceed 64 KB. Five guys spent a year of most of their spare-time on creating this 65.536 bytes counting piece of binary data. In this making-of you can read about the process we went through creating this intro and about the used technology. The party version of the intro can be watched either as a video (embedded below) or by downloading the binary from here. Note that you need a fast graphics card for good experience. We tested it on NVIDIA Geforce GTX 570, 670, AMD Radeon HD7870 and some mobile graphics cards. On the compo machine (Geforce GTX 680) we experienced serious stuttering in the underwater scene. If you have similar problems please contact us.

Beginnings

A few weeks after Revision 2012 payne came up with the idea of making an intro zooming from microcosm all the way out to macrocosm. Inspiration came from the famous Powers of ten video (1977) and the opening scene of the movie Contact (1997). It did not take long to convince the other team members, namely hunta, pap, strepto and skyrunner, of this idea and to start working on the intro.

All content creation, animation, sequencing, and post-processing was done solely with our demo-tool Enigma Studio 4 (depicted in the figure below). You can download the tool, the project files and the full source code from our website. The current version counts around 100.000 lines of code, is written in C++ and uses Direct3D 11 and Qt 5. We have actively developed Enigma Studio since 2005. Thus, during the last year we could mainly concentrate on creating the intro content because most of user-interface and fundamental engine code had been written already. Though, we had to face a lot of new challenges such as memory management, size optimizing shaders, level-of-detail (LOD) techniques, L-systems for plants and animals, realistic skin generation for shells and finally the planet rendering. Besides the graphics, payne rewrote once more our software synthesizer Tunefish, but more on that later.

Enigma Studio 4 demo tool used for creating all intro content

The next step was to distribute the work to be done. Skyrunner is the main musician of our team. Hence, he was in charge of doing the soundtrack. Pap undertook the job of bringing life to the underwater scene. This task meant rewriting the L-system code of Transplant so that even animals (nautilus and shells) could be generated. Additionally, a swarm behavior simulation and a skin texture generator had to be implemented. Strepto decided to work on the terrain and the planet rendering on the basis of ray-marching. Payne was responsible for the intro content and all special effects shaders. Last but not least, hunta cared about all the remaining engine and tool code, as well as porting the texture generator from the CPU to the GPU on the basis of compute shaders.

One year of work

Payne began working on the scenes taking place in microcosm: atoms, molecules, DNA and chromosomes. For that purpose hunta implemented a dynamic LOD system. It allowed showing a lot of geometry distributed over a wide depth range. Payne implemented a microscope shader, which gave all the beginning scenes a bluish look. After a while we realized that this look is too boring to be used for all beginning scenes. Therefore, payne created a different style for the first two scenes (atom and molecules). Below are some screenshots of the first attempts depicted.

This slideshow requires JavaScript.

The more payne progressed with the beginning scenes, the more the additional effects for the underwater and the planet scenes were needed. Pap progressed fast with his code. It was already early possible to generate undersea animals with very realistic skin textures and to simulate swarm behavior for the fish. The skin texture generation was based on the book Algorithmic beauty of sea shells. As pap knew his code and operators the best, he undertook the creation of the underwater scene geometry: placing plants, shells and nautilus, as well as simulating the fish swarms. All post-processing was added by payne. The caustics were faked by simply projecting an animated sum of two sine waves onto the ground and the water surface. The effect of sun shining diffusely through the water was faked by creating a halo post-processing shader, which added a shine over the final image. No real light source was used here. The same code was later used to create the sun glare in the space scenes. Below you can see some work-in-progress screenshots of the L-system and the underwater scene.

This slideshow requires JavaScript.

We knew early that every bit we could get would be needed to fit the whole code and data into 64 KB. Thus, it was clear that we also needed to come up with new ideas of how to synthesize audio.
Our old synthesizer Tunefish 3 was still relatively new and produced very satisfying sound. However, a range of different oscillators and the iFFT-based pad synth were simply too big. As throwing some of them out would have considerably reduced the final audio quality, payne sought a way to reduce the code size but not the audio quality. What if we found a way to produce all these nice sounds by coming up with just one type of oscillator instead of many? Classic oscillators can produce cool analog sounds, which we wanted to have. The iFFT-based pad synth algorithm can produce fat and rich sounds we did not want to miss neither. Though, it cannot produce the simple analog ones. Further more, it is not a real-time algorithm since it requires performing an iFFT on a table of 65536 frequencies. This was just not suitable for every kind of sound we needed, especially not for percussions. Consequently, the idea for the new generator of Tunefish 4 was born. What if we used an iFFT size that is

  1. small enough to be evaluated in real-time,
  2. still produces fat sounds and,
  3. can also produce the simple square, sine and saw-tooth sounds that we all love?

After some days of experimenting it got clear that this was possible. The first attempts sounded more like alien sounds than anything we could possibly use to produce music. Though, it was only a question of time to get used to the new generator. What you hear in Turtles all the way down is not all Tunefish 4 is capable of by any means. There is still a lot of potential left in this approach. As the full source-code is public now, everybody can try out the VSTi and play with the knobs. Below is a screenshot of the Tunefish 4 VSTi depicted.

Tunefish 4 VSTi

Let’s go back to the graphics. The terrain and the planet were both rendered via ray-marching, without using a single triangle. The difficulty with this approach was that we needed to change from a planar noise function for the terrain surface to a spherical one for the planet. Noise derivatives were used to slightly modify the traditional fractional Brownian motion (fBm) construction in order to generate a realistic surface. For performance reasons it was essential to ray-march as few steps as possible. This is why we used sphere tracing to step along the ray, starting only from the top of the atmosphere. Shadows, water reflections and water refractions were ray-marched using the same algorithm. The planet shader was combined with atmospheric scattering and clouds to increase the realism. Finally, some flock of birds and a space station rendered using traditional rasterization were combined with the planet by performing a per-pixel depth test. For that purpose the camera used in the terrain/planet shader had to match 1:1 the camera used for rasterization. Big thanks to IQ for his articles on noise derivatives, terrain marching and the Elevated making-of. Below are some work-in-progress screenshots depicted.

This slideshow requires JavaScript.

After the terrain and the planet scene the outer space begins. As we were not sure yet about the ending of the intro, payne started freely creating new artwork. There were plenty of ideas for the end of the intro:

  • simply looping the intro,
  • ending with a galaxy cluster, morphing into the shape of the Brain Control logo,
  • the camera flies out of a computer screen.

In the end, mainly due to size problems, we opted for a simple fade-to-black after the credits. To cope with the large amount of geometry, especially for the asteroids, again hunta’s LOD system was used. Below there are screenshots of some space scenes depicted which didn’t make it into the final intro.

This slideshow requires JavaScript.

When payne began working on the space scenes we started having some bigger issues with the engine sub-systems. First of all, we exhausted more and more often the amount of available virtual memory and sometimes even video memory. The problem was that while working on the intro, all scenes and all intermediate operator results got loaded into memory over time. Therefore, hunta implemented an operator memory manager. The memory manager assigned every operator class which consumed a significant amount of memory (currently texture, mesh and scene-graph operators) a budget size. It automatically freed memory of those operators that were not part of any active operator stack, as long as their class’ budget size was exceeded. The operators to be freed were chosen in a least-recently-used (LRU) fashion.

The second problem was massive turn-around times between successive stack updates, when working on large (e.g. 4096×4096) textures. For such texture dimensions the CPU texture generator, especially the Perlin noise implementation, was too slow. Thus, hunta finally did what he wanted to do already for a long time. He ported all texture operators as compute shaders to the GPU. This boosted the performance dramatically and consequently the turn-around times shortened. As a result we obtained much nicer nebula textures.

Around two month before Revision 2013 nero/Still began to help us. He offered to do what he called “color consulting”. Though, it turned out to be much more than this. Besides creating plenty of overpaints giving us new ideas of how to tweak colors and light to increase the overall scene quality, he provided tons of valuable feedback and tips on how to improve the flow and the style of the intro. Below you can see some of the original scenes and the corrections as overpaints as proposed by nero.

This slideshow requires JavaScript.

At the party place

Usually, we do a one week coding meeting directly before a demo party if we plan to release something at it. These meetings are of great value as they allow us to fully concentrate on the production. This year we did it again and progressed so well that we believed we would have plenty of time for boozing at Revision. It turned out we were wrong. After arriving at the party place we directly tested our intro on the compo machine. Unfortunately, it turned out that there was still a hell lot of work to do, if we wanted to make it for the deadline. We faced horrible stuttering issues in the underwater scene, discontinuities in the terrain scene and geometry popping issues in the chromosomes, DNA and molecules scenes. After three days of non-stop coding, we luckily managed to fix most of the issues. Unfortunately, we were not able to fix the stuttering, because this problem only appeared on the compo machine. After some blind attempts to fix it we simply ran out of time.

While hunta worked on fixing the issues mentioned above, pap and payne continued crunching even more content into the 64 KB. Pap worked hard on further size-optimizing the intro. This time he focused on the shaders. The first thing we did after Revision 2012 was changing from binary shaders to text shaders; size optimized by Shader Minifier. This approach saved us around 4 KB of compressed size in the player executable, but pap believed that there could be done more. Therefore, he crafted our own custom shader cruncher that applied a couple of more optimizations than Shader Minifier. It replaced e.g. all HLSL keywords by short macros and added a #define for that replacement to the shader, or it minified the variables declared in constant buffers. Our own shader cruncher saved another 2-3 KB just two days before the deadline right at the party place. This allowed us to add the space station and the space ships. Overall, 58 shaders were used in the intro with a total, uncompressed size of 170 KB (comments counted). Below there are some work-in-progress screenshots of the space station depicted.

This slideshow requires JavaScript.

All in all we are very happy about the first place at Revision 2013. Though, for us it was even more important that Brain Control as a demo group did the next step when it comes to teamwork, technology, design and direction. We are already looking forward to the next 64k-intro competition and hope to see you at the next demoparty!


Leave a comment

Revision 2013: “Turtles all the way down” 64k-intro

At Revision 2013 my demo group Brain Control won the the first place in the 64k-intro competition with their newest production Turtles all the way down. You can download the demo tool which was used to create this intro with full source-code. In the following days I am going to publish a making-of, giving a lot of insights of the creation process.