The Infinite Loop

Tales from a lean programmer.


Leave a comment

Custom deleters for smart pointers in modern C++

Introduction

Many C++ programmers are not aware of a big difference in how the custom deleters of std::unique_ptr and std::shared_ptr are implemented in the C++ standard library. std::unique_ptr carries the custom deleter as part of its type (template<class T, class Deleter> std::unique_ptr). In contrast, the custom deleter of std::shared_ptr is not part of the type (template<class T> std::shared_ptr) but part of the constructor’s template argument list (template<class Y, class Deleter> shared_ptr(Y *ptr, Deleter d)). Mostly, this difference doesn’t matter much. Though, there are use-cases, like e.g. factories returning std::unique_ptr with a custom deleter, were the difference does matter.

Design choices

std::unique_ptr

The advantage of making the custom deleter part of std::unqiue_ptr‘s type is that, as long as the deleter is stateless (e.g. a lambda that doesn’t capture anything or a function with no member variables), storing it doesn’t take up any additional memory thanks to the empty base optimization. This makes std::unique_ptr a zero-overhead abstraction, which means that:

  1. Its size is identical to the size of a raw pointer on the underlying architecture.
  2. All calls to the deleter can be inlined.

One possible implemention which makes use of the empty base optimization is to store the wrapped pointer together with the deleter in a compressed pair. The obvious disadvantage of making the custom deleter part of the type is that two std::unique_ptrs with different custom deleters are of different type, even if they wrap the same pointer type.

std::shared_ptr

In contrast to std::unique_ptr, std::shared_ptr provides the convinience of a type erased deleter. Type erased means that the type of the custom deleter is not dragged into std::shared_ptr‘s type. Hence, one cannot know by just looking at the type if two std::shared_ptr instances have different custom deleters or not.
The type erasure makes std::shared_ptr more flexible. For example changing the allocation strategy of a factory, and with it the custom deleter of the returned std::shared_ptrs, doesn’t break source/binary compatibility and thereby, doesn’t require any recompilation of client software.
The drawback is that storing the custom deleter takes up additional memory, because some wrapper (e.g. std::function or a raw function pointer) is needed to store the custom deleter. The rationale behind this design choice is that std::shared_ptr must anyways heap allocate memory for its shared control block, containing the wrapped pointer and the reference counter. Additionally including the custom deleter didn’t seem like a big cost, taking the increased flexiblity into account.

Type erased custom deleters with std::unique_ptr

Imagine you’re building an object factory which returns std::unique_ptrs. The return type of the factory’s Create() function must allow casting instances of different derived classes to the same std::unique_ptr type. One way to do that is to use a std::unique_ptr to the base class. This, however, requires the base class’ destructor to be virtual. What if the destructor cannot be virtual for some reason or the implications for source and binary compatibility are limiting?

An alternative is to create a type erased custom deleter for std::unique_ptr by wrapping the deleter e.g. in an std::function. The wrapped function is then responsible for casting the void * to the correct type when deleting it. This construction works for virtual and non-virtual classes as well as for multiple inheritance, because the deleter casts the void * argument containing the address to the most derived class simply back to the type of the most derived class.

template<typename Type>
void MyDelete(void *ptr) // Casts 'ptr' to real type and deletes it
{
    delete static_cast<Type *>(ptr);
}

auto Factory::Create(int typeId)
{
    // Unique pointer with type erased custom deleter
    using UniquePtr = std::unique_ptr<Base, std::function<void(void *)>>;

    switch (typeId)
    {
    case 0: return UniquePtr(new Derived0, MyDelete<Derived0>); 
    case 1: return UniquePtr(new Derived1, MyDelete<Derived1>);
    // ...
   }
}

The applied type erasure doesn’t come for free. There are two penalties to pay:

  1. Destroying the pointer cannot be inlined anymore and therefore always requires an additional function call.
  2. Additional memory is required to store the deleter.

It turns out that the std::function wrapper increases the memory footprint of the std::unique_ptr type considerably (32 bytes with GCC 5.3.1’s libc++ and 64 bytes with Visual C++ 2015, both 64 bit). Luckily, we can use a simple function pointer to reduce the total size of the final std::unique_ptr to 16 bytes.

using UniquePtr = std::unique_ptr<Base, void(*)(void *)>;


Leave a comment

Scalable spinlocks 1: array-based

Last time we saw that spinlock implementations which only use a single synchronization variable (Test-And-Set Lock, Ticket Lock) don’t scale with growing numbers of threads. Today, I want to talk about two spinlock variants that scale. Namely the Graunke and Thakkar Lock1 (1989) and the Anderson Lock2 (1990). Their underlying key idea is to use one synchronization variable per thread instead of one for all threads, to reduce the amount of cache line invalidations when acquiring/releasing the lock. Both spinlock variants store the synchronization variables in an array. This means that there’s an upper bound on the maximum number of thread’s that can compete for the lock concurrently, which must be known a priori. In upcoming blog posts I’m going to show spinlock variants (the MCS Lock and the CLH Lock) that improve upon array-based spinlocks by removing this hard upper limit.

Anderson’s lock

Anderson’s spinlock is basically a FIFO queue of threads. It’s implemented as an array, which is accessed via an atomic ring-buffer-style index counter. Hence, it requires an atomic fetch-and-add (std::atomic::fetch_add in C++11) operation to advance the ring-buffer index. The basic idea is that each thread spins on its own synchronization variable while it’s = true. The synchronization variable of the next thread in line is eventually set to false when the previous thread leaves the critical section (CS).

Below is an implementation of Anderson’s spinlock in C++11. Note, that this implementation is for illustrative purpose only. It’s not optimized for performance. The array of synchronization variables LockedFlags is initially set to [false, true, true, ..., true]. The first element is set to false, so that the very first thread in line can enter the critical section: for the first thread there’s no previous thread which can enable the first thread to enter the CS. After successfully entering the CS the synchronization variable is false and needs to be reset to true for future threads. It must be taken care for that NextFreeIdx doesn’t overflow. If the maximum number of threads is a power-of-2 integer arithmetic will just work fine, because the overflow matches up with the modulo computation. Otherwise, NextFreeIdx must be kept in check to avoid overflows.

Anderson’s lock requires O(L*T) memory to store L locks with a maximum thread count of T.

class AndersonSpinLock
{
public:
    AndersonSpinLock(size_t maxThreads=std::thread::hardware_concurrency()) :
        LockedFlags(maxThreads)
    {
        for (auto &flag : LockedFlags)
            flag.first = true;

        LockedFlags[0].first = false;
    }

    ALWAYS_INLINE void Enter()
    {
        const size_t index = NextFreeIdx.fetch_add(1)%LockedFlags.size();
        auto &flag = LockedFlags[index].first;

        // Ensure overflow never happens
        if (index == 0)
            NextFreeIdx -= LockedFlags.size();

        while (flag)
            CpuRelax();

        flag = true;
    }

    ALWAYS_INLINE void Leave()
    {
        const size_t idx = NextServingIdx.fetch_add(1);
        LockedFlags[idx%LockedFlags.size()].first = false;
    }

private:
    using PaddedFlag = std::pair<std::atomic_bool,
                                 uint8_t[CACHELINE_SIZE-sizeof(std::atomic_bool)]>;
    static_assert(sizeof(PaddedFlag) == CACHELINE_SIZE, "");

    alignas(CACHELINE_SIZE) std::vector<PaddedFlag> LockedFlags;
    alignas(CACHELINE_SIZE) std::atomic_size_t      NextFreeIdx = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t      NextServingIdx = {1};
};

Graunke and Thakkar’s lock

In contrast to Anderson’s ring-buffer-based lock, Graunke and Thakkar use a linked list to establish a FIFO queue of threads attempting to acquire the lock. Hence, an atomic fetch-and-store operation (std::atomic::exchange in C++11) is needed to atomically update the tail of the queue. Like Anderson’s lock, every thread spins on a different synchronization variable to reduce cache traffic due to cache line invalidations. A thread trying to enter the CS spins on the synchronization variable of the thread in line before it-self by looking at the tail pointer of the wait queue. Every thread is responsible for flipping its synchronization variable when it leaves the CS, which in turn enables the next thread in line to enter.

Below is an implementation of Graunke and Thakkar’s spinlock in C++11. Note, that this implementation is for illustrative purpose only. It’s neither optimized for performance, nor does it work if threads are recreated because of how thread indices are assigned. In their original implementation – which is reflected below – the synchronization flag is not reset after successfully entering the CS. That’s why additionally the old value of the flag must be stored along with the flag pointer in the queue. The spin loop compares the current flag value with the old flag value to see if the previous thread in line has already unlocked the CS by flipping its synchronization variable. The approach used in Anderson’s lock would simplify the code significantly: the old flag value wouldn’t have to be stored anymore in the tail, which means we could get rid of all the binary arithmetic to crunch in and extract the old flag value from the tail.

Graunke and Thakkar’s lock requires O(L*T) memory to store L locks with a maximum thread count of T.

class GraunkeAndThakkarSpinLock
{
public:
    GraunkeAndThakkarSpinLock(size_t maxThreads=std::thread::hardware_concurrency()) :
        LockedFlags(maxThreads)
    {
        for (auto &flag : LockedFlags)
            flag.first = 1;

        assert(Tail.is_lock_free());
        Tail = reinterpret_cast(&LockedFlags[0].first);
        // Make sure there's space to store the old flag value in the LSB
        assert((Tail&1) == 0);
    }

    ALWAYS_INLINE void Enter()
    {
        // Create new tail by chaining my synchronization variable into the list
        const auto &newFlag = LockedFlags[GetThreadIndex()].first;
        const auto newTail = reinterpret_cast(&newFlag)|static_cast(newFlag);
        const auto ahead = Tail.exchange(newTail);

        // Extract flag and old value of previous thread in line,
        // so that we can wait for its completion
        const auto *aheadFlag = reinterpret_cast(ahead&(~static_cast(1)));
        const auto aheadValue = static_cast(ahead&1);

        // Wait for previous thread in line to flip my synchronization variable
        while (aheadFlag->load() == aheadValue)
            CpuRelax();
    }

    ALWAYS_INLINE void Leave()
    {
        // Flipping synchronization variable enables next thread in line to enter CS
        auto &flag = LockedFlags[GetThreadIndex()].first;
        flag = !flag;
    }

private:
    ALWAYS_INLINE size_t GetThreadIndex() const
    {
        static std::atomic_size_t threadCounter = {0};
        thread_local size_t threadIdx = threadCounter++;
        assert(threadIdx < LockedFlags.size());
        return threadIdx;
    }

private:
    using PaddedFlag = std::pair<std::atomic_uint16_t,
                                 uint8_t[CACHELINE_SIZE-sizeof(std::atomic_uint16_t)]>;
    static_assert(sizeof(PaddedFlag) == CACHELINE_SIZE, "");

    // In the LSB the old value of the flag is stored
    alignas(CACHELINE_SIZE) std::atomic<uintptr_t>  Tail;
    alignas(CACHELINE_SIZE) std::vector<PaddedFlag> LockedFlags;

    static_assert(sizeof(decltype(LockedFlags)::value_type) > 1,
                  "Flag size > 1 required: thanks to alignment, "\
                  "old flag value can be stored in LSB");
};

Comparison

The major difference between Anderson’s and Graunke and Thakkar’s spinlock implementations is how synchronization variables are assigned to threads. In Graunke and Thakkar’s lock there’s a 1:1 mapping from synchronization variables to threads. Always the same synchronization variable is assigned to the same thread. As a consequence the lock can be implemented in such a way that synchronization variables are stored in memory locations local to the spinning thread. This allows the lock to perform well on even on cache-less NUMA machines.

In contrast, Anderson’s lock has no control over the memory location in which a thread’s synchronization variable is stored. The synchronization variable assigned to a thread only depends on the order in which threads attempt to acquire the lock. Therefore, the Anderson lock performs much worse on cache-less machines, because it’s entirely possible that a thread spins on a synchronization variable in a remote memory location. On all modern mainstream CPUs this difference has marginal effect, because the CPUs have caches into which the synchronization variables can migrate during spinning.


  1. G. Graunke, S. Thakkar. Synchronization Algorithms for Shared-Memory Multiprocessors. IEEE Computer, Vol. 23, No. 6, (1990), pp. 60-69 
  2. T. E. Anderson. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 1. (1990), pp. 6-16 


6 Comments

The ticket spinlock

Last time we saw how a spinlock can be implemented using an atomic test-and-set (TAS) operation on a single shared synchronization variable. Today, I want to talk about another spinlock variant called Ticket Lock. The Ticket Lock is very similar to TAS-based spinlocks in terms of scalability, but supports first-in-first-out (FIFO) fairness.

As the name suggests, the Ticket Lock employs the same concept as e.g. hair dressers do to serve their customers in the order of arrival. On arrival customers draw a ticket from a ticket dispenser which hands out tickets with increasing numbers. A screen displays the ticket number served next. The customer holding the ticket with the number currently displayed on the screen is served next.

Implementation

The C++11 implementation below uses two std::atomic_size_t variables as counters for the ticket number currently served (ServingTicketNo) and the ticket number handed out to the next arriving thread (NextTicketNo). The implementation is optimized for x86 CPUs. The PAUSE instruction (called from CpuRelax()) is used when spin-waiting and both counters are cache line padded to prevent false sharing. Read the previous article on TAS-based locks for more information.

class TicketSpinLock
{
public:
    ALWAYS_INLINE void Enter()
    {
        const auto myTicketNo = NextTicketNo.fetch_add(1, std::memory_order_relaxed);

        while (ServingTicketNo.load(std::memory_order_acquire) != myTicketNo)
            CpuRelax();
    }

    ALWAYS_INLINE void Leave()
    {
        // We can get around a more expensive read-modify-write operation
        // (std::atomic_size_t::fetch_add()), because no one can modify
        // ServingTicketNo while we're in the critical section.
        const auto newNo = ServingTicketNo.load(std::memory_order_relaxed)+1;
        ServingTicketNo.store(newNo, std::memory_order_release);
    }

private:
    alignas(CACHELINE_SIZE) std::atomic_size_t ServingTicketNo = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t NextTicketNo = {0};
};

static_assert(sizeof(TicketSpinLock) == 2*CACHELINE_SIZE, "");

Overflow is pretty much impossible when 64-bit counters are used. But what happens when the counters are of smaller bit width and eventually overflow? It turns out that overflow is safe, as long as the number of threads using the lock is less than or equal to the value range representable by the counter’s underlying integer type (e.g. 256 for 8-bit counters). Let’s consider a 3-bit integer, which can represent values from 0 to 7. Overflow is safe as long as the there are never more than 8 threads competing for the lock, because the condition ServingTicketNo != myTicketNo is guaranteed to be always only false for the next thread in line. If there were 9 or more threads, the NextTicketNo counter could reach the same value ServingTicketNo has and accordingly two threads could enter the critical section (CS) at the same time. The figure below illustrates the case where 8 threads are competing for the lock. Just one more competing thread could cause multiple threads entering the CS at the same time.

   ServingTicketNo   NextTicketNo
          |             |
          V             V
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 ...
          \_____________/
          Max. 8 threads

No second thread may grab another NextTicketNo=5!

The above means that we can crunch down our implementation in terms of memory footprint by removing the padding and using 8- or 16-bit counters, depending on the expected maximum number of threads. Removing the padding comes at the cost of potential false sharing issues.

Proportional back-off

Under high contention the Test-And-Test-And-Set Lock backs-off exponentially. The reduced amount of memory bus traffic improves performance. What about using the same backing-off strategy with Ticket Locks? It turns out that this is a very bad idea, because the FIFO order causes the back-off delays to accumulate. For example, the thread with myTicketNo == ServingTicketNo+3 must always wait as least as long as the thread with myTicketNo == ServingTicketNo+2 and the thread with myTicketNo == ServingTicketNo+2 must always wait as least as long as the thread with myTicketNo == ServingTicketNo+1.

Is there something else we can do? It turns out we can thanks to the FIFO order in which threads are granted access to the CS. Every thread can calculate how many other threads are going to be granted access to the CS before it-self as numBeforeMe = myTicketNo-ServingTicketNo. With this knowledge and under the assumption that every thread holds the lock approximately for the same duration, we can back-off for the number of threads in line before us times some constant BACKOFF_BASE. BACKOFF_BASE is the expected average time that every thread spends inside the CS. This technique is called proportional back-off.

ALWAYS_INLINE void PropBoTicketSpinLock::Enter()
{
    const auto myTicketNo = NextTicketNo.fetch_add(1, std::memory_order_relaxed);

    while (true)
    {
        const auto servingTicketNo = ServingTicketNo.load(std::memory_order_acquire);
        if (servingTicketNo == myTicketNo)
            break;

        const size_t numBeforeMe = myTicketNo-servingTicketNo;
        const size_t waitIters = BACKOFF_BASE*numBeforeMe;

        for (size_t i=0; i<waitIters; i++)
            CpuRelax();
    }
}

Wrap up

How does the Ticket Lock compare against TAS-based locks and in which situations which of the two locks variants is preferably used? The Ticket Lock has the following advantages over TAS-based locks:

  • The Ticket Lock is fair, because the threads are granted access to the CS in FIFO order. This prevents the same thread from reacquiring the lock multiple times in a row, which – at least in theory – could starve other threads.
  • In contrast to all TAS-based locks the Ticket Lock avoids the Thundering Herd problem, because waiting for the lock and acquiring it doesn’t require any read-modify-write or store operation.
  • In contrast to the Test-And-Set Lock only a single atomic load operation is repeatedly executed while waiting for the lock to become available. Though, this problem is solved by the TTAS Lock.

The biggest disadvantage of the Ticket Lock is that the fairness property backfires once there are more threads competing for the lock than there are CPU cores in the system. The problem is that in that case the thread which can enter the CS next might be sleeping. This means that all other threads must wait, because of the strict fairness guarantee. This property is sometimes referred to as preemption intolerance.

Furthermore, the Ticket Lock doesn’t solve the scalability issue of TAS-based spinlocks. Both spinlock variants don’t scale well, because the number of cache line invalidations triggered when acquiring/releasing the lock is O(#threads). There are scalable lock implementations where all threads spin on different memory locations. These spinlock variants only trigger O(1) many cache line invalidations. Next time we’ll look into scalable spinlock variants.


2 Comments

Test-and-set spinlocks

In my previous blog post I wrote about the most important properties of spinlocks and why they matter. This time I’ll present the first out of three concrete spinlock implementations in C++11 for x86 processors. I’m assuming that you’ve read the first post. You can find all source code in this Github repository.

Introduction

The Test-And-Set (TAS) Lock is the simplest possible spinlock implementation. It uses a single shared memory location for synchronization, which indicates if the lock is taken or not. The memory location is updated using a test-and-set (TAS) operation. TAS atomically writes to the memory location and returns its old value in a single indivisible step. Actually, the name test-and-set is a little misleading, because the caller is responsible for testing if the operation has succeeded or not. The TAS Lock is not fair as it doesn’t guarantee FIFO ordering amongst the threads competing for the lock.

In C++11 std::atomic_bool::exchange() can be used to perform a TAS operation on an std::atomic_bool synchronization variable. On x86 CPUs std::atomic::exchange() is turned into the LOCK XCHG instruction (side note: the LOCK prefix is implicit and not strictly required, because there’s no non-atomic version of XCHG). The following code implements the described TAS Lock.

class TasSpinLock
{
public:
    ALWAYS_INLINE void Enter()
    {
        // atomic_bool::exchange() returns previous value of Locked
        while (Locked.exchange(true, std::memory_order_acquire) == true);
    }

    ALWAYS_INLINE void Leave()
    {
        Locked.store(false, std::memory_order_release);
    }

private:
    alignas(CACHELINE_SIZE) std::atomic_bool Locked = {false};
};

static_assert(sizeof(TasSpinLock) == CACHELINE_SIZE, "");

The TasSpinLock::Locked flag is cache line size padded using alignas to prevent false sharing. You can remove the padding, e.g. in memory-limited environments, when false sharing is not an issue.

The test-and-test-and-set optimization

While the TAS Lock is extremely easy to implement, its scalability is very bad. Already with just a few threads competing for the lock, the amount of required cache line invalidations to acquire/release the lock quickly degrades performance. The problem is two-fold.

  1. std::atomic_bool::exchange() always invalidates the cache line Locked resides in, regardless of whether it succeeded in updating Locked or not.
  2. When the spinlock is released, all waiting threads simultaneously try to acquire it. This is sometimes referred to as Thundering Herd problem. Acquiring the lock means first invalidating the cache line copy of all threads waiting for the lock and then reading the valid cache line copy from the core which released the lock. With t threads this results in O(t) memory bus transactions.

The latter issue is inherent to TAS Locks, because just a single synchronization variable is used and shared between all threads. However, the first issue can be fixed by testing if the lock is free before calling exchange(). First testing if the lock is free only loads the synchronization variable and therefore doesn’t cause a cache line invalidation. This spinlock variant is called Test-And-Test-And-Set (TTAS) Lock. The improved Enter() method looks like this:

ALWAYS_INLINE void TTasSpinLock::Enter()
{
    do
        WaitUntilLockIsFree();
    while (Locked.exchange(true, std::memory_order_acquire) == true);
}

ALWAYS_INLINE void TTasSpinLock::WaitUntilLockIsFree() const
{
    while (Locked.load(std::memory_order_relaxed) == true);
}

Exponential back-off

The TTAS Lock successfully reduces the amount of cache line invalidations occurring while the lock is trying to be acquired. However, as soon as the lock is released, again all threads try to update the Locked flag. If a thread sees that the lock is free, but fails acquiring it subsequently because some other thread was faster, the lock is most probably under high contention. In this situation it can help to wait for a short time to let other threads finish before trying to enter the critical section (CS) again.

Waiting a short time without trying to acquire the lock reduces the number of threads simultaneously competing for the lock. The larger the number of unsuccessful retries, the higher the lock contention and the longer the thread should back-off. Experiments have shown that a good strategy is to back-off exponentially; similar to the congestion avoidance mechanism used in the Ethernet protocol. To avoid threads running in lock step the initial wait time is randomized.

ALWAYS_INLINE static void BackoffExp(size_t &curMaxIters)
{
    assert(curMaxIters > 0);

    thread_local std::uniform_int_distribution<size_t> dist;
    thread_local std::minstd_rand gen(std::random_device{}());
    const size_t spinIters = dist(gen, decltype(dist)::param_type{0, curMaxIters});

    curMaxIters = std::min(2*curMaxIters, MAX_BACKOFF_ITERS);
    for (volatile size_t i=0; i<spinIters; i++); // Avoid being optimized out!
}

ALWAYS_INLINE void ExpBoTTasSpinLock::Enter()
{
    size_t curMaxIters = MIN_BACKOFF_ITERS;

    while (true)
    {
        // Not strictly required but doesn't hurt
        WaitUntilLockIsFree();

        if (Locked.exchange(true, std::memory_order_acquire) == true)
            BackoffExp(curMaxIters); // Couldn't acquire lock => back-off
        else
            break; // Acquired lock => done
    }
}

The downside of waiting is that it renders the lock even unfairer than it is already. Threads that have been attempting to acquire the lock longest are also backing-off longest. Consequently, newly arriving threads have a higher chance to acquire the lock than older threads. On top of that threads might back-off for too long, causing the CS to be underutilized. Furthermore, there are unfortunately no single best values for MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS. Optimally, they’re determined experimentally depending on the workload (contention, size of CS).

Pausing and sleeping

There are two more improvements we can do. First, according to the Intel Architectures Optimization Reference Manual, adding a PAUSE instruction to the body of all spin loops improves performance on Pentium 4 CPUs and later. The PAUSE instruction provides a hint to the CPU that the code being executed is a spin-wait loop. PAUSE is backwards compatible and turns into a NOP instruction on older CPUs. There are three reasons why using PAUSE improves performance:

  • It avoids memory order violations which result in expensive pipeline flushes, because the CPU doesn’t go ahead to fill its pipeline with speculatively executed load and compare instructions.
  • It gives the other hardware thread time to execute on simultaneous multi-threading (SMT) processors (e.g. Intel’s Hyper Threading).
  • It adds a slight delay to the spin loop, which synchronizes it with the system’s memory bus frequency. This frequency is typically much lower than the frequency of the CPU, which in turn reduces power consumption considerably.

The second improvement is to stop spinning, put the thread to sleep and reschedule if the the lock doesn’t become available after some time. The important bit here is to not yield (sched_yield() on Linux, SwitchToThread() on Windows, or more generally std::this_thread::yield() in C++11), but to explicitly put the thread to sleep for a short period of time. This ensures that the thread is really paused for some time and isn’t immediately run again by the scheduler if there’s no other thread available in the scheduler’s run queue. This is especially important on SMT processors where every two logical cores share most of the execution units. Only when the thread is really sleeping the other logical core can fully make use of the shared execution units. The following TTAS Lock implementation uses the PAUSE instruction and sleeps for 500us if the lock isn’t released within MAX_WAIT_ITERS iterations. There’s no single best value for MAX_WAIT_ITERS. Ideally, it’s chosen experimentally like the MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS constants from before.

ALWAYS_INLINE static void CpuRelax()
{
#if (COMPILER == MVCC)
    _mm_pause();
#elif (COMPILER == GCC || COMPILER == LLVM)
    asm("pause");
#endif
}

ALWAYS_INLINE static void YieldSleep()
{
    // Don't yield but sleep to ensure that the thread is not
    // immediately run again in case scheduler's run queue is empty
    using namespace std::chrono;
    std::this_thread::sleep_for(500us);
}

ALWAYS_INLINE static void BackoffExp(size_t &curMaxIters)
{
    ... Unchanged code not repeated, look above ...

    for (size_t i=0; i<spinIters; i++)
        CpuRelax();
}

ALWAYS_INLINE void ExpBoRelaxTTasSpinLock::WaitUntilLockIsFree() const
{
    size_t numIters = 0;

    while (Locked.load(std::memory_order_relaxed))
    {
        if (numIters < MAX_WAIT_ITERS)
        {
            CpuRelax();
            numIters++;
        }
        else
            YieldSleep();
    }
}

Performance comparison

The benchmark used to compare the different TAS Lock variants is extremely simple and should be taken with a grain of salt. It launches t threads, each competing for the lock and incrementing a global atomic counter inside the CS. This causes high lock contention, because all threads are trying to enter the lock at the same time. Though, it’s important to keep in mind that due to the lock’s lack of fairness, it’s entirely possible that the same thread acquires the lock multiple times in a row. I ran the benchmark on a dual socket NUMA system with two Intel Xeon E5-2630 v2 CPUs with 2.60 GHz. The CPUs support Hyper Threading which gives a total of 12 physical and 24 logical cores. The benchmark results are shown below.

TAS Lock benchmark results

The most interesting finding is that the amount of lock reacquisitions is a lot higher than I anticipated. This can be derived from the fact that all TTAS-based locks scale almost linearly, where typically the observed drop in performance for TAS-based locks is exponential. Large numbers of lock reacquisitions by the same thread reduces the amount of cache line invalidations and increases throughput. At the same time the latency of all other threads goes up, because they’re granted the lock later.

The TasSpinLock scales so much worse, even though lock reacquisiton happens, because while spinning the TAS operation causes cache line invalidations also if it doesn’t succeed in acquiring the lock. By looking at the difference between the TasSpinLock and the other TTAS-based versions, it’s highly visible how big the influence of cache line invalidations is on the lock’s performance.

Wrapup

The exponential back-off TTAS Lock tends to perform fairly well as long as the number of competing threads is low or lock reacquisitions boost throughput while increasing latency. It’s especially useful when it’s not known a priori that the number of competing threads is bounded by the number of cores in the system (in that case better scaling spinlock variants perform really badly as we’ll see in the next post). Beyond that, without padding the TTAS Lock just needs a single byte. The compact representation is useful in extremely memory-constrained environments or applications that require vast amounts of spinlocks; e.g. for fine-grained locking of lots of small pieces of data.


2 Comments

Mutex lock guards in C++11

The new concurrency library of C++11 comes with two different classes for managing mutex locks: namely std::lock_guard and std::unique_lock. How do they compare with each other and in which situation which of the two classes should be preferably used?

The std::lock_guard class keeps its associated mutex locked during the entire life time by acquiring the lock on construction and releasing the lock on destruction. This makes it impossible to forget unlocking a critical section and it guarantees exception safety because any critical section is automatically unlocked when the stack is unwound after an exception was thrown. The std::lock_guard class should be used when a limited scope, like a class method, is to be locked.

void Foo::Bar()
{
    std::lock_guard<std::mutex> guard(this->Mutex);
    // mutex is locked now
}   // mutex is unlocked when lock guard goes out of scope

In contrast, the std::unique_lock class is a lot more flexible when dealing with mutex locks. It has the same interface as std::lock_guard but provides additional methods for explicitly locking and unlocking mutexes and deferring locking on construction. By passing std::defer_lock instead of std::adopt_lock the mutex remains unlocked when a std::unique_lock instance is constructed. The lock can then be obtained later by calling lock() on the std::unique_lock instance or alternatively, by passing it to the std::lock() function. To check if a std::unique_lock currently owns its associated mutex the owns_lock() method can be used. Hence, the mutex associated with a std::unique_lock doesn’t have to be locked (sometimes also referred to as owned) during the lock guard’s entire life time. As a consequence, the ownership of a std::unqiue_lock can be transferred between instances. This is why std::unique_lock is movable whereas std::lock_guard is not. Thus, more flexible locking schemes can be implemented by passing around locks between scopes.
For example a std::unique_lock can be returned from a function, or instances of a class containing a std::unique_lock attribute can be stored in containers. Consider the following example in which a mutex is locked in the function Foo(), returned to the function Bar() and only then unlocked on destruction.

std::mutex Mutex;

std::unique_lock<std::mutex> Foo()
{
    std::unique_lock<std::mutex> lock(Mutex);
    return lock;
    // mutex isn't unlocked here!
}

void Bar()
{
    auto lock = Foo();
}   // mutex is unlocked when lock goes out of scope

Keeping std::unique_lock‘s additional lock status up-to-date induces some additional, minimal space and speed overhead in comparison to std::lock_guard. Hence, as a general rule, std::lock_guard should be preferably used when the additional features of std::unique_lock are not needed.


1 Comment

Operator systems: accessing parameters in execute handlers

Introduction

In Enigma Studio we have a number of operators. Each operator performs a certain operation on a data structure (e.g. mesh, bitmap, …) based on a set of parameters. The parameters can be of different type; e.g. 1-, 2-, 3- and 4-dimensional int or float vector, flag, enumeration or string. When an operator is executed to perform its operation the operator's execute handler gets called. In the implementation of the execute handler there has to be a way for the programmer to access the operator's set of parameters. Coming up with a programmer-friendly way to access the parameters is not that straightforward, because all obvious approaches require writing code where the individual parameters have to be accessed explicitly by calling a getter function or by accessing an array. Look at the following code for an illustration. Note, that all code presented in the following is purely for educational purpose. It is by no mean intended to be a feature complete operator stacking system.

enum OpParamType
{
    OPT_INT,
    OPT_FLT,
    OPT_STR
};

struct OpParam
{
    OpParamType type;
    int         chans;

    union
    {
        float   flts[4];
        int     ints[4];
    };

    std::string str; // can't be in union
};

struct Op
{
    void execute()
    {
        execFunc(this);
    }

    void (* execFunc)(Op *op); // pointer to execute handler
    OpParam params[16];
    size_t  numParams;
};

// execute handler
// param[0]: width
// param[1]: height
// param[2]: UV coordinates
void OpExec(Op *op)
{
    int size = op->params[0].ints[0]*op->params[1].ints[0];
    Vector2 uv(op->params[2].flts[0], op->params[2].flts[1]);
}

This approach has some issues. First of all, it is very error-prone because a parameter is solely identified by its index (we could use strings, but it doesn't make it much better). Furthermore, whenever the order of parameters or the type of a parameter changes, each access to the params array of the respective operator has to be changed as well. This sucks and this is something no programmer wants to do. Finally, the syntax used above is not very readable; especially when user-defined data types like Vector2 are used. Let's do something about it!

Concept

In the following I'll present the approach I've come up with. It's based on directly passing the parameter data to the execute handler. This way the only thing that has to be done is to declare/define the execute handler accordingly to its parameter definition. Meaning, that the order of arguments and the type of arguments has to match with the definition of the operator's set of parameters. The execute handler from above e.g. would simply look like:

void OpExec(Op *op, int width, int height, const Vector2 &uv)
{
    int size = width*height;
}

This is exactly what you are used to when programming. The code is not cluttered with accesses to the params array anymore, nor is it as error-prone. Parameters are identified by their names and there is only one source of error (the function's declaration/definition) when it comes to the parameters' types and order.

Implementation

In order to implement such a system we have to go low-level. This means crafting handwritten assembly code to implement the underlying calling convention's argument passing and function calling ourselves. Consequently, you need profound knowledge of C++, x86/x64 assembly and calling conventions to understand what follows. Furthermore, the following code is mostly neither platform nor assembler independent, because of differences in how the operating systems implement certain calling conventions and the assembly syntax. I used Windows 8, Visual C++ 2012 and MASM.

x86 Implementation

On x86 I use Visual C++'s standard calling convention CDECL. CDECL as well as STDCALL (which differ only in who, the caller or the callee, cleans the stack) are very easy to implement by hand, because all arguments are passed on the stack. To do so, all parameters that should be passed to an execute handler are copied into an array first. This array is passed to the callFunc() function, which creates a stack frame, copies the data to the stack and calls the execute handler. The following code is used to copy each parameter to the array. To account for call-by-value and call-by-reference arguments the data's address or the data it-self is copied, depending on the parameter type.

// implemented in an .asm file
extern "C" void callFunc(const void *funcPtr, const size_t *stack, size_t numArgs);

class Op
{
public:
    Op(const void *execFunc, const OpParam *params, size_t numParams) : 
       m_execFunc(execFunc),
       m_numParams(numParams)
    {
        // copy over parameters
        std::copy(params, params+numParams, m_params);
    }

    void execute()
    {
        size_t args[16+1] = {(size_t)this}; // pass pointer to operator

        for (uint32_t i=0; i<m_numParams; i++)
        {
            const OpParam &p = m_params[i];
            switch (p.type)
            {
            case OPT_STR:
                args[i+1] = (size_t)&p.str;
                break;
            default: // float and integer types (it's a union!)
                args[i+1] = (p.chans > 1 ? (size_t)p.ints : (size_t)p.ints[0]);
                break;
            }
        }

        callFunc(m_execFunc, args, m_numParams+1);
    }

private:
    const void * m_execFunc;
    size_t       m_numParams;
    OpParam      m_params[16];
};

The implementation of callFunc() is given below. I used the MASM assembler because its shipped with Visual C++. I don't use inline assembly, because Visual C++ doesn't support x64 inline assembly and I wanted to have both callFunc() implementations in one file.

.686
.model flat, c             ; calling convention = cdecl
.code
callFunc proc funcPtr:ptr, stack:ptr, numArgs:dword
    push    esi            ; save callee saved regs
    push    edi
    mov     eax, [funcPtr] ; store arguments on stack before new
    mov     ecx, [numArgs] ; stack frame is installed (they
    mov     esi, [stack]   ; won't be accessible anymore)
    push    ebp
    mov     ebp, esp
    sub     esp, ecx       ; reserve stack space for parameters
    sub     esp, ecx       ; (4 * number of arguments)
    sub     esp, ecx
    sub     esp, ecx
    mov     edi, esp       ; copy parameters on stack
    rep     movsd
    call    eax            ; call execute handler
    mov     esp, ebp       ; remove stack frame
    pop     ebp
    pop     edi            ; restore callee saved regs
    pop     esi
    ret
callFunc endp
end

x64 Implementation

The x64 code is slightly more complex. The reason for this is that on x64 there is only one calling convention called FASTCALL. In this calling convention not all arguments just go onto the stack, but some of them have to be passed in registers. Furthermore, there are some differences in how call-by-value for structs works. Structures which's size does not exceed 64 bytes are passed in registers or on the stack. For bigger structures a pointer is passed. If the structure is passed by value, its data is first copied to the home space and the passed pointer points there. As I only needed call-by-reference for user-defined data types bigger than 64 bytes, I didn't have to care about this. Furthermore, the stack pointer RSP has to be 16 byte aligned when calling the execute handler. You might be wondering why 16 and not 8 byte. The reason for this is that it simplifies the alignment of SSE data types for the compiler (sizeof(__m128) = 16 bytes). Describing the calling convention in more detail is beyond the scope of this post, but there are plenty of good articles online.
As I mentioned before already, Visual C++ does not support x64 inline assembly. Therefore, I had to go for an extra .asm file which gets assembled by MASM64. The steps undertaken by the callFunc() implementation are:

  1. Allocate stack space for the parameters that won't be passed in registers.
  2. Align the stack pointer RSP on 16 byte boundary.
  3. Copy the 5th, 6th, … arguments to the stack.
  4. Copy the first four arguments to the registers RAX, RDX, R8 and R9.
  5. Call the execute handler.

In the following the source code of the x64 callFunc() function is depicted. You should note that MASM64 does not support using the arguments declared in the proc statement. This is why I directly access the arguments via the registers. For the ease of much simpler manual register allocation I copy them in the beginning to memory.

.data
numArgs dq 0
stack   dq 0
funcPtr dq 0

.code
callFunc proc funcPtrP:ptr, stackP:ptr, numArgsP:dword
    push    rdi
    push    rsi

    mov     [funcPtr], rcx  ; simplifies register allocation
    mov     [stack], rdx    ; don't use passed variable names!
    mov     [numArgs], r8   ; MASM doesn't support this for x64

; ----- allocate stack space and copy parameters -----

    mov     rcx, [numArgs]  ; number of passed arguments
    sub     rcx, 4          ; 4 of them will be passed in regs
    cmp     rcx, 0          ; some left for the stack?
    jng     noParamsOnStack
    lea     r10, [rcx*8]    ; calculate required stack space
    sub     rsp, r10        ; reserve stack space

    mov     r11, rsp        ; align stack pointer to 16 bytes
    and     r11, 15         ; mod 16
    jz      dontAlign       ; is already 16 bytes aligned?
    sub     rsp, r11        ; perform RSP alignment
dontAlign:

    mov     rdi, rsp        ; copy parameters to stack
    mov     rsi, [stack]
    add     rsi, 4*8        ; first 4 parameters are in regs
    rep     movsq

noParamsOnStack:

; ----- store first 4 arguments in RCX, RDX, R8, R9 -----

    mov     rsi, [stack]
    mov     r10, [numArgs]  ; switch (numArgs)
    cmp     r10, 4
    jge     fourParams
    cmp     r10, 3
    je      threeParams
    cmp     r10, 2
    je      twoParams
    cmp     r10, 1
    je      oneParam
    jmp     noParams

fourParams:                 ; fall through used
    mov     r9, [rsi+24]
    movss   xmm3, dword ptr [rsi+24]
threeParams:
    mov     r8, [rsi+16]
    movss   xmm2, dword ptr [rsi+16]
twoParams:
    mov     rdx, [rsi+8]
    movss   xmm1, dword ptr [rsi+8]
oneParam:
    mov     rcx, [rsi]
    movss   xmm0, dword ptr [rsi]
noParams:

; ----- call execute handler for operator -----

    sub     rsp, 20h        ; reserve 32 byte home space
    call    [funcPtr]       ; call function pointer

; ----- restore non-volatile registers -----

    mov     rsp, rbp
    pop     rsi
    pop     rdi
    ret
callFunc endp
end

Conclusion

The presented solution for implementing execute handlers works very well for me. Nevertheless, it is important to get the handler definitions/declarations right in order to avoid annoying crashes. There are no validity checks at compile-time. Apart from argument order, parameters cannot be passed call-by-value (at least in x86, in x64 call-by-value silently turns into call-by-reference because the parameter data is not copied to home space; which is not much better). You can download the full source code from my github repository.


2 Comments

A pitfall with initialization lists in C++

Welcome to this quick post on constructor initialization lists in C++. If you are not familiar with initialization lists here is a very short introduction. When in C++ a class is instantiated, first, all base classes and second, all class attributes are constructed. Initialization lists are used to control that process. In initialization lists the constructors of the base class and the class attributes can be explicitly called. Otherwise, they are initialized by calling their default constructor. For efficiency reasons it is important to use initialization lists, because all member initialization takes place before the body of the constructor is entered. So much about the basics of initialization lists. Now a little riddle. The following piece of code will crash. Can you find the problem? Give it a try your-self before you continue reading.

struct Foo
{
    Foo(int newId) : id(newId) {}
    int id;
};

struct Bar
{
    Bar(const Foo &newFoo) : foo(newFoo), fooId(foo.id) {}

    const int   fooId;
    const Foo & foo;
};

int main(int argc, char **argv)
{
    Foo foo(303);
    Bar bar(foo);
    return 0;
}

If you don’t know the nasty details of how class attributes are initialized via initialization lists, you most probably couldn’t figure out what causes the crash in those few lines of code. The issue with the code above is that the order of attribute initialization is not determined by the order of appearance in the initialization list, but by the order of declaration within the class. Note, that foo appears in the initialization list before fooId, but fooId is declared before foo. Hence, not foo but fooId is initialized first, accessing the still uninitialized attribute foo.

So, why is that you might ask? This, on the first glance, strange behavior actually makes a lot of sense. When an object is destroyed, its destructor calls the destructors of every class attribute in the reverse order they were initialized. As there can be potentially more than one constructor with different orders of attribute initialization the order of destruction wouldn’t be defined. To solve the ambiguity simply the order of declaration is used.