The Infinite Loop

Tales from a lean programmer.


Leave a comment

Scalable spinlocks 1: array-based

Last time we saw that spinlock implementations which only use a single synchronization variable (Test-And-Set Lock, Ticket Lock) don’t scale with growing numbers of threads. Today, I want to talk about two spinlock variants that scale. Namely the Graunke and Thakkar Lock1 (1989) and the Anderson Lock2 (1990). Their underlying key idea is to use one synchronization variable per thread instead of one for all threads, to reduce the amount of cache line invalidations when acquiring/releasing the lock. Both spinlock variants store the synchronization variables in an array. This means that there’s an upper bound on the maximum number of thread’s that can compete for the lock concurrently, which must be known a priori. In upcoming blog posts I’m going to show spinlock variants (the MCS Lock and the CLH Lock) that improve upon array-based spinlocks by removing this hard upper limit.

Anderson’s lock

Anderson’s spinlock is basically a FIFO queue of threads. It’s implemented as an array, which is accessed via an atomic ring-buffer-style index counter. Hence, it requires an atomic fetch-and-add (std::atomic::fetch_add in C++11) operation to advance the ring-buffer index. The basic idea is that each thread spins on its own synchronization variable while it’s = true. The synchronization variable of the next thread in line is eventually set to false when the previous thread leaves the critical section (CS).

Below is an implementation of Anderson’s spinlock in C++11. Note, that this implementation is for illustrative purpose only. It’s not optimized for performance. The array of synchronization variables LockedFlags is initially set to [false, true, true, ..., true]. The first element is set to false, so that the very first thread in line can enter the critical section: for the first thread there’s no previous thread which can enable the first thread to enter the CS. After successfully entering the CS the synchronization variable is false and needs to be reset to true for future threads. It must be taken care for that NextFreeIdx doesn’t overflow. If the maximum number of threads is a power-of-2 integer arithmetic will just work fine, because the overflow matches up with the modulo computation. Otherwise, NextFreeIdx must be kept in check to avoid overflows.

Anderson’s lock requires O(L*T) memory to store L locks with a maximum thread count of T.

class AndersonSpinLock
{
public:
    AndersonSpinLock(size_t maxThreads=std::thread::hardware_concurrency()) :
        LockedFlags(maxThreads)
    {
        for (auto &flag : LockedFlags)
            flag.first = true;

        LockedFlags[0].first = false;
    }

    ALWAYS_INLINE void Enter()
    {
        const size_t index = NextFreeIdx.fetch_add(1)%LockedFlags.size();
        auto &flag = LockedFlags[index].first;

        // Ensure overflow never happens
        if (index == 0)
            NextFreeIdx -= LockedFlags.size();

        while (flag)
            CpuRelax();

        flag = true;
    }

    ALWAYS_INLINE void Leave()
    {
        const size_t idx = NextServingIdx.fetch_add(1);
        LockedFlags[idx%LockedFlags.size()].first = false;
    }

private:
    using PaddedFlag = std::pair<std::atomic_bool,
                                 uint8_t[CACHELINE_SIZE-sizeof(std::atomic_bool)]>;
    static_assert(sizeof(PaddedFlag) == CACHELINE_SIZE, "");

    alignas(CACHELINE_SIZE) std::vector<PaddedFlag> LockedFlags;
    alignas(CACHELINE_SIZE) std::atomic_size_t      NextFreeIdx = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t      NextServingIdx = {1};
};

Graunke and Thakkar’s lock

In contrast to Anderson’s ring-buffer-based lock, Graunke and Thakkar use a linked list to establish a FIFO queue of threads attempting to acquire the lock. Hence, an atomic fetch-and-store operation (std::atomic::exchange in C++11) is needed to atomically update the tail of the queue. Like Anderson’s lock, every thread spins on a different synchronization variable to reduce cache traffic due to cache line invalidations. A thread trying to enter the CS spins on the synchronization variable of the thread in line before it-self by looking at the tail pointer of the wait queue. Every thread is responsible for flipping its synchronization variable when it leaves the CS, which in turn enables the next thread in line to enter.

Below is an implementation of Graunke and Thakkar’s spinlock in C++11. Note, that this implementation is for illustrative purpose only. It’s neither optimized for performance, nor does it work if threads are recreated because of how thread indices are assigned. In their original implementation – which is reflected below – the synchronization flag is not reset after successfully entering the CS. That’s why additionally the old value of the flag must be stored along with the flag pointer in the queue. The spin loop compares the current flag value with the old flag value to see if the previous thread in line has already unlocked the CS by flipping its synchronization variable. The approach used in Anderson’s lock would simplify the code significantly: the old flag value wouldn’t have to be stored anymore in the tail, which means we could get rid of all the binary arithmetic to crunch in and extract the old flag value from the tail.

Graunke and Thakkar’s lock requires O(L*T) memory to store L locks with a maximum thread count of T.

class GraunkeAndThakkarSpinLock
{
public:
    GraunkeAndThakkarSpinLock(size_t maxThreads=std::thread::hardware_concurrency()) :
        LockedFlags(maxThreads)
    {
        for (auto &flag : LockedFlags)
            flag.first = 1;

        assert(Tail.is_lock_free());
        Tail = reinterpret_cast(&LockedFlags[0].first);
        // Make sure there's space to store the old flag value in the LSB
        assert((Tail&1) == 0);
    }

    ALWAYS_INLINE void Enter()
    {
        // Create new tail by chaining my synchronization variable into the list
        const auto &newFlag = LockedFlags[GetThreadIndex()].first;
        const auto newTail = reinterpret_cast(&newFlag)|static_cast(newFlag);
        const auto ahead = Tail.exchange(newTail);

        // Extract flag and old value of previous thread in line,
        // so that we can wait for its completion
        const auto *aheadFlag = reinterpret_cast(ahead&(~static_cast(1)));
        const auto aheadValue = static_cast(ahead&1);

        // Wait for previous thread in line to flip my synchronization variable
        while (aheadFlag->load() == aheadValue)
            CpuRelax();
    }

    ALWAYS_INLINE void Leave()
    {
        // Flipping synchronization variable enables next thread in line to enter CS
        auto &flag = LockedFlags[GetThreadIndex()].first;
        flag = !flag;
    }

private:
    ALWAYS_INLINE size_t GetThreadIndex() const
    {
        static std::atomic_size_t threadCounter = {0};
        thread_local size_t threadIdx = threadCounter++;
        assert(threadIdx < LockedFlags.size());
        return threadIdx;
    }

private:
    using PaddedFlag = std::pair<std::atomic_uint16_t,
                                 uint8_t[CACHELINE_SIZE-sizeof(std::atomic_uint16_t)]>;
    static_assert(sizeof(PaddedFlag) == CACHELINE_SIZE, "");

    // In the LSB the old value of the flag is stored
    alignas(CACHELINE_SIZE) std::atomic<uintptr_t>  Tail;
    alignas(CACHELINE_SIZE) std::vector<PaddedFlag> LockedFlags;

    static_assert(sizeof(decltype(LockedFlags)::value_type) > 1,
                  "Flag size > 1 required: thanks to alignment, "\
                  "old flag value can be stored in LSB");
};

Comparison

The major difference between Anderson’s and Graunke and Thakkar’s spinlock implementations is how synchronization variables are assigned to threads. In Graunke and Thakkar’s lock there’s a 1:1 mapping from synchronization variables to threads. Always the same synchronization variable is assigned to the same thread. As a consequence the lock can be implemented in such a way that synchronization variables are stored in memory locations local to the spinning thread. This allows the lock to perform well on even on cache-less NUMA machines.

In contrast, Anderson’s lock has no control over the memory location in which a thread’s synchronization variable is stored. The synchronization variable assigned to a thread only depends on the order in which threads attempt to acquire the lock. Therefore, the Anderson lock performs much worse on cache-less machines, because it’s entirely possible that a thread spins on a synchronization variable in a remote memory location. On all modern mainstream CPUs this difference has marginal effect, because the CPUs have caches into which the synchronization variables can migrate during spinning.


  1. G. Graunke, S. Thakkar. Synchronization Algorithms for Shared-Memory Multiprocessors. IEEE Computer, Vol. 23, No. 6, (1990), pp. 60-69 
  2. T. E. Anderson. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 1. (1990), pp. 6-16 


1 Comment

Important properties of spinlocks

Typically, when a thread tries to acquire an already locked mutex it will go to sleep and give up its time slice, allowing another thread to run immediately instead. If the duration for which the mutex is held is extremely short, the time for putting the waiting thread to sleep and waking it up again to retry can easily be longer. In such cases spinlocks can be used to increase performance, by continuously trying to acquire the lock without sleeping. Side note: Obviously, it makes no sense to use spinlocks in systems with a single CPU core. Polling for the spinlock to become available would only prevent the single available CPU core from running the other thread to release the lock.

To deep dive into the multitude of spinlock implementations, it’s crucial to understand what their key characteristics are. Generally, spinlocks can be grouped into fair and unfair as well as scalable and unscalable implementations. In addition, we look at the spinlock’s memory footprint. The three properties are discussed in more detail in the rest of this post.

For all the nitty-gritty architectural details of concrete C++11 spinlock implementations read the upcoming blog posts about Test-And-Set (TAS) Locks, Ticket Locks and MCS Locks. I’ll upload the source code to this Github repository.

Scalability

Existing spinlock variants have very different performance characteristics under contention. A lock is said to be contented when there are multiple threads concurrently trying to acquire the lock while it’s taken. The performance of unscalable spinlock implementations degrades as the number of threads trying to acquire the lock concurrently goes up. The performance of simple Ticket or TAS Locks for example drops exponentially with increasing numbers of threads. In contrast, scalable spinlock implementations – like e.g. the MCS Lock – exhibit no degradation in performance even for large numbers of threads. The scalability graph of the optimal spinlock vs the Ticket Lock is depicted in the graph below.

Scalability of optimal spinlock

Crucial for the scalability of spinlocks is the amount of cache line invalidations which take place when the lock is acquired and released. Cache line invalidations take place when e.g. a thread modifies a synchronization variable while leaving a critical section (CS) on which other threads are polling at the same time. Suddenly, cache line copies stored in caches of other CPU cores are not valid anymore. Accordingly, these cache lines must be invalidated and reobtained from the CPU core’s cache which modified it. Thus, the number of cache line invalidations which occur when the spinlock is acquired/released is often used as a metric to compare spinlock implementations with each other. Usually, scalable spinlocks require O(1) many cache line invalidations to acquire/release the lock, while unscalable spinlocks require O(#threads) many.

Advanced scalable spinlock implementations may not only take the total number of CPU cores in a system into account, but also the details of the underlying cache coherency mechanism. For example in systems with multiple CPUs and non-uniform memory access times (NUMA), inter-core communication costs may vary greatly. For example in a NUMA system with multiple multi-core CPUs it’s entirely possible to pass a lock to another core within in the same CPU in less time than to a core of another CPU.

Fairness

Fair spinlocks guarantee that the same thread may never reacquire the lock if there are other threads trying to acquire the same lock at the same time. Fair spinlocks maintain a first-in-first-out (FIFO) order among the threads attempting to acquire the lock.

Often, unfair spinlocks trade latency for throughput, while fair spinlocks trade throughput for latency. Consider the following example. Let’s say we’re running t threads each performing n/t loop iterations in which they try to enter a CS. When using an unfair spinlock, in the worst case (at least in theory) every thread could enter the CS n/t times in a row instead of alternating with lock acquisition between threads. In certain scenarios this can boost throughput considerably, because the number of expensive cache line invalidations is reduced. However, the time until any other thread may enter the CS rises. In theory this can lead to starvation. In practice the cache coherency protocol should ensure that no starvation occurs.

Typically, the performance of fair spinlocks drops extremely when the number of threads competing for the same lock is greater than the number of CPU cores in the system. The reason for this is that the thread which is next in the sequence to acquire the lock might be sleeping. In contrast to unfair spinlocks, during the time the next thread is sleeping no other thread can acquire the lock because of the strict acquisition order guaranteed by the fair lock. This property is sometimes referred to as preemption intolerance.

Memory footprint

Generally, the required memory to store l spinlocks can depend on the number of threads t which are using the lock. While some spinlock implementations require only constant memory for t threads, other’s memory footprint scales linearly with the number of threads. The amount of needed memory to store a spinlock mostly doesn’t matter, because the memory footprint for the majority of locks is in the order of a few dozen bytes per thread. Mostly it’s a multiple of the CPU’s cache line size to account for false sharing. However, there are cases where the memory footprint plays an important role. Examples are extremely memory-constrained environments or applications that require vast amounts of spinlocks e.g. for fine-grained locking of lots of small pieces of data.