December 3, 2016
by David Leave a comment

Scalable spinlocks 1: array-based

Last time we saw that spinlock implementations which only use a single synchronization variable (Test-And-Set Lock, Ticket Lock) don’t scale with growing numbers of threads. Today, I want to talk about two spinlock variants that scale. Namely the Graunke and Thakkar Lock¹ (1989) and the Anderson Lock² (1990). Their underlying key idea is to use one synchronization variable per thread instead of one for all threads, to reduce the amount of cache line invalidations when acquiring/releasing the lock. Both spinlock variants store the synchronization variables in an array. This means that there’s an upper bound on the maximum number of thread’s that can compete for the lock concurrently, which must be known a priori. In upcoming blog posts I’m going to show spinlock variants (the MCS Lock and the CLH Lock) that improve upon array-based spinlocks by removing this hard upper limit.

Anderson’s lock

Anderson’s spinlock is basically a FIFO queue of threads. It’s implemented as an array, which is accessed via an atomic ring-buffer-style index counter. Hence, it requires an atomic fetch-and-add (std::atomic::fetch_add in C++11) operation to advance the ring-buffer index. The basic idea is that each thread spins on its own synchronization variable while it’s = true. The synchronization variable of the next thread in line is eventually set to false when the previous thread leaves the critical section (CS).

Below is an implementation of Anderson’s spinlock in C++11. Note, that this implementation is for illustrative purpose only. It’s not optimized for performance. The array of synchronization variables LockedFlags is initially set to [false, true, true, ..., true]. The first element is set to false, so that the very first thread in line can enter the critical section: for the first thread there’s no previous thread which can enable the first thread to enter the CS. After successfully entering the CS the synchronization variable is false and needs to be reset to true for future threads. It must be taken care for that NextFreeIdx doesn’t overflow. If the maximum number of threads is a power-of-2 integer arithmetic will just work fine, because the overflow matches up with the modulo computation. Otherwise, NextFreeIdx must be kept in check to avoid overflows.

Anderson’s lock requires $O(L*T)$ memory to store $L$ locks with a maximum thread count of $T$ .

class AndersonSpinLock
{
public:
    AndersonSpinLock(size_t maxThreads=std::thread::hardware_concurrency()) :
        LockedFlags(maxThreads)
    {
        for (auto &flag : LockedFlags)
            flag.first = true;

        LockedFlags[0].first = false;
    }

    ALWAYS_INLINE void Enter()
    {
        const size_t index = NextFreeIdx.fetch_add(1)%LockedFlags.size();
        auto &flag = LockedFlags[index].first;

        // Ensure overflow never happens
        if (index == 0)
            NextFreeIdx -= LockedFlags.size();

        while (flag)
            CpuRelax();

        flag = true;
    }

    ALWAYS_INLINE void Leave()
    {
        const size_t idx = NextServingIdx.fetch_add(1);
        LockedFlags[idx%LockedFlags.size()].first = false;
    }

private:
    using PaddedFlag = std::pair<std::atomic_bool,
                                 uint8_t[CACHELINE_SIZE-sizeof(std::atomic_bool)]>;
    static_assert(sizeof(PaddedFlag) == CACHELINE_SIZE, "");

    alignas(CACHELINE_SIZE) std::vector<PaddedFlag> LockedFlags;
    alignas(CACHELINE_SIZE) std::atomic_size_t      NextFreeIdx = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t      NextServingIdx = {1};
};

Graunke and Thakkar’s lock

In contrast to Anderson’s ring-buffer-based lock, Graunke and Thakkar use a linked list to establish a FIFO queue of threads attempting to acquire the lock. Hence, an atomic fetch-and-store operation (std::atomic::exchange in C++11) is needed to atomically update the tail of the queue. Like Anderson’s lock, every thread spins on a different synchronization variable to reduce cache traffic due to cache line invalidations. A thread trying to enter the CS spins on the synchronization variable of the thread in line before it-self by looking at the tail pointer of the wait queue. Every thread is responsible for flipping its synchronization variable when it leaves the CS, which in turn enables the next thread in line to enter.

Below is an implementation of Graunke and Thakkar’s spinlock in C++11. Note, that this implementation is for illustrative purpose only. It’s neither optimized for performance, nor does it work if threads are recreated because of how thread indices are assigned. In their original implementation – which is reflected below – the synchronization flag is not reset after successfully entering the CS. That’s why additionally the old value of the flag must be stored along with the flag pointer in the queue. The spin loop compares the current flag value with the old flag value to see if the previous thread in line has already unlocked the CS by flipping its synchronization variable. The approach used in Anderson’s lock would simplify the code significantly: the old flag value wouldn’t have to be stored anymore in the tail, which means we could get rid of all the binary arithmetic to crunch in and extract the old flag value from the tail.

Graunke and Thakkar’s lock requires $O(L*T)$ memory to store $L$ locks with a maximum thread count of $T$ .

class GraunkeAndThakkarSpinLock
{
public:
    GraunkeAndThakkarSpinLock(size_t maxThreads=std::thread::hardware_concurrency()) :
        LockedFlags(maxThreads)
    {
        for (auto &flag : LockedFlags)
            flag.first = 1;

        assert(Tail.is_lock_free());
        Tail = reinterpret_cast(&LockedFlags[0].first);
        // Make sure there's space to store the old flag value in the LSB
        assert((Tail&1) == 0);
    }

    ALWAYS_INLINE void Enter()
    {
        // Create new tail by chaining my synchronization variable into the list
        const auto &newFlag = LockedFlags[GetThreadIndex()].first;
        const auto newTail = reinterpret_cast(&newFlag)|static_cast(newFlag);
        const auto ahead = Tail.exchange(newTail);

        // Extract flag and old value of previous thread in line,
        // so that we can wait for its completion
        const auto *aheadFlag = reinterpret_cast(ahead&(~static_cast(1)));
        const auto aheadValue = static_cast(ahead&1);

        // Wait for previous thread in line to flip my synchronization variable
        while (aheadFlag->load() == aheadValue)
            CpuRelax();
    }

    ALWAYS_INLINE void Leave()
    {
        // Flipping synchronization variable enables next thread in line to enter CS
        auto &flag = LockedFlags[GetThreadIndex()].first;
        flag = !flag;
    }

private:
    ALWAYS_INLINE size_t GetThreadIndex() const
    {
        static std::atomic_size_t threadCounter = {0};
        thread_local size_t threadIdx = threadCounter++;
        assert(threadIdx < LockedFlags.size());
        return threadIdx;
    }

private:
    using PaddedFlag = std::pair<std::atomic_uint16_t,
                                 uint8_t[CACHELINE_SIZE-sizeof(std::atomic_uint16_t)]>;
    static_assert(sizeof(PaddedFlag) == CACHELINE_SIZE, "");

    // In the LSB the old value of the flag is stored
    alignas(CACHELINE_SIZE) std::atomic<uintptr_t>  Tail;
    alignas(CACHELINE_SIZE) std::vector<PaddedFlag> LockedFlags;

    static_assert(sizeof(decltype(LockedFlags)::value_type) > 1,
                  "Flag size > 1 required: thanks to alignment, "\
                  "old flag value can be stored in LSB");
};

Comparison

The major difference between Anderson’s and Graunke and Thakkar’s spinlock implementations is how synchronization variables are assigned to threads. In Graunke and Thakkar’s lock there’s a 1:1 mapping from synchronization variables to threads. Always the same synchronization variable is assigned to the same thread. As a consequence the lock can be implemented in such a way that synchronization variables are stored in memory locations local to the spinning thread. This allows the lock to perform well on even on cache-less NUMA machines.

In contrast, Anderson’s lock has no control over the memory location in which a thread’s synchronization variable is stored. The synchronization variable assigned to a thread only depends on the order in which threads attempt to acquire the lock. Therefore, the Anderson lock performs much worse on cache-less machines, because it’s entirely possible that a thread spins on a synchronization variable in a remote memory location. On all modern mainstream CPUs this difference has marginal effect, because the CPUs have caches into which the synchronization variables can migrate during spinning.

G. Graunke, S. Thakkar. Synchronization Algorithms for Shared-Memory Multiprocessors. IEEE Computer, Vol. 23, No. 6, (1990), pp. 60-69 ↩
T. E. Anderson. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 1. (1990), pp. 6-16 ↩

April 9, 2016
by David 6 Comments

The ticket spinlock

Last time we saw how a spinlock can be implemented using an atomic test-and-set (TAS) operation on a single shared synchronization variable. Today, I want to talk about another spinlock variant called Ticket Lock. The Ticket Lock is very similar to TAS-based spinlocks in terms of scalability, but supports first-in-first-out (FIFO) fairness.

As the name suggests, the Ticket Lock employs the same concept as e.g. hair dressers do to serve their customers in the order of arrival. On arrival customers draw a ticket from a ticket dispenser which hands out tickets with increasing numbers. A screen displays the ticket number served next. The customer holding the ticket with the number currently displayed on the screen is served next.

Implementation

The C++11 implementation below uses two std::atomic_size_t variables as counters for the ticket number currently served (ServingTicketNo) and the ticket number handed out to the next arriving thread (NextTicketNo). The implementation is optimized for x86 CPUs. The PAUSE instruction (called from CpuRelax()) is used when spin-waiting and both counters are cache line padded to prevent false sharing. Read the previous article on TAS-based locks for more information.

class TicketSpinLock
{
public:
    ALWAYS_INLINE void Enter()
    {
        const auto myTicketNo = NextTicketNo.fetch_add(1, std::memory_order_relaxed);

        while (ServingTicketNo.load(std::memory_order_acquire) != myTicketNo)
            CpuRelax();
    }

    ALWAYS_INLINE void Leave()
    {
        // We can get around a more expensive read-modify-write operation
        // (std::atomic_size_t::fetch_add()), because no one can modify
        // ServingTicketNo while we're in the critical section.
        const auto newNo = ServingTicketNo.load(std::memory_order_relaxed)+1;
        ServingTicketNo.store(newNo, std::memory_order_release);
    }

private:
    alignas(CACHELINE_SIZE) std::atomic_size_t ServingTicketNo = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t NextTicketNo = {0};
};

static_assert(sizeof(TicketSpinLock) == 2*CACHELINE_SIZE, "");

Overflow is pretty much impossible when 64-bit counters are used. But what happens when the counters are of smaller bit width and eventually overflow? It turns out that overflow is safe, as long as the number of threads using the lock is less than or equal to the value range representable by the counter’s underlying integer type (e.g. 256 for 8-bit counters). Let’s consider a 3-bit integer, which can represent values from 0 to 7. Overflow is safe as long as the there are never more than 8 threads competing for the lock, because the condition ServingTicketNo != myTicketNo is guaranteed to be always only false for the next thread in line. If there were 9 or more threads, the NextTicketNo counter could reach the same value ServingTicketNo has and accordingly two threads could enter the critical section (CS) at the same time. The figure below illustrates the case where 8 threads are competing for the lock. Just one more competing thread could cause multiple threads entering the CS at the same time.

   ServingTicketNo   NextTicketNo
          |             |
          V             V
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 ...
          \_____________/
          Max. 8 threads

No second thread may grab another NextTicketNo=5!

The above means that we can crunch down our implementation in terms of memory footprint by removing the padding and using 8- or 16-bit counters, depending on the expected maximum number of threads. Removing the padding comes at the cost of potential false sharing issues.

Proportional back-off

Under high contention the Test-And-Test-And-Set Lock backs-off exponentially. The reduced amount of memory bus traffic improves performance. What about using the same backing-off strategy with Ticket Locks? It turns out that this is a very bad idea, because the FIFO order causes the back-off delays to accumulate. For example, the thread with myTicketNo == ServingTicketNo+3 must always wait as least as long as the thread with myTicketNo == ServingTicketNo+2 and the thread with myTicketNo == ServingTicketNo+2 must always wait as least as long as the thread with myTicketNo == ServingTicketNo+1.

Is there something else we can do? It turns out we can thanks to the FIFO order in which threads are granted access to the CS. Every thread can calculate how many other threads are going to be granted access to the CS before it-self as numBeforeMe = myTicketNo-ServingTicketNo. With this knowledge and under the assumption that every thread holds the lock approximately for the same duration, we can back-off for the number of threads in line before us times some constant BACKOFF_BASE. BACKOFF_BASE is the expected average time that every thread spends inside the CS. This technique is called proportional back-off.

ALWAYS_INLINE void PropBoTicketSpinLock::Enter()
{
    const auto myTicketNo = NextTicketNo.fetch_add(1, std::memory_order_relaxed);

    while (true)
    {
        const auto servingTicketNo = ServingTicketNo.load(std::memory_order_acquire);
        if (servingTicketNo == myTicketNo)
            break;

        const size_t numBeforeMe = myTicketNo-servingTicketNo;
        const size_t waitIters = BACKOFF_BASE*numBeforeMe;

        for (size_t i=0; i<waitIters; i++)
            CpuRelax();
    }
}

Wrap up

How does the Ticket Lock compare against TAS-based locks and in which situations which of the two locks variants is preferably used? The Ticket Lock has the following advantages over TAS-based locks:

The Ticket Lock is fair, because the threads are granted access to the CS in FIFO order. This prevents the same thread from reacquiring the lock multiple times in a row, which – at least in theory – could starve other threads.
In contrast to all TAS-based locks the Ticket Lock avoids the Thundering Herd problem, because waiting for the lock and acquiring it doesn’t require any read-modify-write or store operation.
In contrast to the Test-And-Set Lock only a single atomic load operation is repeatedly executed while waiting for the lock to become available. Though, this problem is solved by the TTAS Lock.

The biggest disadvantage of the Ticket Lock is that the fairness property backfires once there are more threads competing for the lock than there are CPU cores in the system. The problem is that in that case the thread which can enter the CS next might be sleeping. This means that all other threads must wait, because of the strict fairness guarantee. This property is sometimes referred to as preemption intolerance.

Furthermore, the Ticket Lock doesn’t solve the scalability issue of TAS-based spinlocks. Both spinlock variants don’t scale well, because the number of cache line invalidations triggered when acquiring/releasing the lock is O(#threads). There are scalable lock implementations where all threads spin on different memory locations. These spinlock variants only trigger O(1) many cache line invalidations. Next time we’ll look into scalable spinlock variants.

March 23, 2016
by David 2 Comments

Test-and-set spinlocks

In my previous blog post I wrote about the most important properties of spinlocks and why they matter. This time I’ll present the first out of three concrete spinlock implementations in C++11 for x86 processors. I’m assuming that you’ve read the first post. You can find all source code in this Github repository.

Introduction

The Test-And-Set (TAS) Lock is the simplest possible spinlock implementation. It uses a single shared memory location for synchronization, which indicates if the lock is taken or not. The memory location is updated using a test-and-set (TAS) operation. TAS atomically writes to the memory location and returns its old value in a single indivisible step. Actually, the name test-and-set is a little misleading, because the caller is responsible for testing if the operation has succeeded or not. The TAS Lock is not fair as it doesn’t guarantee FIFO ordering amongst the threads competing for the lock.

In C++11 std::atomic_bool::exchange() can be used to perform a TAS operation on an std::atomic_bool synchronization variable. On x86 CPUs std::atomic::exchange() is turned into the LOCK XCHG instruction (side note: the LOCK prefix is implicit and not strictly required, because there’s no non-atomic version of XCHG). The following code implements the described TAS Lock.

class TasSpinLock
{
public:
    ALWAYS_INLINE void Enter()
    {
        // atomic_bool::exchange() returns previous value of Locked
        while (Locked.exchange(true, std::memory_order_acquire) == true);
    }

    ALWAYS_INLINE void Leave()
    {
        Locked.store(false, std::memory_order_release);
    }

private:
    alignas(CACHELINE_SIZE) std::atomic_bool Locked = {false};
};

static_assert(sizeof(TasSpinLock) == CACHELINE_SIZE, "");

The TasSpinLock::Locked flag is cache line size padded using alignas to prevent false sharing. You can remove the padding, e.g. in memory-limited environments, when false sharing is not an issue.

The test-and-test-and-set optimization

While the TAS Lock is extremely easy to implement, its scalability is very bad. Already with just a few threads competing for the lock, the amount of required cache line invalidations to acquire/release the lock quickly degrades performance. The problem is two-fold.

std::atomic_bool::exchange() always invalidates the cache line Locked resides in, regardless of whether it succeeded in updating Locked or not.
When the spinlock is released, all waiting threads simultaneously try to acquire it. This is sometimes referred to as Thundering Herd problem. Acquiring the lock means first invalidating the cache line copy of all threads waiting for the lock and then reading the valid cache line copy from the core which released the lock. With t threads this results in O(t) memory bus transactions.

The latter issue is inherent to TAS Locks, because just a single synchronization variable is used and shared between all threads. However, the first issue can be fixed by testing if the lock is free before calling exchange(). First testing if the lock is free only loads the synchronization variable and therefore doesn’t cause a cache line invalidation. This spinlock variant is called Test-And-Test-And-Set (TTAS) Lock. The improved Enter() method looks like this:

ALWAYS_INLINE void TTasSpinLock::Enter()
{
    do
        WaitUntilLockIsFree();
    while (Locked.exchange(true, std::memory_order_acquire) == true);
}

ALWAYS_INLINE void TTasSpinLock::WaitUntilLockIsFree() const
{
    while (Locked.load(std::memory_order_relaxed) == true);
}

Exponential back-off

The TTAS Lock successfully reduces the amount of cache line invalidations occurring while the lock is trying to be acquired. However, as soon as the lock is released, again all threads try to update the Locked flag. If a thread sees that the lock is free, but fails acquiring it subsequently because some other thread was faster, the lock is most probably under high contention. In this situation it can help to wait for a short time to let other threads finish before trying to enter the critical section (CS) again.

Waiting a short time without trying to acquire the lock reduces the number of threads simultaneously competing for the lock. The larger the number of unsuccessful retries, the higher the lock contention and the longer the thread should back-off. Experiments have shown that a good strategy is to back-off exponentially; similar to the congestion avoidance mechanism used in the Ethernet protocol. To avoid threads running in lock step the initial wait time is randomized.

ALWAYS_INLINE static void BackoffExp(size_t &curMaxIters)
{
    assert(curMaxIters > 0);

    thread_local std::uniform_int_distribution<size_t> dist;
    thread_local std::minstd_rand gen(std::random_device{}());
    const size_t spinIters = dist(gen, decltype(dist)::param_type{0, curMaxIters});

    curMaxIters = std::min(2*curMaxIters, MAX_BACKOFF_ITERS);
    for (volatile size_t i=0; i<spinIters; i++); // Avoid being optimized out!
}

ALWAYS_INLINE void ExpBoTTasSpinLock::Enter()
{
    size_t curMaxIters = MIN_BACKOFF_ITERS;

    while (true)
    {
        // Not strictly required but doesn't hurt
        WaitUntilLockIsFree();

        if (Locked.exchange(true, std::memory_order_acquire) == true)
            BackoffExp(curMaxIters); // Couldn't acquire lock => back-off
        else
            break; // Acquired lock => done
    }
}

The downside of waiting is that it renders the lock even unfairer than it is already. Threads that have been attempting to acquire the lock longest are also backing-off longest. Consequently, newly arriving threads have a higher chance to acquire the lock than older threads. On top of that threads might back-off for too long, causing the CS to be underutilized. Furthermore, there are unfortunately no single best values for MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS. Optimally, they’re determined experimentally depending on the workload (contention, size of CS).

Pausing and sleeping

There are two more improvements we can do. First, according to the Intel Architectures Optimization Reference Manual, adding a PAUSE instruction to the body of all spin loops improves performance on Pentium 4 CPUs and later. The PAUSE instruction provides a hint to the CPU that the code being executed is a spin-wait loop. PAUSE is backwards compatible and turns into a NOP instruction on older CPUs. There are three reasons why using PAUSE improves performance:

It avoids memory order violations which result in expensive pipeline flushes, because the CPU doesn’t go ahead to fill its pipeline with speculatively executed load and compare instructions.
It gives the other hardware thread time to execute on simultaneous multi-threading (SMT) processors (e.g. Intel’s Hyper Threading).
It adds a slight delay to the spin loop, which synchronizes it with the system’s memory bus frequency. This frequency is typically much lower than the frequency of the CPU, which in turn reduces power consumption considerably.

The second improvement is to stop spinning, put the thread to sleep and reschedule if the the lock doesn’t become available after some time. The important bit here is to not yield (sched_yield() on Linux, SwitchToThread() on Windows, or more generally std::this_thread::yield() in C++11), but to explicitly put the thread to sleep for a short period of time. This ensures that the thread is really paused for some time and isn’t immediately run again by the scheduler if there’s no other thread available in the scheduler’s run queue. This is especially important on SMT processors where every two logical cores share most of the execution units. Only when the thread is really sleeping the other logical core can fully make use of the shared execution units. The following TTAS Lock implementation uses the PAUSE instruction and sleeps for 500us if the lock isn’t released within MAX_WAIT_ITERS iterations. There’s no single best value for MAX_WAIT_ITERS. Ideally, it’s chosen experimentally like the MIN_BACKOFF_ITERS and MAX_BACKOFF_ITERS constants from before.

ALWAYS_INLINE static void CpuRelax()
{
#if (COMPILER == MVCC)
    _mm_pause();
#elif (COMPILER == GCC || COMPILER == LLVM)
    asm("pause");
#endif
}

ALWAYS_INLINE static void YieldSleep()
{
    // Don't yield but sleep to ensure that the thread is not
    // immediately run again in case scheduler's run queue is empty
    using namespace std::chrono;
    std::this_thread::sleep_for(500us);
}

ALWAYS_INLINE static void BackoffExp(size_t &curMaxIters)
{
    ... Unchanged code not repeated, look above ...

    for (size_t i=0; i<spinIters; i++)
        CpuRelax();
}

ALWAYS_INLINE void ExpBoRelaxTTasSpinLock::WaitUntilLockIsFree() const
{
    size_t numIters = 0;

    while (Locked.load(std::memory_order_relaxed))
    {
        if (numIters < MAX_WAIT_ITERS)
        {
            CpuRelax();
            numIters++;
        }
        else
            YieldSleep();
    }
}

Performance comparison

The benchmark used to compare the different TAS Lock variants is extremely simple and should be taken with a grain of salt. It launches t threads, each competing for the lock and incrementing a global atomic counter inside the CS. This causes high lock contention, because all threads are trying to enter the lock at the same time. Though, it’s important to keep in mind that due to the lock’s lack of fairness, it’s entirely possible that the same thread acquires the lock multiple times in a row. I ran the benchmark on a dual socket NUMA system with two Intel Xeon E5-2630 v2 CPUs with 2.60 GHz. The CPUs support Hyper Threading which gives a total of 12 physical and 24 logical cores. The benchmark results are shown below.

TAS Lock benchmark results

The most interesting finding is that the amount of lock reacquisitions is a lot higher than I anticipated. This can be derived from the fact that all TTAS-based locks scale almost linearly, where typically the observed drop in performance for TAS-based locks is exponential. Large numbers of lock reacquisitions by the same thread reduces the amount of cache line invalidations and increases throughput. At the same time the latency of all other threads goes up, because they’re granted the lock later.

The TasSpinLock scales so much worse, even though lock reacquisiton happens, because while spinning the TAS operation causes cache line invalidations also if it doesn’t succeed in acquiring the lock. By looking at the difference between the TasSpinLock and the other TTAS-based versions, it’s highly visible how big the influence of cache line invalidations is on the lock’s performance.

Wrapup

The exponential back-off TTAS Lock tends to perform fairly well as long as the number of competing threads is low or lock reacquisitions boost throughput while increasing latency. It’s especially useful when it’s not known a priori that the number of competing threads is bounded by the number of cores in the system (in that case better scaling spinlock variants perform really badly as we’ll see in the next post). Beyond that, without padding the TTAS Lock just needs a single byte. The compact representation is useful in extremely memory-constrained environments or applications that require vast amounts of spinlocks; e.g. for fine-grained locking of lots of small pieces of data.

March 12, 2016
by David 1 Comment

Important properties of spinlocks

Typically, when a thread tries to acquire an already locked mutex it will go to sleep and give up its time slice, allowing another thread to run immediately instead. If the duration for which the mutex is held is extremely short, the time for putting the waiting thread to sleep and waking it up again to retry can easily be longer. In such cases spinlocks can be used to increase performance, by continuously trying to acquire the lock without sleeping. Side note: Obviously, it makes no sense to use spinlocks in systems with a single CPU core. Polling for the spinlock to become available would only prevent the single available CPU core from running the other thread to release the lock.

To deep dive into the multitude of spinlock implementations, it’s crucial to understand what their key characteristics are. Generally, spinlocks can be grouped into fair and unfair as well as scalable and unscalable implementations. In addition, we look at the spinlock’s memory footprint. The three properties are discussed in more detail in the rest of this post.

For all the nitty-gritty architectural details of concrete C++11 spinlock implementations read the upcoming blog posts about Test-And-Set (TAS) Locks, Ticket Locks and MCS Locks. I’ll upload the source code to this Github repository.

Scalability

Existing spinlock variants have very different performance characteristics under contention. A lock is said to be contented when there are multiple threads concurrently trying to acquire the lock while it’s taken. The performance of unscalable spinlock implementations degrades as the number of threads trying to acquire the lock concurrently goes up. The performance of simple Ticket or TAS Locks for example drops exponentially with increasing numbers of threads. In contrast, scalable spinlock implementations – like e.g. the MCS Lock – exhibit no degradation in performance even for large numbers of threads. The scalability graph of the optimal spinlock vs the Ticket Lock is depicted in the graph below.

Scalability of optimal spinlock

Crucial for the scalability of spinlocks is the amount of cache line invalidations which take place when the lock is acquired and released. Cache line invalidations take place when e.g. a thread modifies a synchronization variable while leaving a critical section (CS) on which other threads are polling at the same time. Suddenly, cache line copies stored in caches of other CPU cores are not valid anymore. Accordingly, these cache lines must be invalidated and reobtained from the CPU core’s cache which modified it. Thus, the number of cache line invalidations which occur when the spinlock is acquired/released is often used as a metric to compare spinlock implementations with each other. Usually, scalable spinlocks require O(1) many cache line invalidations to acquire/release the lock, while unscalable spinlocks require O(#threads) many.

Advanced scalable spinlock implementations may not only take the total number of CPU cores in a system into account, but also the details of the underlying cache coherency mechanism. For example in systems with multiple CPUs and non-uniform memory access times (NUMA), inter-core communication costs may vary greatly. For example in a NUMA system with multiple multi-core CPUs it’s entirely possible to pass a lock to another core within in the same CPU in less time than to a core of another CPU.

Fairness

Fair spinlocks guarantee that the same thread may never reacquire the lock if there are other threads trying to acquire the same lock at the same time. Fair spinlocks maintain a first-in-first-out (FIFO) order among the threads attempting to acquire the lock.

Often, unfair spinlocks trade latency for throughput, while fair spinlocks trade throughput for latency. Consider the following example. Let’s say we’re running t threads each performing n/t loop iterations in which they try to enter a CS. When using an unfair spinlock, in the worst case (at least in theory) every thread could enter the CS n/t times in a row instead of alternating with lock acquisition between threads. In certain scenarios this can boost throughput considerably, because the number of expensive cache line invalidations is reduced. However, the time until any other thread may enter the CS rises. In theory this can lead to starvation. In practice the cache coherency protocol should ensure that no starvation occurs.

Typically, the performance of fair spinlocks drops extremely when the number of threads competing for the same lock is greater than the number of CPU cores in the system. The reason for this is that the thread which is next in the sequence to acquire the lock might be sleeping. In contrast to unfair spinlocks, during the time the next thread is sleeping no other thread can acquire the lock because of the strict acquisition order guaranteed by the fair lock. This property is sometimes referred to as preemption intolerance.

Memory footprint

Generally, the required memory to store l spinlocks can depend on the number of threads t which are using the lock. While some spinlock implementations require only constant memory for t threads, other’s memory footprint scales linearly with the number of threads. The amount of needed memory to store a spinlock mostly doesn’t matter, because the memory footprint for the majority of locks is in the order of a few dozen bytes per thread. Mostly it’s a multiple of the CPU’s cache line size to account for false sharing. However, there are cases where the memory footprint plays an important role. Examples are extremely memory-constrained environments or applications that require vast amounts of spinlocks e.g. for fine-grained locking of lots of small pieces of data.

April 27, 2014
by David 16 Comments

An overview of direct memory access

Introduction

Direct memory access (DMA) is conceptually easy, but without experience in hardware design or driver development it can be cumbersome to understand. In this blog post I will explain what DMA is and how it evolved during the last decades. My goal is to make it comprehensible especially for people without experience in hardware design or driver development.

DMA allows computer devices of certain hardware sub-systems to directly access system memory and other device’s memory independently of the CPU. This enables the CPU to keep on working concurrently on other task while long lasting memory operations take place; considerably boosting overall system performance. DMA is used by different hardware like graphics cards, sound cards, network cards and disk drive controllers. DMA is rather a concept than a specific technology. There is no specification which describes in detail how DMA transfers work. Even on the contrary, the concept of directly accessing memory without CPU interaction is employed in many different hardware sub-systems in today’s computers. The most typical application is communicating with peripheral devices plugged into a bus system like ATA, SATA, PCI or PCI Express. Beyond that, DMA transfers are used for intra-core communication in micro processors and even to copy data from the memory of one computer into the memory of another computer over the network via remote DMA (don’t mix up this technology with NVIDIA’s new GPUDirect RDMA feature).

To give a concrete example, imagine you’re playing an open world computer game which loads new game assets on demand from your hard disk. Large amounts of game data must be copied over from hard disk into system RAM. Without DMA the CPU would be actively involved in each and every memory transfer operation. Consequently, less computing time would be left for other game play related tasks like AI or physics. In times of multi-core processors this seems less like a problem. However, as data volumes and work load sizes are ever growing, off-loading large memory transfer operations from the CPU is also today absolutely essential in order to achieve high system performance.

How DMA evolved over time

In my experience many software people think that DMA nowadays still works as it did in the old days. I guess this is because it’s the more intuitive way to think about DMA. Back then, extension devices did not actively take part in DMA transfers, but there was a DMA controller (e.g. the Intel 8237, first used in the IBM PC in 1981) which enabled DMA transfers between system memory and device I/O over the good old Industrial Standard Architecture (ISA) bus. The DMA controller could be programmed by the CPU to perform a number of memory transfers on behalf of the CPU. This way of accomplishing DMA transfers is also known as third party DMA. At that time the system bus was identical to the ISA expansion bus. To account for reduced bus performance in situations where CPU and DMA controller needed to access the bus simultaneously, different DMA modes (cycle stealing, transparent and burst) could be used. When the first IBM AT clones came out, the expansion bus got physically separated from the system bus using an ISA bridge. This was necessary because the AT clones had CPUs running at higher frequencies than the expansion bus. In the figure below the single bus and the separated bus architectures are depicted.

Single vs separate bus architecture

With the introduction of the conventional Peripheral Component Interface (PCI) bus architecture in 1992, the DMA controller became obsolete because of a technique called bus mastering, or first party DMA. PCI DMA transfers were implemented by allowing only one device at a time to access the bus. This device is called the bus master. While the bus master holds the bus it can perform memory transfers without CPU interaction. The fundamental difference between bus mastering and the use of a DMA controller is that DMA compatible devices must contain a DMA engine driving the memory transfers. As multiple PCI devices can master the bus, an arbitration scheme is required to avoid that more than one device drives the bus simultaneously. The advantage of bus mastering is a significant latency reduction because communication with the third party DMA controller is avoided. Additionally, each device’s DMA engine can be specifically optimized for the sort of DMA transfers it performs.

PCI architecture

Today’s computers don’t contain DMA controllers anymore. If they do so, it’s only to support legacy buses like e.g. ISA, often by simulating an ISA interface using a Low Pin Count (LPC) bus bridge. In 2004 the PCI successor and latest peripheral computer bus system PCI Express (PCIe) was introduced. PCIe turned the conventional PCI bus from a true bus architecture, with several devices physically sharing the same bus, into a serial, packet-switched, point-to-point architecture; very similar to how packet-switched networks function. PCIe connects each device with a dedicated, bi-directional link to a PCIe switch. As a result, PCIe supports full duplex DMA transfers of multiple devices at the same time. All arbitration logic is replaced by the packet routing logic implemented in the PCIe switches. While PCIe is entirely different to PCI on the hardware level, PCIe preserves backwards compatibility with PCI on the driver level. Newer PCIe devices can be detected and used by PCI drivers without explicit support for the PCIe standard. Though, the new PCIe features cannot be used of course.

PCIe architecture

DMA from a driver developer’s perspective

Now you know what DMA is and how it fits into a computer’s hardware architecture. So let’s see how DMA can be used in practice to speed up data heavy tasks. Since the dawn of DMA the driver (software) must prepare any peripheral DMA transfers, because only the operating system (OS) has full control over the memory system (we see later why this is important), the file system and the user-space processes. In the first step, the driver determines the source and destination memory addresses for the transfer. Next, the driver programs the hardware to perform the DMA transfer. The major difference between PCI/PCIe DMA and legacy ISA DMA is the way a DMA transfer is initiated. For PCI/PCIe no uniform, device independent way to initiate DMA transfers exists anymore, because each device contains its own, proprietary DMA engine. In contrast, the legacy DMA controller is always the same.

First, the peripheral device’s DMA engine is programmed with the source and destination addresses of the memory ranges to copy. Second, the device is signaled to begin the DMA transfer. Fair enough, but how can the driver know when the DMA transfer has finished? Usually, the device raises interrupts to inform the CPU about transfers that have finished. For each interrupt an interrupt handler, previously installed by the driver, is called and the finished transfer can be acknowledged accordingly by the OS (e.g. signaling the block I/O layer that a block has been read from disk and control can be handed back to the user-space process which requested this block). Back in the times of high latency spinning disks an slow network interfaces this was sufficient. Today, however, we’ve got solid state disks (SSD) and gigabit, low-latency network interfaces. To avoid completely maxing out the system by a vast number of interrupts, a common technique is to hold back and queue up multiple interrupts on the device until e.g. a timeout triggers, a certain number of interrupts are pending or any other condition suiting the application is met. This technique is known as interrupt coalescing. Obviously, the condition is always a trade-off between low latency and high throughput. The more frequently new interrupts are raised, the quicker the OS and its waiting processes are informed about finished memory transfers. However, if the OS is interrupted less often it can spend more time on other jobs.

DMA seems to be a nice feature in theory, but how does transferring large continuous memory regions play together with virtual memory? Virtual memory is usually organized in chunks of 4 KiB, called pages. Virtual memory is continuous as seen from a process’ point-of-view thanks to page tables and the memory management unit (MMU). However, it’s non-continuous as seen from the device point-of-view, because there is no MMU between the PCIe bus and the memory controller (well, some CPUs have an IO-MMU but let’s keep things simple). Hence, in a single DMA transfer only one page could be copied at a time. To overcome this limitation OS usually provide a scatter/gather API. Such an API chains together multiple page-sized memory transfers by creating a list of addresses of pages to be transferred.

Take home message

DMA is an indispensable technique for memory-heavy, high-performance computing. Over the last decades, the entire bus system and DMA controller concept was superseded by moving the DMA controller into the devices and using a point-to-point bus architecture. This reduced latency, made concurrent DMA transfers possible and allowed for device specific DMA engine optimizations. For the drivers less has changed. They are still responsible for initiating the DMA transfers. Though, today, instead of programming a DMA controller in a device independent way, drivers must program device specific DMA engines. Therefore, programming DMA transfers and processing DMA status information can look very different depending on the device.

The Infinite Loop

Tales from a lean programmer.

Category Archives: Computer architecture

Scalable spinlocks 1: array-based

Anderson’s lock

Graunke and Thakkar’s lock

Comparison

The ticket spinlock

Implementation

Proportional back-off

Wrap up

Test-and-set spinlocks

Introduction

The test-and-test-and-set optimization

Exponential back-off

Pausing and sleeping

Performance comparison

Wrapup

Important properties of spinlocks

Scalability

Fairness

Memory footprint

An overview of direct memory access

Introduction

How DMA evolved over time

DMA from a driver developer’s perspective

Take home message