CAS ABA

https://en.wikipedia.org/wiki/Compare-and-swap

Compare-and-swap

From Wikipedia, the free encyclopedia
 
 
 
Jump to navigationJump to search

In computer sciencecompare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response (this variant is often called compare-and-set), or by returning the value read from the memory location (not the value written to it).

Overview[edit]

A compare-and-swap operation is an atomic version of the following pseudocode, where * denotes access through a pointer:[1]

function cas(p: pointer to int, old: int, new: int) is
    if *p ≠ old
        return false

    *p ← new

    return true

This operation is used to implement synchronization primitives like semaphores and mutexes,[1] as well as more sophisticated lock-free and wait-free algorithmsMaurice Herlihy (1991) proved that CAS can implement more of these algorithms than atomic read, write, or fetch-and-add, and assuming a fairly large[clarification needed] amount of memory, that it can implement all of them.[2] CAS is equivalent to load-link/store-conditional, in the sense that a constant number of invocations of either primitive can be used to implement the other one in a wait-free manner.[3]

Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again. Instead of immediately retrying after a CAS operation fails, researchers have found that total system performance can be improved in multiprocessor systems—where many threads constantly update some particular shared variable—if threads that see their CAS fail use exponential backoff—in other words, wait a little before retrying the CAS.[4]

Example application: atomic adder[edit]

As an example use case of compare-and-swap, here is an algorithm for atomically incrementing or decrementing an integer. This is useful in a variety of applications that use counters. The function add performs the action *p ← *p + a, atomically (again denoting pointer indirection by *, as in C) and returns the final value stored in the counter. Unlike in the cas pseudocode above, there is no requirement that any sequence of operations is atomic except for cas.

function add(p: pointer to int, a: int) returns int
    done ← false

    while not done
        value ← *p  // Even this operation doesn't need to be atomic.
        done ← cas(p, value, value + a)

    return value + a

In this algorithm, if the value of *p changes after (or while!) it is fetched and before the CAS does the store, CAS will notice and report this fact, causing the algorithm to retry.[5]

ABA problem[edit]

Some CAS-based algorithms are affected by and must handle the problem of a false positive match, or the ABA problem. It is possible that between the time the old value is read and the time CAS is attempted, some other processors or threads change the memory location two or more times such that it acquires a bit pattern which matches the old value. The problem arises if this new bit pattern, which looks exactly like the old value, has a different meaning: for instance, it could be a recycled address, or a wrapped version counter.

A general solution to this is to use a double-length CAS (DCAS). E.g., on a 32-bit system, a 64-bit CAS can be used. The second half is used to hold a counter. The compare part of the operation compares the previously read value of the pointer and the counter, with the current pointer and counter. If they match, the swap occurs - the new value is written - but the new value has an incremented counter. This means that if ABA has occurred, although the pointer value will be the same, the counter is exceedingly unlikely to be the same (for a 32-bit value, a multiple of 232 operations would have to have occurred, causing the counter to wrap and at that moment, the pointer value would have to also by chance be the same).

An alternative form of this (useful on CPUs which lack DCAS) is to use an index into a freelist, rather than a full pointer, e.g. with a 32-bit CAS, use a 16-bit index and a 16-bit counter. However, the reduced counter lengths begin to make ABA possible at modern CPU speeds.

One simple technique which helps alleviate this problem is to store an ABA counter in each data structure element, rather than using a single ABA counter for the whole data structure.

A more complicated but more effective solution is to implement safe memory reclamation (SMR). This is in effect lock-free garbage collection. The advantage of using SMR is the assurance a given pointer will exist only once at any one time in the data structure, thus the ABA problem is completely solved. (Without SMR, something like a freelist will be in use, to ensure that all data elements can be accessed safely (no memory access violations) even when they are no longer present in the data structure. With SMR, only elements actually currently in the data structure will be accessed).

Costs and benefits[edit]

CAS, and other atomic instructions, are sometimes thought to be unnecessary in uniprocessor systems, because the atomicity of any sequence of instructions can be achieved by disabling interrupts while executing it. However, disabling interrupts has numerous downsides. For example, code that is allowed to do so must be trusted not to be malicious and monopolize the CPU, as well as to be correct and not accidentally hang the machine in an infinite loop or page fault. Further, disabling interrupts is often deemed too expensive to be practical. Thus, even programs only intended to run on uniprocessor machines will benefit from atomic instructions, as in the case of Linux's futexes.

In multiprocessor systems, it is usually impossible to disable interrupts on all processors at the same time. Even if it were possible, two or more processors could be attempting to access the same semaphore's memory at the same time, and thus atomicity would not be achieved. The compare-and-swap instruction allows any processor to atomically test and modify a memory location, preventing such multiple-processor collisions.

On server-grade multi-processor architectures of the 2010s, compare-and-swap is cheap relative to a simple load that is not served from cache. A 2013 paper points out that a CAS is only 1.15 times more expensive than a non-cached load on Intel Xeon (Westmere-EX) and 1.35 times on AMD Opteron (Magny-Cours).[6]

Implementations[edit]

Compare-and-swap (and compare-and-swap-double) has been an integral part of the IBM 370 (and all successor) architectures since 1970. The operating systems that run on these architectures make extensive use of this instruction to facilitate process (i.e., system and user tasks) and processor (i.e., central processors) parallelism while eliminating, to the greatest degree possible, the "disabled spinlocks" which had been employed in earlier IBM operating systems. Similarly, the use of test-and-set was also eliminated. In these operating systems, new units of work may be instantiated "globally", into the global service priority list, or "locally", into the local service priority list, by the execution of a single compare-and-swap instruction. This substantially improved the responsiveness of these operating systems.

In the x86 (since 80486) and Itanium architectures this is implemented as the compare and exchange (CMPXCHG) instruction (on a multiprocessor the LOCK prefix must be used).

As of 2013, most multiprocessor architectures support CAS in hardware, and the compare-and-swap operation is the most popular synchronization primitive for implementing both lock-based and non-blocking concurrent data structures.[4]

The atomic counter and atomic bitmask operations in the Linux kernel typically use a compare-and-swap instruction in their implementation. The SPARC-V8 and PA-RISC architectures are two of the very few recent architectures that do not support CAS in hardware; the Linux port to these architectures uses a spinlock.[7]

Implementation in C[edit]

Many C compilers support using compare-and-swap either with the C11 <stdatomic.h> functions,[8] or some non-standard C extension of that particular C compiler,[9] or by calling a function written directly in assembly language using the compare-and-swap instruction.

The following C function shows the basic behavior of a compare-and-swap variant that returns the old value of the specified memory location; however, this version does not provide the crucial guarantees of atomicity that a real compare-and-swap operation would:

int compare_and_swap(int* reg, int oldval, int newval)
{
    ATOMIC();
    int old_reg_val = *reg;
    if (old_reg_val == oldval)
        *reg = newval;
    END_ATOMIC();
    return old_reg_val;
}

old_reg_val is always returned, but it can be tested following the compare_and_swap operation to see if it matches oldval, as it may be different, meaning that another process has managed to succeed in a competing compare_and_swap to change the reg value from oldval.

For example, an election protocol can be implemented such that every process checks the result of compare_and_swap against its own PID (= newval). The winning process finds the compare_and_swap returning the initial non-PID value (e.g., zero). For the losers it will return the winning PID.

bool compare_and_swap(int *accum, int *dest, int newval)
{
    if (*accum == *dest) {
        *dest = newval;
        return true;
    } else {
        *accum = *dest;
        return false;
    }
}

This is the logic in the Intel Software Manual Vol 2A.

Extensions[edit]

Since CAS operates on a single pointer-sized memory location, while most lock-free and wait-free algorithms need to modify multiple locations, several extensions have been implemented.

Double compare-and-swap (DCAS)
Compares two unrelated memory locations with two expected values, and if they're equal, sets both locations to new values. The generalization of DCAS to multiple (non-adjacent) words is called MCAS or CASN. DCAS and MCAS are of practical interest in the convenient (concurrent) implementation of some data structures like dequeues or binary search trees.[10][11] DCAS and MCAS may be implemented however using the more expressive hardware transactional memory[12] present in some recent processors such as IBM POWER8 or in Intel processors supporting Transactional Synchronization Extensions (TSX).
Double-wide compare-and-swap
Operates on two adjacent pointer-sized locations (or, equivalently, one location twice as big as a pointer). On later x86 processors, the CMPXCHG8B and CMPXCHG16B instructions[13] serve this role, although early 64-bit AMD CPUs did not support CMPXCHG16B (modern AMD CPUs do). Some Intel motherboards from the Core 2 era also hamper its use, even though the processors support it. These issues came into the spotlight at the launch of Windows 8.1 because it required hardware support for CMPXCHG16B.[14]
Single compare, double swap
Compares one pointer but writes two. The Itanium's cmp8xchg16 instruction implements this,[15] where the two written pointers are adjacent.
Multi-word compare-and-swap
Is a generalisation of normal compare-and-swap. It can be used to atomically swap an arbitrary number of arbitrarily located memory locations. Usually, multi-word compare-and-swap is implemented in software using normal double-wide compare-and-swap operations.[16] The drawback of this approach is a lack of scalability.

See also[edit]

 

https://zh.wikipedia.org/wiki/比较并交换

比较并交换(compare and swap, CAS),是原子操作的一种,可用于在多线程编程中实现不被打断的数据交换操作,从而避免多线程同时改写某一数据时由于执行顺序不确定性以及中断的不可预知性产生的数据不一致问题。 该操作通过将内存中的值与指定数据进行比较,当数值一样时将内存中的数据替换为新的值。

概述[编辑]

一个CAS操作的过程可以用以下c代码表示: [1]

int cas(long *addr, long old, long new)
{
    /* Executes atomically. */
    if(*addr != old)
        return 0;
    *addr = new;
    return 1;
}

在使用上,通常会记录下某块内存中的旧值,通过对旧值进行一系列的操作后得到新值,然后通过CAS操作将新值旧值进行交换。如果这块内存的值在这期间内没被修改过,则旧值会与内存中的数据相同,这时CAS操作将会成功执行 使内存中的数据变为新值。如果内存中的值在这期间内被修改过,则一般[2]来说旧值会与内存中的数据不同,这时CAS操作将会失败,新值将不会被写入内存。

应用[编辑]

在应用中CAS可以用于实现无锁数据结构,常见的有无锁队列(先入先出)[3] 以及无锁栈(先入后出)。对于可在任意位置插入数据的链表以及双向链表,实现无锁操作的难度较大。[4]

ABA问题[编辑]

ABA问题是无锁结构实现中常见的一种问题,可基本表述为:

  1. 进程P1读取了一个数值A
  2. P1被挂起(时间片耗尽、中断等),进程P2开始执行
  3. P2修改数值A为数值B,然后又修改回A
  4. P1被唤醒,比较后发现数值A没有变化,程序继续执行。

对于P1来说,数值A未发生过改变,但实际上A已经被变化过了,继续使用可能会出现问题。在CAS操作中,由于比较的多是指针,这个问题将会变得更加严重。试想如下情况:

   top
    |
    V   
  0x0014
| Node A | --> |  Node X | --> ……

有一个栈(先入后出)中有top和节点A,节点A目前位于栈顶top指针指向A。现在有一个进程P1想要pop一个节点,因此按照如下无锁操作进行

pop()
{
  do{
    ptr = top;            // ptr = top = NodeA
    next_prt = top->next; // next_ptr = NodeX
  } while(CAS(top, ptr, next_ptr) != true);
  return ptr;   
}

而进程P2在执行CAS操作之前打断了P1,并对栈进行了一系列的pop和push操作,使栈变为如下结构:

   top
    |
    V  
  0x0014
| Node C | --> | Node B | --> |  Node X | --> ……

进程P2首先pop出NodeA,之后又push了两个NodeB和C,由于内存管理机制中广泛使用的内存重用机制,导致NodeC的地址与之前的NodeA一致。

这时P1又开始继续运行,在执行CAS操作时,由于top依旧指向的是NodeA的地址(实际上已经变为NodeC),因此将top的值修改为了NodeX,这时栈结构如下:

                                   top
                                    |
   0x0014                           V
 | Node C | --> | Node B | --> |  Node X | --> ……

经过CAS操作后,top指针错误地指向了NodeX而不是NodeB。

实现[编辑]

CAS操作基于CPU提供的原子操作指令实现。对于Intel X86处理器,可通过在汇编指令前增加LOCK前缀来锁定系统总线,使系统总线在汇编指令执行时无法访问相应的内存地址。而各个编译器根据这个特点实现了各自的原子操作函数。[5]

  • C语言C11的头文件<stdatomic.h>。由GNU提供了对应的__sync系列函数完成原子操作。 [6][7]
  • C++11,STL提供了atomic系列函数。[8][7]
  • JAVA,sun.misc.Unsafe提供了compareAndSwap系列函数。[9]
  • C#,通过Interlocked方法实现。[10]
  • Go, 通过import "sync/atomic"包实现。[11]
  • Windows,通过Windows API实现了InterlockedCompareExchangeXYZ系列函数。[12][7]

 ABA problem - Wikipedia https://en.wikipedia.org/wiki/ABA_problem

ABA problem

From Wikipedia, the free encyclopedia
 
 
 
Jump to navigationJump to search

In multithreaded computing, the ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and "value is the same" is used to indicate "nothing has changed". However, another thread can execute between the two reads and change the value, do other work, then change the value back, thus fooling the first thread into thinking "nothing has changed" even though the second thread did work that violates that assumption.

The ABA problem occurs when multiple threads (or processes) accessing shared data interleave. Below is a sequence of events that illustrates the ABA problem:

  1. Process {\displaystyle P_{1}}P_{1} reads value A from some shared memory location,
  2. {\displaystyle P_{1}}P_{1} is preempted, allowing process {\displaystyle P_{2}}P_{2} to run,
  3. {\displaystyle P_{2}}P_{2} writes value B to the shared memory location
  4. {\displaystyle P_{2}}P_{2} writes value A to the shared memory location
  5. {\displaystyle P_{2}}P_{2} is preempted, allowing process {\displaystyle P_{1}}P_{1} to run,
  6. {\displaystyle P_{1}}P_{1} reads value A from the shared memory location,
  7. {\displaystyle P_{1}}P_{1} determines that the shared memory value has not changed and continues.

Although {\displaystyle P_{1}}P_{1} can continue executing, it is possible that the behavior will not be correct due to the "hidden" modification in shared memory.

A common case of the ABA problem is encountered when implementing a lock-free data structure. If an item is removed from the list, deleted, and then a new item is allocated and added to the list, it is common for the allocated object to be at the same location as the deleted object due to MRU memory allocation. A pointer to the new item is thus often equal to a pointer to the old item, causing an ABA problem.

Examples[edit]

Consider a software example (written in C++) of ABA using a lock-free stack:

/* Naive lock-free stack which suffers from ABA problem.*/
class Stack {
  std::atomic<Obj*> top_ptr;
  //
  // Pops the top object and returns a pointer to it.
  //
  Obj* Pop() {
    while (1) {
      Obj* ret_ptr = top_ptr;
      if (!ret_ptr) return nullptr;
      // For simplicity, suppose that we can ensure that this dereference is safe
      // (i.e., that no other thread has popped the stack in the meantime).
      Obj* next_ptr = ret_ptr->next;
      // If the top node is still ret, then assume no one has changed the stack.
      // (That statement is not always true because of the ABA problem)
      // Atomically replace top with next.
      if (top_ptr.compare_exchange_weak(ret_ptr, next_ptr)) {
        return ret_ptr;
      }
      // The stack has changed, start over.
    }
  }
  //
  // Pushes the object specified by obj_ptr to stack.
  //
  void Push(Obj* obj_ptr) {
    while (1) {
      Obj* next_ptr = top_ptr;
      obj_ptr->next = next_ptr;
      // If the top node is still next, then assume no one has changed the stack.
      // (That statement is not always true because of the ABA problem)
      // Atomically replace top with obj.
      if (top_ptr.compare_exchange_weak(next_ptr, obj_ptr)) {
        return;
      }
      // The stack has changed, start over.
    }
  }
};

This code can normally prevent problems from concurrent access, but suffers from ABA problems. Consider the following sequence:

Stack initially contains top → A → B → C

Thread 1 starts running pop:

ret = A;
next = B;

Thread 1 gets interrupted just before the compare_exchange_weak...

{ // Thread 2 runs pop:
  ret = A;
  next = B;
  compare_exchange_weak(A, B)  // Success, top = B
  return A;
} // Now the stack is top → B → C
{ // Thread 2 runs pop again:
  ret = B;
  next = C;
  compare_exchange_weak(B, C)  // Success, top = C
  return B;
} // Now the stack is top → C
delete B;
{ // Thread 2 now pushes A back onto the stack:
  A->next = C;
  compare_exchange_weak(C, A)  // Success, top = A
}

Now the stack is top → A → C

When Thread 1 resumes:

compare_exchange_weak(A, B)

This instruction succeeds because it finds top == ret (both are A), so it sets top to next (which is B). As B has been deleted the program will access freed memory when it tries to look at the first element on the stack. In C++, as shown here, accessing freed memory is undefined behavior: this may result in crashes, data corruption or even just silently appear to work correctly. ABA bugs such as this can be difficult to debug.

Workarounds[edit]

Tagged state reference[edit]

A common workaround is to add extra "tag" or "stamp" bits to the quantity being considered. For example, an algorithm using compare and swap on a pointer might use the low bits of the address to indicate how many times the pointer has been successfully modified. Because of this, the next compare-and-swap will fail, even if the addresses are the same, because the tag bits will not match. This is sometimes called ABAʹ since the second A is made slightly different from the first. Such tagged state references are also used in transactional memory. Although a tagged pointer can be used for implementation, a separate tag field is preferred if double-width CAS is available.

If "tag" field wraps around, guarantees against ABA do not stand anymore. However, it has been observed that on currently existing CPUs, and using 60-bit tags, no wraparound is possible as long as the program lifetime (that is, without restarting the program) is limited to 10 years; in addition, it was argued that for practical purposes it is usually sufficient to have 40-48 bits of tag to guarantee against wrapping around. As modern CPUs (in particular, all modern x64 CPUs) tend to support 128-bit CAS operations, this can allow firm guarantees against ABA.[1]

Intermediate nodes[edit]

A correct but expensive approach is to use intermediate nodes that are not data elements and thus assure invariants as elements are inserted and removed [Valois].

Deferred reclamation[edit]

Another approach is to defer reclamation of removed data elements. One way to defer reclamation is to run the algorithm in an environment featuring an automatic garbage collector; a problem here however is that if the GC is not lock-free, then the overall system is not lock-free, even though the data structure itself is.

Another way to defer reclamation is to use one or more hazard pointers, which are pointers to locations that otherwise cannot appear in the list. Each hazard pointer represents an intermediate state of an in-progress change; the presence of the pointer assures further synchronization [Doug Lea]. Hazard pointers are lock-free, but can only track at most a fixed number of elements per thread as being in-use.

Yet another way to defer reclamation is to use read-copy update (RCU), which involves enclosing the update in an RCU read-side critical section and then waiting for an RCU grace period before freeing any removed data elements. Using RCU in this way guarantees that any data element removed cannot reappear until all currently executing operations have completed. RCU is lock-free, but isn't suitable for all workloads.

Alternate instructions[edit]

Rather than using a single pointer-wider compare-and-swap instructions, some processors have other instructions intended to be more resistant or immune to the ABA problem.

Some architectures provide "larger" atomic operations such that, as example, both forward and backward links in a doubly linked list can be updated atomically; while this feature is architecture-dependent, it, in particular, is available for x86/x64 architectures (x86 allows for 64-bit CAS, and all modern x64 CPUs allow for 128-bit CAS) and IBM's z/Architecture (which allows for up to 128-bit CAS).

Some architectures provide a load linked, store conditional instruction in which the store is performed only when there are no other stores of the indicated location. This effectively separates the notion of "storage contains value" from "storage has been changed". Examples include DEC AlphaMIPSPowerPCRISC-V and ARM (v6 and later). Since these instructions provide atomicity using the address rather than the value, routines using these instructions are immune to the ABA problem.[2]

See also[edit]

References[edit]

  1. ^ 'No Bugs' Hare, CAS (Re)Actor for Non-Blocking Multithreaded Primitives, reprinted from Overload #142, 2017
  2. ^ John Goodacre and Andrew N. Sloss. "Parallelism and the ARM Instruction Set Architecture". p. 46.

 

posted @ 2017-04-06 12:18  papering  阅读(336)  评论(0编辑  收藏  举报