diff options
-rw-r--r-- | Documentation/memory-barriers.txt | 141 | ||||
-rw-r--r-- | include/asm-generic/bitops/lock.h | 14 |
2 files changed, 123 insertions, 32 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 904ee42d078e..3729cbe60e41 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -232,7 +232,7 @@ And there are a number of things that _must_ or _must_not_ be assumed: with memory references that are not protected by READ_ONCE() and WRITE_ONCE(). Without them, the compiler is within its rights to do all sorts of "creative" transformations, which are covered in - the Compiler Barrier section. + the COMPILER BARRIER section. (*) It _must_not_ be assumed that independent loads and stores will be issued in the order given. This means that for: @@ -555,6 +555,30 @@ between the address load and the data load: This enforces the occurrence of one of the two implications, and prevents the third possibility from arising. +A data-dependency barrier must also order against dependent writes: + + CPU 1 CPU 2 + =============== =============== + { A == 1, B == 2, C = 3, P == &A, Q == &C } + B = 4; + <write barrier> + WRITE_ONCE(P, &B); + Q = READ_ONCE(P); + <data dependency barrier> + *Q = 5; + +The data-dependency barrier must order the read into Q with the store +into *Q. This prohibits this outcome: + + (Q == B) && (B == 4) + +Please note that this pattern should be rare. After all, the whole point +of dependency ordering is to -prevent- writes to the data structure, along +with the expensive cache misses associated with those writes. This pattern +can be used to record rare error conditions and the like, and the ordering +prevents such records from being lost. + + [!] Note that this extremely counterintuitive situation arises most easily on machines with split caches, so that, for example, one cache bank processes even-numbered cache lines and the other bank processes odd-numbered cache @@ -565,21 +589,6 @@ odd-numbered bank is idle, one can see the new value of the pointer P (&B), but the old value of the variable B (2). -Another example of where data dependency barriers might be required is where a -number is read from memory and then used to calculate the index for an array -access: - - CPU 1 CPU 2 - =============== =============== - { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 } - M[1] = 4; - <write barrier> - WRITE_ONCE(P, 1); - Q = READ_ONCE(P); - <data dependency barrier> - D = M[Q]; - - The data dependency barrier is very important to the RCU system, for example. See rcu_assign_pointer() and rcu_dereference() in include/linux/rcupdate.h. This permits the current target of an RCU'd @@ -800,9 +809,13 @@ In summary: use smp_rmb(), smp_wmb(), or, in the case of prior stores and later loads, smp_mb(). - (*) If both legs of the "if" statement begin with identical stores - to the same variable, a barrier() statement is required at the - beginning of each leg of the "if" statement. + (*) If both legs of the "if" statement begin with identical stores to + the same variable, then those stores must be ordered, either by + preceding both of them with smp_mb() or by using smp_store_release() + to carry out the stores. Please note that it is -not- sufficient + to use barrier() at beginning of each leg of the "if" statement, + as optimizing compilers do not necessarily respect barrier() + in this case. (*) Control dependencies require at least one run-time conditional between the prior load and the subsequent store, and this @@ -814,7 +827,7 @@ In summary: (*) Control dependencies require that the compiler avoid reordering the dependency into nonexistence. Careful use of READ_ONCE() or atomic{,64}_read() can help to preserve your control dependency. - Please see the Compiler Barrier section for more information. + Please see the COMPILER BARRIER section for more information. (*) Control dependencies pair normally with other types of barriers. @@ -1257,7 +1270,7 @@ TRANSITIVITY Transitivity is a deeply intuitive notion about ordering that is not always provided by real computer systems. The following example -demonstrates transitivity (also called "cumulativity"): +demonstrates transitivity: CPU 1 CPU 2 CPU 3 ======================= ======================= ======================= @@ -1305,8 +1318,86 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes. General barriers are therefore required to ensure that all CPUs agree on the combined order of CPU 1's and CPU 2's accesses. -To reiterate, if your code requires transitivity, use general barriers -throughout. +General barriers provide "global transitivity", so that all CPUs will +agree on the order of operations. In contrast, a chain of release-acquire +pairs provides only "local transitivity", so that only those CPUs on +the chain are guaranteed to agree on the combined order of the accesses. +For example, switching to C code in deference to Herman Hollerith: + + int u, v, x, y, z; + + void cpu0(void) + { + r0 = smp_load_acquire(&x); + WRITE_ONCE(u, 1); + smp_store_release(&y, 1); + } + + void cpu1(void) + { + r1 = smp_load_acquire(&y); + r4 = READ_ONCE(v); + r5 = READ_ONCE(u); + smp_store_release(&z, 1); + } + + void cpu2(void) + { + r2 = smp_load_acquire(&z); + smp_store_release(&x, 1); + } + + void cpu3(void) + { + WRITE_ONCE(v, 1); + smp_mb(); + r3 = READ_ONCE(u); + } + +Because cpu0(), cpu1(), and cpu2() participate in a local transitive +chain of smp_store_release()/smp_load_acquire() pairs, the following +outcome is prohibited: + + r0 == 1 && r1 == 1 && r2 == 1 + +Furthermore, because of the release-acquire relationship between cpu0() +and cpu1(), cpu1() must see cpu0()'s writes, so that the following +outcome is prohibited: + + r1 == 1 && r5 == 0 + +However, the transitivity of release-acquire is local to the participating +CPUs and does not apply to cpu3(). Therefore, the following outcome +is possible: + + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 + +As an aside, the following outcome is also possible: + + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1 + +Although cpu0(), cpu1(), and cpu2() will see their respective reads and +writes in order, CPUs not involved in the release-acquire chain might +well disagree on the order. This disagreement stems from the fact that +the weak memory-barrier instructions used to implement smp_load_acquire() +and smp_store_release() are not required to order prior stores against +subsequent loads in all cases. This means that cpu3() can see cpu0()'s +store to u as happening -after- cpu1()'s load from v, even though +both cpu0() and cpu1() agree that these two operations occurred in the +intended order. + +However, please keep in mind that smp_load_acquire() is not magic. +In particular, it simply reads from its argument with ordering. It does +-not- ensure that any particular value will be read. Therefore, the +following outcome is possible: + + r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 + +Note that this outcome can happen even on a mythical sequentially +consistent system where nothing is ever reordered. + +To reiterate, if your code requires global transitivity, use general +barriers throughout. ======================== @@ -1459,7 +1550,7 @@ of optimizations: the following: a = 0; - /* Code that does not store to variable a. */ + ... Code that does not store to variable a ... a = 0; The compiler sees that the value of variable 'a' is already zero, so @@ -1471,7 +1562,7 @@ of optimizations: wrong guess: WRITE_ONCE(a, 0); - /* Code that does not store to variable a. */ + ... Code that does not store to variable a ... WRITE_ONCE(a, 0); (*) The compiler is within its rights to reorder memory accesses unless diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h index c30266e94806..8ef0ccbf8167 100644 --- a/include/asm-generic/bitops/lock.h +++ b/include/asm-generic/bitops/lock.h @@ -29,16 +29,16 @@ do { \ * @nr: the bit to set * @addr: the address to start counting from * - * This operation is like clear_bit_unlock, however it is not atomic. - * It does provide release barrier semantics so it can be used to unlock - * a bit lock, however it would only be used if no other CPU can modify - * any bits in the memory until the lock is released (a good example is - * if the bit lock itself protects access to the other bits in the word). + * A weaker form of clear_bit_unlock() as used by __bit_lock_unlock(). If all + * the bits in the word are protected by this lock some archs can use weaker + * ops to safely unlock. + * + * See for example x86's implementation. */ #define __clear_bit_unlock(nr, addr) \ do { \ - smp_mb(); \ - __clear_bit(nr, addr); \ + smp_mb__before_atomic(); \ + clear_bit(nr, addr); \ } while (0) #endif /* _ASM_GENERIC_BITOPS_LOCK_H_ */ |