[concurrency-interest] Enforcing total sync order on modern hardware

Aleksey Shipilev aleksey.shipilev at oracle.com
Tue Mar 17 06:46:05 EDT 2015


On 17.03.2015 9:31, Marko Topolnik wrote:
> There is another concern that may be interesting to reconsider. Given
> the lack of total sync order when just using memory barriers, is the
> JSR-133 Cookbook wrong/outdated in this respect? It doesn't at all deal
> with the issue of the sync order, just with the visibility of
> inter-thread actions.

Sigh. The rule of thumb in dealing with concurrency: assume you are
misunderstanding something before you think something agreed upon is
incorrect. In fact, in this thread, you confused yourselves with "memory
barriers are lacking total order". Memory barriers *are* regaining the
total ordering when used correctly. E.g. on x86, this is a sequentially
consistent code:

  SeqCst store:
    mov %blah, (memory)   ; volatile store
    lock addl (%esp), 0

  SeqCst load:
    mov (memory), %blah   ; volatile load

The alternative also works:

  SeqCst store:
    mov %blah, (memory)   ; volatile store
    mfence

  SeqCst load:
    mov (memory), %blah   ; volatile load


The mental model I am having in my head is as follows:

  a) Cache-coherent systems maintain the consistent (coherent) view of
each memory location at any given moment. In fact, most coherency
protocols provide the total order for the operations on a *single*
location. Regardless how the actual interconnect is operating, the cache
coherency protocols are to maintain that illusion. MESI-like protocols
are by nature message-based, and so they do not require shared bus to
begin with, so no problems with QPI.

  b) CPUs can do dirty work either doing multiple accesses at once, or
even without consulting the coherency. That has a potential to break the
invariants the coherency is trying to maintain: either through
overlapping the accesses, or not exposing the values right away. Memory
barriers are here to instruct hardware to stop being smart and expose
the operations to coherency in the instruction order.

  c) The easiest way to provide the memory fence effect is force CPU to
wait for completion of fenced operations, without doing anything else.
What fences are doing is (almost) exactly that: they let coherency to
sort out the final value of the memory location without overlapping the
access with anything else.

E.g. on x86, where total store order is provided by hardware, the
fencing is required to avoid store forwarding effects. With store
forwarding and no fencing, the writer CPU can "see" the written value
before anything else sees it, and can go on pretending everyone else saw
that value as well. This shortcut sets us up for potentially observing
the non-total ordering:

Dekker idiom:

   volatile int x, y;
 -----------------------
  x = 1        y = 1
  r1 = y       r2 = x

(r1, r2) = (0, 0) is forbidden, non-SC.

Imagine CPU1 does the "x = 1" store, and it is on its way to coherency,
now in store buffer. CPU1 then proceeds to read "y", and imagine it
reads "0", because CPU2 is lagging behind. Now, CPU2 does "y = 1" store.
Then CPU2 does "r2 = x" load, and reads "0" (!), because the "x" value
is still stuck in CPU1's store buffer. This is a "transient" condition
when coherency has no idea CPUs have different ideas about the value of
"x", because all values have not yet reached coherency. Locked
instruction / mfence after "x = 1" write basically says "until we are
sure "x = 1" had completed, freeze and don't make any moves."

Even without store buffers, when CPUs can start writing "x = 1" and "y =
1" at the same time, waiting for their local operations to complete
saves us from failures due to overlapped in-flight updates of "x" and
"y". In other words, CPU1 would not be able to start loading "y" before
it completed "x = 1", which means that whatever value of "y" it reads
after completing the write of "x" while CPU1's write of "y = 1" is still
pending, CPU2 is ought to observe "x = 1". Memory barriers provide the
"lock-step" behavior here.

The same line of reasoning applies to non-TSO hardware,
non-multiple-copy-atomic hardware, etc. Memory barriers are routinely
used to regain the order of operations. These things are
well-researched, so instead of relying on mailing list hearsays, see the
authoritative sources here:

http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf
  http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
  http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
  http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html


> If a JVM just followed the rules put forth by the Cookbook, would an
> invalid execution as outlined in my diagram actually be possible?

This is the monospace-friendly version:
 http://cs.oswego.edu/pipermail/concurrency-interest/2015-March/014118.html

Thinking on hardware level is full of surprises, and one should only do
that very carefully.

So, if only the "currentTime" is volatile (sequentially consistent), you
can still read "0" from non-volatile "sharedVar" in Rv0. The easiest way
to achieve that hardware wise is to get Wv1 to stuck in store buffer,
voila. It does not depend on whether the loads and stores to
"currentTime" are totally ordered or not. Even if they are, Wv1 can
still move past Rwt6.

If "sharedVar" is also volatile (sequentially consistent), then Wv1
would complete before reading Rwt6. Reading Rwt6 after the write means
the write is observable near tick 6: it is plausible the clock ticked 6
before we were writing; it is plausible the clock ticked 6 right after
we did the write. Which *really* means the write is guaranteed to be
observable at the *next* tick, T9, since "currentTime" reads/writes are
totally ordered. Therefore, once the reader thread observed t=9, it
should also observe the Wv1, rendering Rv0 reading "0" incorrect.

                                Rrt9 ---> Rv0
  Wv0 --> Wv1 --> Rwt6           ^
         .---------^         .---/
       T6 ---------------> T9

 "global time" -------------------------------->


Notice how this relies on the writer thread to observe Rwt6! That's a
reference frame for you. If writer was to observe Rwt9, you might have
plausibly inferred the Wv1 may be not visible at Rv0:

            Rrt9 ---> Rv0
              ^
  Wv0 --------+---------> Wv1 --> Rwt9
         .----/--------------------^
       T9

 "global time" -------------------------------->

It inherently relies on T6 updates to wait for everyone to see the
update. In other words, guaranteeing every thread sees the consistent
progression of values in some "global time".

Thanks,
-Aleksey.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20150317/f9ca0d9c/attachment.bin>


More information about the Concurrency-interest mailing list