[concurrency-interest] Enforcing total sync order on modern hardware

Vitaly Davidovich vitalyd at gmail.com
Tue Mar 17 11:33:42 EDT 2015

I suspect that part of the confusion is caused by Intel talking about
locked instructions as having total order.  I'm not sure why they call them
out like this, but I suspect it's because the CPUs have to arbitrate at who
acquires the cacheline in exclusive mode if this happens to run at exactly
the same time.  Otherwise, I don't see it any different than normal cache
coherence mechanics that prevent multiple CPUs performing an update to the
same line concurrently.

Also, total order means each CPUs stores appear in program (and same) order
to all other CPUs (modulo store buffer forwarding).  But to me this means
there're multiple ordered streams proceeding concurrently - everyone
observing these stores sees them at the same time (if read at the same
time) and in same order.  But the system isn't moving in some wall-clock
lock step manner unless code uses happens before conditions to make
progress by observing phase changes.

sent from my phone
On Mar 17, 2015 11:20 AM, "Aleksey Shipilev" <aleksey.shipilev at oracle.com>

> On 03/17/2015 04:39 PM, Marko Topolnik wrote:
> > On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev
> > <aleksey.shipilev at oracle.com <mailto:aleksey.shipilev at oracle.com>>
> wrote:
> > So let's fix the following total order on currentTime:
> >
> > T3 -> Rwt3 -> T6 -> Rwt6 -> Rrt6 -> T9 -> Rrt9
> >
> >
> >     If "sharedVar" is also volatile (sequentially consistent), then Wv1
> >     would complete before reading Rwt6.
> >
> >
> > OK, but this wouldn't necessarily happen on a unique global timescale:
> > the "writing" thread would have the ordering Wv1 -> Rwt6; there would be
> > an _independent_ total order of actions on currentTime, and a third,
> > again independent order of actions by the "reading" thread. Due to the
> > distributed nature of coherence the fact that, on one core, Wv1 precedes
> > Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious
> > that there is transitivity between these individual orders.
> No one fully understands how exactly the hardware concurrency works. All
> the academic papers you and me cited before are trying to recover the
> actual semantics from the experiments and anecdotal examples. Therefore,
> lacking the clearly-specified model, we can only come up with some model
> and explain the behavior there.
> I offered the simplistic mental model where fences expose the values to
> global coherency in instruction order, and thus regain the total order.
> When you say "wouldn't necessarily happen on a unique global timescale"
> or "due to distributed nature of coherence", you seem to be implying
> some other model, but that definition is too vague to be useful. Indeed,
> I can come up with an arbitrarily weak model that allows pretty much
> anything.
> > Particularly note this statement
> > in http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> > <http://www.cl.cam.ac.uk/%7Epes20/weakmemory/cacm.pdf>:
> >
> > "[the CPU vendor specifications] admit the IRIW behaviour above but,
> > under reasonable assumptions on the strongest x86 memory barrier,
> > MFENCE, adding MFENCEs would not suffice to recover sequential
> > consistency (instead, one would have to make liberal use of x86 LOCK’d
> > instructions). Here the specifications seem to be much looser than the
> > behaviour of implemented processors: to the best of our knowledge, and
> > following some testing, IRIW is not observable in practice, even without
> > MFENCEs. It appears that some JVM implementations depend on this fact,
> > and would not be correct if one assumed only the IWP/AMD3.14/x86-CC
> > architecture."
> http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/popl09.pdf has a
> discussion about the suspected semantics of MFENCE, and they argue
> MFENCEs are not enough if you are being paranoid about this. They also
> argue Intel SDM can be read as if MFENCEs are actually serialized.
> My example above also implied the fences are globally serialized. This
> dubs the actual practice in both C/C++11 and JVM mappings, that assume
> x86 MFENCE-s are globally serialized and MFENCE/LOCK-insns can be used
> interchangeably.
> POWER's hwsync seems to provide SC guarantees in the similar fashion,
> "Regaining sequential consistency (SC) using sync: If one adds a sync
> between every program-order pair of instructions (creating tests
> SB+syncs, MP+syncs, WRC+syncs, and IRIW+syncs), then all the non-SC
> results above are forbidden,"
> (http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf).
> These intricate details on hardware concurrency is one of the reasons we
> have JMM. JMM explicitly requires the total order on synchronization
> actions, and JVM implementors are responsible to figure out how to map
> it back on hardware, let it be fences or locked instructions or what.
> Aside: The IRIW example on POWER requires hwsyncs before the volatile
> reads, which incurs performance penalties. The discussion on whether
> IRIW (and more broadly, total order that mandates IRIW) is something
> that we should care about in a language memory model, is way, way above
> most people's heads, including mine. Until there is an expert consensus
> on the topic, laymans we are should not try to confuse ourselves with
> it, before reading up and fully understanding the academic trail on the
> topic.
> > Also, for the newer revision of Intel's specification, “P6. In a
> > multiprocessor system, stores to the same location have a total order”
> > has been replaced by: “Any two stores are seen in a consistent order by
> > processors other than those performing the stores.”
> I think that provision is to allow store forwarding, as I described in
> my original reply with Dekker idiom. MFENCE should bring the total store
> order guarantees back even for the "writer" processors, since as per
> Intel SDM:
> "This serializing operation guarantees that every load and store
> instruction that precedes the MFENCE instruction in program order
> becomes *globally visible* before any load or store instruction that
> follows the MFENCE instruction."
> In other words, the relaxation for "writer" processors is the relaxation
> for the regular TSO, not when you are using fences to force the order.
> > So here's a consistent order seen by all the processors except those
> > running the two writing threads:
> >
> > Wv0 -> T3 -> T6 -> T9 -> Wv1
> >
> > This also respects the total ordering for each individual site, and a
> > total ordering of each individual processor's stores. The "reading"
> > thread inserts its Rv0 between T9 and Wv1.
> I don't understand this example. If Wv1 is serialized with fence, and
> there is a read from "t" following it, that read cannot be Rwt6 then, as
> it should see t=9. That pretty much deconstructs the original "execution".
> -Aleksey.
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20150317/eb166221/attachment-0001.html>

More information about the Concurrency-interest mailing list