[concurrency-interest] Enforcing total sync order on modern hardware

Aleksey Shipilev aleksey.shipilev at oracle.com
Tue Mar 17 10:56:39 EDT 2015

On 03/17/2015 04:39 PM, Marko Topolnik wrote:
> On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev
> <aleksey.shipilev at oracle.com <mailto:aleksey.shipilev at oracle.com>> wrote:
> So let's fix the following total order on currentTime:
> T3 -> Rwt3 -> T6 -> Rwt6 -> Rrt6 -> T9 -> Rrt9
>     If "sharedVar" is also volatile (sequentially consistent), then Wv1
>     would complete before reading Rwt6. 
> OK, but this wouldn't necessarily happen on a unique global timescale:
> the "writing" thread would have the ordering Wv1 -> Rwt6; there would be
> an _independent_ total order of actions on currentTime, and a third,
> again independent order of actions by the "reading" thread. Due to the
> distributed nature of coherence the fact that, on one core, Wv1 precedes
> Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious
> that there is transitivity between these individual orders.

No one fully understands how exactly the hardware concurrency works. All
the academic papers you and me cited before are trying to recover the
actual semantics from the experiments and anecdotal examples. Therefore,
lacking the clearly-specified model, we can only come up with some model
and explain the behavior there.

I offered the simplistic mental model where fences expose the values to
global coherency in instruction order, and thus regain the total order.
When you say "wouldn't necessarily happen on a unique global timescale"
or "due to distributed nature of coherence", you seem to be implying
some other model, but that definition is too vague to be useful. Indeed,
I can come up with an arbitrarily weak model that allows pretty much

> Particularly note this statement
> in http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> <http://www.cl.cam.ac.uk/%7Epes20/weakmemory/cacm.pdf>:
> "[the CPU vendor specifications] admit the IRIW behaviour above but,
> under reasonable assumptions on the strongest x86 memory barrier,
> MFENCE, adding MFENCEs would not suffice to recover sequential
> consistency (instead, one would have to make liberal use of x86 LOCK’d
> instructions). Here the specifications seem to be much looser than the
> behaviour of implemented processors: to the best of our knowledge, and
> following some testing, IRIW is not observable in practice, even without
> MFENCEs. It appears that some JVM implementations depend on this fact,
> and would not be correct if one assumed only the IWP/AMD3.14/x86-CC
> architecture."

http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/popl09.pdf has a
discussion about the suspected semantics of MFENCE, and they argue
MFENCEs are not enough if you are being paranoid about this. They also
argue Intel SDM can be read as if MFENCEs are actually serialized.

My example above also implied the fences are globally serialized. This
dubs the actual practice in both C/C++11 and JVM mappings, that assume
x86 MFENCE-s are globally serialized and MFENCE/LOCK-insns can be used

POWER's hwsync seems to provide SC guarantees in the similar fashion,
"Regaining sequential consistency (SC) using sync: If one adds a sync
between every program-order pair of instructions (creating tests
SB+syncs, MP+syncs, WRC+syncs, and IRIW+syncs), then all the non-SC
results above are forbidden,"

These intricate details on hardware concurrency is one of the reasons we
have JMM. JMM explicitly requires the total order on synchronization
actions, and JVM implementors are responsible to figure out how to map
it back on hardware, let it be fences or locked instructions or what.

Aside: The IRIW example on POWER requires hwsyncs before the volatile
reads, which incurs performance penalties. The discussion on whether
IRIW (and more broadly, total order that mandates IRIW) is something
that we should care about in a language memory model, is way, way above
most people's heads, including mine. Until there is an expert consensus
on the topic, laymans we are should not try to confuse ourselves with
it, before reading up and fully understanding the academic trail on the

> Also, for the newer revision of Intel's specification, “P6. In a
> multiprocessor system, stores to the same location have a total order”
> has been replaced by: “Any two stores are seen in a consistent order by
> processors other than those performing the stores.”

I think that provision is to allow store forwarding, as I described in
my original reply with Dekker idiom. MFENCE should bring the total store
order guarantees back even for the "writer" processors, since as per
Intel SDM:

"This serializing operation guarantees that every load and store
instruction that precedes the MFENCE instruction in program order
becomes *globally visible* before any load or store instruction that
follows the MFENCE instruction."

In other words, the relaxation for "writer" processors is the relaxation
for the regular TSO, not when you are using fences to force the order.

> So here's a consistent order seen by all the processors except those
> running the two writing threads:
> Wv0 -> T3 -> T6 -> T9 -> Wv1
> This also respects the total ordering for each individual site, and a
> total ordering of each individual processor's stores. The "reading"
> thread inserts its Rv0 between T9 and Wv1.

I don't understand this example. If Wv1 is serialized with fence, and
there is a read from "t" following it, that read cannot be Rwt6 then, as
it should see t=9. That pretty much deconstructs the original "execution".


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20150317/9b42fa53/attachment.bin>

More information about the Concurrency-interest mailing list