[concurrency-interest] Enforcing total sync order on modern hardware

Vitaly Davidovich vitalyd at gmail.com
Tue Mar 17 10:18:55 EDT 2015


The Intel spec update is to account for store buffers in the absence of
fences.

In your "inserts a Rv0" example aren't we back to scheduling? There's no
happens before relationship between the time value and the two other
threads.  Making the time writes volatile translates into those writes
being visible to all other cores before the time thread progresses - when
that time value is observed is up to those other cores.  If the two threads
communicated via piggybacking on the time value, say writer writes Svar=9
only when t=9 then reader is guaranteed to see at least Svar=9 if they also
see t=9 (on x86), assuming Svar is volatile.  But as your example stands,
you have some value being written globally every so often, and then two
other threads peeking and poking at it while doing their own thing.

sent from my phone
On Mar 17, 2015 10:03 AM, "Marko Topolnik" <marko at hazelcast.com> wrote:

> On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev <
> aleksey.shipilev at oracle.com> wrote:
>
>> On 17.03.2015 9:31, Marko Topolnik wrote:
>> > There is another concern that may be interesting to reconsider. Given
>> > the lack of total sync order when just using memory barriers, is the
>> > JSR-133 Cookbook wrong/outdated in this respect? It doesn't at all deal
>> > with the issue of the sync order, just with the visibility of
>> > inter-thread actions.
>>
>> The mental model I am having in my head is as follows:
>>
>>   a) Cache-coherent systems maintain the consistent (coherent) view of
>> each memory location at any given moment. In fact, most coherency
>> protocols provide the total order for the operations on a *single*
>> location. Regardless how the actual interconnect is operating, the cache
>> coherency protocols are to maintain that illusion. MESI-like protocols
>> are by nature message-based, and so they do not require shared bus to
>> begin with, so no problems with QPI.
>>
>
> So let's fix the following total order on currentTime:
>
> T3 -> Rwt3 -> T6 -> Rwt6 -> Rrt6 -> T9 -> Rrt9
>
>
>> If "sharedVar" is also volatile (sequentially consistent), then Wv1
>> would complete before reading Rwt6.
>
>
> OK, but this wouldn't necessarily happen on a unique global timescale: the
> "writing" thread would have the ordering Wv1 -> Rwt6; there would be an
> _independent_ total order of actions on currentTime, and a third, again
> independent order of actions by the "reading" thread. Due to the
> distributed nature of coherence the fact that, on one core, Wv1 precedes
> Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious that
> there is transitivity between these individual orders.
>
> Particularly note this statement in
> http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf:
>
> "[the CPU vendor specifications] admit the IRIW behaviour above but, under
> reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding
> MFENCEs would not suffice to recover sequential consistency (instead, one
> would have to make liberal use of x86 LOCK’d instructions). Here the
> specifications seem to be much looser than the behaviour of implemented
> processors: to the best of our knowledge, and following some testing, IRIW
> is not observable in practice, even without MFENCEs. It appears that some
> JVM implementations depend on this fact, and would not be correct if one
> assumed only the IWP/AMD3.14/x86-CC architecture."
>
> Also, for the newer revision of Intel's specification, “P6. In a
> multiprocessor system, stores to the same location have a total order” has
> been replaced by: “Any two stores are seen in a consistent order by
> processors other than those performing the stores.”
>
> So here's a consistent order seen by all the processors except those
> running the two writing threads:
>
> Wv0 -> T3 -> T6 -> T9 -> Wv1
>
> This also respects the total ordering for each individual site, and a
> total ordering of each individual processor's stores. The "reading" thread
> inserts its Rv0 between T9 and Wv1.
>
>
>
>> Reading Rwt6 after the write means
>> the write is observable near tick 6: it is plausible the clock ticked 6
>> before we were writing; it is plausible the clock ticked 6 right after
>> we did the write. Which *really* means the write is guaranteed to be
>> observable at the *next* tick, T9, since "currentTime" reads/writes are
>> totally ordered. Therefore, once the reader thread observed t=9, it
>> should also observe the Wv1, rendering Rv0 reading "0" incorrect.
>>
>>                                 Rrt9 ---> Rv0
>>   Wv0 --> Wv1 --> Rwt6           ^
>>          .---------^         .---/
>>        T6 ---------------> T9
>>
>>  "global time" -------------------------------->
>>
>>
>> Notice how this relies on the writer thread to observe Rwt6! That's a
>> reference frame for you. If writer was to observe Rwt9, you might have
>> plausibly inferred the Wv1 may be not visible at Rv0:
>>
>
> Thanks, that was precisely my motivation to add Rwt6 :)
>
> ---
> Marko
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20150317/4a5b1805/attachment.html>


More information about the Concurrency-interest mailing list