[concurrency-interest] Does StampedLock need a releaseFence in theory?

Andrew Haley aph at redhat.com
Thu Jul 14 13:16:44 EDT 2016

On 14/07/16 16:27, Martin Buchholz wrote:
> On Thu, Jul 14, 2016 at 1:23 AM, Andrew Haley <aph at redhat.com> wrote:
>> On 14/07/16 01:53, Hans Boehm wrote:
>>> An ARMv8 compareAndSet operation (using only acquire and release
>>> operations, not dmb, as it should be implemented) will behave like the
>>> lock-based one in this respect.  I think the current code above is
>>> incorrect on ARMv8 (barring compensating pessimizations elsewhere).
>> Umm, what?  The ARMv8 compareAndSet has a sequentially consistent store.
>> I guess I must be missing something important.
> (Pretending to be Hans here ...)
> The idea is that all ARMv8 "load-acquire/store-release" operations
> (including those used for implementing CAS) are sequentially consistent
> when considered as a group in the same way that all "synchronization
> actions" in Java are, but they can still be reordered with plain
> reads/writes, just like Java plain variable access can be reordered with
> volatile variable access (unless a happens-before relationship exists).
> The section in
> https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> on aarch64 should be useful.

I get that.  I'm just trying to understand exactly which scenario Hans
is worried about.

void writer(...) {
    unsigned seq0 = seq;
    while (seq0 & 1 ||
    { seq0 = seq; }
    data1 = ...;
    data2 = ...;
    seq = seq0 + 2;

> CAS is implemented using a ldaxr followed by stlxr which is efficient, but
> allows subsequent writes to move in between the ldaxr and the stlxr.

OK, got that.  A write of data1 might move before the seq.cmp_exc_wk
has succeeded in bumping seq0 (the version clock).

Reading the AArch64 specification as carefully as I can, I see

For a Store-Release, observers in the shareability domain of the
address accessed by the Store-Release observe:

1.  Both of the following, if the shareability of the addresses
accessed requires that the observer observes them:

        Reads and writes caused by loads and stores appearing in
        program order before the Store-Release.

        Writes that have been observed by the PE executing the
        Store-Release before executing the Store-Release.

2.  The write caused by the Store-Release.

There are no additional ordering requirements on loads or stores that
appear in program order after the Store-Release.

So, yes, it is quite possible for a write of data1 to move before the
write of the clock.  And I can see why that would be bad.  We really
need something like

void writer(...) {
    unsigned seq0 = seq;
    while (seq0 & 1 ||
    { seq0 = seq; }
    data1 = ...;
    data2 = ...;
    seq = seq0 + 2;

> (Back to being Martin ...)
> Reordering a plain store from after to before a stlxr (rather than
> non-exclusive stlr) is still rather surprising because it looks like
> a speculative store - we don't know yet whether the stlxr will
> succeed.  Unlike the case where we implement CAS via a lock.

But even CAS via a lock has the roach motel property, so we're used to
the idea of stores and loads moving into a critical section: none of
this should surprise people.

> Am I thinking too atomically?  Perhaps the stlxr instruction
> implementation exclusively acquires the cache line, sees that it
> surely will succeed, but will be slow because pending memory
> operations must be completed first.

I don't think it has to be so complicated.  An out-of-order
implementation can move stores which are after this stlxr to before it
simply because there is no logic preventing it from doing so.  Whether
this is a realistic or useful property of a real machine is a whole
'nother matter.


More information about the Concurrency-interest mailing list