[concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
aph at redhat.com
Fri Dec 12 05:57:28 EST 2014
On 11/12/14 22:01, Stephan Diestelhorst wrote:
> Am Donnerstag, 11. Dezember 2014, 15:36:27 schrieb Andrew Haley:
>> On 12/11/2014 02:54 PM, Stephan Diestelhorst wrote:
>>> You pretty much swapped DMBs and MFENCEs ;) So MFENCEs are local, in
>>> that they need to drain the write-buffer before allowing a later load
>>> access to the data. DMBs, on the other hand, at least conceptually need
>>> to make sure that stores from other cores have become visible everywhere
>>> when the local core has seen them before the DMB.
>> Excuse me? Now I'm even more confused.
>> If this core has seen a store from another core, then a DMB on this
>> core makes that store visible to all other cores, even if the store
>> had no memory fence?
> Yep. Conceptually it has to. Imagine in the IRIW example, there is not
> even a fence behind the store:
> Memory: foo = bar = 0
> T1 T2 T3 T4
> foo := 1 bar := 1 r1 = foo r3 = bar
> DMB DMB
> r2 = bar r4 = foo
> r1 == 1 && r2 == 0 && r3 ==1 && r4 == 0 ?
Okay, I get that: it enforces a global visibility ordering on the memory
>> So the simple act of reading a memory location and then doing a DMB
>> causes a previous store to that memory location to become visible
>> to all cores.
> Yes, if you read the updated store value with a load before the DMB. If
> you look through the ARM documentation, this is precisely the reason for
> the somewhat complex description with the recursive definition of what
> really is before and after the barrier.
Aha! Thank you.
> The description tells you that the barrier not only orders things
> that were on the same core before / after, but also things that were
> read by instructions before the barrier, likewise for things
> happening after the barrier. This is necessary, as otherwise there
> would be no way to enforce a consistent global store order in a weak
> memory model.
The problem with "as if, conceptually" descriptions is that they give
programmers no intuitively understandable model they can get a grip
on. I think I understand MESI/MOESI protocols, write buffers,
invalidate queues, and so on. I suppose that what happens here is
that the processor executing a DMB has to send an invalidate message
to the other processors for the data in its cache, but that sounds
prohibitively expensive to me.
I am aware that the specification is a model, it's not exactly what
happens. However, there is a real problem in the industry
(particularly in Java) that people have a false understanding of
memory barriers and their cost. I am playing an endless game of
whack-a-mole rebutting influential programmers who tell the world to
avoid volatile variables because accesses "flush the cache" and so are
very costly. And I have to reply no, that's not what happens, memory
barriers are a local operation and the cache-coherence protocol does
It's the "somewhat complex" (i.e. utterly baffling) description in the
ARM ARM which is perhaps the real problem in this particular case.
Just to get on my soapbox, for a moment. It's not enough these days
to say to the programmer "do this and your program is correct";
rather, people need to have an intuitive model which gives them a way
to reason informally (and reasonably accurately) about the cost of an
action on a multicore system. I accept that we don't want to
overspecify anything, so I'm not suggesting that the specifications
should be changed; I would like to see a better public discourse
around all this, though. I am aware that we're not even at the point
where programmers can reason correctly about relaxed memory, and that
has to come first.
> The reason for fences being so simple on TSO(-like) architectures is
> precisely, that stores are magically globally ordered already, so the
> fence does not have to influence them.
More information about the Concurrency-interest