[concurrency-interest] AtomicReference.updateAndGet() mandatory updating
Nathan and Ila Reynolds
nathanila at gmail.com
Fri May 26 12:15:17 EDT 2017
The particular situation I encountered was on x86 hardware. However,
even on other memory architectures, replacing writes with full fences is
definitely faster. Here's why. x86 hardware implements MESI cache
protocol. These other architectures implement MERSI cache protocol.
Background information: On some (all?) x86 architectures, CAS is
implemented in L3 cache. The core executing CAS has to send a message
to the L3 cache and wait for the L3 cache to reply stating if the
operation succeed or failed. This explains why CAS takes so many
cycles. The benefit is that the L3 cache can keep the only copy of the
cache line and all of the cores on the processor can send CAS operations
and get higher throughput. If CAS were implemented in L1 cache and the
core's pipeline, then it would take about 9 cycles. 3 cycles to load
the data into the core's pipeline. 3 cycles to execute the comparison.
3 cycles to store data into the cache line. This would yield higher
performance (lower CAS latency) but would lower overall throughput since
the cache line would have to travel among the cores in the processor.
In both cache protocols, a write forces the cache line into the M
(modified) state. Thus, the cache line can only exist in 1 cache and
hence must move from L3 cache to L3 cache in order to be modified by the
cores in each processor. This will cause cores to stall while they
attempt to get the cache line into its own L3 cache in the E (exclusive)
or M (modified) states.
Instead, if we replace the write with a full fence, then the cache line
can stay longer in the S (shared) state. Thus, the cache line can exist
in all caches. All of the cores can progress at a higher throughput.
The penalty is that they have to implement a full fence. The number of
cycles to implement a full fence is less than or equal to stalling
waiting for a highly contended cache line. It depends on the case. In
the case where the loads and stores can proceed quickly because all of
the cache lines are available, the full fence will have very little to
no impact. In the case where a load or store before the full fence is
waiting for a cache line, then the pipeline will stall. If the load or
store is waiting for that highly contended cache line, then the full
fence will be no faster (or slower).
So, if we can replace writes with full fences, we will get faster
performance and better cache scalability since the cache line doesn't
have to travel among processors.
On 5/26/2017 6:56 AM, Andrew Dinn wrote:
> On 26/05/17 13:46, Doug Lea wrote:
>> On 05/26/2017 04:44 AM, Andrew Haley wrote:
>>>> a compareAndSet has the memory semantics of a volatile
>>>> write regardless of whether or not the write occurred.
>>> That property of compareAndSet is very ugly, and is a pain to
>>> implement. Most of the CAS instructions I know about don't have such
>>> a property, and making it work requires either an unconditional full
>>> fence after every CAS or a conditional one if it fails. Either way
>>> sucks mightily and is unlikely to be useful except in the most
>>> pathological cases.
>> Initially (in Java5) requiring it has led to some questionable reliance.
>> So we cannot change it. But there's not much motivation to do so anyway:
>> As implied by Nathan Reynolds, encountering some (local) fence overhead
>> on CAS failure typically reduces contention and may improve throughput.
> It would be useful to know if that reduction in contention is specific
> to, say, x86 hardware or also occurs on weak memory architectures like
> AArch64 or ppc. Perhaps Nathan could clarify that?
> Andrew Dinn
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
More information about the Concurrency-interest