[concurrency-interest] The very best CAS loop

Paul Sandoz paul.sandoz at oracle.com
Mon Sep 26 21:07:46 EDT 2016


> On 26 Sep 2016, at 16:32, Peter Levart <peter.levart at gmail.com> wrote:
> 
> Hi,
> 
> I don't know about you, but on my i7 PC, I get consistently better results using strong CAS as opposed to weak CAS for the following benchmark:
> 
> http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench.java
> 
> The results:
> 
> Benchmark                                      Mode  Cnt   Score   Error  Units
> GetAndUpdateBench.dflt                         avgt   20  53.947 ± 0.770  ns/op
> GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  54.103 ± 1.083  ns/op
> GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  53.791 ± 1.014  ns/op
> GetAndUpdateBench.martin                       avgt   20  52.724 ± 0.218  ns/op
> GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  52.379 ± 0.972  ns/op
> GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  53.070 ± 1.005  ns/op
> GetAndUpdateBench.shade                        avgt   20  53.437 ± 0.624  ns/op
> GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  53.401 ± 0.615  ns/op
> GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  53.474 ± 0.652  ns/op
> GetAndUpdateBench.strong                       avgt   20  38.090 ± 0.727  ns/op
> GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  37.784 ± 1.633  ns/op
> GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  38.397 ± 1.691  ns/op
> 

For C2 on x86, the weak CAS intrinsic wires up in the same manner as the strong CAS intrinsic. IIRC for C1 on x86, weak CAS is not intrinsic.

On platforms/compilers that do not support the weak CAS intrinsic, the Unsafe weak CAS implementation defers to the Unsafe strong CAS.

I would be inclined to switch off tiered compilation and use parallel GC (not G1) and obtain results with a number of forks to see if that makes a difference. I am speculating, i really need to play with the benchmark and look at generated code/perfasm output.

I believe a Raspberry Pi 2 is based on an ARM Cortex-A7 processor, which is 32-bit, and i am unsure of the state of affairs w.r.t. HotSpot on that platform.

Paul.

> 
> The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line. The update function does have a non-zero cost.
> 
> To complement above results, I also ran the same code on Raspbery PI 2 with Oracle build of JDK 9 (b137):
> 
> Benchmark                                      Mode  Cnt     Score   Error  Units
> GetAndUpdateBench.dflt                         avgt   20  2347.811 ± 2.492  ns/op
> GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  2347.328 ± 6.697  ns/op
> GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  2348.294 ± 7.052  ns/op
> GetAndUpdateBench.martin                       avgt   20  2342.773 ± 5.618  ns/op
> GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  2342.398 ± 8.662  ns/op
> GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  2343.148 ± 9.514  ns/op
> GetAndUpdateBench.shade                        avgt   20  1844.426 ± 4.304  ns/op
> GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  1841.747 ± 5.287  ns/op
> GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  1847.105 ± 6.416  ns/op
> GetAndUpdateBench.strong                       avgt   20  1846.413 ± 1.155  ns/op
> GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  1846.312 ± 8.095  ns/op
> GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  1846.513 ± 8.110  ns/op
> 
> 
> Here we may be observing a strange phenomena. Reexecuting the non-zero-cost update function on spurious failure (shade) surprisingly gives better throughput than immediately retrying weak CAS (dflt, martin).
> 
> Also, wasn't strong CAS on ARM supposed to be mapped to a weak CAS loop? Why the difference between dflt/martin and strong then?
> 
> You've got something to study now... ;-)
> 
> Regards, Peter
> 
> 
> 
> On 09/26/2016 07:51 PM, Martin Buchholz wrote:
>> 
>> 
>> On Mon, Sep 26, 2016 at 10:36 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>> 
>>> On 26 Sep 2016, at 10:01, Martin Buchholz <martinrb at google.com> wrote:
>>> 
>>> Hi Aleksey,
>>> 
>>> My goals were to
>>> - have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
>>> - never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
>>> - yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!
>>> 
>>>     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>>>         for (;;) {
>>>             V prev = get();
>>>             V next = updateFunction.apply(prev);
>>>             if (strongCas(prev, next))
>>>                 return prev;
>>>         }
>>>     }
>>>     /** Implement strong CAS using weak CAS. */
>>>     boolean strongCas(V expectedValue, V newValue) {
>>>         while (get() == expectedValue) {
>>>             if (weakCompareAndSetVolatile(expectedValue, newValue))
>>>                 return true;
>>>         }
>>>         return false;
>>>     }
>>> 
>> 
>> If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.
>> 
>> 
>> That code was just demonstrating how the return value from get() was getting lost, whereas if you inline a weak cas loop as I have done, you save the extra call to get().
>> 
>> I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.
>> 
>> I'm imagining user code that counts all the calls to getAndUpdate, and all the calls to updateFunction.  If they differ, I would like that to be evidence of true contention.   True, we tell users updateFunction is supposed to be side-effect-free.  OTOH, we say that attempted updates may fail due to "contention among threads", suggesting we don't call updateFunction spuriously.
>> 
>> 
>> _______________________________________________
>> Concurrency-interest mailing list
>> 
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20160926/4129072d/attachment-0001.sig>


More information about the Concurrency-interest mailing list