[concurrency-interest] The very best CAS loop

Peter Levart peter.levart at gmail.com
Tue Sep 27 06:44:35 EDT 2016


What I was hoping to measure with the benchmark is some difference 
between using weak CAS and strong CAS in a getAndUpdate retry loop that 
always invokes the update function (even on spurious failures of weak 
CAS). I can understand that this is not possible on x86_64 as both are 
mapped to the same strong CAS intrinsic. But I also haven't observed any 
meaningful difference on arm32. So is it possible that Oracle's JDK 9 
b137 for arm32 also maps both weak and strong CAS to the same 
implementation which is a strong CAS simulation? Is there any platform 
where weak CAS is actually a weak CAS currently?


Regards, Peter


On 09/27/2016 11:39 AM, Peter Levart wrote:
> Hi,
>
> Ah, the pitfalls of benchmarking...
>
> Even though I ran the benchmarks for several times and got consistent 
> results, it seems that the consistency was accidental. As I said: "The 
> benchmark exhibits 2 threads that don't have data contention (each of 
> them is modifying its own variable), but most probably those threads 
> do modify the same cache line."
>
> "most probably" == "not necessarily"!!! Murphy interfering with 
> "strong" benchmark it was. The Vars object was consistently allocated 
> at an address so that value1 and value2 were not in the same cache 
> line only in "strong" benchmark. Corrected benchmark that guarantees 
> same cache-line for two int values in an aligned direct byte buffer:
>
> http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench2.java
>
> ...shows more uniform results on x86_64:
>
> Benchmark                                       Mode  Cnt Score   
> Error  Units
> GetAndUpdateBench2.dflt                         avgt 20  56.683 ± 
> 0.539  ns/op
> GetAndUpdateBench2.dflt:getAndUpdate1_dflt      avgt 20  56.703 ± 
> 0.640  ns/op
> GetAndUpdateBench2.dflt:getAndUpdate2_dflt      avgt 20  56.662 ± 
> 0.519  ns/op
> GetAndUpdateBench2.martin                       avgt 20  56.343 ± 
> 0.496  ns/op
> GetAndUpdateBench2.martin:getAndUpdate1_martin  avgt 20  56.382 ± 
> 0.486  ns/op
> GetAndUpdateBench2.martin:getAndUpdate2_martin  avgt 20  56.305 ± 
> 0.520  ns/op
> GetAndUpdateBench2.shade                        avgt 20  55.447 ± 
> 0.398  ns/op
> GetAndUpdateBench2.shade:getAndUpdate1_shade    avgt 20  55.408 ± 
> 0.526  ns/op
> GetAndUpdateBench2.shade:getAndUpdate2_shade    avgt 20  55.486 ± 
> 0.506  ns/op
> GetAndUpdateBench2.strong                       avgt 20  55.689 ± 
> 0.617  ns/op
> GetAndUpdateBench2.strong:getAndUpdate1_strong  avgt 20  55.612 ± 
> 0.601  ns/op
> GetAndUpdateBench2.strong:getAndUpdate2_strong  avgt 20  55.766 ± 
> 0.701  ns/op
>
>
> And on Raspberry PI 2 (ARMv7) too:
>
> Benchmark                                       Mode  Cnt Score    
> Error  Units
> GetAndUpdateBench2.dflt                         avgt 20  1953.743 ±  
> 3.430  ns/op
> GetAndUpdateBench2.dflt:getAndUpdate1_dflt      avgt 20  1953.332 ±  
> 8.565  ns/op
> GetAndUpdateBench2.dflt:getAndUpdate2_dflt      avgt 20  1954.154 ±  
> 8.334  ns/op
> GetAndUpdateBench2.martin                       avgt 20  1955.770 ±  
> 6.025  ns/op
> GetAndUpdateBench2.martin:getAndUpdate1_martin  avgt 20  1956.913 ± 
> 10.290  ns/op
> GetAndUpdateBench2.martin:getAndUpdate2_martin  avgt 20  1954.626 ±  
> 6.957  ns/op
> GetAndUpdateBench2.shade                        avgt 20  1959.441 ±  
> 2.337  ns/op
> GetAndUpdateBench2.shade:getAndUpdate1_shade    avgt 20  1961.199 ±  
> 6.367  ns/op
> GetAndUpdateBench2.shade:getAndUpdate2_shade    avgt 20  1957.682 ±  
> 5.713  ns/op
> GetAndUpdateBench2.strong                       avgt 20  1960.668 ±  
> 8.764  ns/op
> GetAndUpdateBench2.strong:getAndUpdate1_strong  avgt 20  1959.773 ± 
> 10.804  ns/op
> GetAndUpdateBench2.strong:getAndUpdate2_strong  avgt 20  1961.564 ±  
> 8.086  ns/op
>
>
> Regards, Peter
>
>
> On 09/27/2016 03:07 AM, Paul Sandoz wrote:
>>> On 26 Sep 2016, at 16:32, Peter Levart<peter.levart at gmail.com>  wrote:
>>>
>>> Hi,
>>>
>>> I don't know about you, but on my i7 PC, I get consistently better results using strong CAS as opposed to weak CAS for the following benchmark:
>>>
>>> http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench.java
>>>
>>> The results:
>>>
>>> Benchmark                                      Mode  Cnt   Score   Error  Units
>>> GetAndUpdateBench.dflt                         avgt   20  53.947 ± 0.770  ns/op
>>> GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  54.103 ± 1.083  ns/op
>>> GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  53.791 ± 1.014  ns/op
>>> GetAndUpdateBench.martin                       avgt   20  52.724 ± 0.218  ns/op
>>> GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  52.379 ± 0.972  ns/op
>>> GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  53.070 ± 1.005  ns/op
>>> GetAndUpdateBench.shade                        avgt   20  53.437 ± 0.624  ns/op
>>> GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  53.401 ± 0.615  ns/op
>>> GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  53.474 ± 0.652  ns/op
>>> GetAndUpdateBench.strong                       avgt   20  38.090 ± 0.727  ns/op
>>> GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  37.784 ± 1.633  ns/op
>>> GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  38.397 ± 1.691  ns/op
>>>
>> For C2 on x86, the weak CAS intrinsic wires up in the same manner as the strong CAS intrinsic. IIRC for C1 on x86, weak CAS is not intrinsic.
>>
>> On platforms/compilers that do not support the weak CAS intrinsic, the Unsafe weak CAS implementation defers to the Unsafe strong CAS.
>>
>> I would be inclined to switch off tiered compilation and use parallel GC (not G1) and obtain results with a number of forks to see if that makes a difference. I am speculating, i really need to play with the benchmark and look at generated code/perfasm output.
>>
>> I believe a Raspberry Pi 2 is based on an ARM Cortex-A7 processor, which is 32-bit, and i am unsure of the state of affairs w.r.t. HotSpot on that platform.
>>
>> Paul.
>>
>>> The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line. The update function does have a non-zero cost.
>>>
>>> To complement above results, I also ran the same code on Raspbery PI 2 with Oracle build of JDK 9 (b137):
>>>
>>> Benchmark                                      Mode  Cnt     Score   Error  Units
>>> GetAndUpdateBench.dflt                         avgt   20  2347.811 ± 2.492  ns/op
>>> GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  2347.328 ± 6.697  ns/op
>>> GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  2348.294 ± 7.052  ns/op
>>> GetAndUpdateBench.martin                       avgt   20  2342.773 ± 5.618  ns/op
>>> GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  2342.398 ± 8.662  ns/op
>>> GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  2343.148 ± 9.514  ns/op
>>> GetAndUpdateBench.shade                        avgt   20  1844.426 ± 4.304  ns/op
>>> GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  1841.747 ± 5.287  ns/op
>>> GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  1847.105 ± 6.416  ns/op
>>> GetAndUpdateBench.strong                       avgt   20  1846.413 ± 1.155  ns/op
>>> GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  1846.312 ± 8.095  ns/op
>>> GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  1846.513 ± 8.110  ns/op
>>>
>>>
>>> Here we may be observing a strange phenomena. Reexecuting the non-zero-cost update function on spurious failure (shade) surprisingly gives better throughput than immediately retrying weak CAS (dflt, martin).
>>>
>>> Also, wasn't strong CAS on ARM supposed to be mapped to a weak CAS loop? Why the difference between dflt/martin and strong then?
>>>
>>> You've got something to study now... ;-)
>>>
>>> Regards, Peter
>>>
>>>
>>>
>>> On 09/26/2016 07:51 PM, Martin Buchholz wrote:
>>>> On Mon, Sep 26, 2016 at 10:36 AM, Paul Sandoz<paul.sandoz at oracle.com>  wrote:
>>>>
>>>>> On 26 Sep 2016, at 10:01, Martin Buchholz<martinrb at google.com>  wrote:
>>>>>
>>>>> Hi Aleksey,
>>>>>
>>>>> My goals were to
>>>>> - have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
>>>>> - never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
>>>>> - yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!
>>>>>
>>>>>      public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>>>>>          for (;;) {
>>>>>              V prev = get();
>>>>>              V next = updateFunction.apply(prev);
>>>>>              if (strongCas(prev, next))
>>>>>                  return prev;
>>>>>          }
>>>>>      }
>>>>>      /** Implement strong CAS using weak CAS. */
>>>>>      boolean strongCas(V expectedValue, V newValue) {
>>>>>          while (get() == expectedValue) {
>>>>>              if (weakCompareAndSetVolatile(expectedValue, newValue))
>>>>>                  return true;
>>>>>          }
>>>>>          return false;
>>>>>      }
>>>>>
>>>> If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.
>>>>
>>>>
>>>> That code was just demonstrating how the return value from get() was getting lost, whereas if you inline a weak cas loop as I have done, you save the extra call to get().
>>>>
>>>> I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.
>>>>
>>>> I'm imagining user code that counts all the calls to getAndUpdate, and all the calls to updateFunction.  If they differ, I would like that to be evidence of true contention.   True, we tell users updateFunction is supposed to be side-effect-free.  OTOH, we say that attempted updates may fail due to "contention among threads", suggesting we don't call updateFunction spuriously.
>>>>
>>>>
>>>> _______________________________________________
>>>> Concurrency-interest mailing list
>>>>
>>>> Concurrency-interest at cs.oswego.edu
>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20160927/0c3e8862/attachment.html>


More information about the Concurrency-interest mailing list