[concurrency-interest] The very best CAS loop

Peter Levart peter.levart at gmail.com
Tue Sep 27 05:39:42 EDT 2016


Hi,

Ah, the pitfalls of benchmarking...

Even though I ran the benchmarks for several times and got consistent 
results, it seems that the consistency was accidental. As I said: "The 
benchmark exhibits 2 threads that don't have data contention (each of 
them is modifying its own variable), but most probably those threads do 
modify the same cache line."

"most probably" == "not necessarily"!!! Murphy interfering with "strong" 
benchmark it was. The Vars object was consistently allocated at an 
address so that value1 and value2 were not in the same cache line only 
in "strong" benchmark. Corrected benchmark that guarantees same 
cache-line for two int values in an aligned direct byte buffer:

http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench2.java

...shows more uniform results on x86_64:

Benchmark                                       Mode  Cnt Score   Error  
Units
GetAndUpdateBench2.dflt                         avgt   20 56.683 ± 
0.539  ns/op
GetAndUpdateBench2.dflt:getAndUpdate1_dflt      avgt   20 56.703 ± 
0.640  ns/op
GetAndUpdateBench2.dflt:getAndUpdate2_dflt      avgt   20 56.662 ± 
0.519  ns/op
GetAndUpdateBench2.martin                       avgt   20 56.343 ± 
0.496  ns/op
GetAndUpdateBench2.martin:getAndUpdate1_martin  avgt   20 56.382 ± 
0.486  ns/op
GetAndUpdateBench2.martin:getAndUpdate2_martin  avgt   20 56.305 ± 
0.520  ns/op
GetAndUpdateBench2.shade                        avgt   20 55.447 ± 
0.398  ns/op
GetAndUpdateBench2.shade:getAndUpdate1_shade    avgt   20 55.408 ± 
0.526  ns/op
GetAndUpdateBench2.shade:getAndUpdate2_shade    avgt   20 55.486 ± 
0.506  ns/op
GetAndUpdateBench2.strong                       avgt   20 55.689 ± 
0.617  ns/op
GetAndUpdateBench2.strong:getAndUpdate1_strong  avgt   20 55.612 ± 
0.601  ns/op
GetAndUpdateBench2.strong:getAndUpdate2_strong  avgt   20 55.766 ± 
0.701  ns/op


And on Raspberry PI 2 (ARMv7) too:

Benchmark                                       Mode  Cnt Score    
Error  Units
GetAndUpdateBench2.dflt                         avgt   20 1953.743 ±  
3.430  ns/op
GetAndUpdateBench2.dflt:getAndUpdate1_dflt      avgt   20 1953.332 ±  
8.565  ns/op
GetAndUpdateBench2.dflt:getAndUpdate2_dflt      avgt   20 1954.154 ±  
8.334  ns/op
GetAndUpdateBench2.martin                       avgt   20 1955.770 ±  
6.025  ns/op
GetAndUpdateBench2.martin:getAndUpdate1_martin  avgt   20 1956.913 ± 
10.290  ns/op
GetAndUpdateBench2.martin:getAndUpdate2_martin  avgt   20 1954.626 ±  
6.957  ns/op
GetAndUpdateBench2.shade                        avgt   20 1959.441 ±  
2.337  ns/op
GetAndUpdateBench2.shade:getAndUpdate1_shade    avgt   20 1961.199 ±  
6.367  ns/op
GetAndUpdateBench2.shade:getAndUpdate2_shade    avgt   20 1957.682 ±  
5.713  ns/op
GetAndUpdateBench2.strong                       avgt   20 1960.668 ±  
8.764  ns/op
GetAndUpdateBench2.strong:getAndUpdate1_strong  avgt   20 1959.773 ± 
10.804  ns/op
GetAndUpdateBench2.strong:getAndUpdate2_strong  avgt   20 1961.564 ±  
8.086  ns/op


Regards, Peter


On 09/27/2016 03:07 AM, Paul Sandoz wrote:
>> On 26 Sep 2016, at 16:32, Peter Levart <peter.levart at gmail.com> wrote:
>>
>> Hi,
>>
>> I don't know about you, but on my i7 PC, I get consistently better results using strong CAS as opposed to weak CAS for the following benchmark:
>>
>> http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench.java
>>
>> The results:
>>
>> Benchmark                                      Mode  Cnt   Score   Error  Units
>> GetAndUpdateBench.dflt                         avgt   20  53.947 ± 0.770  ns/op
>> GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  54.103 ± 1.083  ns/op
>> GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  53.791 ± 1.014  ns/op
>> GetAndUpdateBench.martin                       avgt   20  52.724 ± 0.218  ns/op
>> GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  52.379 ± 0.972  ns/op
>> GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  53.070 ± 1.005  ns/op
>> GetAndUpdateBench.shade                        avgt   20  53.437 ± 0.624  ns/op
>> GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  53.401 ± 0.615  ns/op
>> GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  53.474 ± 0.652  ns/op
>> GetAndUpdateBench.strong                       avgt   20  38.090 ± 0.727  ns/op
>> GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  37.784 ± 1.633  ns/op
>> GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  38.397 ± 1.691  ns/op
>>
> For C2 on x86, the weak CAS intrinsic wires up in the same manner as the strong CAS intrinsic. IIRC for C1 on x86, weak CAS is not intrinsic.
>
> On platforms/compilers that do not support the weak CAS intrinsic, the Unsafe weak CAS implementation defers to the Unsafe strong CAS.
>
> I would be inclined to switch off tiered compilation and use parallel GC (not G1) and obtain results with a number of forks to see if that makes a difference. I am speculating, i really need to play with the benchmark and look at generated code/perfasm output.
>
> I believe a Raspberry Pi 2 is based on an ARM Cortex-A7 processor, which is 32-bit, and i am unsure of the state of affairs w.r.t. HotSpot on that platform.
>
> Paul.
>
>> The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line. The update function does have a non-zero cost.
>>
>> To complement above results, I also ran the same code on Raspbery PI 2 with Oracle build of JDK 9 (b137):
>>
>> Benchmark                                      Mode  Cnt     Score   Error  Units
>> GetAndUpdateBench.dflt                         avgt   20  2347.811 ± 2.492  ns/op
>> GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  2347.328 ± 6.697  ns/op
>> GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  2348.294 ± 7.052  ns/op
>> GetAndUpdateBench.martin                       avgt   20  2342.773 ± 5.618  ns/op
>> GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  2342.398 ± 8.662  ns/op
>> GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  2343.148 ± 9.514  ns/op
>> GetAndUpdateBench.shade                        avgt   20  1844.426 ± 4.304  ns/op
>> GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  1841.747 ± 5.287  ns/op
>> GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  1847.105 ± 6.416  ns/op
>> GetAndUpdateBench.strong                       avgt   20  1846.413 ± 1.155  ns/op
>> GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  1846.312 ± 8.095  ns/op
>> GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  1846.513 ± 8.110  ns/op
>>
>>
>> Here we may be observing a strange phenomena. Reexecuting the non-zero-cost update function on spurious failure (shade) surprisingly gives better throughput than immediately retrying weak CAS (dflt, martin).
>>
>> Also, wasn't strong CAS on ARM supposed to be mapped to a weak CAS loop? Why the difference between dflt/martin and strong then?
>>
>> You've got something to study now... ;-)
>>
>> Regards, Peter
>>
>>
>>
>> On 09/26/2016 07:51 PM, Martin Buchholz wrote:
>>> On Mon, Sep 26, 2016 at 10:36 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>>>
>>>> On 26 Sep 2016, at 10:01, Martin Buchholz <martinrb at google.com> wrote:
>>>>
>>>> Hi Aleksey,
>>>>
>>>> My goals were to
>>>> - have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
>>>> - never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
>>>> - yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!
>>>>
>>>>      public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>>>>          for (;;) {
>>>>              V prev = get();
>>>>              V next = updateFunction.apply(prev);
>>>>              if (strongCas(prev, next))
>>>>                  return prev;
>>>>          }
>>>>      }
>>>>      /** Implement strong CAS using weak CAS. */
>>>>      boolean strongCas(V expectedValue, V newValue) {
>>>>          while (get() == expectedValue) {
>>>>              if (weakCompareAndSetVolatile(expectedValue, newValue))
>>>>                  return true;
>>>>          }
>>>>          return false;
>>>>      }
>>>>
>>> If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.
>>>
>>>
>>> That code was just demonstrating how the return value from get() was getting lost, whereas if you inline a weak cas loop as I have done, you save the extra call to get().
>>>
>>> I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.
>>>
>>> I'm imagining user code that counts all the calls to getAndUpdate, and all the calls to updateFunction.  If they differ, I would like that to be evidence of true contention.   True, we tell users updateFunction is supposed to be side-effect-free.  OTOH, we say that attempted updates may fail due to "contention among threads", suggesting we don't call updateFunction spuriously.
>>>
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>>
>>> Concurrency-interest at cs.oswego.edu
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20160927/db826d1a/attachment-0001.html>


More information about the Concurrency-interest mailing list