[concurrency-interest] The very best CAS loop

Peter Levart peter.levart at gmail.com
Wed Sep 28 10:17:28 EDT 2016


Hi Andrew,

If you look at what the GetAndUpdateBench actually does, then you'll see 
that *all* weak CAS failures (100 %) should be spurious and that there 
should be *no* strong CAS failures. The benchmark exhibits 2 threads and 
each of them accesses its own variable.

The problem with this benchmark is, I think, that all platforms tried so 
far either natively implement strong CAS or weak CAS, but not both (I 
might be wrong, that's just what I have observed so far). On Intel, weak 
CAS delegates to strong CAS and on AArch64, strong CAS is emulated with 
a weak CAS / re-read / compare loop. So this benchmark that tries to 
show the difference between using weak and strong CAS on one platform 
just compares same content with two different envelopes. The results 
should be comparable.

The difference you get between dflt/martin and shade/strong are 
therefore probably just caused by false-sharing / no-false-sharing 
between the fields of the allocated Vars object. Remember, each of the 
two threads accesses a different field. There's no data race (not strong 
CAS failures), just false-sharing or not resulting in spurious weak CAS 
failures or cache contention with strong CASes or not.

To verify my claims, I replaced VarHandle view over direct ByteBuffer 
with Unsafe accesses to buffer's native memory in the following benchmark:

http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench3.java

compiled here:

http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/benchmarks.jar

Can you try it on AArch64. You can replace BlackHole.consumeCPU with you 
xhorshift loop, but I don't think you'll get much different results.

On Intel (i7), they are comparable and just about 5% faster than 
VarHandle variant, so I think even VarHandle variant is dominated by CAS 
and updateFn and not by indirections or checks performed by VarHandle 
view over direct ByteBuffer:

Benchmark                                       (updateFnCpu) Mode  
Cnt    Score   Error  Units
GetAndUpdateBench3.dflt 1  avgt   20   44.269 ± 5.233  ns/op
GetAndUpdateBench3.dflt:getAndUpdate1_dflt 1  avgt   20   44.661 ± 
5.533  ns/op
GetAndUpdateBench3.dflt:getAndUpdate2_dflt 1  avgt   20   43.877 ± 
5.128  ns/op
GetAndUpdateBench3.dflt 10  avgt   20   48.481 ± 1.235  ns/op
GetAndUpdateBench3.dflt:getAndUpdate1_dflt 10  avgt   20   48.021 ± 
1.156  ns/op
GetAndUpdateBench3.dflt:getAndUpdate2_dflt 10  avgt   20   48.942 ± 
1.773  ns/op
GetAndUpdateBench3.dflt 20  avgt   20   54.645 ± 0.647  ns/op
GetAndUpdateBench3.dflt:getAndUpdate1_dflt 20  avgt   20   54.546 ± 
2.042  ns/op
GetAndUpdateBench3.dflt:getAndUpdate2_dflt 20  avgt   20   54.744 ± 
2.250  ns/op
GetAndUpdateBench3.dflt 50  avgt   20  111.442 ± 0.236  ns/op
GetAndUpdateBench3.dflt:getAndUpdate1_dflt 50  avgt   20  111.143 ± 
1.168  ns/op
GetAndUpdateBench3.dflt:getAndUpdate2_dflt 50  avgt   20  111.740 ± 
0.944  ns/op
GetAndUpdateBench3.dflt 100  avgt   20  202.829 ± 3.689  ns/op
GetAndUpdateBench3.dflt:getAndUpdate1_dflt 100  avgt   20  202.541 ± 
4.309  ns/op
GetAndUpdateBench3.dflt:getAndUpdate2_dflt 100  avgt   20  203.117 ± 
4.126  ns/op
GetAndUpdateBench3.martin 1  avgt   20   44.245 ± 0.295  ns/op
GetAndUpdateBench3.martin:getAndUpdate1_martin 1  avgt   20   44.254 ± 
0.310  ns/op
GetAndUpdateBench3.martin:getAndUpdate2_martin 1  avgt   20   44.235 ± 
0.289  ns/op
GetAndUpdateBench3.martin 10  avgt   20   46.241 ± 3.474  ns/op
GetAndUpdateBench3.martin:getAndUpdate1_martin 10  avgt   20   46.213 ± 
3.682  ns/op
GetAndUpdateBench3.martin:getAndUpdate2_martin 10  avgt   20   46.270 ± 
3.692  ns/op
GetAndUpdateBench3.martin 20  avgt   20   55.344 ± 1.307  ns/op
GetAndUpdateBench3.martin:getAndUpdate1_martin 20  avgt   20   56.396 ± 
2.557  ns/op
GetAndUpdateBench3.martin:getAndUpdate2_martin 20  avgt   20   54.291 ± 
1.630  ns/op
GetAndUpdateBench3.martin 50  avgt   20  111.164 ± 1.533  ns/op
GetAndUpdateBench3.martin:getAndUpdate1_martin 50  avgt   20  111.110 ± 
2.465  ns/op
GetAndUpdateBench3.martin:getAndUpdate2_martin 50  avgt   20  111.217 ± 
2.781  ns/op
GetAndUpdateBench3.martin 100  avgt   20  202.234 ± 0.912  ns/op
GetAndUpdateBench3.martin:getAndUpdate1_martin 100  avgt   20  201.866 ± 
2.174  ns/op
GetAndUpdateBench3.martin:getAndUpdate2_martin 100  avgt   20  202.601 ± 
2.502  ns/op
GetAndUpdateBench3.shade 1  avgt   20   43.170 ± 6.922  ns/op
GetAndUpdateBench3.shade:getAndUpdate1_shade 1  avgt   20   43.200 ± 
6.937  ns/op
GetAndUpdateBench3.shade:getAndUpdate2_shade 1  avgt   20   43.140 ± 
6.915  ns/op
GetAndUpdateBench3.shade 10  avgt   20   47.756 ± 1.122  ns/op
GetAndUpdateBench3.shade:getAndUpdate1_shade 10  avgt   20   47.364 ± 
1.209  ns/op
GetAndUpdateBench3.shade:getAndUpdate2_shade 10  avgt   20   48.147 ± 
1.329  ns/op
GetAndUpdateBench3.shade 20  avgt   20   54.876 ± 0.732  ns/op
GetAndUpdateBench3.shade:getAndUpdate1_shade 20  avgt   20   54.705 ± 
1.152  ns/op
GetAndUpdateBench3.shade:getAndUpdate2_shade 20  avgt   20   55.047 ± 
1.423  ns/op
GetAndUpdateBench3.shade 50  avgt   20  110.661 ± 0.177  ns/op
GetAndUpdateBench3.shade:getAndUpdate1_shade 50  avgt   20  111.847 ± 
2.226  ns/op
GetAndUpdateBench3.shade:getAndUpdate2_shade 50  avgt   20  109.476 ± 
2.115  ns/op
GetAndUpdateBench3.shade 100  avgt   20  203.791 ± 2.609  ns/op
GetAndUpdateBench3.shade:getAndUpdate1_shade 100  avgt   20  204.484 ± 
4.302  ns/op
GetAndUpdateBench3.shade:getAndUpdate2_shade 100  avgt   20  203.097 ± 
4.031  ns/op
GetAndUpdateBench3.strong 1  avgt   20   44.953 ± 0.785  ns/op
GetAndUpdateBench3.strong:getAndUpdate1_strong 1  avgt   20   44.989 ± 
0.722  ns/op
GetAndUpdateBench3.strong:getAndUpdate2_strong 1  avgt   20   44.917 ± 
0.926  ns/op
GetAndUpdateBench3.strong 10  avgt   20   46.965 ± 0.620  ns/op
GetAndUpdateBench3.strong:getAndUpdate1_strong 10  avgt   20   46.799 ± 
1.144  ns/op
GetAndUpdateBench3.strong:getAndUpdate2_strong 10  avgt   20   47.130 ± 
1.289  ns/op
GetAndUpdateBench3.strong 20  avgt   20   54.904 ± 0.956  ns/op
GetAndUpdateBench3.strong:getAndUpdate1_strong 20  avgt   20   55.006 ± 
2.855  ns/op
GetAndUpdateBench3.strong:getAndUpdate2_strong 20  avgt   20   54.802 ± 
2.413  ns/op
GetAndUpdateBench3.strong 50  avgt   20  109.759 ± 0.941  ns/op
GetAndUpdateBench3.strong:getAndUpdate1_strong 50  avgt   20  109.110 ± 
2.792  ns/op
GetAndUpdateBench3.strong:getAndUpdate2_strong 50  avgt   20  110.408 ± 
2.905  ns/op
GetAndUpdateBench3.strong 100  avgt   20  201.612 ± 0.881  ns/op
GetAndUpdateBench3.strong:getAndUpdate1_strong 100  avgt   20  199.593 ± 
2.319  ns/op
GetAndUpdateBench3.strong:getAndUpdate2_strong 100  avgt   20  203.630 ± 
3.172  ns/op


Regards, Peter


On 09/28/2016 03:08 PM, Andrew Haley wrote:
> I did a few experiments, with simpler code.
>
> The version with the direct byte buffer was not very efficient: it
> does many reads of memory unrelated to the actual thing we're trying
> to test.  This is a shame, given that one of the goals of JEP 193 is
> that the API should be at least as good as Unsafe.  Right now we'e a
> long way from that goal.
>
> The test code using the DBB VarHandle does this:
>
>          spill
>          safepoint poll
>          DirectByteBuffer.limit
>          DirectByteBuffer.hb
>          DirectByteBuffer.address
>          {our int value}
>          spill
>          getAndUpdate_shade.updateFn
>          getAndUpdate_shade.updateFn.getClass()
>          ... call updateFn
>          Blackhole
>          .. call consumeCPU
>          spill
>          DirectByteBuffer.hb
>          DirectByteBuffer.address
>          DirectByteBuffer.isReadOnly
>
>          ... and finally
>
>          CAS {our int value}
>
> I used GetAndUpdateBench (which uses simple VarHandles, no
> ByteBuffers) and replaced the call to Blackhole.consumeCPU(20L) with
>
>          for (int i = 0; i < 10; i++) {
>              // Marsaglia xor-shift generator
>              n ^= (n << 21);
>              n ^= (n >>> 35);
>              n ^= (n << 4);
>          }
>
> to try to get a better idea of the real performance of getAndUpdate
> with a (somewhat) time-consuming update function.  I did this because
> Blackhole.consumeCPU() causes all registers to be spilled to the
> stack.
>
> The results, AArch64:
>
> GetAndUpdateBench0.dflt                         avgt   20  160.454 ?  8.667  ns/op
> GetAndUpdateBench0.dflt:getAndUpdate1_dflt      avgt   20  160.459 ?  8.555  ns/op
> GetAndUpdateBench0.dflt:getAndUpdate2_dflt      avgt   20  160.450 ?  8.787  ns/op
> GetAndUpdateBench0.martin                       avgt   20  159.229 ? 10.803  ns/op
> GetAndUpdateBench0.martin:getAndUpdate1_martin  avgt   20  159.279 ? 10.774  ns/op
> GetAndUpdateBench0.martin:getAndUpdate2_martin  avgt   20  159.180 ? 10.833  ns/op
> GetAndUpdateBench0.shade                        avgt   20  136.533 ?  4.101  ns/op
> GetAndUpdateBench0.shade:getAndUpdate1_shade    avgt   20  136.811 ?  4.042  ns/op
> GetAndUpdateBench0.shade:getAndUpdate2_shade    avgt   20  136.255 ?  4.246  ns/op
> GetAndUpdateBench0.strong                       avgt   20  136.746 ?  4.992  ns/op
> GetAndUpdateBench0.strong:getAndUpdate1_strong  avgt   20  136.737 ?  5.007  ns/op
> GetAndUpdateBench0.strong:getAndUpdate2_strong  avgt   20  136.755 ?  4.985  ns/op
>
> strong and shade are the fastest, as I'd expect, because they do the
> least work.  dflt and martin have a lot of jitter, perhaps because of
> branch prediction.  I think shade would be better at times of high
> contention, but that is very hard to measure.
>
> Re the profile data for martin: I'm seeing "spurious" CAS failures in
> the profile data of 87018:270, i.e. 0.3%.
>
> I conclude that string and shade are equally good unless there is high
> contention.  I haven't measured that.  I also haven't measured an
> expensive update function with high jitter.  I suspect that with high
> contention shade would win.  With no contention and no false sharing
> there is nothing to choose between them.
>
> If the update function was very expensive the rate of CAS failures
> might well rise and it would be worth going to heroic attempts to
> avoid calling it, but I suspect that such functions are unusual.
>
> Andrew.
>
>
> NB:
>
> Looking at the profile data, "spurious" CAS failures are pretty
> rare.  I do not believe that this is simply because we "got lucky" and
> the variables were on different cache lines, but it's possible.  One
> one run I saw times spectacularly better than these, and I ignored it.
> I added some padding to prevent the two variables sharing the same
> cache line, and:
>
> Benchmark                                       Mode  Cnt   Score   Error  Units
> GetAndUpdateBench0.dflt                         avgt   20  77.116 ? 0.120  ns/op
> GetAndUpdateBench0.dflt:getAndUpdate1_dflt      avgt   20  77.142 ? 0.159  ns/op
> GetAndUpdateBench0.dflt:getAndUpdate2_dflt      avgt   20  77.090 ? 0.112  ns/op
> GetAndUpdateBench0.martin                       avgt   20  76.865 ? 0.222  ns/op
> GetAndUpdateBench0.martin:getAndUpdate1_martin  avgt   20  76.840 ? 0.260  ns/op
> GetAndUpdateBench0.martin:getAndUpdate2_martin  avgt   20  76.889 ? 0.241  ns/op
> GetAndUpdateBench0.shade                        avgt   20  76.324 ? 0.096  ns/op
> GetAndUpdateBench0.shade:getAndUpdate1_shade    avgt   20  76.286 ? 0.015  ns/op
> GetAndUpdateBench0.shade:getAndUpdate2_shade    avgt   20  76.362 ? 0.183  ns/op
> GetAndUpdateBench0.strong                       avgt   20  76.303 ? 0.080  ns/op
> GetAndUpdateBench0.strong:getAndUpdate1_strong  avgt   20  76.330 ? 0.154  ns/op
> GetAndUpdateBench0.strong:getAndUpdate2_strong  avgt   20  76.277 ? 0.011  ns/op
>
> QED, I think.
>
>
> Appendix
>
> With trivial update function, no false sharing:
>
> Benchmark                                       Mode  Cnt   Score   Error  Units
> GetAndUpdateBench0.dflt                         avgt   20  44.842 ? 0.189  ns/op
> GetAndUpdateBench0.dflt:getAndUpdate1_dflt      avgt   20  44.830 ? 0.188  ns/op
> GetAndUpdateBench0.dflt:getAndUpdate2_dflt      avgt   20  44.855 ? 0.202  ns/op
> GetAndUpdateBench0.martin                       avgt   20  44.900 ? 0.182  ns/op
> GetAndUpdateBench0.martin:getAndUpdate1_martin  avgt   20  44.874 ? 0.338  ns/op
> GetAndUpdateBench0.martin:getAndUpdate2_martin  avgt   20  44.926 ? 0.342  ns/op
> GetAndUpdateBench0.shade                        avgt   20  44.626 ? 0.037  ns/op
> GetAndUpdateBench0.shade:getAndUpdate1_shade    avgt   20  44.622 ? 0.022  ns/op
> GetAndUpdateBench0.shade:getAndUpdate2_shade    avgt   20  44.630 ? 0.071  ns/op
> GetAndUpdateBench0.strong                       avgt   20  44.208 ? 0.051  ns/op
> GetAndUpdateBench0.strong:getAndUpdate1_strong  avgt   20  44.229 ? 0.097  ns/op
> GetAndUpdateBench0.strong:getAndUpdate2_strong  avgt   20  44.188 ? 0.014  ns/op
>
> With trivial update function and false sharing:
>
> # Run complete. Total time: 00:03:35
>
> Benchmark                                       Mode  Cnt    Score    Error  Units
> GetAndUpdateBench0.dflt                         avgt   20  145.885 ?  3.751  ns/op
> GetAndUpdateBench0.dflt:getAndUpdate1_dflt      avgt   20  149.685 ?  9.260  ns/op
> GetAndUpdateBench0.dflt:getAndUpdate2_dflt      avgt   20  142.085 ?  8.205  ns/op
> GetAndUpdateBench0.martin                       avgt   20  155.300 ? 10.353  ns/op
> GetAndUpdateBench0.martin:getAndUpdate1_martin  avgt   20  153.278 ? 32.496  ns/op
> GetAndUpdateBench0.martin:getAndUpdate2_martin  avgt   20  157.323 ? 35.911  ns/op
> GetAndUpdateBench0.shade                        avgt   20  140.776 ?  8.517  ns/op
> GetAndUpdateBench0.shade:getAndUpdate1_shade    avgt   20  140.115 ?  8.897  ns/op
> GetAndUpdateBench0.shade:getAndUpdate2_shade    avgt   20  141.438 ?  9.049  ns/op
> GetAndUpdateBench0.strong                       avgt   20  108.120 ?  3.737  ns/op
> GetAndUpdateBench0.strong:getAndUpdate1_strong  avgt   20  108.126 ?  3.735  ns/op
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20160928/2736dbde/attachment-0001.html>


More information about the Concurrency-interest mailing list