[concurrency-interest] The very best CAS loop

Andrew Haley aph at redhat.com
Wed Sep 28 09:08:02 EDT 2016


I did a few experiments, with simpler code.

The version with the direct byte buffer was not very efficient: it
does many reads of memory unrelated to the actual thing we're trying
to test.  This is a shame, given that one of the goals of JEP 193 is
that the API should be at least as good as Unsafe.  Right now we'e a
long way from that goal.

The test code using the DBB VarHandle does this:

        spill
        safepoint poll
        DirectByteBuffer.limit
        DirectByteBuffer.hb
        DirectByteBuffer.address
        {our int value}
        spill
        getAndUpdate_shade.updateFn
        getAndUpdate_shade.updateFn.getClass()
        ... call updateFn
        Blackhole
        .. call consumeCPU
        spill
        DirectByteBuffer.hb
        DirectByteBuffer.address
        DirectByteBuffer.isReadOnly

        ... and finally

        CAS {our int value}

I used GetAndUpdateBench (which uses simple VarHandles, no
ByteBuffers) and replaced the call to Blackhole.consumeCPU(20L) with

        for (int i = 0; i < 10; i++) {
            // Marsaglia xor-shift generator
            n ^= (n << 21);
            n ^= (n >>> 35);
            n ^= (n << 4);
        }

to try to get a better idea of the real performance of getAndUpdate
with a (somewhat) time-consuming update function.  I did this because
Blackhole.consumeCPU() causes all registers to be spilled to the
stack.

The results, AArch64:

GetAndUpdateBench0.dflt                         avgt   20  160.454 ?  8.667  ns/op
GetAndUpdateBench0.dflt:getAndUpdate1_dflt      avgt   20  160.459 ?  8.555  ns/op
GetAndUpdateBench0.dflt:getAndUpdate2_dflt      avgt   20  160.450 ?  8.787  ns/op
GetAndUpdateBench0.martin                       avgt   20  159.229 ? 10.803  ns/op
GetAndUpdateBench0.martin:getAndUpdate1_martin  avgt   20  159.279 ? 10.774  ns/op
GetAndUpdateBench0.martin:getAndUpdate2_martin  avgt   20  159.180 ? 10.833  ns/op
GetAndUpdateBench0.shade                        avgt   20  136.533 ?  4.101  ns/op
GetAndUpdateBench0.shade:getAndUpdate1_shade    avgt   20  136.811 ?  4.042  ns/op
GetAndUpdateBench0.shade:getAndUpdate2_shade    avgt   20  136.255 ?  4.246  ns/op
GetAndUpdateBench0.strong                       avgt   20  136.746 ?  4.992  ns/op
GetAndUpdateBench0.strong:getAndUpdate1_strong  avgt   20  136.737 ?  5.007  ns/op
GetAndUpdateBench0.strong:getAndUpdate2_strong  avgt   20  136.755 ?  4.985  ns/op

strong and shade are the fastest, as I'd expect, because they do the
least work.  dflt and martin have a lot of jitter, perhaps because of
branch prediction.  I think shade would be better at times of high
contention, but that is very hard to measure.

Re the profile data for martin: I'm seeing "spurious" CAS failures in
the profile data of 87018:270, i.e. 0.3%.

I conclude that string and shade are equally good unless there is high
contention.  I haven't measured that.  I also haven't measured an
expensive update function with high jitter.  I suspect that with high
contention shade would win.  With no contention and no false sharing
there is nothing to choose between them.

If the update function was very expensive the rate of CAS failures
might well rise and it would be worth going to heroic attempts to
avoid calling it, but I suspect that such functions are unusual.

Andrew.


NB:

Looking at the profile data, "spurious" CAS failures are pretty
rare.  I do not believe that this is simply because we "got lucky" and
the variables were on different cache lines, but it's possible.  One
one run I saw times spectacularly better than these, and I ignored it.
I added some padding to prevent the two variables sharing the same
cache line, and:

Benchmark                                       Mode  Cnt   Score   Error  Units
GetAndUpdateBench0.dflt                         avgt   20  77.116 ? 0.120  ns/op
GetAndUpdateBench0.dflt:getAndUpdate1_dflt      avgt   20  77.142 ? 0.159  ns/op
GetAndUpdateBench0.dflt:getAndUpdate2_dflt      avgt   20  77.090 ? 0.112  ns/op
GetAndUpdateBench0.martin                       avgt   20  76.865 ? 0.222  ns/op
GetAndUpdateBench0.martin:getAndUpdate1_martin  avgt   20  76.840 ? 0.260  ns/op
GetAndUpdateBench0.martin:getAndUpdate2_martin  avgt   20  76.889 ? 0.241  ns/op
GetAndUpdateBench0.shade                        avgt   20  76.324 ? 0.096  ns/op
GetAndUpdateBench0.shade:getAndUpdate1_shade    avgt   20  76.286 ? 0.015  ns/op
GetAndUpdateBench0.shade:getAndUpdate2_shade    avgt   20  76.362 ? 0.183  ns/op
GetAndUpdateBench0.strong                       avgt   20  76.303 ? 0.080  ns/op
GetAndUpdateBench0.strong:getAndUpdate1_strong  avgt   20  76.330 ? 0.154  ns/op
GetAndUpdateBench0.strong:getAndUpdate2_strong  avgt   20  76.277 ? 0.011  ns/op

QED, I think.


Appendix

With trivial update function, no false sharing:

Benchmark                                       Mode  Cnt   Score   Error  Units
GetAndUpdateBench0.dflt                         avgt   20  44.842 ? 0.189  ns/op
GetAndUpdateBench0.dflt:getAndUpdate1_dflt      avgt   20  44.830 ? 0.188  ns/op
GetAndUpdateBench0.dflt:getAndUpdate2_dflt      avgt   20  44.855 ? 0.202  ns/op
GetAndUpdateBench0.martin                       avgt   20  44.900 ? 0.182  ns/op
GetAndUpdateBench0.martin:getAndUpdate1_martin  avgt   20  44.874 ? 0.338  ns/op
GetAndUpdateBench0.martin:getAndUpdate2_martin  avgt   20  44.926 ? 0.342  ns/op
GetAndUpdateBench0.shade                        avgt   20  44.626 ? 0.037  ns/op
GetAndUpdateBench0.shade:getAndUpdate1_shade    avgt   20  44.622 ? 0.022  ns/op
GetAndUpdateBench0.shade:getAndUpdate2_shade    avgt   20  44.630 ? 0.071  ns/op
GetAndUpdateBench0.strong                       avgt   20  44.208 ? 0.051  ns/op
GetAndUpdateBench0.strong:getAndUpdate1_strong  avgt   20  44.229 ? 0.097  ns/op
GetAndUpdateBench0.strong:getAndUpdate2_strong  avgt   20  44.188 ? 0.014  ns/op

With trivial update function and false sharing:

# Run complete. Total time: 00:03:35

Benchmark                                       Mode  Cnt    Score    Error  Units
GetAndUpdateBench0.dflt                         avgt   20  145.885 ?  3.751  ns/op
GetAndUpdateBench0.dflt:getAndUpdate1_dflt      avgt   20  149.685 ?  9.260  ns/op
GetAndUpdateBench0.dflt:getAndUpdate2_dflt      avgt   20  142.085 ?  8.205  ns/op
GetAndUpdateBench0.martin                       avgt   20  155.300 ? 10.353  ns/op
GetAndUpdateBench0.martin:getAndUpdate1_martin  avgt   20  153.278 ? 32.496  ns/op
GetAndUpdateBench0.martin:getAndUpdate2_martin  avgt   20  157.323 ? 35.911  ns/op
GetAndUpdateBench0.shade                        avgt   20  140.776 ?  8.517  ns/op
GetAndUpdateBench0.shade:getAndUpdate1_shade    avgt   20  140.115 ?  8.897  ns/op
GetAndUpdateBench0.shade:getAndUpdate2_shade    avgt   20  141.438 ?  9.049  ns/op
GetAndUpdateBench0.strong                       avgt   20  108.120 ?  3.737  ns/op
GetAndUpdateBench0.strong:getAndUpdate1_strong  avgt   20  108.126 ?  3.735  ns/op


More information about the Concurrency-interest mailing list