[concurrency-interest] AtomicReferenceArray.get() and intrinsics method inlining

Manuel Dominguez Sarmiento mads at renxo.com
Thu Jan 16 14:43:27 EST 2020


Hi group,

For the past few weeks, we've been attempting to diagnose a caching 
performance issue. Profiling with YourKit it became evident that the 
caching layer (EhCache 2.10.x) was the culprit. This caching library 
uses a fork of ConcurrentHashMapV8 (pre-Java8) behind the scenes.

The ConcurrentHashMapV8 fork has been customized for internal EhCache 
usage, however looking at the source, it does not differ that much from 
the original ConcurrentHashMapV8. Sources for EhCache 2.10.5 can be 
downloaded from https://www.ehcache.org/downloads/
The ConcurrentHashMapV8 fork is 
net.sf.ehcache.util.concurrent.ConcurrentHashMap

Further profiling revealed that the hotspot was not within the 
ConcurrentHashMapV8 fork, but rather at 
j.u.c.atomic.AtomicReferenceArray.get()
Apparently, the EhCache team decided at some point to replace 
Unsafe.getObjectVolatile() with AtomicReferenceArray, presumably to 
avoid compiler warnings regarding the unsupported use of Unsafe (this 
was way before Java 9, the module system, Unsafe deprecation, etc.)

The logic seems straightforward and there are no obvious errors in their 
use of AtomicReferenceArray. The switch from Unsafe.getObjectVolatile() 
to AtomicReferenceArray.get() works fine.
However, we performed the following micro-benchmark: we inserted a 
single entry [key = Integer.valueOf(1); value = new Object()] and then 
performed map.get() a billion times for the same key.

The ConcurrentHashMapV8 took about 40 seconds on average to complete. 
However, stock Java 8 ConcurrentHashMap took only about 800 ms to 
complete the same task. Profiling showed that the ConcurrentHashMapV8 
fork hotspot was at AtomicReferenceArray.get(), which matched the issue 
we found in our production systems with very "hot" cache keys.

We produced a simple test class to exclude everything EhCache-related 
and we were able to reproduce this same behaviour with direct 
AtomicReferenceArray usage. This benchmark does not attempt to measure 
anything useful in the real world, but it's simple enough so that this 
performance difference became easily reproducible:

///////////////////////////////////////////////////////////////////////////////////////////////////////////////
public class Test {
     private static final Unsafe UNSAFE;
     private static final int KEY = 1;
     static {
         try {
             Field f = Unsafe.class.getDeclaredField("theUnsafe");
             f.setAccessible(true);
             UNSAFE = (Unsafe) f.get(null);
         } catch (Exception e) {
             throw new Error(e);
         }
     }

     public static void main(String[] args) {
         final long repetitions = 1_000_000_000L;
         System.out.println("ara=" + ara(repetitions) + " ms");
     }

     private static final long ara(final long repetitions) {
         final AtomicReferenceArray<Object> ara = new 
AtomicReferenceArray<Object>(100);
         ara.set(KEY, new Object());
         long start = System.nanoTime();
         for (long i = 0; i < repetitions; i++) {
             ara.get(KEY);
         }
         long end = System.nanoTime();
         return (end - start) / (1000 * 1000);
     }
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////

This test took about 40 seconds on the exact same hardware as previous 
CHM tests. So we concluded that AtomicReferenceArray.get() usage instead 
of Unsafe.getObjectVolatile() was the cause behind the 
ConcurrentHashMapV8 EhCache fork being so much slower than Java8 stock 
ConcurrentHashMap.

The difference is so huge that we suspected this could be a result of 
compiler optimizations resulting in very different bytecode, and 
possibly method inlining not working as expected with Unsafe instricts. 
This was dead on: we enabled the following JVM diagnosing options:
-server -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions 
-XX:+PrintInlining -XX:-TieredCompilation -XX:MaxInlineLevel=15

And the test yielded the following on standard output (only the relevant 
part is reproduced below):

  com.renxo.cms.Test::ara @ 32 (65 bytes)
                             @ 34 
java.util.concurrent.atomic.AtomicReferenceArray::get (10 bytes) inline 
(hot)
                               @ 3 
java.util.concurrent.atomic.AtomicReferenceArray::checkedByteOffset (45 
bytes)   inline (hot)
                                 @ 41 
java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 
bytes)   inline (hot)
                               @ 6 
java.util.concurrent.atomic.AtomicReferenceArray::getRaw (12 bytes)   
inline (hot)
                                 @ 8 sun.misc.Unsafe::getObjectVolatile 
(0 bytes)   failed to inline (intrinsic)
                                 @ 8 sun.misc.Unsafe::getObjectVolatile 
(0 bytes)   native method

So this is the interesting bit:
sun.misc.Unsafe::getObjectVolatile (0 bytes)   failed to inline (intrinsic)

Running a similar benchmark against stock Java8 ConcurrentHashMap.get() 
showed that sun.misc.Unsafe::getObjectVolatile was being inlined just 
fine. So something at AtomicReferenceArray.get() is preventing 
getObjectVolatile() to be inlined. Since this is an intrinsic function, 
performance difference should be huge.

After careful studying of stock Java8 ConcurrentHashMap.get(), we found 
that the reason why that method was being successfully inlined is the 
(tab = table) != null check before tabAt() is invoked. Apparently, the 
HotSpot compiler is unable to inline getObjectVolatile() unless it can 
verify that its main argument will always be non-null.

So, we created an AtomicReferenceArray fork with an alternative getRaw() 
implementation:

     private E getRaw(long offset) {
         Object[] array = this.array;
         if (array == null) {
             throw new IllegalStateException();
         }
         return (E) unsafe.getObjectVolatile(array, offset);
     }

This resulted in the test taking less than 500 ms instead of 40 seconds 
(that's an 80-fold difference!). And the compiler was now successfully 
inlining:

com.renxo.cms.Test::ara @ 32 (65 bytes)
                             @ 34 
com.renxo.cms.AtomicReferenceArray::get (10 bytes)   inline (hot)
                               @ 3 
com.renxo.cms.AtomicReferenceArray::checkedByteOffset (42 bytes) inline 
(hot)
                                 @ 38 
com.renxo.cms.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)
                               @ 6 
com.renxo.cms.AtomicReferenceArray::getRaw (26 bytes)   inline (hot)
                                 @ 22 sun.misc.Unsafe::getObjectVolatile 
(0 bytes)   (intrinsic)

We were extremely surprised that AtomicReferenceArray being a core 
library could benefit so much from such a small change. It is likely 
that our particular test is not representative of real-world workloads, 
but still, is there any reason why this fix should not be introduced in 
OpenJDK? We see that the current trunk AtomicReferenceArray no longer 
uses Unsafe.getObjectVolatile() and instead uses VarHandles. We're still 
on Java8 so we did not test this on Java9 or later. But it is possible 
that the same issue is still there.

Also, we just focused on AtomicReferenceArray.get(). It is likely that 
other methods using Unsafe having similar inlining issues. Also, other 
j.u.c.atomic classes might exhibit similar inlining behaviour.

Concurrency experts, we'd appreciate your feedback on our findings.
Thanks,

*Manuel Dominguez Sarmiento*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20200116/e8f961f7/attachment.htm>


More information about the Concurrency-interest mailing list