[concurrency-interest] HotSpot inlining in hot loops.
Dawid Weiss
dawid.weiss at gmail.com
Tue Jan 12 14:47:58 EST 2010
To complete the thread and for archival reasons -- I did cross-post
the question to hotspot-compiler. It seems like the optimisation is
reasonable and possible under Java memory model, but hotspot currently
does not perform it.
Dawid
On Thu, Jan 7, 2010 at 11:42 AM, David Holmes <davidcholmes at aapt.net.au> wrote:
> Dawid Weiss writes:
>>
>> I have been wondering about one thing... knowing the people reading
>> this list this I believe it's the right place to ask.
>
> Well there may be some people here who know the answer, but that doesn't
> make this the right place to ask. ;-)
>
> A better place to ask is the OpenJDK compiler mailing list:
>
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-compiler-dev
>
> Cheers,
> David Holmes
>
>> I have been experimenting with various code patterns and
>> micro-optimisations (and I do realize it's a very subtle topic, but
>> let me continue). I observed the following interesting behavior from
>> SUN's HotSpot implementation. Consider the following two loops:
>>
>> 1.
>> for (int i = 0; i < list.size(); i++)
>> {
>> value = list.get(i);
>> }
>>
>> 2.
>> final int size = list.size();
>> final int [] buffer = list.buffer;
>> for (int i = 0; i < size; i++)
>> {
>> value = buffer[i];
>> }
>>
>> The list class is custom and the implementation of size() and
>> get() is trivial:
>>
>> get == return buffer[index];
>> size == return elementsCount;
>>
>> (no boundary checks, no nothing).
>>
>> Now, the micro-benchmarks (server JVM, with proper warmup, many
>> benchmark repetitions, no parallel activity from the GC or whatsover)
>> are
>> consistently showing that loop (2) is at least TWICE as fast as loop
>> (1) (on various CPU architectures, multi and single-core systems).
>> "value"
>> variable is a public static volatile to force the compiler to actually
>> read the buffer's content and store is somewhere. Example timings:
>>
>> testSimpleGetLoop : time.bench: 1.97, round: 0.20 [+- 0.00],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>> testDirectBufferLoop : time.bench: 0.57, round: 0.06 [+- 0.01],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>>
>> (note the "round" field above, in brackets are standard deviations
>> from multiple test runs).
>>
>> I looked at the assembly dumped by the HotSpot and from it I can
>> conclude that in case of loop (1) the generated code always attempts
>> to re-read the fields referenced
>> through the list field (it is object-scoped, private, final,
>> non-volatile, direct class access -- no interfaces). See listing (A)
>> below. My understanding of the JMM was that
>> in cases such as this one, the compiler can safely assume no side
>> effects for the current thread and move field references to registers.
>> This is exactly what happens
>> in case (2) (see listing (B) below) -- the fields are moved to
>> regiters and additional loop unrolling is performed by the compiler. I
>> am definitely missing something here -- is this my
>> incorrect understanding of the spec or simply the Java compiler does
>> not do aggresive optimisations?
>>
>> I also checked with other JVMs. IBM's Java does not show such a major
>> difference:
>>
>> testSimpleGetLoop time.bench: 1.96, round: 0.20 [+- 0.01],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>> testDirectBufferLoop time.bench: 1.70, round: 0.17 [+- 0.00],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>>
>> JRocki also doesn't:
>>
>> testSimpleGetLoop time.bench: 1.02, round: 0.10 [+- 0.01],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>> testDirectBufferLoop time.bench: 1.01, round: 0.10 [+-
>> 0.01], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>>
>> I have experimented with other scenarios (closure loops, iterables,
>> etc.) and the results are often even more intriguing, but I don't want
>> to start too many threads at once. I will appreciate your comments,
>> even if they simply redirect me to right fragment of the JVM
>> specification.
>>
>> Dawid
>>
>>
>> (A) simple get loop, pseudo-assembly dump (after warmup).
>>
>> 030 B4: # B11 B5 <- B3 B8 Loop: B4-B8 inner Freq: 220753
>> 030 MOV EDX,[EBP + #12] ! Field .buffer
>> 033 NullCheck EBP
>> 033
>> 033 B5: # B12 B6 <- B4 Freq: 220753
>> 033 MOV EBP,[EDX + #8]
>> 036 NullCheck EDX
>> 036
>> 036 B6: # B10 B7 <- B5 Freq: 220752
>> 036 CMPu EDI,EBP
>> 038 Jnb,us B10 P=0.000001 C=-1.000000
>> 038
>> 03a B7: # B13 B8 <- B6 Freq: 220752
>> 03a MOVSX8 EAX,[EDX + #12 + EDI] # byte
>> 03f MEMBAR-release ! (empty encoding)
>> 03f MOV8 [ECX + precise klass Benchmark:
>> 0x048e80e8:Constant:exact *],EAX ! Field Volatile value
>> 045 MEMBAR-volatile (unnecessary so empty encoding)
>> 045 LOCK ADDL [ESP + #0], 0 ! membar_volatile
>> 04a MOV EBP,[EBX + #12] ! Field .list
>> 04d MOV EDX,[EBP + #8] # int ! Field .elementsCount
>> 050 NullCheck EBP
>> 050
>> 050 B8: # B4 B9 <- B7 Freq: 220752
>> 050 INC EDI
>> 051 TSTL #polladdr,EAX ! Safepoint: poll for GC
>> 057 CMP EDI,EDX
>> 059 Jl,s B4 P=1.000000 C=179200.000000
>>
>>
>> (B) direct buffer access loop (after warmup).
>>
>> 01a MOV ECX,[ECX + #12] ! Field .list
>> 01d MOV EAX,[ECX + #8] # int ! Field .elementsCount
>> 020 NullCheck ECX
>> 020
>> 020 B2: # B12 B3 <- B1 Freq: 0.999999
>> 020 TEST EAX,EAX
>> 022 Jle B12 P=0.000000 C=1.000000
>> 022
>> 028 B3: # B4 <- B2 Freq: 0.999999
>> 028 MOV EDI,[ECX + #12] ! Field .buffer
>> 02b XOR EBX,EBX
>> 02d MOV EBP,#360
>> 02d
>> 032 B4: # B14 B5 <- B3 B6 Loop: B4-B6 inner stride:
>> not constant
>> pre of N158 Freq: 1.99999
>> 032 MOV EDX,[EDI + #8]
>> 035 NullCheck EDI
>> 035
>> 035 B5: # B13 B6 <- B4 Freq: 1.99999
>> 035 CMPu EBX,EDX
>> 037 Jnb,u B13 P=0.000001 C=-1.000000
>> 037
>> 03d B6: # B4 B7 <- B5 Freq: 1.99999
>> 03d MOVSX8 ECX,[EDI + #12 + EBX] # byte
>> 042 MEMBAR-release ! (empty encoding)
>> 042 MOV8 [EBP + precise klass :
>> 0x04161d30:Constant:exact *],ECX
>> ! Field Volatile.value
>> 048 MEMBAR-volatile (unnecessary so empty encoding)
>> 048 LOCK ADDL [ESP + #0], 0 ! membar_volatile
>> 04d INC EBX
>> 04e CMP EBX,#1
>> 051 Jl,s B4 # Loop end P=0.500000 C=334848.000000
>>
>> 053 B7: # B9 B8 <- B6 Freq: 0.999994
>> 053 MOV ESI,EAX
>> 055 MIN ESI,EDX
>> 05b SUB ESI,EBX
>> 05d AND ESI,#-4
>> 060 ADD ESI,EBX
>> 062 CMP EBX,ESI
>> 064 Jge,s B9 P=0.000001 C=-1.000000
>> NOP # 10 bytes pad for loops and calls
>>
>> 070 B8: # B8 B9 <- B7 B8 Loop: B8-B8 inner stride:
>> not constant
>> main of N116 Freq: 999993
>> 070 MOVSX8 ECX,[EDI + #12 + EBX] # byte
>> 075 MEMBAR-release ! (empty encoding)
>> 075 MOV8 [EBP + precise klass :
>> 0x04161d30:Constant:exact *],ECX
>> ! Field Volatile.value
>> 07b MEMBAR-volatile (unnecessary so empty encoding)
>> 07b MEMBAR-volatile (unnecessary so empty encoding)
>> 07b MOVSX8 ECX,[EDI + #13 + EBX] # byte
>> 080 MEMBAR-release ! (empty encoding)
>> 080 MOV8 [EBP + precise klass :
>> 0x04161d30:Constant:exact *],ECX
>> ! Field Volatile.value
>> 086 MEMBAR-volatile (unnecessary so empty encoding)
>> 086 MEMBAR-volatile (unnecessary so empty encoding)
>>
>> ... (loop further unrolled here) ...
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
More information about the Concurrency-interest
mailing list