[concurrency-interest] HotSpot inlining in hot loops.

Dawid Weiss dawid.weiss at gmail.com
Tue Jan 12 14:47:58 EST 2010


To complete the thread and for archival reasons -- I did cross-post
the question to hotspot-compiler. It seems like the optimisation is
reasonable and possible under Java memory model, but hotspot currently
does not perform it.

Dawid

On Thu, Jan 7, 2010 at 11:42 AM, David Holmes <davidcholmes at aapt.net.au> wrote:
> Dawid Weiss writes:
>>
>> I have been wondering about one thing... knowing the people reading
>> this list this I believe it's the right place to ask.
>
> Well there may be some people here who know the answer, but that doesn't
> make this the right place to ask. ;-)
>
> A better place to ask is the OpenJDK compiler mailing list:
>
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-compiler-dev
>
> Cheers,
> David Holmes
>
>> I have been experimenting with various code patterns and
>> micro-optimisations (and I do realize it's a very subtle topic, but
>> let me continue). I observed the following interesting behavior from
>> SUN's HotSpot implementation. Consider the following two loops:
>>
>> 1.
>>         for (int i = 0; i < list.size(); i++)
>>         {
>>             value = list.get(i);
>>         }
>>
>> 2.
>>         final int size = list.size();
>>         final int [] buffer = list.buffer;
>>         for (int i = 0; i < size; i++)
>>         {
>>             value = buffer[i];
>>         }
>>
>> The list class is custom and the implementation of size() and
>> get() is trivial:
>>
>> get == return buffer[index];
>> size == return elementsCount;
>>
>> (no boundary checks, no nothing).
>>
>> Now, the micro-benchmarks (server JVM, with proper warmup, many
>> benchmark repetitions, no parallel activity from the GC or whatsover)
>> are
>> consistently showing that loop (2) is at least TWICE as fast as loop
>> (1) (on various CPU architectures, multi and single-core systems).
>> "value"
>> variable is a public static volatile to force the compiler to actually
>> read the buffer's content and store is somewhere. Example timings:
>>
>> testSimpleGetLoop          : time.bench: 1.97, round: 0.20 [+- 0.00],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>> testDirectBufferLoop       : time.bench: 0.57, round: 0.06 [+- 0.01],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>>
>> (note the "round" field above, in brackets are standard deviations
>> from multiple test runs).
>>
>> I looked at the assembly dumped by the HotSpot and from it I can
>> conclude that in case of loop (1) the generated code always attempts
>> to re-read the fields referenced
>> through the list field (it is object-scoped, private, final,
>> non-volatile, direct class access -- no interfaces). See listing (A)
>> below. My understanding of the JMM was that
>> in cases such as this one, the compiler can safely assume no side
>> effects for the current thread and move field references to registers.
>> This is exactly what happens
>> in case (2) (see listing (B) below) -- the fields are moved to
>> regiters and additional loop unrolling is performed by the compiler. I
>> am definitely missing something here -- is this my
>> incorrect understanding of the spec or simply the Java compiler does
>> not do aggresive optimisations?
>>
>> I also checked with other JVMs. IBM's Java does not show such a major
>> difference:
>>
>> testSimpleGetLoop            time.bench: 1.96, round: 0.20 [+- 0.01],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>> testDirectBufferLoop        time.bench: 1.70, round: 0.17 [+- 0.00],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>>
>> JRocki also doesn't:
>>
>> testSimpleGetLoop             time.bench: 1.02, round: 0.10 [+- 0.01],
>> round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>> testDirectBufferLoop             time.bench: 1.01, round: 0.10 [+-
>> 0.01], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
>>
>> I have experimented with other scenarios (closure loops, iterables,
>> etc.) and the results are often even more intriguing, but I don't want
>> to start too many threads at once. I will appreciate your comments,
>> even if they simply redirect me to right fragment of the JVM
>> specification.
>>
>> Dawid
>>
>>
>> (A) simple get loop, pseudo-assembly dump (after warmup).
>>
>> 030   B4: #   B11 B5 &lt;- B3 B8      Loop: B4-B8 inner  Freq: 220753
>> 030           MOV    EDX,[EBP + #12] ! Field .buffer
>> 033           NullCheck EBP
>> 033
>> 033   B5: #   B12 B6 &lt;- B4  Freq: 220753
>> 033           MOV    EBP,[EDX + #8]
>> 036           NullCheck EDX
>> 036
>> 036   B6: #   B10 B7 &lt;- B5  Freq: 220752
>> 036           CMPu   EDI,EBP
>> 038           Jnb,us B10  P=0.000001 C=-1.000000
>> 038
>> 03a   B7: #   B13 B8 &lt;- B6  Freq: 220752
>> 03a           MOVSX8 EAX,[EDX + #12 + EDI]    # byte
>> 03f           MEMBAR-release ! (empty encoding)
>> 03f           MOV8   [ECX + precise klass Benchmark:
>> 0x048e80e8:Constant:exact *],EAX ! Field  Volatile value
>> 045           MEMBAR-volatile (unnecessary so empty encoding)
>> 045           LOCK ADDL [ESP + #0], 0 ! membar_volatile
>> 04a           MOV    EBP,[EBX + #12] ! Field .list
>> 04d           MOV    EDX,[EBP + #8]   # int ! Field .elementsCount
>> 050           NullCheck EBP
>> 050
>> 050   B8: #   B4 B9 &lt;- B7  Freq: 220752
>> 050           INC    EDI
>> 051           TSTL   #polladdr,EAX    ! Safepoint: poll for GC
>> 057           CMP    EDI,EDX
>> 059           Jl,s  B4  P=1.000000 C=179200.000000
>>
>>
>> (B) direct buffer access loop (after warmup).
>>
>> 01a           MOV    ECX,[ECX + #12] ! Field .list
>> 01d           MOV    EAX,[ECX + #8]   # int ! Field .elementsCount
>> 020           NullCheck ECX
>> 020
>> 020   B2: #   B12 B3 &lt;- B1  Freq: 0.999999
>> 020           TEST   EAX,EAX
>> 022           Jle    B12  P=0.000000 C=1.000000
>> 022
>> 028   B3: #   B4 &lt;- B2  Freq: 0.999999
>> 028           MOV    EDI,[ECX + #12] ! Field .buffer
>> 02b           XOR    EBX,EBX
>> 02d           MOV    EBP,#360
>> 02d
>> 032   B4: #   B14 B5 &lt;- B3 B6      Loop: B4-B6 inner stride:
>> not constant
>> pre of N158 Freq: 1.99999
>> 032           MOV    EDX,[EDI + #8]
>> 035           NullCheck EDI
>> 035
>> 035   B5: #   B13 B6 &lt;- B4  Freq: 1.99999
>> 035           CMPu   EBX,EDX
>> 037           Jnb,u  B13  P=0.000001 C=-1.000000
>> 037
>> 03d   B6: #   B4 B7 &lt;- B5  Freq: 1.99999
>> 03d           MOVSX8 ECX,[EDI + #12 + EBX]    # byte
>> 042           MEMBAR-release ! (empty encoding)
>> 042           MOV8   [EBP + precise klass :
>> 0x04161d30:Constant:exact *],ECX
>> ! Field  Volatile.value
>> 048           MEMBAR-volatile (unnecessary so empty encoding)
>> 048           LOCK ADDL [ESP + #0], 0 ! membar_volatile
>> 04d           INC    EBX
>> 04e           CMP    EBX,#1
>> 051           Jl,s  B4        # Loop end  P=0.500000 C=334848.000000
>>
>> 053   B7: #   B9 B8 &lt;- B6  Freq: 0.999994
>> 053           MOV    ESI,EAX
>> 055           MIN    ESI,EDX
>> 05b           SUB    ESI,EBX
>> 05d           AND    ESI,#-4
>> 060           ADD    ESI,EBX
>> 062           CMP    EBX,ESI
>> 064           Jge,s  B9  P=0.000001 C=-1.000000
>>               NOP     # 10 bytes pad for loops and calls
>>
>> 070   B8: #   B8 B9 &lt;- B7 B8       Loop: B8-B8 inner stride:
>> not constant
>> main of N116 Freq: 999993
>> 070           MOVSX8 ECX,[EDI + #12 + EBX]    # byte
>> 075           MEMBAR-release ! (empty encoding)
>> 075           MOV8   [EBP + precise klass :
>> 0x04161d30:Constant:exact *],ECX
>> ! Field  Volatile.value
>> 07b           MEMBAR-volatile (unnecessary so empty encoding)
>> 07b           MEMBAR-volatile (unnecessary so empty encoding)
>> 07b           MOVSX8 ECX,[EDI + #13 + EBX]    # byte
>> 080           MEMBAR-release ! (empty encoding)
>> 080           MOV8   [EBP + precise klass :
>> 0x04161d30:Constant:exact *],ECX
>> ! Field  Volatile.value
>> 086           MEMBAR-volatile (unnecessary so empty encoding)
>> 086           MEMBAR-volatile (unnecessary so empty encoding)
>>
>> ... (loop further unrolled here) ...
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>



More information about the Concurrency-interest mailing list