[concurrency-interest] A beginner question (on fork-and-join)

Nathan Reynolds nathan.reynolds at oracle.com
Tue Nov 22 12:24:43 EST 2011


There are many optimizations that are useful with profiling data.  For 
example, I was surprised the other day that a virtual method got 
inlined!  The profile showed that only 1 class of many was being called 
into.  So, the optimizer put an if statement to check the class (3 cycle 
penalty) and then inlined the virtual method.  The check branches out to 
slower code if a different class is encountered.  Without profiling 
data, the optimizer would have to assume that any class was equally 
likely and therefore leave it as a pure virtual call.  However, if I had 
switch to 100 invocations then optimize, the code would have been just 
as good in this case.  The profile samples would have said the same 
thing to the optimizer.

Nathan Reynolds 
<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> | 
Consulting Member of Technical Staff | 602.333.9091
Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology

On 11/22/2011 9:52 AM, Gregg Wonderly wrote:
> Nathan, thanks for your insight.  My experience is that most of my 
> code does absolutely little which would be "tuned" for improved 
> throughput by an optimizer.  Instead, my applications are largely 
> client server and remote communications based, and the vast majority 
> of wall clock time involves latency in communications between machines.
>
> The SecurityManager, Permission, PermissionCollection and other 
> related classes are absolutely at odds with throughput because of 
> course grained locking and other extremely poor design choices.  Use 
> of a SecurityManager seems like it is a much overlooked part of the 
> JVM performance.
>
> We have things like the use of reflection in subclasses of Thread and 
> "locking" to single threaded construction (a static container of all 
> things) which kills throughput, no matter when the Jit takes over 
> because of
>
>     private static boolean isCCLOverridden(Class cl);
>
> That mechanism should be using Future (inside CHM perhaps) and not 
> synchronized().  This can, to some degree be worked around by using a 
> Runnable, instead, but sometimes a Thread is what works better because 
> of its accessibility through Thread.currentThread() and yes you could 
> use ThreadLocal instead for many cases.  But, subclassing thread for a 
> Thread factory can be a necessity, and that's where many server kinds 
> of applications with a security manager active get into trouble.
>
> For me, anything with a security manager that is a client, starts much 
> faster with 100 invocations.  If there is anything appreciable that I 
> miss by compiling that soon, I've not noticed it.
>
> Your discussion on heuristics being used to make more intelligent 
> decisions about what optimizations might be beneficial (branch 
> optimizations I'm guessing) help me see that there might be some 
> better reasons to delay for some heavily compute bound classes.
>
> Gregg
>
> On 11/21/2011 5:29 PM, Nathan Reynolds wrote:
>> The JVM does profile-guided optimization. If you reduce the warm up 
>> to only 100
>> invocations, then the JVM only looks at those 100 samples and 
>> determines how to
>> optimize the method. I would guess that for some methods 100, or 1000 
>> or 10000
>> invocations isn't going to make any difference on the optimized code. 
>> However,
>> other methods need the full 10,000 invocations in order to fully 
>> understand how
>> the method is used and the best way to optimize it.
>>
>> In production, you could start one JVM with 100 invocations and the 
>> other with
>> the default. If both JVMs have the same CPU usage, response times and 
>> throughput
>> after warmup and compilation, then 100 invocations is sufficient for 
>> your
>> workload. I would guess that the one with 100 invocations will suffer.
>>
>> I'm not sure, but I believe HotSpot 7 includes a tiered compilation. 
>> After 1,000
>> invocations, the method is deemed hot enough that the JVM optimizes 
>> it without
>> any profiling data to guide the optimizations. The JVM adds profiling 
>> code to
>> the method at this time. After 10,000 invocations, the JVM does the
>> profile-guided optimization of the method.
>>
>> On a heavily used server, 10,000 invocations should happen very 
>> quickly. For
>> some servers, they will process that many requests per second or even
>> sub-second. So, the question becomes does the first few minutes of 
>> execution
>> really matter considering the lifespan of the server? In the overall 
>> picture,
>> the start up time is much less than 1% of the total time the server 
>> is running.
>>
>> For client applications, this is a much different story. The 10,000 
>> invocation
>> won't be reached until the user presses a button 10,000 times. 
>> However, _some_
>> of the time the response time of the program isn't critical. The 
>> response time
>> for fully optimized code might be 1 ms. With unoptimized code it 
>> might be 10 ms.
>> The user may not be able to notice. For example, the older flat-panel 
>> monitors
>> refresh at 60 Hz (= 16.6 ms). So, if the program responds within 16.6 
>> ms, the
>> user may not even be able to see that it took a bit longer.
>>
>> However, I hear your pain. I wish there were a good way to have 
>> instant warm up.
>> I and several others have given this problem a lot of thought. All of 
>> the
>> schemes we have come up with have a lot of issues and were rejected 
>> flat-out or
>> were tried and then rejected due to performance issues.
>>
>> Nathan Reynolds 
>> <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
>> Consulting Member of Technical Staff | 602.333.9091
>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
>>
>> On 11/21/2011 3:36 PM, Gregg Wonderly wrote:
>>> So I have to ask, why don't you use the command line property to 
>>> change this
>>> to something like 100 for a faster warm up? For some of my 
>>> applications, doing
>>> this reduces startup time by orders of magnitude because of the 
>>> number of
>>> times some things are invoked. In particular, server applications 
>>> using a
>>> security manager seem to start much faster.
>>>
>>> Gregg Wonderly
>>>
>>> On 11/21/2011 3:40 PM, Nathan Reynolds wrote:
>>>> Microbenchmarks are incredibly hard to get right. For example, 
>>>> HotSpot 7 JVM
>>>> won't do a full optimization of a method until 10,000 invocations. 
>>>> You need to
>>>> bump up the priority of the test thread so that other things on the 
>>>> system don't
>>>> add noise. These probably aren't applicable to your case, but you 
>>>> may to force a
>>>> full GC right before running the test.
>>>>
>>>> You probably want to use http://code.google.com/p/caliper/ which 
>>>> deals with all
>>>> of these gotchas.
>>>>
>>>> Nathan Reynolds 
>>>> <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
>>>> Consulting Member of Technical Staff | 602.333.9091
>>>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
>>>>
>>>> On 11/21/2011 2:15 PM, David Harrigan wrote:
>>>>> Hi Everyone,
>>>>>
>>>>> I'm learning about the fork and join framework in JDK7 and to test it
>>>>> out I wrote a little program that tries to find a number at the 
>>>>> end of
>>>>> a list with 50,000 elements.
>>>>> What puzzles me is when I run the "find" in a sequential fashion, it
>>>>> returns faster than if I use a fork-and-join implementation. I'm
>>>>> running each "find" 5000 times
>>>>> so as to "warm" up the JVM. I've got a timing listed below:
>>>>>
>>>>> Generating some data...done!
>>>>> Sequential
>>>>> Simon Stopwatch: total 1015 s, counter 5000, max 292 ms, min 195 ms,
>>>>> mean 203 ms [sequential INHERIT]
>>>>> Parallel
>>>>> Simon Stopwatch: total 1352 s, counter 5000, max 4.70 s, min 243 ms,
>>>>> mean 270 ms [parallel INHERIT]
>>>>>
>>>>> (some runtime information)
>>>>>
>>>>> openjdk version "1.7.0-ea"
>>>>> OpenJDK Runtime Environment (build 1.7.0-ea-b215)
>>>>> OpenJDK 64-Bit Server VM (build 21.0-b17, mixed mode)
>>>>>
>>>>> 2.66Mhz Intel Core i7 with 8GB RAM (256KB L2 cache per core (4 cores)
>>>>> and 4MB L3 cache) running on a MBP (Lion 10.7.2)
>>>>>
>>>>> Forgive my ignorance but this type of programming is still quite new
>>>>> to me and I'm obviously doing something wrong, but I don't know what.
>>>>> My suspicion is
>>>>> something to do with spinning up and down threads and the overhead
>>>>> that entails. I've posted the src herehttp://pastebin.com/p96R24R0.
>>>>>
>>>>> My sincere apologies if this list is not appropriate for this 
>>>>> posting,
>>>>> if so I would welcome a pointer on where I can find more information
>>>>> to help me understand
>>>>> better the behaviour of my program when using F&J.
>>>>>
>>>>> I thought that by using F&J I would be able to find the answer 
>>>>> quicker
>>>>> than doing the searching sequentially, perhaps I've choosen a wrong
>>>>> initial problem to
>>>>> test this out (something that is suited to a sequential search and 
>>>>> not
>>>>> a parallel search?)
>>>>>
>>>>> Thank you all in advance.
>>>>>
>>>>> -=david=-
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Concurrency-interest mailing list
>>>> Concurrency-interest at cs.oswego.edu
>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20111122/1e11ba79/attachment.html>


More information about the Concurrency-interest mailing list