[concurrency-interest] A beginner question (on fork-and-join)

Gregg Wonderly gregg at cytetech.com
Tue Nov 22 11:52:24 EST 2011

Nathan, thanks for your insight.  My experience is that most of my code does 
absolutely little which would be "tuned" for improved throughput by an 
optimizer.  Instead, my applications are largely client server and remote 
communications based, and the vast majority of wall clock time involves latency 
in communications between machines.

The SecurityManager, Permission, PermissionCollection and other related classes 
are absolutely at odds with throughput because of course grained locking and 
other extremely poor design choices.  Use of a SecurityManager seems like it is 
a much overlooked part of the JVM performance.

We have things like the use of reflection in subclasses of Thread and "locking" 
to single threaded construction (a static container of all things) which kills 
throughput, no matter when the Jit takes over because of

	private static boolean isCCLOverridden(Class cl);

That mechanism should be using Future (inside CHM perhaps) and not 
synchronized().  This can, to some degree be worked around by using a Runnable, 
instead, but sometimes a Thread is what works better because of its 
accessibility through Thread.currentThread() and yes you could use ThreadLocal 
instead for many cases.  But, subclassing thread for a Thread factory can be a 
necessity, and that's where many server kinds of applications with a security 
manager active get into trouble.

For me, anything with a security manager that is a client, starts much faster 
with 100 invocations.  If there is anything appreciable that I miss by compiling 
that soon, I've not noticed it.

Your discussion on heuristics being used to make more intelligent decisions 
about what optimizations might be beneficial (branch optimizations I'm guessing) 
help me see that there might be some better reasons to delay for some heavily 
compute bound classes.


On 11/21/2011 5:29 PM, Nathan Reynolds wrote:
> The JVM does profile-guided optimization. If you reduce the warm up to only 100
> invocations, then the JVM only looks at those 100 samples and determines how to
> optimize the method. I would guess that for some methods 100, or 1000 or 10000
> invocations isn't going to make any difference on the optimized code. However,
> other methods need the full 10,000 invocations in order to fully understand how
> the method is used and the best way to optimize it.
> In production, you could start one JVM with 100 invocations and the other with
> the default. If both JVMs have the same CPU usage, response times and throughput
> after warmup and compilation, then 100 invocations is sufficient for your
> workload. I would guess that the one with 100 invocations will suffer.
> I'm not sure, but I believe HotSpot 7 includes a tiered compilation. After 1,000
> invocations, the method is deemed hot enough that the JVM optimizes it without
> any profiling data to guide the optimizations. The JVM adds profiling code to
> the method at this time. After 10,000 invocations, the JVM does the
> profile-guided optimization of the method.
> On a heavily used server, 10,000 invocations should happen very quickly. For
> some servers, they will process that many requests per second or even
> sub-second. So, the question becomes does the first few minutes of execution
> really matter considering the lifespan of the server? In the overall picture,
> the start up time is much less than 1% of the total time the server is running.
> For client applications, this is a much different story. The 10,000 invocation
> won't be reached until the user presses a button 10,000 times. However, _some_
> of the time the response time of the program isn't critical. The response time
> for fully optimized code might be 1 ms. With unoptimized code it might be 10 ms.
> The user may not be able to notice. For example, the older flat-panel monitors
> refresh at 60 Hz (= 16.6 ms). So, if the program responds within 16.6 ms, the
> user may not even be able to see that it took a bit longer.
> However, I hear your pain. I wish there were a good way to have instant warm up.
> I and several others have given this problem a lot of thought. All of the
> schemes we have come up with have a lot of issues and were rejected flat-out or
> were tried and then rejected due to performance issues.
> Nathan Reynolds <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
> Consulting Member of Technical Staff | 602.333.9091
> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
> On 11/21/2011 3:36 PM, Gregg Wonderly wrote:
>> So I have to ask, why don't you use the command line property to change this
>> to something like 100 for a faster warm up? For some of my applications, doing
>> this reduces startup time by orders of magnitude because of the number of
>> times some things are invoked. In particular, server applications using a
>> security manager seem to start much faster.
>> Gregg Wonderly
>> On 11/21/2011 3:40 PM, Nathan Reynolds wrote:
>>> Microbenchmarks are incredibly hard to get right. For example, HotSpot 7 JVM
>>> won't do a full optimization of a method until 10,000 invocations. You need to
>>> bump up the priority of the test thread so that other things on the system don't
>>> add noise. These probably aren't applicable to your case, but you may to force a
>>> full GC right before running the test.
>>> You probably want to use http://code.google.com/p/caliper/ which deals with all
>>> of these gotchas.
>>> Nathan Reynolds <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> |
>>> Consulting Member of Technical Staff | 602.333.9091
>>> Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology
>>> On 11/21/2011 2:15 PM, David Harrigan wrote:
>>>> Hi Everyone,
>>>> I'm learning about the fork and join framework in JDK7 and to test it
>>>> out I wrote a little program that tries to find a number at the end of
>>>> a list with 50,000 elements.
>>>> What puzzles me is when I run the "find" in a sequential fashion, it
>>>> returns faster than if I use a fork-and-join implementation. I'm
>>>> running each "find" 5000 times
>>>> so as to "warm" up the JVM. I've got a timing listed below:
>>>> Generating some data...done!
>>>> Sequential
>>>> Simon Stopwatch: total 1015 s, counter 5000, max 292 ms, min 195 ms,
>>>> mean 203 ms [sequential INHERIT]
>>>> Parallel
>>>> Simon Stopwatch: total 1352 s, counter 5000, max 4.70 s, min 243 ms,
>>>> mean 270 ms [parallel INHERIT]
>>>> (some runtime information)
>>>> openjdk version "1.7.0-ea"
>>>> OpenJDK Runtime Environment (build 1.7.0-ea-b215)
>>>> OpenJDK 64-Bit Server VM (build 21.0-b17, mixed mode)
>>>> 2.66Mhz Intel Core i7 with 8GB RAM (256KB L2 cache per core (4 cores)
>>>> and 4MB L3 cache) running on a MBP (Lion 10.7.2)
>>>> Forgive my ignorance but this type of programming is still quite new
>>>> to me and I'm obviously doing something wrong, but I don't know what.
>>>> My suspicion is
>>>> something to do with spinning up and down threads and the overhead
>>>> that entails. I've posted the src herehttp://pastebin.com/p96R24R0.
>>>> My sincere apologies if this list is not appropriate for this posting,
>>>> if so I would welcome a pointer on where I can find more information
>>>> to help me understand
>>>> better the behaviour of my program when using F&J.
>>>> I thought that by using F&J I would be able to find the answer quicker
>>>> than doing the searching sequentially, perhaps I've choosen a wrong
>>>> initial problem to
>>>> test this out (something that is suited to a sequential search and not
>>>> a parallel search?)
>>>> Thank you all in advance.
>>>> -=david=-
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

More information about the Concurrency-interest mailing list