[concurrency-interest] ForkJoinPool not designed for nested Java 8 streams.parallel().forEach( ... )

Christian Fries email at christian-fries.de
Wed May 14 16:39:14 EDT 2014

Dear All.

Indeed, it appeared as if we were going a bit past each others in this discussions. I will stick to executor framework for nested parallelism, but since I did some further analysis, let me try to give a summary:

My concern was and is that there is a serious implementation bug in the 1.8u5 ForkJoinPool framework (specifically ForkJoinTask, likely line 401), rendering it potentially inappropriate for nested loops, threatening the versatility of Java parallel streams. I do not say that Fork/Join as a theoretical / academic concept is not appropriate for nested parallelism. In fact, I believe the opposite is true. Maybe that was a misunderstanding. - I am just concerned about the implementation.

The bug will lead to unexpected deadlocks and serious performance losses. I do have a workaround/bugfix.

For the performance: my test results are (reproducible, test cases without any blocking operation):

Timing for nested loop test cases, outer loop parallel (parallelism set to 2):
	inner loop sequential           = 27,61 +/- 0,50 (min: 26,62 , max: 28,61)	[CPU Time: 76,91]	(number of threads in F/J pool: 2)
	inner loop parallel             = 41,18 +/- 2,99 (min: 33,14 , max: 45,59)	[CPU Time: 106,00]	(number of threads in F/J pool: 8)
	inner loop parallel with bugfix = 26,99 +/- 0,81 (min: 25,75 , max: 28,77)	[CPU Time: 77,73]	(number of threads in F/J pool: 2)
(test cases were run independent with 20 warm up runs and 20 test runs on an Intel i7 at 2.6 GHz).

If you monitor the cpu usage of the jvm process (the above is just the bare getThreadCpuTime for the f/j threads and the main thread), then you will see that it is even worse: without the bug fix the jvm need over twice as much cpu time to be still slower by 30%.

For the deadlocks: The following code will result in a deadlock (surely):

	public static void nestedLoopDeadlockDemo() {
		Object lock = new Object();

		IntStream.range(0,24).parallel().forEach(i -> {
			// do some work here
			synchronized(lock) {
				IntStream.range(0,100).parallel().forEach(j -> {
					// do some work here

(and the problem is gone with that bugfix). Note: I really learned something from this discussion and discussing the design of my test case is nice, but that test case exists primarily to illustrate the bug. I do know that nesting can be avoided (though I believe nesting improves abstraction in some cases) and I do know that synchronization has to be done with care. With respect to some arguments which had appeared here: The deadlocking has nothing to do with compensation threads. I monitor the creation of threads and there are no compensation threads at all during the whole test. Also note, that the documentation of the backend ForkJoinPool states that
"The pool attempts to maintain enough active (or available) threads by dynamically adding, suspending, or resuming internal worker threads, even if some tasks are stalled waiting to join others. However, no such adjustments are guaranteed in the face of blocked IO or other unmanaged synchronization."
- but a deadlock would only be expected if the number of available threads is exhausted (and we are far below that number).

If you consider the same setup without the synchronized - i.e. plain nested parallel loops - then the bug will still results in a big performance loss in some cases.

More details for that bug:
Run the program referenced below in a debugger. It will hang in a deadlock in the last test case.
In the debugger suspend all threads and check where theses threads are waiting.
You will find out that all threads wait for the synchronize lock inside the outer loop and that lock is owned by one of the threads, lets call that owner of the lock ForkJoinPool.commonPool-worker-7.
If you check ForkJoinPool.commonPool-worker-7 (the owner of the lock) then you see that he waits for the synchronize lock, but he is already inside the inner loop. Now, lets check why a pice of code inside the synchronize waits for the lock: You see that the wait() is issued by the awaitJoin() in line 402 of ForkJoinTask. This is wrong, instead that task should have done an externalWaitDone(). Explanation: That task is the main task of the inner loop, i.e., a task of the outer loop that created the inner loop, but (due to a bug) that tasks considers itself as a forked worker (of the inner loop) and creates a join - effectively joining with the outer loop’s task. The problem is that if the inner loop is running on a forked worker of the outer loop, we cannot distinguish forked (inner loop’s) worker from main threads because line 401 is always true.
Double Checking: If the explanation in 2.2 would be correct, the problem would go away if we fix line 401 (and also all other corresponding tests). To fix this we change the type of the thread containing the inner loop from ForkJoinWorkerThread to Thread (by creating a wrapper). Indeed: this fixed that deadlock and greatly improved performance.
For a complete test demonstrating the deadlock and the workaround see: NestedParallelForEachAndSynchronization.java at

For the test producing the performance benchmarks see: NestedParallelForEachBenchmark.java at
If you run this test case for test case 1 and test case 3 you will note that without the bug fix the ForkJoinPool will spawn a lot (8) worker threads and burn a lot of cpu time (sometimes runs at 500% where the version with the bug fix will always stay around 200% with 2 worker threads)

I am not sure how easy it is to fix this, but since I already provide a workaround, it appears that a bug fix should be a minor correction to the F/J implementation. (If you like, I can comment a bit more on that might get changed in the ForkJoinTask).

Would you consider the deadlocking a bug too? Could anybody look at this? I believe this has to be fixed soon (cannot wait until JDK 9).


Am 09.05.2014 um 15:25 schrieb Doug Lea <dl at cs.oswego.edu>:

> On 05/09/2014 09:05 AM, Christian Fries wrote:
>> Wouldn’t flattening mean loss of abstraction (no abstraction between F and
>> LM), spaghetti code, etc.
> Somehow we seem to be talking past each other. Here's a restart:
> 1. j.u.Stream.parallel() automates many forms of parallelism.
> 2. It doesn't do a great job for your usage on your computer.
> 3. Because of side issues (like blocking Semaphores), it took
> some effort to discover underlying problems.
> 4. But now we know why, and will try to improve. For reasons you
> probably don't care about, it is hard to improve this usage while
> not hurting others.
> 5. In the mean time, you could if desired manually arrange
> parallelism rather than relying on j.u.streams.
> -Doug
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20140514/54921651/attachment.html>

More information about the Concurrency-interest mailing list