[concurrency-interest] Some interesting (confusing?) benchmark results

Peter Levart peter.levart at gmail.com
Fri Feb 15 04:54:59 EST 2013


On 02/14/2013 08:40 PM, thurston at nomagicsoftware.com wrote:
> So my results are different in that the MergeSorter version (which 
> creates many subtasks, 16K+) performs better than the tasks == workers 
> (i.e. FJPThreads) == 2.  And not by a few percentage points - between 
> 30-40% on average.
> It's difficult for me to make sense of these conflicting results.

Your MergeSorter uses an inferior algorithm for sorting in the leaf 
tasks (insert-sort). It has a time complexity of O(N^2) for random 
inputs, but it is as fast or even faster than better algorithms (quick, 
heap, merge-sort) for very small Ns (Arrays.sort delegates to 
insert-sort when N < 47), so when the THRESHOLD is big, the N^2 hits you.

>
> As far as performing sorts in-place on an array; my question: is that 
> multi-thread-safe?
> (although my code faces the same issue since insertionSort() is done 
> in-place).

As long as the algorithm guarantees that segments of array that are used 
(read or written) by different threads (tasks) that execute concurrently 
are disjunct, you are fine. When you do a SubTask.join(), the part of 
the array that this subtask was responsible for, is ready to be used by 
the task that joined the subtask - that's all the synchronization you need.

>
> There is nothing in the javadocs that specifically addresses this issue:
>    "Tasks should also not perform blocking IO, and should ideally 
> access variables that   are completely independent of those accessed 
> by other running tasks" is the closest thing I could find.

Individual elements of an array *are* independent variables.

>
> Is a child returning a value from its compute method *happen-before* 
> its parent's reading the return value from join(), i.e.
>
> C(return from compute) < P(read return from join)
>
> because P can occur on another thread than C
> I guess that's a question for Dr. Lea

Although there's no explicit mention of that in the javadoc (or I just 
can't spot it), this should be self-evident. I don't know how FJPool 
would be useful otherwise.

Regards, Peter

>
>
>
> On 2013-02-13 08:30, Peter Levart wrote:
>> On 02/13/2013 04:59 PM, thurston at nomagicsoftware.com wrote:
>>> so you have 4 cores?
>>> Makes sense that those results correspond to 20M as opposed to 2M
>>> And given that you changed insertSort() -> Arrays.sort, if you 
>>> change threshold to input.length / ForkJoinPool.threads (presumably 
>>> 4 in your case), does that perform better?
>> It does (N=20M, THRESHOLD=5M, parallelism=4):
>>
>> Arrays.sort: 1529550064 nanos
>> Arrays.sort: 1478430550 nanos
>> Arrays.sort: 1475643365 nanos
>> Arrays.sort: 1481308646 nanos
>> Arrays.sort: 1477199836 nanos
>> MergeSorter2: 500316422 nanos
>> MergeSorter2: 486144409 nanos
>> MergeSorter2: 468044083 nanos
>> MergeSorter2: 624446626 nanos
>> MergeSorter2: 635152854 nanos
>>
>>
>>>
>>> Many people were suggesting that the # of tasks created should be 
>>> bounded by the # of threads in the fjpool which should == # of 
>>> hardware threads. (you can accomplish this by setting the threshold 
>>> to above formula).
>>
>> But as David Holmes has noted, the THRESHOLD should be further
>> divided by 8 as a starting point, to accommodate for unequal execution
>> of threads/tasks, so that there is no waiting at the end for final
>> long tasks to complete while other threads in pool are IDLE-ing...
>>
>> For example (N=20M, THRESHOLD=500K, parallelism=4):
>>
>> MergeSorter2: 531712994 nanos
>> MergeSorter2: 539048356 nanos
>> MergeSorter2: 586208993 nanos
>> MergeSorter2: 544239837 nanos
>> MergeSorter2: 497440246 nanos
>>
>> And (N=20M, THRESHOLD=50K, parallelism=4):
>>
>> MergeSorter2: 548634308 nanos
>> MergeSorter2: 579119444 nanos
>> MergeSorter2: 555252699 nanos
>> MergeSorter2: 505902467 nanos
>> MergeSorter2: 503907973 nanos
>>
>> And (N=20M, THRESHOLD=5K, parallelism=4):
>>
>> MergeSorter2: 547342635 nanos
>> MergeSorter2: 575145627 nanos
>> MergeSorter2: 559160837 nanos
>> MergeSorter2: 530709697 nanos
>> MergeSorter2: 562272911 nanos
>>
>> ...it does perform a little slower, but in average with lower
>> THRESHOLD, the performance should be more stable...
>>
>>
>> Regards, Peter
>>
>>>
>>> And with 20M/250 (threshold), you would be creating what 100Ks?, 
>>> 1Ms? (you can comment in the AtomicInteger.increment()) in the 
>>> constructor to save on the math
>>>
>>> And of course *I* understand the performance characteristics of 
>>> insertionSort, look at the implementation in SortUtil (which is not 
>>> parallel).  What I did makes perfect sense if you understand 
>>> insertSort vs mergeSort
>>>
>>> On 2013-02-13 07:34, Peter Levart wrote:
>>>> On 02/13/2013 04:27 PM, thurston at nomagicsoftware.com wrote:
>>>>> Could you publish your machine specs?
>>>>> And what was the size of your sample array?  2M or something else?
>>>>> If it is 2M (and if I'm doing my math right), it's weird that my 
>>>>> times are less:
>>>>> basically for 2M/250, I get around 370ms +- 25ms for the MergeSorter
>>>>> and around 330ms for java's Array.sort
>>>>
>>>> Oh, sorry. My machine is i7 Linux PC, using recent JDK8 build. I had
>>>> to increase the N to 20M to actually make it sweat a little...
>>>>
>>>> Your unchanged MergeSorter is better than Arrays.sort with
>>>> parallelism = 4 for example:
>>>>
>>>> MergeSorter: 1134740049 nanos
>>>> MergeSorter: 1082783496 nanos
>>>> MergeSorter: 1033780556 nanos
>>>> MergeSorter2: 720092808 nanos
>>>> MergeSorter2: 649416922 nanos
>>>> MergeSorter2: 605587079 nanos
>>>> Arrays.sort: 1454093373 nanos
>>>> Arrays.sort: 1466476368 nanos
>>>> Arrays.sort: 1461604292 nanos
>>>>
>>>> ...but with only 2 threads it is on-par with Arrays.sort.
>>>>
>>>> Regards, Peter
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> On 2013-02-13 06:40, Peter Levart wrote:
>>>>>> On 02/12/2013 08:20 PM, thurston at nomagicsoftware.com wrote:
>>>>>>> Here is all of the relevant code in a single blob.  I think it's 
>>>>>>> selef-explanatory; sample test is at the end.
>>>>>>>
>>>>>>> http://pastebin.com/WrfBHYSG
>>>>>>
>>>>>> Hi Thurston,
>>>>>>
>>>>>> You're doing too much copying. I modified tour MergeSorter a little:
>>>>>>
>>>>>>
>>>>>> public class MergeSorter2 extends RecursiveTask<int[]> {
>>>>>>     static final int THRESHOLD = 250;
>>>>>>     final int[] array, work;
>>>>>>     final int offset, length;
>>>>>>
>>>>>>     public MergeSorter2(int[] array) {
>>>>>>         this.array = array;
>>>>>>         this.work = new int[array.length];
>>>>>>         offset = 0;
>>>>>>         length = array.length;
>>>>>>     }
>>>>>>
>>>>>>     private MergeSorter2(int[] array, int[] work, int offset, int 
>>>>>> length) {
>>>>>>         this.array = array;
>>>>>>         this.work = work;
>>>>>>         this.offset = offset;
>>>>>>         this.length = length;
>>>>>>     }
>>>>>>
>>>>>>     @Override
>>>>>>     protected int[] compute() {
>>>>>>         if (length < MergeSorter2.THRESHOLD) {
>>>>>>             Arrays.sort(array, offset, offset + length);
>>>>>>             return this.array;
>>>>>>         }
>>>>>>
>>>>>>         int halfLength = length >> 1;
>>>>>>         MergeSorter2 left = new MergeSorter2(array, work, offset, 
>>>>>> halfLength);
>>>>>>         left.fork();
>>>>>>
>>>>>>         int mid = offset + halfLength;
>>>>>>
>>>>>>         MergeSorter2 right = new MergeSorter2(array, work, mid,
>>>>>> length - halfLength);
>>>>>>         right.compute();
>>>>>>         left.join();
>>>>>>
>>>>>>         // copy 1st half from array to work
>>>>>>         System.arraycopy(array, offset, work, offset, halfLength);
>>>>>>
>>>>>>         // merge 1st half of work and 2nd half of array into array
>>>>>>         int end = offset + length;
>>>>>>         int i = offset;
>>>>>>         int j = mid;
>>>>>>         int k = offset;
>>>>>>         int x = work[i];
>>>>>>         int y = array[j];
>>>>>>         while (true) {
>>>>>>             if (j >= end || x <= y) {
>>>>>>                 array[k++] = x;
>>>>>>                 i++;
>>>>>>                 // until we drain the 1st half (the rest of array is
>>>>>> already in place)
>>>>>>                 if (i >= mid) break;
>>>>>>                 x = work[i];
>>>>>>             } else { // j < end
>>>>>>                 array[k++] = y;
>>>>>>                 j++;
>>>>>>                 if (j < end) y = array[j];
>>>>>>             }
>>>>>>         }
>>>>>>
>>>>>>         return this.array;
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> ... it uses a single parallel "work" array for merging, which is
>>>>>> shared among tasks. It tries to minimize the copying. Here's the
>>>>>> results I get on a FJPool with parallelism=2:
>>>>>>
>>>>>>  MergeSorter: 1557638090 nanos
>>>>>>  MergeSorter: 1545600414 nanos
>>>>>>  MergeSorter: 1478185502 nanos
>>>>>> MergeSorter2: 1058750062 nanos
>>>>>> MergeSorter2:  978295370 nanos
>>>>>> MergeSorter2:  976838429 nanos
>>>>>>  Arrays.sort: 1494256525 nanos
>>>>>>  Arrays.sort: 1505639161 nanos
>>>>>>  Arrays.sort: 1483640342 nanos
>>>>>>
>>>>>>
>>>>>> I get even better results with greater THRESHOLD (but not more than
>>>>>> 10% better). Your MergeSorter is worse with greater THRESHOLD 
>>>>>> because
>>>>>> you are using insert-sort for final tasks. The Arrays.sort() 
>>>>>> delegates
>>>>>> to insert-sort only when N < 47, else it uses quick-sort...
>>>>>>
>>>>>> Regards, Peter
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2013-02-12 01:55, thurston at nomagicsoftware.com wrote:
>>>>>>>> Yes, I will push the code and my measurements some time this week.
>>>>>>>> I'm really interested in seeing what others' results would be. 
>>>>>>>> What I
>>>>>>>> can say is that the timings are surprisingly stable (range 
>>>>>>>> 30-40ms)
>>>>>>>> and the input data is randomly generated for each test run 
>>>>>>>> which can
>>>>>>>> certainly affect merge(int[], int[])'s performance
>>>>>>>>
>>>>>>>> On 2013-02-12 01:50, Kirk Pepperdine wrote:
>>>>>>>>> Hi,
>>>>>>>>> On 2013-02-11, at 9:52 PM, thurston at nomagicsoftware.com wrote:
>>>>>>>>> Hi T,
>>>>>>>>>
>>>>>>>>> Can you pub the raw data from your runs? Average tends to hide
>>>>>>>>> effects and it's difficult to say anything with only that 
>>>>>>>>> measure.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Kirk
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I made some initial attempts at using ForkJoin framework for 
>>>>>>>>>> some rather obvious recursively parallel use-cases, namely 
>>>>>>>>>> summing the elements of an int[] and also sorting an int[] 
>>>>>>>>>> (randomly filled).
>>>>>>>>>>
>>>>>>>>>> The problem-sizes I used ranged from 1M - 10M.
>>>>>>>>>>
>>>>>>>>>> I really enjoy using the framework and find it relatively 
>>>>>>>>>> easy to reason about my programs (if not necessarily the 
>>>>>>>>>> internals of the framework).
>>>>>>>>>> But I was disappointed with the results: in both cases they 
>>>>>>>>>> were slower than the corresponding Java serial implementation.
>>>>>>>>>>
>>>>>>>>>> I completely understand the principle of YMMV, and I wasn't 
>>>>>>>>>> expecting 2x speedup, but especially in the case of the sort, 
>>>>>>>>>> but I was expecting that ForkJoin would do at least a little 
>>>>>>>>>> better than the single-threaded version. I guess what I'm 
>>>>>>>>>> asking is: are these results surprising?  Does it mean I'm 
>>>>>>>>>> doing something wrong?
>>>>>>>>>>
>>>>>>>>>> I'll give a brief description of my implementation:
>>>>>>>>>> single-threaded:  Arrays.sort(int[])
>>>>>>>>>>
>>>>>>>>>> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 
>>>>>>>>>> (arrived at by trial-and-error)
>>>>>>>>>>
>>>>>>>>>> int[] compute()
>>>>>>>>>> {
>>>>>>>>>> if int[].length < THRESHOLD
>>>>>>>>>>    return insertionSort(int[])
>>>>>>>>>> left = new MergeSort(int[] //split in half)
>>>>>>>>>> right = new MergeSort(int[] //other half)
>>>>>>>>>> return merge(right.compute(), left.join())
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> The insertionSort() and merge() methods are just standard 
>>>>>>>>>> implementations; there is a bit of apples to oranges 
>>>>>>>>>> comparison since Arrays.sort() uses an optimized quicksort, 
>>>>>>>>>> but we're still talking O(nlog(n))
>>>>>>>>>>
>>>>>>>>>> Just ran it on my laptop:
>>>>>>>>>> Windows 7 64-bit
>>>>>>>>>> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
>>>>>>>>>> Core 2 Duo==> 2 cores 2GHz
>>>>>>>>>>
>>>>>>>>>> 2M int[]:
>>>>>>>>>> single-threaded: ~330ms
>>>>>>>>>> ForkJoin (2 workers): ~390ms
>>>>>>>>>>
>>>>>>>>>> Would appreciate any feedback and hopefully the #s are 
>>>>>>>>>> somewhat helpful (and please no scoffawing at my antiquated 
>>>>>>>>>> machine)
>>>>>>>>>>
>>>>>>>>>> -T
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Concurrency-interest mailing list
>>>>>>>>>> Concurrency-interest at cs.oswego.edu
>>>>>>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Concurrency-interest mailing list
>>>>>>> Concurrency-interest at cs.oswego.edu
>>>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>
>>>
>



More information about the Concurrency-interest mailing list