[concurrency-interest] Some interesting (confusing?) benchmark results

thurston at nomagicsoftware.com thurston at nomagicsoftware.com
Wed Feb 13 10:59:38 EST 2013


so you have 4 cores?
Makes sense that those results correspond to 20M as opposed to 2M
And given that you changed insertSort() -> Arrays.sort, if you change 
threshold to input.length / ForkJoinPool.threads (presumably 4 in your 
case), does that perform better?

Many people were suggesting that the # of tasks created should be 
bounded by the # of threads in the fjpool which should == # of hardware 
threads. (you can accomplish this by setting the threshold to above 
formula).

And with 20M/250 (threshold), you would be creating what 100Ks?, 1Ms? 
(you can comment in the AtomicInteger.increment()) in the constructor to 
save on the math

And of course *I* understand the performance characteristics of 
insertionSort, look at the implementation in SortUtil (which is not 
parallel).  What I did makes perfect sense if you understand insertSort 
vs mergeSort

On 2013-02-13 07:34, Peter Levart wrote:
> On 02/13/2013 04:27 PM, thurston at nomagicsoftware.com wrote:
>> Could you publish your machine specs?
>> And what was the size of your sample array?  2M or something else?
>> If it is 2M (and if I'm doing my math right), it's weird that my 
>> times are less:
>> basically for 2M/250, I get around 370ms +- 25ms for the MergeSorter
>> and around 330ms for java's Array.sort
>
> Oh, sorry. My machine is i7 Linux PC, using recent JDK8 build. I had
> to increase the N to 20M to actually make it sweat a little...
>
> Your unchanged MergeSorter is better than Arrays.sort with
> parallelism = 4 for example:
>
> MergeSorter: 1134740049 nanos
> MergeSorter: 1082783496 nanos
> MergeSorter: 1033780556 nanos
> MergeSorter2: 720092808 nanos
> MergeSorter2: 649416922 nanos
> MergeSorter2: 605587079 nanos
> Arrays.sort: 1454093373 nanos
> Arrays.sort: 1466476368 nanos
> Arrays.sort: 1461604292 nanos
>
> ...but with only 2 threads it is on-par with Arrays.sort.
>
> Regards, Peter
>
>>
>> Thanks
>>
>> On 2013-02-13 06:40, Peter Levart wrote:
>>> On 02/12/2013 08:20 PM, thurston at nomagicsoftware.com wrote:
>>>> Here is all of the relevant code in a single blob.  I think it's 
>>>> selef-explanatory; sample test is at the end.
>>>>
>>>> http://pastebin.com/WrfBHYSG
>>>
>>> Hi Thurston,
>>>
>>> You're doing too much copying. I modified tour MergeSorter a 
>>> little:
>>>
>>>
>>> public class MergeSorter2 extends RecursiveTask<int[]> {
>>>     static final int THRESHOLD = 250;
>>>     final int[] array, work;
>>>     final int offset, length;
>>>
>>>     public MergeSorter2(int[] array) {
>>>         this.array = array;
>>>         this.work = new int[array.length];
>>>         offset = 0;
>>>         length = array.length;
>>>     }
>>>
>>>     private MergeSorter2(int[] array, int[] work, int offset, int 
>>> length) {
>>>         this.array = array;
>>>         this.work = work;
>>>         this.offset = offset;
>>>         this.length = length;
>>>     }
>>>
>>>     @Override
>>>     protected int[] compute() {
>>>         if (length < MergeSorter2.THRESHOLD) {
>>>             Arrays.sort(array, offset, offset + length);
>>>             return this.array;
>>>         }
>>>
>>>         int halfLength = length >> 1;
>>>         MergeSorter2 left = new MergeSorter2(array, work, offset, 
>>> halfLength);
>>>         left.fork();
>>>
>>>         int mid = offset + halfLength;
>>>
>>>         MergeSorter2 right = new MergeSorter2(array, work, mid,
>>> length - halfLength);
>>>         right.compute();
>>>         left.join();
>>>
>>>         // copy 1st half from array to work
>>>         System.arraycopy(array, offset, work, offset, halfLength);
>>>
>>>         // merge 1st half of work and 2nd half of array into array
>>>         int end = offset + length;
>>>         int i = offset;
>>>         int j = mid;
>>>         int k = offset;
>>>         int x = work[i];
>>>         int y = array[j];
>>>         while (true) {
>>>             if (j >= end || x <= y) {
>>>                 array[k++] = x;
>>>                 i++;
>>>                 // until we drain the 1st half (the rest of array 
>>> is
>>> already in place)
>>>                 if (i >= mid) break;
>>>                 x = work[i];
>>>             } else { // j < end
>>>                 array[k++] = y;
>>>                 j++;
>>>                 if (j < end) y = array[j];
>>>             }
>>>         }
>>>
>>>         return this.array;
>>>     }
>>> }
>>>
>>>
>>> ... it uses a single parallel "work" array for merging, which is
>>> shared among tasks. It tries to minimize the copying. Here's the
>>> results I get on a FJPool with parallelism=2:
>>>
>>>  MergeSorter: 1557638090 nanos
>>>  MergeSorter: 1545600414 nanos
>>>  MergeSorter: 1478185502 nanos
>>> MergeSorter2: 1058750062 nanos
>>> MergeSorter2:  978295370 nanos
>>> MergeSorter2:  976838429 nanos
>>>  Arrays.sort: 1494256525 nanos
>>>  Arrays.sort: 1505639161 nanos
>>>  Arrays.sort: 1483640342 nanos
>>>
>>>
>>> I get even better results with greater THRESHOLD (but not more than
>>> 10% better). Your MergeSorter is worse with greater THRESHOLD 
>>> because
>>> you are using insert-sort for final tasks. The Arrays.sort() 
>>> delegates
>>> to insert-sort only when N < 47, else it uses quick-sort...
>>>
>>> Regards, Peter
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>> On 2013-02-12 01:55, thurston at nomagicsoftware.com wrote:
>>>>> Yes, I will push the code and my measurements some time this 
>>>>> week.
>>>>> I'm really interested in seeing what others' results would be. 
>>>>> What I
>>>>> can say is that the timings are surprisingly stable (range 
>>>>> 30-40ms)
>>>>> and the input data is randomly generated for each test run which 
>>>>> can
>>>>> certainly affect merge(int[], int[])'s performance
>>>>>
>>>>> On 2013-02-12 01:50, Kirk Pepperdine wrote:
>>>>>> Hi,
>>>>>> On 2013-02-11, at 9:52 PM, thurston at nomagicsoftware.com wrote:
>>>>>> Hi T,
>>>>>>
>>>>>> Can you pub the raw data from your runs? Average tends to hide
>>>>>> effects and it's difficult to say anything with only that 
>>>>>> measure.
>>>>>>
>>>>>> Regards,
>>>>>> Kirk
>>>>>>
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I made some initial attempts at using ForkJoin framework for 
>>>>>>> some rather obvious recursively parallel use-cases, namely 
>>>>>>> summing the elements of an int[] and also sorting an int[] 
>>>>>>> (randomly filled).
>>>>>>>
>>>>>>> The problem-sizes I used ranged from 1M - 10M.
>>>>>>>
>>>>>>> I really enjoy using the framework and find it relatively easy 
>>>>>>> to reason about my programs (if not necessarily the internals of 
>>>>>>> the framework).
>>>>>>> But I was disappointed with the results: in both cases they 
>>>>>>> were slower than the corresponding Java serial implementation.
>>>>>>>
>>>>>>> I completely understand the principle of YMMV, and I wasn't 
>>>>>>> expecting 2x speedup, but especially in the case of the sort, but 
>>>>>>> I was expecting that ForkJoin would do at least a little better 
>>>>>>> than the single-threaded version.  I guess what I'm asking is: 
>>>>>>> are these results surprising?  Does it mean I'm doing something 
>>>>>>> wrong?
>>>>>>>
>>>>>>> I'll give a brief description of my implementation:
>>>>>>> single-threaded:  Arrays.sort(int[])
>>>>>>>
>>>>>>> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 (arrived 
>>>>>>> at by trial-and-error)
>>>>>>>
>>>>>>> int[] compute()
>>>>>>> {
>>>>>>> if int[].length < THRESHOLD
>>>>>>>    return insertionSort(int[])
>>>>>>> left = new MergeSort(int[] //split in half)
>>>>>>> right = new MergeSort(int[] //other half)
>>>>>>> return merge(right.compute(), left.join())
>>>>>>> }
>>>>>>>
>>>>>>> The insertionSort() and merge() methods are just standard 
>>>>>>> implementations; there is a bit of apples to oranges comparison 
>>>>>>> since Arrays.sort() uses an optimized quicksort, but we're still 
>>>>>>> talking O(nlog(n))
>>>>>>>
>>>>>>> Just ran it on my laptop:
>>>>>>> Windows 7 64-bit
>>>>>>> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
>>>>>>> Core 2 Duo==> 2 cores 2GHz
>>>>>>>
>>>>>>> 2M int[]:
>>>>>>> single-threaded: ~330ms
>>>>>>> ForkJoin (2 workers): ~390ms
>>>>>>>
>>>>>>> Would appreciate any feedback and hopefully the #s are somewhat 
>>>>>>> helpful (and please no scoffawing at my antiquated machine)
>>>>>>>
>>>>>>> -T
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Concurrency-interest mailing list
>>>>>>> Concurrency-interest at cs.oswego.edu
>>>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>
>>>> _______________________________________________
>>>> Concurrency-interest mailing list
>>>> Concurrency-interest at cs.oswego.edu
>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>



More information about the Concurrency-interest mailing list