[concurrency-interest] Some interesting (confusing?) benchmark results

thurston at nomagicsoftware.com thurston at nomagicsoftware.com
Thu Feb 14 14:40:42 EST 2013


So my results are different in that the MergeSorter version (which 
creates many subtasks, 16K+) performs better than the tasks == workers 
(i.e. FJPThreads) == 2.  And not by a few percentage points - between 
30-40% on average.
It's difficult for me to make sense of these conflicting results.

As far as performing sorts in-place on an array; my question: is that 
multi-thread-safe?
(although my code faces the same issue since insertionSort() is done 
in-place).

There is nothing in the javadocs that specifically addresses this 
issue:
    "Tasks should also not perform blocking IO, and should ideally 
access variables that   are completely independent of those accessed by 
other running tasks" is the closest thing I could find.

Is a child returning a value from its compute method *happen-before* 
its parent's reading the return value from join(), i.e.

C(return from compute) < P(read return from join)

because P can occur on another thread than C
I guess that's a question for Dr. Lea



On 2013-02-13 08:30, Peter Levart wrote:
> On 02/13/2013 04:59 PM, thurston at nomagicsoftware.com wrote:
>> so you have 4 cores?
>> Makes sense that those results correspond to 20M as opposed to 2M
>> And given that you changed insertSort() -> Arrays.sort, if you 
>> change threshold to input.length / ForkJoinPool.threads (presumably 4 
>> in your case), does that perform better?
> It does (N=20M, THRESHOLD=5M, parallelism=4):
>
> Arrays.sort: 1529550064 nanos
> Arrays.sort: 1478430550 nanos
> Arrays.sort: 1475643365 nanos
> Arrays.sort: 1481308646 nanos
> Arrays.sort: 1477199836 nanos
> MergeSorter2: 500316422 nanos
> MergeSorter2: 486144409 nanos
> MergeSorter2: 468044083 nanos
> MergeSorter2: 624446626 nanos
> MergeSorter2: 635152854 nanos
>
>
>>
>> Many people were suggesting that the # of tasks created should be 
>> bounded by the # of threads in the fjpool which should == # of 
>> hardware threads. (you can accomplish this by setting the threshold to 
>> above formula).
>
> But as David Holmes has noted, the THRESHOLD should be further
> divided by 8 as a starting point, to accommodate for unequal 
> execution
> of threads/tasks, so that there is no waiting at the end for final
> long tasks to complete while other threads in pool are IDLE-ing...
>
> For example (N=20M, THRESHOLD=500K, parallelism=4):
>
> MergeSorter2: 531712994 nanos
> MergeSorter2: 539048356 nanos
> MergeSorter2: 586208993 nanos
> MergeSorter2: 544239837 nanos
> MergeSorter2: 497440246 nanos
>
> And (N=20M, THRESHOLD=50K, parallelism=4):
>
> MergeSorter2: 548634308 nanos
> MergeSorter2: 579119444 nanos
> MergeSorter2: 555252699 nanos
> MergeSorter2: 505902467 nanos
> MergeSorter2: 503907973 nanos
>
> And (N=20M, THRESHOLD=5K, parallelism=4):
>
> MergeSorter2: 547342635 nanos
> MergeSorter2: 575145627 nanos
> MergeSorter2: 559160837 nanos
> MergeSorter2: 530709697 nanos
> MergeSorter2: 562272911 nanos
>
> ...it does perform a little slower, but in average with lower
> THRESHOLD, the performance should be more stable...
>
>
> Regards, Peter
>
>>
>> And with 20M/250 (threshold), you would be creating what 100Ks?, 
>> 1Ms? (you can comment in the AtomicInteger.increment()) in the 
>> constructor to save on the math
>>
>> And of course *I* understand the performance characteristics of 
>> insertionSort, look at the implementation in SortUtil (which is not 
>> parallel).  What I did makes perfect sense if you understand 
>> insertSort vs mergeSort
>>
>> On 2013-02-13 07:34, Peter Levart wrote:
>>> On 02/13/2013 04:27 PM, thurston at nomagicsoftware.com wrote:
>>>> Could you publish your machine specs?
>>>> And what was the size of your sample array?  2M or something else?
>>>> If it is 2M (and if I'm doing my math right), it's weird that my 
>>>> times are less:
>>>> basically for 2M/250, I get around 370ms +- 25ms for the 
>>>> MergeSorter
>>>> and around 330ms for java's Array.sort
>>>
>>> Oh, sorry. My machine is i7 Linux PC, using recent JDK8 build. I 
>>> had
>>> to increase the N to 20M to actually make it sweat a little...
>>>
>>> Your unchanged MergeSorter is better than Arrays.sort with
>>> parallelism = 4 for example:
>>>
>>> MergeSorter: 1134740049 nanos
>>> MergeSorter: 1082783496 nanos
>>> MergeSorter: 1033780556 nanos
>>> MergeSorter2: 720092808 nanos
>>> MergeSorter2: 649416922 nanos
>>> MergeSorter2: 605587079 nanos
>>> Arrays.sort: 1454093373 nanos
>>> Arrays.sort: 1466476368 nanos
>>> Arrays.sort: 1461604292 nanos
>>>
>>> ...but with only 2 threads it is on-par with Arrays.sort.
>>>
>>> Regards, Peter
>>>
>>>>
>>>> Thanks
>>>>
>>>> On 2013-02-13 06:40, Peter Levart wrote:
>>>>> On 02/12/2013 08:20 PM, thurston at nomagicsoftware.com wrote:
>>>>>> Here is all of the relevant code in a single blob.  I think it's 
>>>>>> selef-explanatory; sample test is at the end.
>>>>>>
>>>>>> http://pastebin.com/WrfBHYSG
>>>>>
>>>>> Hi Thurston,
>>>>>
>>>>> You're doing too much copying. I modified tour MergeSorter a 
>>>>> little:
>>>>>
>>>>>
>>>>> public class MergeSorter2 extends RecursiveTask<int[]> {
>>>>>     static final int THRESHOLD = 250;
>>>>>     final int[] array, work;
>>>>>     final int offset, length;
>>>>>
>>>>>     public MergeSorter2(int[] array) {
>>>>>         this.array = array;
>>>>>         this.work = new int[array.length];
>>>>>         offset = 0;
>>>>>         length = array.length;
>>>>>     }
>>>>>
>>>>>     private MergeSorter2(int[] array, int[] work, int offset, int 
>>>>> length) {
>>>>>         this.array = array;
>>>>>         this.work = work;
>>>>>         this.offset = offset;
>>>>>         this.length = length;
>>>>>     }
>>>>>
>>>>>     @Override
>>>>>     protected int[] compute() {
>>>>>         if (length < MergeSorter2.THRESHOLD) {
>>>>>             Arrays.sort(array, offset, offset + length);
>>>>>             return this.array;
>>>>>         }
>>>>>
>>>>>         int halfLength = length >> 1;
>>>>>         MergeSorter2 left = new MergeSorter2(array, work, offset, 
>>>>> halfLength);
>>>>>         left.fork();
>>>>>
>>>>>         int mid = offset + halfLength;
>>>>>
>>>>>         MergeSorter2 right = new MergeSorter2(array, work, mid,
>>>>> length - halfLength);
>>>>>         right.compute();
>>>>>         left.join();
>>>>>
>>>>>         // copy 1st half from array to work
>>>>>         System.arraycopy(array, offset, work, offset, 
>>>>> halfLength);
>>>>>
>>>>>         // merge 1st half of work and 2nd half of array into 
>>>>> array
>>>>>         int end = offset + length;
>>>>>         int i = offset;
>>>>>         int j = mid;
>>>>>         int k = offset;
>>>>>         int x = work[i];
>>>>>         int y = array[j];
>>>>>         while (true) {
>>>>>             if (j >= end || x <= y) {
>>>>>                 array[k++] = x;
>>>>>                 i++;
>>>>>                 // until we drain the 1st half (the rest of array 
>>>>> is
>>>>> already in place)
>>>>>                 if (i >= mid) break;
>>>>>                 x = work[i];
>>>>>             } else { // j < end
>>>>>                 array[k++] = y;
>>>>>                 j++;
>>>>>                 if (j < end) y = array[j];
>>>>>             }
>>>>>         }
>>>>>
>>>>>         return this.array;
>>>>>     }
>>>>> }
>>>>>
>>>>>
>>>>> ... it uses a single parallel "work" array for merging, which is
>>>>> shared among tasks. It tries to minimize the copying. Here's the
>>>>> results I get on a FJPool with parallelism=2:
>>>>>
>>>>>  MergeSorter: 1557638090 nanos
>>>>>  MergeSorter: 1545600414 nanos
>>>>>  MergeSorter: 1478185502 nanos
>>>>> MergeSorter2: 1058750062 nanos
>>>>> MergeSorter2:  978295370 nanos
>>>>> MergeSorter2:  976838429 nanos
>>>>>  Arrays.sort: 1494256525 nanos
>>>>>  Arrays.sort: 1505639161 nanos
>>>>>  Arrays.sort: 1483640342 nanos
>>>>>
>>>>>
>>>>> I get even better results with greater THRESHOLD (but not more 
>>>>> than
>>>>> 10% better). Your MergeSorter is worse with greater THRESHOLD 
>>>>> because
>>>>> you are using insert-sort for final tasks. The Arrays.sort() 
>>>>> delegates
>>>>> to insert-sort only when N < 47, else it uses quick-sort...
>>>>>
>>>>> Regards, Peter
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2013-02-12 01:55, thurston at nomagicsoftware.com wrote:
>>>>>>> Yes, I will push the code and my measurements some time this 
>>>>>>> week.
>>>>>>> I'm really interested in seeing what others' results would be. 
>>>>>>> What I
>>>>>>> can say is that the timings are surprisingly stable (range 
>>>>>>> 30-40ms)
>>>>>>> and the input data is randomly generated for each test run 
>>>>>>> which can
>>>>>>> certainly affect merge(int[], int[])'s performance
>>>>>>>
>>>>>>> On 2013-02-12 01:50, Kirk Pepperdine wrote:
>>>>>>>> Hi,
>>>>>>>> On 2013-02-11, at 9:52 PM, thurston at nomagicsoftware.com wrote:
>>>>>>>> Hi T,
>>>>>>>>
>>>>>>>> Can you pub the raw data from your runs? Average tends to hide
>>>>>>>> effects and it's difficult to say anything with only that 
>>>>>>>> measure.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Kirk
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I made some initial attempts at using ForkJoin framework for 
>>>>>>>>> some rather obvious recursively parallel use-cases, namely 
>>>>>>>>> summing the elements of an int[] and also sorting an int[] 
>>>>>>>>> (randomly filled).
>>>>>>>>>
>>>>>>>>> The problem-sizes I used ranged from 1M - 10M.
>>>>>>>>>
>>>>>>>>> I really enjoy using the framework and find it relatively 
>>>>>>>>> easy to reason about my programs (if not necessarily the 
>>>>>>>>> internals of the framework).
>>>>>>>>> But I was disappointed with the results: in both cases they 
>>>>>>>>> were slower than the corresponding Java serial implementation.
>>>>>>>>>
>>>>>>>>> I completely understand the principle of YMMV, and I wasn't 
>>>>>>>>> expecting 2x speedup, but especially in the case of the sort, 
>>>>>>>>> but I was expecting that ForkJoin would do at least a little 
>>>>>>>>> better than the single-threaded version.  I guess what I'm 
>>>>>>>>> asking is: are these results surprising?  Does it mean I'm 
>>>>>>>>> doing something wrong?
>>>>>>>>>
>>>>>>>>> I'll give a brief description of my implementation:
>>>>>>>>> single-threaded:  Arrays.sort(int[])
>>>>>>>>>
>>>>>>>>> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 
>>>>>>>>> (arrived at by trial-and-error)
>>>>>>>>>
>>>>>>>>> int[] compute()
>>>>>>>>> {
>>>>>>>>> if int[].length < THRESHOLD
>>>>>>>>>    return insertionSort(int[])
>>>>>>>>> left = new MergeSort(int[] //split in half)
>>>>>>>>> right = new MergeSort(int[] //other half)
>>>>>>>>> return merge(right.compute(), left.join())
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> The insertionSort() and merge() methods are just standard 
>>>>>>>>> implementations; there is a bit of apples to oranges comparison 
>>>>>>>>> since Arrays.sort() uses an optimized quicksort, but we're 
>>>>>>>>> still talking O(nlog(n))
>>>>>>>>>
>>>>>>>>> Just ran it on my laptop:
>>>>>>>>> Windows 7 64-bit
>>>>>>>>> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
>>>>>>>>> Core 2 Duo==> 2 cores 2GHz
>>>>>>>>>
>>>>>>>>> 2M int[]:
>>>>>>>>> single-threaded: ~330ms
>>>>>>>>> ForkJoin (2 workers): ~390ms
>>>>>>>>>
>>>>>>>>> Would appreciate any feedback and hopefully the #s are 
>>>>>>>>> somewhat helpful (and please no scoffawing at my antiquated 
>>>>>>>>> machine)
>>>>>>>>>
>>>>>>>>> -T
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Concurrency-interest mailing list
>>>>>>>>> Concurrency-interest at cs.oswego.edu
>>>>>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>>
>>>>>> _______________________________________________
>>>>>> Concurrency-interest mailing list
>>>>>> Concurrency-interest at cs.oswego.edu
>>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>
>>



More information about the Concurrency-interest mailing list