[concurrency-interest] Some interesting (confusing?) benchmark results

√iktor Ҡlang viktor.klang at gmail.com
Mon Feb 11 17:39:06 EST 2013


David,


On Mon, Feb 11, 2013 at 11:34 PM, David Holmes <davidcholmes at aapt.net.au>wrote:

> **
> Victor,
>
> See the description of LoadBalanceFactor. It makes sense to create more
> tasks than there are threads because things progress at different rates. A
> thread can be delayed through scheduling, JIT activity, GC etc etc. Suppose
> with 2 threads one had completed its half while the other was only a
> quarter of the way through. You then have to wait much longer for
> completion. But if you have broken things down into smaller subtasks (say
> 4) then that first thread could be "helping" the second to complete the
> overall job. But break it down too fine and you get too much task overhead.
> Hence the heuristic value of 8 - which is generally a good starting point.
> Plus performance is not super-sensitive to getting the LoadBalanceFactor
> exactly right.
>

True, but cache-effects (& hardward) also has to be taken into
consideration (shared l2 vs shared l3 vs going cross-socket/NUMA).
Ideally the user API would be devoid of encoding such assumtptions, as they
will vary on a platform level…

Cheers,
√


>
> David
>
> -----Original Message-----
> *From:* viktor ?lang [mailto:viktor.klang at gmail.com]
> *Sent:* Tuesday, 12 February 2013 8:25 AM
> *To:* dholmes at ieee.org
> *Cc:* thurston; concurrency-interest
> *Subject:* Re: [concurrency-interest] Some interesting (confusing?)
> benchmark results
>
>
>
>
> On Mon, Feb 11, 2013 at 11:18 PM, David Holmes <davidcholmes at aapt.net.au>wrote:
>
>> That doesn't make sense to me. If you had a threshold of 250 then you are
>> creating thousands of tasks and that overhead should potentially bring
>> your
>> laptop to its knees. With 250K threshold you have far fewer tasks.
>>
>
> I'm not sure it makes sense to create more sort-tasks than there is
> Threads in the pool.
>
>
>>
>> That said I don't know what the memory requirements are for your
>> algorithm.
>> If you are allocating temporary arrays then a  large threshold could well
>> cause a problem.
>>
>> The parallel array sort going into JDK 8 is a merge sort but uses a
>> working
>> array equal in size to the original array to be sorted.
>>
>
> No in-place merge sort? ;-)
>
>
>>
>> David
>>
>> > -----Original Message-----
>> > From: concurrency-interest-bounces at cs.oswego.edu
>> > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
>> > thurston at nomagicsoftware.com
>>  > Sent: Tuesday, 12 February 2013 7:49 AM
>> > To: concurrency-interest at cs.oswego.edu
>> > Subject: Re: [concurrency-interest] Some interesting (confusing?)
>> > benchmark results
>> >
>> >
>> > Well, it's the algorithm that's dispositive in this case.
>> > Insertion-sort is empirically better than mergesort up to around
>> > 25-sized arrays.  That's where I started, and I played around with
>> > different thresholds and 250 seemed to be the best.
>> > Anything larger than 1000 you start to see significant degradation.
>> > I tried your formula (approximately) with N = 2 000 000 ==> 250K
>> >
>> > That brought my laptop to its knees (as expected): over 33K ms as
>> > opposed to 390ms with 250.
>> >
>> > Following my own admonition (the algorithm matters), it occurs to me
>> > that I just use my own serial mergesort as comparison to the Forkjoin
>> > (as opposed to Arrays.sort).  I'll let y'all know the results of that
>> >
>> > -T
>> >
>> >
>> >
>> > On 2013-02-11 13:19, David Holmes wrote:
>> > > Your sequential threshold is far too small for only two worker
>> > > threads. You
>> > > would get near optimum speedup only splitting the problem into two.
>> > >
>> > > The heuristic for sequential threshold is:
>> > >
>> > > Threshold = N / (LoadBalanceFactor * NumCore)
>> > >
>> > > where N is the "size" of the problem.
>> > >
>> > > The LoadBalanceFactor in an ideal world would be 1 but real tasks
>> > > encounter
>> > > delays and execute differently so a value > 1 is needed.
>> > > Heuristically a
>> > > value of 8 is a good starting value.
>> > >
>> > > David Holmes
>> > >
>> > >> -----Original Message-----
>> > >> From: concurrency-interest-bounces at cs.oswego.edu
>> > >> [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
>> > >> thurston at nomagicsoftware.com
>> > >> Sent: Tuesday, 12 February 2013 6:52 AM
>> > >> To: concurrency-interest at cs.oswego.edu
>> > >> Subject: [concurrency-interest] Some interesting (confusing?)
>> > >> benchmark
>> > >> results
>> > >>
>> > >>
>> > >> Hello,
>> > >>
>> > >> I made some initial attempts at using ForkJoin framework for some
>> > >> rather obvious recursively parallel use-cases, namely summing the
>> > >> elements of an int[] and also sorting an int[] (randomly filled).
>> > >>
>> > >> The problem-sizes I used ranged from 1M - 10M.
>> > >>
>> > >> I really enjoy using the framework and find it relatively easy to
>> > >> reason about my programs (if not necessarily the internals of the
>> > >> framework).
>> > >> But I was disappointed with the results: in both cases they were
>> > >> slower
>> > >> than the corresponding Java serial implementation.
>> > >>
>> > >> I completely understand the principle of YMMV, and I wasn't
>> > >> expecting
>> > >> 2x speedup, but especially in the case of the sort, but I was
>> > >> expecting
>> > >> that ForkJoin would do at least a little better than the
>> > >> single-threaded
>> > >> version.  I guess what I'm asking is: are these results surprising?
>> > >> Does it mean I'm doing something wrong?
>> > >>
>> > >> I'll give a brief description of my implementation:
>> > >> single-threaded:  Arrays.sort(int[])
>> > >>
>> > >> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 (arrived at
>> > >> by
>> > >> trial-and-error)
>> > >>
>> > >> int[] compute()
>> > >> {
>> > >> if int[].length < THRESHOLD
>> > >>      return insertionSort(int[])
>> > >> left = new MergeSort(int[] //split in half)
>> > >> right = new MergeSort(int[] //other half)
>> > >> return merge(right.compute(), left.join())
>> > >> }
>> > >>
>> > >> The insertionSort() and merge() methods are just standard
>> > >> implementations; there is a bit of apples to oranges comparison
>> > >> since
>> > >> Arrays.sort() uses an optimized quicksort, but we're still talking
>> > >> O(nlog(n))
>> > >>
>> > >> Just ran it on my laptop:
>> > >> Windows 7 64-bit
>> > >> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
>> > >> Core 2 Duo==> 2 cores 2GHz
>> > >>
>> > >> 2M int[]:
>> > >> single-threaded: ~330ms
>> > >> ForkJoin (2 workers): ~390ms
>> > >>
>> > >> Would appreciate any feedback and hopefully the #s are somewhat
>> > >> helpful
>> > >> (and please no scoffawing at my antiquated machine)
>> > >>
>> > >> -T
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> Concurrency-interest mailing list
>> > >> Concurrency-interest at cs.oswego.edu
>> > >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> > >>
>> >
>> > _______________________________________________
>> > Concurrency-interest mailing list
>> > Concurrency-interest at cs.oswego.edu
>> > http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> >
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>
>
>
> --
> *Viktor Klang*
> *Director of Engineering*
> *
> *
> Typesafe <http://www.typesafe.com/> - The software stack for applications
> that scale
> Twitter: @viktorklang
>
>


-- 
*Viktor Klang*
*Director of Engineering*
*
*
Typesafe <http://www.typesafe.com/> - The software stack for applications
that scale
Twitter: @viktorklang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130211/dd643bd0/attachment-0001.html>


More information about the Concurrency-interest mailing list