[concurrency-interest] Some interesting (confusing?) benchmark results

David Holmes davidcholmes at aapt.net.au
Tue Feb 12 00:26:00 EST 2013


> This is in line with my expectation; what amount of work-stealing could 
> happen with two tasks (each merge-sorting a 1M item array)?  With 16K 
> tasks, the work-stealing can really strut its stuff (and at least in my 
> case,can overwhelm the overhead associated with scheduling the tasks)

I think you are seeing affects that are not due to parallelisation of the algorithm as such. 16K tasks for two threads is just ridiculous task creation and management overhead. If you see a speed up compared to three tasks then you are either not comparing apples to oranges, or you are affected by other things - such as memory management when dealing with large arrays.

David
-----


> -----Original Message-----
> From: thurston at nomagicsoftware.com [mailto:thurston at nomagicsoftware.com]
> Sent: Tuesday, 12 February 2013 3:14 PM
> To: viktor ?lang
> Cc: dholmes at ieee.org; concurrency-interest
> Subject: Re: [concurrency-interest] Some interesting (confusing?)
> benchmark results
> 
> 
> 
> 
> On 2013-02-11 14:24, √iktor Ҡlang wrote:
> > On Mon, Feb 11, 2013 at 11:18 PM, David Holmes 
> > <davidcholmes at aapt.net.au> wrote:
> >
> >> That doesn't make sense to me. If you had a threshold of 250 then 
> >> you are
> >> creating thousands of tasks and that overhead should potentially 
> >> bring your
> >> laptop to its knees. With 250K threshold you have far fewer tasks.
> >
> > I'm not sure it makes sense to create more sort-tasks than there is
> > Threads in the pool.
> 
> Really, Viktor?  That surprises me coming from Akka.  Does that mean 
> you guys only allocate as many message-processing "workers" as there are 
> hardware threads?
> 
> To be clear, with the 2M size array data, and the threshold of 250-size 
> array for the (serial) insertion sort, 16K+ sort tasks are constructed 
> (as D. Holmes mentioned).  This is significantly *faster* than doing 
> what either of you suggested.  If I simply split the problem in two, and 
> just run a serial mergesort on each half and then merge the two halves 
> (that's three total tasks), it's over 50% slower.
> 
> Let me repeat that.  16K+ tasks is much, much faster than 3 tasks.
> 
> This is in line with my expectation; what amount of work-stealing could 
> happen with two tasks (each merge-sorting a 1M item array)?  With 16K 
> tasks, the work-stealing can really strut its stuff (and at least in my 
> case,can overwhelm the overhead associated with scheduling the tasks)
> 
> If you only create as many tasks as there are hardware threads -- that 
> seems a lot like a standard thread-pool implementation.  (Of course, if 
> there are many more cores, the opportunity for work-stealing would 
> increase even with the 1-1 ratio).
> 
> When I first used fork/join, that's the approach I took.
> 1.  split the problem (hardware threads * (1||2)
> 2.  have each task do its work serially (i.e. don't fork)
> 3.  combine the (sub)totals/results
> 
> Not only is that less appealing from a programming point of view, it 
> performs slower; that's the promise, nee even purpose, of the fork/join 
> framework at least when it comes to implementing naturally recursive 
> algorithms.
> 
> -T
> 
> 
> >
> >> That said I don't know what the memory requirements are for your 
> >> algorithm.
> >> If you are allocating temporary arrays then a  large threshold could 
> >> well
> >> cause a problem.
> >>
> >> The parallel array sort going into JDK 8 is a merge sort but uses a 
> >> working
> >> array equal in size to the original array to be sorted.
> >
> > No in-place merge sort? ;-)
> >  
> >
> >>> -----Original Message-----
> >> > From: concurrency-interest-bounces at cs.oswego.edu
> >> > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
> >> > thurston at nomagicsoftware.com
> >>
> >>> Sent: Tuesday, 12 February 2013 7:49 AM
> >> > To: concurrency-interest at cs.oswego.edu
> >> > Subject: Re: [concurrency-interest] Some interesting (confusing?)
> >> > benchmark results
> >> >
> >> >
> >> > Well, it's the algorithm that's dispositive in this case.
> >> > Insertion-sort is empirically better than mergesort up to around
> >> > 25-sized arrays.  That's where I started, and I played around with
> >> > different thresholds and 250 seemed to be the best.
> >> > Anything larger than 1000 you start to see significant 
> >> degradation.
> >> > I tried your formula (approximately) with N = 2 000 000 ==> 250K
> >> >
> >> > That brought my laptop to its knees (as expected): over 33K ms as
> >> > opposed to 390ms with 250.
> >> >
> >> > Following my own admonition (the algorithm matters), it occurs to 
> >> me
> >> > that I just use my own serial mergesort as comparison to the 
> >> Forkjoin
> >> > (as opposed to Arrays.sort).  I'll let y'all know the results of 
> >> that
> >> >
> >> > -T
> >> >
> >> >
> >> >
> >> > On 2013-02-11 13:19, David Holmes wrote:
> >> > > Your sequential threshold is far too small for only two worker
> >> > > threads. You
> >> > > would get near optimum speedup only splitting the problem into 
> >> two.
> >> > >
> >> > > The heuristic for sequential threshold is:
> >> > >
> >> > > Threshold = N / (LoadBalanceFactor * NumCore)
> >> > >
> >> > > where N is the "size" of the problem.
> >> > >
> >> > > The LoadBalanceFactor in an ideal world would be 1 but real 
> >> tasks
> >> > > encounter
> >> > > delays and execute differently so a value > 1 is needed.
> >> > > Heuristically a
> >> > > value of 8 is a good starting value.
> >> > >
> >> > > David Holmes
> >> > >
> >> > >> -----Original Message-----
> >> > >> From: concurrency-interest-bounces at cs.oswego.edu
> >> > >> [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
> >> > >> thurston at nomagicsoftware.com
> >> > >> Sent: Tuesday, 12 February 2013 6:52 AM
> >> > >> To: concurrency-interest at cs.oswego.edu
> >> > >> Subject: [concurrency-interest] Some interesting (confusing?)
> >> > >> benchmark
> >> > >> results
> >> > >>
> >> > >>
> >> > >> Hello,
> >> > >>
> >> > >> I made some initial attempts at using ForkJoin framework for 
> >> some
> >> > >> rather obvious recursively parallel use-cases, namely summing 
> >> the
> >> > >> elements of an int[] and also sorting an int[] (randomly 
> >> filled).
> >> > >>
> >> > >> The problem-sizes I used ranged from 1M - 10M.
> >> > >>
> >> > >> I really enjoy using the framework and find it relatively easy 
> >> to
> >> > >> reason about my programs (if not necessarily the internals of 
> >> the
> >> > >> framework).
> >> > >> But I was disappointed with the results: in both cases they 
> >> were
> >> > >> slower
> >> > >> than the corresponding Java serial implementation.
> >> > >>
> >> > >> I completely understand the principle of YMMV, and I wasn't
> >> > >> expecting
> >> > >> 2x speedup, but especially in the case of the sort, but I was
> >> > >> expecting
> >> > >> that ForkJoin would do at least a little better than the
> >> > >> single-threaded
> >> > >> version.  I guess what I'm asking is: are these results 
> >> surprising?
> >> > >> Does it mean I'm doing something wrong?
> >> > >>
> >> > >> I'll give a brief description of my implementation:
> >> > >> single-threaded:  Arrays.sort(int[])
> >> > >>
> >> > >> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 (arrived 
> >> at
> >> > >> by
> >> > >> trial-and-error)
> >> > >>
> >> > >> int[] compute()
> >> > >> {
> >> > >> if int[].length < THRESHOLD
> >> > >>      return insertionSort(int[])
> >> > >> left = new MergeSort(int[] //split in half)
> >> > >> right = new MergeSort(int[] //other half)
> >> > >> return merge(right.compute(), left.join())
> >> > >> }
> >> > >>
> >> > >> The insertionSort() and merge() methods are just standard
> >> > >> implementations; there is a bit of apples to oranges comparison
> >> > >> since
> >> > >> Arrays.sort() uses an optimized quicksort, but we're still 
> >> talking
> >> > >> O(nlog(n))
> >> > >>
> >> > >> Just ran it on my laptop:
> >> > >> Windows 7 64-bit
> >> > >> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
> >> > >> Core 2 Duo==> 2 cores 2GHz
> >> > >>
> >> > >> 2M int[]:
> >> > >> single-threaded: ~330ms
> >> > >> ForkJoin (2 workers): ~390ms
> >> > >>
> >> > >> Would appreciate any feedback and hopefully the #s are somewhat
> >> > >> helpful
> >> > >> (and please no scoffawing at my antiquated machine)
> >> > >>
> >> > >> -T
> >> > >>
> >> > >>
> >> > >> _______________________________________________
> >> > >> Concurrency-interest mailing list
> >> > >> Concurrency-interest at cs.oswego.edu
> >> > >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
> >> > >>
> >> >
> >> > _______________________________________________
> >> > Concurrency-interest mailing list
> >> > Concurrency-interest at cs.oswego.edu
> >> > http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
> >> >
> >>
> >> _______________________________________________
> >> Concurrency-interest mailing list
> >> Concurrency-interest at cs.oswego.edu
> >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
> >>
> >> --
> > 
> > 
> r:rgb(0,0,0);font-family:Times;font-variant:normal;letter-spacing:
> normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;
> text-transform:none;white-space:normal;word-spacing:0px;font-size:
> medium">VIKTOR
> > KLANG
> >  _Director of Engineering_
> >
> >  Typesafe [2] - The software stack for applications that scale
> >  Twitter: @viktorklang
> >
> > Links:
> > ------
> > [1] http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> > [2] http://www.typesafe.com/
> 
> 




More information about the Concurrency-interest mailing list