[concurrency-interest] Some interesting (confusing?) benchmark results

David Holmes davidcholmes at aapt.net.au
Tue Feb 12 00:53:00 EST 2013


Small arrays are handled quite differently in the VM to large arrays.

There is something happening here that you don't understand.

Is your full code available?

David

> -----Original Message-----
> From: thurston at nomagicsoftware.com [mailto:thurston at nomagicsoftware.com]
> Sent: Tuesday, 12 February 2013 3:45 PM
> To: dholmes at ieee.org
> Cc: viktor ?lang; concurrency-interest
> Subject: RE: [concurrency-interest] Some interesting (confusing?)
> benchmark results
> 
> 
> The #s are the ultimate arbiter, aren't they?  I can't imagine that the 
> 3-task solution copies the array more often than the 16K solution, since 
> each level of the DAG results in a duplicate of the original array (of 
> course in smaller and smaller chunks)
> 
> 16K:  ~385ms
> 3:    ~550ms
> 
> On 2013-02-11 21:26, David Holmes wrote:
> >> This is in line with my expectation; what amount of work-stealing 
> >> could
> >> happen with two tasks (each merge-sorting a 1M item array)?  With 
> >> 16K
> >> tasks, the work-stealing can really strut its stuff (and at least in 
> >> my
> >> case,can overwhelm the overhead associated with scheduling the 
> >> tasks)
> >
> > I think you are seeing affects that are not due to parallelisation of
> > the algorithm as such. 16K tasks for two threads is just ridiculous
> > task creation and management overhead. If you see a speed up compared
> > to three tasks then you are either not comparing apples to oranges, 
> > or
> > you are affected by other things - such as memory management when
> > dealing with large arrays.
> >
> > David
> > -----
> >
> >
> >> -----Original Message-----
> >> From: thurston at nomagicsoftware.com 
> >> [mailto:thurston at nomagicsoftware.com]
> >> Sent: Tuesday, 12 February 2013 3:14 PM
> >> To: viktor ?lang
> >> Cc: dholmes at ieee.org; concurrency-interest
> >> Subject: Re: [concurrency-interest] Some interesting (confusing?)
> >> benchmark results
> >>
> >>
> >>
> >>
> >> On 2013-02-11 14:24, √iktor Ҡlang wrote:
> >> > On Mon, Feb 11, 2013 at 11:18 PM, David Holmes
> >> > <davidcholmes at aapt.net.au> wrote:
> >> >
> >> >> That doesn't make sense to me. If you had a threshold of 250 then
> >> >> you are
> >> >> creating thousands of tasks and that overhead should potentially
> >> >> bring your
> >> >> laptop to its knees. With 250K threshold you have far fewer 
> >> tasks.
> >> >
> >> > I'm not sure it makes sense to create more sort-tasks than there 
> >> is
> >> > Threads in the pool.
> >>
> >> Really, Viktor?  That surprises me coming from Akka.  Does that mean
> >> you guys only allocate as many message-processing "workers" as there 
> >> are
> >> hardware threads?
> >>
> >> To be clear, with the 2M size array data, and the threshold of 
> >> 250-size
> >> array for the (serial) insertion sort, 16K+ sort tasks are 
> >> constructed
> >> (as D. Holmes mentioned).  This is significantly *faster* than doing
> >> what either of you suggested.  If I simply split the problem in two, 
> >> and
> >> just run a serial mergesort on each half and then merge the two 
> >> halves
> >> (that's three total tasks), it's over 50% slower.
> >>
> >> Let me repeat that.  16K+ tasks is much, much faster than 3 tasks.
> >>
> >> This is in line with my expectation; what amount of work-stealing 
> >> could
> >> happen with two tasks (each merge-sorting a 1M item array)?  With 
> >> 16K
> >> tasks, the work-stealing can really strut its stuff (and at least in 
> >> my
> >> case,can overwhelm the overhead associated with scheduling the 
> >> tasks)
> >>
> >> If you only create as many tasks as there are hardware threads -- 
> >> that
> >> seems a lot like a standard thread-pool implementation.  (Of course, 
> >> if
> >> there are many more cores, the opportunity for work-stealing would
> >> increase even with the 1-1 ratio).
> >>
> >> When I first used fork/join, that's the approach I took.
> >> 1.  split the problem (hardware threads * (1||2)
> >> 2.  have each task do its work serially (i.e. don't fork)
> >> 3.  combine the (sub)totals/results
> >>
> >> Not only is that less appealing from a programming point of view, it
> >> performs slower; that's the promise, nee even purpose, of the 
> >> fork/join
> >> framework at least when it comes to implementing naturally recursive
> >> algorithms.
> >>
> >> -T
> >>
> >>
> >> >
> >> >> That said I don't know what the memory requirements are for your
> >> >> algorithm.
> >> >> If you are allocating temporary arrays then a  large threshold 
> >> could
> >> >> well
> >> >> cause a problem.
> >> >>
> >> >> The parallel array sort going into JDK 8 is a merge sort but uses 
> >> a
> >> >> working
> >> >> array equal in size to the original array to be sorted.
> >> >
> >> > No in-place merge sort? ;-)
> >> >
> >> >
> >> >>> -----Original Message-----
> >> >> > From: concurrency-interest-bounces at cs.oswego.edu
> >> >> > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
> >> >> > thurston at nomagicsoftware.com
> >> >>
> >> >>> Sent: Tuesday, 12 February 2013 7:49 AM
> >> >> > To: concurrency-interest at cs.oswego.edu
> >> >> > Subject: Re: [concurrency-interest] Some interesting 
> >> (confusing?)
> >> >> > benchmark results
> >> >> >
> >> >> >
> >> >> > Well, it's the algorithm that's dispositive in this case.
> >> >> > Insertion-sort is empirically better than mergesort up to 
> >> around
> >> >> > 25-sized arrays.  That's where I started, and I played around 
> >> with
> >> >> > different thresholds and 250 seemed to be the best.
> >> >> > Anything larger than 1000 you start to see significant
> >> >> degradation.
> >> >> > I tried your formula (approximately) with N = 2 000 000 ==> 
> >> 250K
> >> >> >
> >> >> > That brought my laptop to its knees (as expected): over 33K ms 
> >> as
> >> >> > opposed to 390ms with 250.
> >> >> >
> >> >> > Following my own admonition (the algorithm matters), it occurs 
> >> to
> >> >> me
> >> >> > that I just use my own serial mergesort as comparison to the
> >> >> Forkjoin
> >> >> > (as opposed to Arrays.sort).  I'll let y'all know the results 
> >> of
> >> >> that
> >> >> >
> >> >> > -T
> >> >> >
> >> >> >
> >> >> >
> >> >> > On 2013-02-11 13:19, David Holmes wrote:
> >> >> > > Your sequential threshold is far too small for only two 
> >> worker
> >> >> > > threads. You
> >> >> > > would get near optimum speedup only splitting the problem 
> >> into
> >> >> two.
> >> >> > >
> >> >> > > The heuristic for sequential threshold is:
> >> >> > >
> >> >> > > Threshold = N / (LoadBalanceFactor * NumCore)
> >> >> > >
> >> >> > > where N is the "size" of the problem.
> >> >> > >
> >> >> > > The LoadBalanceFactor in an ideal world would be 1 but real
> >> >> tasks
> >> >> > > encounter
> >> >> > > delays and execute differently so a value > 1 is needed.
> >> >> > > Heuristically a
> >> >> > > value of 8 is a good starting value.
> >> >> > >
> >> >> > > David Holmes
> >> >> > >
> >> >> > >> -----Original Message-----
> >> >> > >> From: concurrency-interest-bounces at cs.oswego.edu
> >> >> > >> [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf 
> >> Of
> >> >> > >> thurston at nomagicsoftware.com
> >> >> > >> Sent: Tuesday, 12 February 2013 6:52 AM
> >> >> > >> To: concurrency-interest at cs.oswego.edu
> >> >> > >> Subject: [concurrency-interest] Some interesting 
> >> (confusing?)
> >> >> > >> benchmark
> >> >> > >> results
> >> >> > >>
> >> >> > >>
> >> >> > >> Hello,
> >> >> > >>
> >> >> > >> I made some initial attempts at using ForkJoin framework for
> >> >> some
> >> >> > >> rather obvious recursively parallel use-cases, namely 
> >> summing
> >> >> the
> >> >> > >> elements of an int[] and also sorting an int[] (randomly
> >> >> filled).
> >> >> > >>
> >> >> > >> The problem-sizes I used ranged from 1M - 10M.
> >> >> > >>
> >> >> > >> I really enjoy using the framework and find it relatively 
> >> easy
> >> >> to
> >> >> > >> reason about my programs (if not necessarily the internals 
> >> of
> >> >> the
> >> >> > >> framework).
> >> >> > >> But I was disappointed with the results: in both cases they
> >> >> were
> >> >> > >> slower
> >> >> > >> than the corresponding Java serial implementation.
> >> >> > >>
> >> >> > >> I completely understand the principle of YMMV, and I wasn't
> >> >> > >> expecting
> >> >> > >> 2x speedup, but especially in the case of the sort, but I 
> >> was
> >> >> > >> expecting
> >> >> > >> that ForkJoin would do at least a little better than the
> >> >> > >> single-threaded
> >> >> > >> version.  I guess what I'm asking is: are these results
> >> >> surprising?
> >> >> > >> Does it mean I'm doing something wrong?
> >> >> > >>
> >> >> > >> I'll give a brief description of my implementation:
> >> >> > >> single-threaded:  Arrays.sort(int[])
> >> >> > >>
> >> >> > >> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 
> >> (arrived
> >> >> at
> >> >> > >> by
> >> >> > >> trial-and-error)
> >> >> > >>
> >> >> > >> int[] compute()
> >> >> > >> {
> >> >> > >> if int[].length < THRESHOLD
> >> >> > >>      return insertionSort(int[])
> >> >> > >> left = new MergeSort(int[] //split in half)
> >> >> > >> right = new MergeSort(int[] //other half)
> >> >> > >> return merge(right.compute(), left.join())
> >> >> > >> }
> >> >> > >>
> >> >> > >> The insertionSort() and merge() methods are just standard
> >> >> > >> implementations; there is a bit of apples to oranges 
> >> comparison
> >> >> > >> since
> >> >> > >> Arrays.sort() uses an optimized quicksort, but we're still
> >> >> talking
> >> >> > >> O(nlog(n))
> >> >> > >>
> >> >> > >> Just ran it on my laptop:
> >> >> > >> Windows 7 64-bit
> >> >> > >> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
> >> >> > >> Core 2 Duo==> 2 cores 2GHz
> >> >> > >>
> >> >> > >> 2M int[]:
> >> >> > >> single-threaded: ~330ms
> >> >> > >> ForkJoin (2 workers): ~390ms
> >> >> > >>
> >> >> > >> Would appreciate any feedback and hopefully the #s are 
> >> somewhat
> >> >> > >> helpful
> >> >> > >> (and please no scoffawing at my antiquated machine)
> >> >> > >>
> >> >> > >> -T
> >> >> > >>
> >> >> > >>
> >> >> > >> _______________________________________________
> >> >> > >> Concurrency-interest mailing list
> >> >> > >> Concurrency-interest at cs.oswego.edu
> >> >> > >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest 
> >> [1]
> >> >> > >>
> >> >> >
> >> >> > _______________________________________________
> >> >> > Concurrency-interest mailing list
> >> >> > Concurrency-interest at cs.oswego.edu
> >> >> > http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
> >> >> >
> >> >>
> >> >> _______________________________________________
> >> >> Concurrency-interest mailing list
> >> >> Concurrency-interest at cs.oswego.edu
> >> >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
> >> >>
> >> >> --
> >> >
> >> >
> >> r:rgb(0,0,0);font-family:Times;font-variant:normal;letter-spacing:
> >> normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;
> >> text-transform:none;white-space:normal;word-spacing:0px;font-size:
> >> medium">VIKTOR
> >> > KLANG
> >> >  _Director of Engineering_
> >> >
> >> >  Typesafe [2] - The software stack for applications that scale
> >> >  Twitter: @viktorklang
> >> >
> >> > Links:
> >> > ------
> >> > [1] http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> >> > [2] http://www.typesafe.com/
> >>
> >>
> 
> 




More information about the Concurrency-interest mailing list