[concurrency-interest] Some interesting (confusing?) benchmark results

thurston at nomagicsoftware.com thurston at nomagicsoftware.com
Tue Feb 12 00:45:08 EST 2013


The #s are the ultimate arbiter, aren't they?  I can't imagine that the 
3-task solution copies the array more often than the 16K solution, since 
each level of the DAG results in a duplicate of the original array (of 
course in smaller and smaller chunks)

16K:  ~385ms
3:    ~550ms

On 2013-02-11 21:26, David Holmes wrote:
>> This is in line with my expectation; what amount of work-stealing 
>> could
>> happen with two tasks (each merge-sorting a 1M item array)?  With 
>> 16K
>> tasks, the work-stealing can really strut its stuff (and at least in 
>> my
>> case,can overwhelm the overhead associated with scheduling the 
>> tasks)
>
> I think you are seeing affects that are not due to parallelisation of
> the algorithm as such. 16K tasks for two threads is just ridiculous
> task creation and management overhead. If you see a speed up compared
> to three tasks then you are either not comparing apples to oranges, 
> or
> you are affected by other things - such as memory management when
> dealing with large arrays.
>
> David
> -----
>
>
>> -----Original Message-----
>> From: thurston at nomagicsoftware.com 
>> [mailto:thurston at nomagicsoftware.com]
>> Sent: Tuesday, 12 February 2013 3:14 PM
>> To: viktor ?lang
>> Cc: dholmes at ieee.org; concurrency-interest
>> Subject: Re: [concurrency-interest] Some interesting (confusing?)
>> benchmark results
>>
>>
>>
>>
>> On 2013-02-11 14:24, √iktor Ҡlang wrote:
>> > On Mon, Feb 11, 2013 at 11:18 PM, David Holmes
>> > <davidcholmes at aapt.net.au> wrote:
>> >
>> >> That doesn't make sense to me. If you had a threshold of 250 then
>> >> you are
>> >> creating thousands of tasks and that overhead should potentially
>> >> bring your
>> >> laptop to its knees. With 250K threshold you have far fewer 
>> tasks.
>> >
>> > I'm not sure it makes sense to create more sort-tasks than there 
>> is
>> > Threads in the pool.
>>
>> Really, Viktor?  That surprises me coming from Akka.  Does that mean
>> you guys only allocate as many message-processing "workers" as there 
>> are
>> hardware threads?
>>
>> To be clear, with the 2M size array data, and the threshold of 
>> 250-size
>> array for the (serial) insertion sort, 16K+ sort tasks are 
>> constructed
>> (as D. Holmes mentioned).  This is significantly *faster* than doing
>> what either of you suggested.  If I simply split the problem in two, 
>> and
>> just run a serial mergesort on each half and then merge the two 
>> halves
>> (that's three total tasks), it's over 50% slower.
>>
>> Let me repeat that.  16K+ tasks is much, much faster than 3 tasks.
>>
>> This is in line with my expectation; what amount of work-stealing 
>> could
>> happen with two tasks (each merge-sorting a 1M item array)?  With 
>> 16K
>> tasks, the work-stealing can really strut its stuff (and at least in 
>> my
>> case,can overwhelm the overhead associated with scheduling the 
>> tasks)
>>
>> If you only create as many tasks as there are hardware threads -- 
>> that
>> seems a lot like a standard thread-pool implementation.  (Of course, 
>> if
>> there are many more cores, the opportunity for work-stealing would
>> increase even with the 1-1 ratio).
>>
>> When I first used fork/join, that's the approach I took.
>> 1.  split the problem (hardware threads * (1||2)
>> 2.  have each task do its work serially (i.e. don't fork)
>> 3.  combine the (sub)totals/results
>>
>> Not only is that less appealing from a programming point of view, it
>> performs slower; that's the promise, nee even purpose, of the 
>> fork/join
>> framework at least when it comes to implementing naturally recursive
>> algorithms.
>>
>> -T
>>
>>
>> >
>> >> That said I don't know what the memory requirements are for your
>> >> algorithm.
>> >> If you are allocating temporary arrays then a  large threshold 
>> could
>> >> well
>> >> cause a problem.
>> >>
>> >> The parallel array sort going into JDK 8 is a merge sort but uses 
>> a
>> >> working
>> >> array equal in size to the original array to be sorted.
>> >
>> > No in-place merge sort? ;-)
>> >
>> >
>> >>> -----Original Message-----
>> >> > From: concurrency-interest-bounces at cs.oswego.edu
>> >> > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
>> >> > thurston at nomagicsoftware.com
>> >>
>> >>> Sent: Tuesday, 12 February 2013 7:49 AM
>> >> > To: concurrency-interest at cs.oswego.edu
>> >> > Subject: Re: [concurrency-interest] Some interesting 
>> (confusing?)
>> >> > benchmark results
>> >> >
>> >> >
>> >> > Well, it's the algorithm that's dispositive in this case.
>> >> > Insertion-sort is empirically better than mergesort up to 
>> around
>> >> > 25-sized arrays.  That's where I started, and I played around 
>> with
>> >> > different thresholds and 250 seemed to be the best.
>> >> > Anything larger than 1000 you start to see significant
>> >> degradation.
>> >> > I tried your formula (approximately) with N = 2 000 000 ==> 
>> 250K
>> >> >
>> >> > That brought my laptop to its knees (as expected): over 33K ms 
>> as
>> >> > opposed to 390ms with 250.
>> >> >
>> >> > Following my own admonition (the algorithm matters), it occurs 
>> to
>> >> me
>> >> > that I just use my own serial mergesort as comparison to the
>> >> Forkjoin
>> >> > (as opposed to Arrays.sort).  I'll let y'all know the results 
>> of
>> >> that
>> >> >
>> >> > -T
>> >> >
>> >> >
>> >> >
>> >> > On 2013-02-11 13:19, David Holmes wrote:
>> >> > > Your sequential threshold is far too small for only two 
>> worker
>> >> > > threads. You
>> >> > > would get near optimum speedup only splitting the problem 
>> into
>> >> two.
>> >> > >
>> >> > > The heuristic for sequential threshold is:
>> >> > >
>> >> > > Threshold = N / (LoadBalanceFactor * NumCore)
>> >> > >
>> >> > > where N is the "size" of the problem.
>> >> > >
>> >> > > The LoadBalanceFactor in an ideal world would be 1 but real
>> >> tasks
>> >> > > encounter
>> >> > > delays and execute differently so a value > 1 is needed.
>> >> > > Heuristically a
>> >> > > value of 8 is a good starting value.
>> >> > >
>> >> > > David Holmes
>> >> > >
>> >> > >> -----Original Message-----
>> >> > >> From: concurrency-interest-bounces at cs.oswego.edu
>> >> > >> [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf 
>> Of
>> >> > >> thurston at nomagicsoftware.com
>> >> > >> Sent: Tuesday, 12 February 2013 6:52 AM
>> >> > >> To: concurrency-interest at cs.oswego.edu
>> >> > >> Subject: [concurrency-interest] Some interesting 
>> (confusing?)
>> >> > >> benchmark
>> >> > >> results
>> >> > >>
>> >> > >>
>> >> > >> Hello,
>> >> > >>
>> >> > >> I made some initial attempts at using ForkJoin framework for
>> >> some
>> >> > >> rather obvious recursively parallel use-cases, namely 
>> summing
>> >> the
>> >> > >> elements of an int[] and also sorting an int[] (randomly
>> >> filled).
>> >> > >>
>> >> > >> The problem-sizes I used ranged from 1M - 10M.
>> >> > >>
>> >> > >> I really enjoy using the framework and find it relatively 
>> easy
>> >> to
>> >> > >> reason about my programs (if not necessarily the internals 
>> of
>> >> the
>> >> > >> framework).
>> >> > >> But I was disappointed with the results: in both cases they
>> >> were
>> >> > >> slower
>> >> > >> than the corresponding Java serial implementation.
>> >> > >>
>> >> > >> I completely understand the principle of YMMV, and I wasn't
>> >> > >> expecting
>> >> > >> 2x speedup, but especially in the case of the sort, but I 
>> was
>> >> > >> expecting
>> >> > >> that ForkJoin would do at least a little better than the
>> >> > >> single-threaded
>> >> > >> version.  I guess what I'm asking is: are these results
>> >> surprising?
>> >> > >> Does it mean I'm doing something wrong?
>> >> > >>
>> >> > >> I'll give a brief description of my implementation:
>> >> > >> single-threaded:  Arrays.sort(int[])
>> >> > >>
>> >> > >> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 
>> (arrived
>> >> at
>> >> > >> by
>> >> > >> trial-and-error)
>> >> > >>
>> >> > >> int[] compute()
>> >> > >> {
>> >> > >> if int[].length < THRESHOLD
>> >> > >>      return insertionSort(int[])
>> >> > >> left = new MergeSort(int[] //split in half)
>> >> > >> right = new MergeSort(int[] //other half)
>> >> > >> return merge(right.compute(), left.join())
>> >> > >> }
>> >> > >>
>> >> > >> The insertionSort() and merge() methods are just standard
>> >> > >> implementations; there is a bit of apples to oranges 
>> comparison
>> >> > >> since
>> >> > >> Arrays.sort() uses an optimized quicksort, but we're still
>> >> talking
>> >> > >> O(nlog(n))
>> >> > >>
>> >> > >> Just ran it on my laptop:
>> >> > >> Windows 7 64-bit
>> >> > >> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
>> >> > >> Core 2 Duo==> 2 cores 2GHz
>> >> > >>
>> >> > >> 2M int[]:
>> >> > >> single-threaded: ~330ms
>> >> > >> ForkJoin (2 workers): ~390ms
>> >> > >>
>> >> > >> Would appreciate any feedback and hopefully the #s are 
>> somewhat
>> >> > >> helpful
>> >> > >> (and please no scoffawing at my antiquated machine)
>> >> > >>
>> >> > >> -T
>> >> > >>
>> >> > >>
>> >> > >> _______________________________________________
>> >> > >> Concurrency-interest mailing list
>> >> > >> Concurrency-interest at cs.oswego.edu
>> >> > >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest 
>> [1]
>> >> > >>
>> >> >
>> >> > _______________________________________________
>> >> > Concurrency-interest mailing list
>> >> > Concurrency-interest at cs.oswego.edu
>> >> > http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
>> >> >
>> >>
>> >> _______________________________________________
>> >> Concurrency-interest mailing list
>> >> Concurrency-interest at cs.oswego.edu
>> >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
>> >>
>> >> --
>> >
>> >
>> r:rgb(0,0,0);font-family:Times;font-variant:normal;letter-spacing:
>> normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;
>> text-transform:none;white-space:normal;word-spacing:0px;font-size:
>> medium">VIKTOR
>> > KLANG
>> >  _Director of Engineering_
>> >
>> >  Typesafe [2] - The software stack for applications that scale
>> >  Twitter: @viktorklang
>> >
>> > Links:
>> > ------
>> > [1] http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> > [2] http://www.typesafe.com/
>>
>>



More information about the Concurrency-interest mailing list