[concurrency-interest] Some interesting (confusing?) benchmark results

thurston at nomagicsoftware.com thurston at nomagicsoftware.com
Tue Feb 12 01:01:49 EST 2013


Unfortunately, not at the moment.
I absolutely will do that in the next day or so.
I'd be interested in others running the code and comparing results -- 
some people will only be convinced when they run the code themselves 
(Hmmm, I'm like that)

On 2013-02-11 21:53, David Holmes wrote:
> Small arrays are handled quite differently in the VM to large arrays.
>
> There is something happening here that you don't understand.
>
> Is your full code available?
>
> David
>
>> -----Original Message-----
>> From: thurston at nomagicsoftware.com 
>> [mailto:thurston at nomagicsoftware.com]
>> Sent: Tuesday, 12 February 2013 3:45 PM
>> To: dholmes at ieee.org
>> Cc: viktor ?lang; concurrency-interest
>> Subject: RE: [concurrency-interest] Some interesting (confusing?)
>> benchmark results
>>
>>
>> The #s are the ultimate arbiter, aren't they?  I can't imagine that 
>> the
>> 3-task solution copies the array more often than the 16K solution, 
>> since
>> each level of the DAG results in a duplicate of the original array 
>> (of
>> course in smaller and smaller chunks)
>>
>> 16K:  ~385ms
>> 3:    ~550ms
>>
>> On 2013-02-11 21:26, David Holmes wrote:
>> >> This is in line with my expectation; what amount of work-stealing
>> >> could
>> >> happen with two tasks (each merge-sorting a 1M item array)?  With
>> >> 16K
>> >> tasks, the work-stealing can really strut its stuff (and at least 
>> in
>> >> my
>> >> case,can overwhelm the overhead associated with scheduling the
>> >> tasks)
>> >
>> > I think you are seeing affects that are not due to parallelisation 
>> of
>> > the algorithm as such. 16K tasks for two threads is just 
>> ridiculous
>> > task creation and management overhead. If you see a speed up 
>> compared
>> > to three tasks then you are either not comparing apples to 
>> oranges,
>> > or
>> > you are affected by other things - such as memory management when
>> > dealing with large arrays.
>> >
>> > David
>> > -----
>> >
>> >
>> >> -----Original Message-----
>> >> From: thurston at nomagicsoftware.com
>> >> [mailto:thurston at nomagicsoftware.com]
>> >> Sent: Tuesday, 12 February 2013 3:14 PM
>> >> To: viktor ?lang
>> >> Cc: dholmes at ieee.org; concurrency-interest
>> >> Subject: Re: [concurrency-interest] Some interesting (confusing?)
>> >> benchmark results
>> >>
>> >>
>> >>
>> >>
>> >> On 2013-02-11 14:24, √iktor Ҡlang wrote:
>> >> > On Mon, Feb 11, 2013 at 11:18 PM, David Holmes
>> >> > <davidcholmes at aapt.net.au> wrote:
>> >> >
>> >> >> That doesn't make sense to me. If you had a threshold of 250 
>> then
>> >> >> you are
>> >> >> creating thousands of tasks and that overhead should 
>> potentially
>> >> >> bring your
>> >> >> laptop to its knees. With 250K threshold you have far fewer
>> >> tasks.
>> >> >
>> >> > I'm not sure it makes sense to create more sort-tasks than 
>> there
>> >> is
>> >> > Threads in the pool.
>> >>
>> >> Really, Viktor?  That surprises me coming from Akka.  Does that 
>> mean
>> >> you guys only allocate as many message-processing "workers" as 
>> there
>> >> are
>> >> hardware threads?
>> >>
>> >> To be clear, with the 2M size array data, and the threshold of
>> >> 250-size
>> >> array for the (serial) insertion sort, 16K+ sort tasks are
>> >> constructed
>> >> (as D. Holmes mentioned).  This is significantly *faster* than 
>> doing
>> >> what either of you suggested.  If I simply split the problem in 
>> two,
>> >> and
>> >> just run a serial mergesort on each half and then merge the two
>> >> halves
>> >> (that's three total tasks), it's over 50% slower.
>> >>
>> >> Let me repeat that.  16K+ tasks is much, much faster than 3 
>> tasks.
>> >>
>> >> This is in line with my expectation; what amount of work-stealing
>> >> could
>> >> happen with two tasks (each merge-sorting a 1M item array)?  With
>> >> 16K
>> >> tasks, the work-stealing can really strut its stuff (and at least 
>> in
>> >> my
>> >> case,can overwhelm the overhead associated with scheduling the
>> >> tasks)
>> >>
>> >> If you only create as many tasks as there are hardware threads --
>> >> that
>> >> seems a lot like a standard thread-pool implementation.  (Of 
>> course,
>> >> if
>> >> there are many more cores, the opportunity for work-stealing 
>> would
>> >> increase even with the 1-1 ratio).
>> >>
>> >> When I first used fork/join, that's the approach I took.
>> >> 1.  split the problem (hardware threads * (1||2)
>> >> 2.  have each task do its work serially (i.e. don't fork)
>> >> 3.  combine the (sub)totals/results
>> >>
>> >> Not only is that less appealing from a programming point of view, 
>> it
>> >> performs slower; that's the promise, nee even purpose, of the
>> >> fork/join
>> >> framework at least when it comes to implementing naturally 
>> recursive
>> >> algorithms.
>> >>
>> >> -T
>> >>
>> >>
>> >> >
>> >> >> That said I don't know what the memory requirements are for 
>> your
>> >> >> algorithm.
>> >> >> If you are allocating temporary arrays then a  large threshold
>> >> could
>> >> >> well
>> >> >> cause a problem.
>> >> >>
>> >> >> The parallel array sort going into JDK 8 is a merge sort but 
>> uses
>> >> a
>> >> >> working
>> >> >> array equal in size to the original array to be sorted.
>> >> >
>> >> > No in-place merge sort? ;-)
>> >> >
>> >> >
>> >> >>> -----Original Message-----
>> >> >> > From: concurrency-interest-bounces at cs.oswego.edu
>> >> >> > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf 
>> Of
>> >> >> > thurston at nomagicsoftware.com
>> >> >>
>> >> >>> Sent: Tuesday, 12 February 2013 7:49 AM
>> >> >> > To: concurrency-interest at cs.oswego.edu
>> >> >> > Subject: Re: [concurrency-interest] Some interesting
>> >> (confusing?)
>> >> >> > benchmark results
>> >> >> >
>> >> >> >
>> >> >> > Well, it's the algorithm that's dispositive in this case.
>> >> >> > Insertion-sort is empirically better than mergesort up to
>> >> around
>> >> >> > 25-sized arrays.  That's where I started, and I played 
>> around
>> >> with
>> >> >> > different thresholds and 250 seemed to be the best.
>> >> >> > Anything larger than 1000 you start to see significant
>> >> >> degradation.
>> >> >> > I tried your formula (approximately) with N = 2 000 000 ==>
>> >> 250K
>> >> >> >
>> >> >> > That brought my laptop to its knees (as expected): over 33K 
>> ms
>> >> as
>> >> >> > opposed to 390ms with 250.
>> >> >> >
>> >> >> > Following my own admonition (the algorithm matters), it 
>> occurs
>> >> to
>> >> >> me
>> >> >> > that I just use my own serial mergesort as comparison to the
>> >> >> Forkjoin
>> >> >> > (as opposed to Arrays.sort).  I'll let y'all know the 
>> results
>> >> of
>> >> >> that
>> >> >> >
>> >> >> > -T
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On 2013-02-11 13:19, David Holmes wrote:
>> >> >> > > Your sequential threshold is far too small for only two
>> >> worker
>> >> >> > > threads. You
>> >> >> > > would get near optimum speedup only splitting the problem
>> >> into
>> >> >> two.
>> >> >> > >
>> >> >> > > The heuristic for sequential threshold is:
>> >> >> > >
>> >> >> > > Threshold = N / (LoadBalanceFactor * NumCore)
>> >> >> > >
>> >> >> > > where N is the "size" of the problem.
>> >> >> > >
>> >> >> > > The LoadBalanceFactor in an ideal world would be 1 but 
>> real
>> >> >> tasks
>> >> >> > > encounter
>> >> >> > > delays and execute differently so a value > 1 is needed.
>> >> >> > > Heuristically a
>> >> >> > > value of 8 is a good starting value.
>> >> >> > >
>> >> >> > > David Holmes
>> >> >> > >
>> >> >> > >> -----Original Message-----
>> >> >> > >> From: concurrency-interest-bounces at cs.oswego.edu
>> >> >> > >> [mailto:concurrency-interest-bounces at cs.oswego.edu]On 
>> Behalf
>> >> Of
>> >> >> > >> thurston at nomagicsoftware.com
>> >> >> > >> Sent: Tuesday, 12 February 2013 6:52 AM
>> >> >> > >> To: concurrency-interest at cs.oswego.edu
>> >> >> > >> Subject: [concurrency-interest] Some interesting
>> >> (confusing?)
>> >> >> > >> benchmark
>> >> >> > >> results
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> Hello,
>> >> >> > >>
>> >> >> > >> I made some initial attempts at using ForkJoin framework 
>> for
>> >> >> some
>> >> >> > >> rather obvious recursively parallel use-cases, namely
>> >> summing
>> >> >> the
>> >> >> > >> elements of an int[] and also sorting an int[] (randomly
>> >> >> filled).
>> >> >> > >>
>> >> >> > >> The problem-sizes I used ranged from 1M - 10M.
>> >> >> > >>
>> >> >> > >> I really enjoy using the framework and find it relatively
>> >> easy
>> >> >> to
>> >> >> > >> reason about my programs (if not necessarily the 
>> internals
>> >> of
>> >> >> the
>> >> >> > >> framework).
>> >> >> > >> But I was disappointed with the results: in both cases 
>> they
>> >> >> were
>> >> >> > >> slower
>> >> >> > >> than the corresponding Java serial implementation.
>> >> >> > >>
>> >> >> > >> I completely understand the principle of YMMV, and I 
>> wasn't
>> >> >> > >> expecting
>> >> >> > >> 2x speedup, but especially in the case of the sort, but I
>> >> was
>> >> >> > >> expecting
>> >> >> > >> that ForkJoin would do at least a little better than the
>> >> >> > >> single-threaded
>> >> >> > >> version.  I guess what I'm asking is: are these results
>> >> >> surprising?
>> >> >> > >> Does it mean I'm doing something wrong?
>> >> >> > >>
>> >> >> > >> I'll give a brief description of my implementation:
>> >> >> > >> single-threaded:  Arrays.sort(int[])
>> >> >> > >>
>> >> >> > >> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250
>> >> (arrived
>> >> >> at
>> >> >> > >> by
>> >> >> > >> trial-and-error)
>> >> >> > >>
>> >> >> > >> int[] compute()
>> >> >> > >> {
>> >> >> > >> if int[].length < THRESHOLD
>> >> >> > >>      return insertionSort(int[])
>> >> >> > >> left = new MergeSort(int[] //split in half)
>> >> >> > >> right = new MergeSort(int[] //other half)
>> >> >> > >> return merge(right.compute(), left.join())
>> >> >> > >> }
>> >> >> > >>
>> >> >> > >> The insertionSort() and merge() methods are just standard
>> >> >> > >> implementations; there is a bit of apples to oranges
>> >> comparison
>> >> >> > >> since
>> >> >> > >> Arrays.sort() uses an optimized quicksort, but we're 
>> still
>> >> >> talking
>> >> >> > >> O(nlog(n))
>> >> >> > >>
>> >> >> > >> Just ran it on my laptop:
>> >> >> > >> Windows 7 64-bit
>> >> >> > >> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
>> >> >> > >> Core 2 Duo==> 2 cores 2GHz
>> >> >> > >>
>> >> >> > >> 2M int[]:
>> >> >> > >> single-threaded: ~330ms
>> >> >> > >> ForkJoin (2 workers): ~390ms
>> >> >> > >>
>> >> >> > >> Would appreciate any feedback and hopefully the #s are
>> >> somewhat
>> >> >> > >> helpful
>> >> >> > >> (and please no scoffawing at my antiquated machine)
>> >> >> > >>
>> >> >> > >> -T
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> _______________________________________________
>> >> >> > >> Concurrency-interest mailing list
>> >> >> > >> Concurrency-interest at cs.oswego.edu
>> >> >> > >> 
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> >> [1]
>> >> >> > >>
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > Concurrency-interest mailing list
>> >> >> > Concurrency-interest at cs.oswego.edu
>> >> >> > http://cs.oswego.edu/mailman/listinfo/concurrency-interest 
>> [1]
>> >> >> >
>> >> >>
>> >> >> _______________________________________________
>> >> >> Concurrency-interest mailing list
>> >> >> Concurrency-interest at cs.oswego.edu
>> >> >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
>> >> >>
>> >> >> --
>> >> >
>> >> >
>> >> 
>> r:rgb(0,0,0);font-family:Times;font-variant:normal;letter-spacing:
>> >> 
>> normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;
>> >> 
>> text-transform:none;white-space:normal;word-spacing:0px;font-size:
>> >> medium">VIKTOR
>> >> > KLANG
>> >> >  _Director of Engineering_
>> >> >
>> >> >  Typesafe [2] - The software stack for applications that scale
>> >> >  Twitter: @viktorklang
>> >> >
>> >> > Links:
>> >> > ------
>> >> > [1] http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> >> > [2] http://www.typesafe.com/
>> >>
>> >>
>>
>>



More information about the Concurrency-interest mailing list