[concurrency-interest] Some interesting (confusing?) benchmark results

thurston at nomagicsoftware.com thurston at nomagicsoftware.com
Tue Feb 12 04:40:04 EST 2013



On 2013-02-11 23:47, √iktor Ҡlang wrote:
> On Tue, Feb 12, 2013 at 6:13 AM, <thurston at nomagicsoftware.com> 
> wrote:
>
>> On 2013-02-11 14:24, √iktor Ҡlang wrote:
>>
>>> On Mon, Feb 11, 2013 at 11:18 PM, David Holmes 
>>> <davidcholmes at aapt.net.au> wrote:
>>>
>>>> That doesn't make sense to me. If you had a threshold of 250 then 
>>>> you are
>>>> creating thousands of tasks and that overhead should potentially 
>>>> bring your
>>>> laptop to its knees. With 250K threshold you have far fewer tasks.
>>>
>>> I'm not sure it makes sense to create more sort-tasks than there is
>>> Threads in the pool.
>>
>> Really, Viktor?  That surprises me coming from Akka.
>
> I have found that when I am surprised by someone, I get a much better
> understanding by asking for clarification before I assume too much.
>
>  
>
>>  Does that mean you guys only allocate as many message-processing 
>> "workers" as there are hardware threads? 
>
> What's you definition of "worker"?
Worker in above sense means the # of Recursive(Action/Task) *instances* 
(in fork/join terms), which isn't probably applicable for message 
passing/mailbox processing.
I guess I mean how many dedicated threads are there that can be 
actively processing Actors' messages (i.e. not waiting at the receive if 
that makes sense)

> Consider this:
>
> Do you agree that thread pool sizing depends on type of work? (IO
> bound vs CPU bound, bursty vs steady etc etc)
Well, yes; i.e. optimal thread-pool sizing (and of course it depends on 
a lot more than that)

> Do you agree that a JVM Thread is not a unit of parallelism?
Not sure what you mean.

> Do you agree that having more JVM Threads than hardware threads is
> bad for CPU-bound workloads?

Not necessarily. What is true is that you want as few threads as are 
necessary to keep all the CPUs at (or near) 100%.  In an ideal case, 
that would be == hardware threads (context switches are expensive), but 
OS thread scheduling is complex and opaque (I'm not pretending to be an 
expert). I don't know what the magic formula is--you have to measure. 
but it seems plausible that an extra buffer of software threads > 
hardware threads could perform better even in CPU-bound workload (does 
reading/writing to memory count as I/O? I guess technically it does; 
hard to think of a meaningful workload that wouldn't entail memory 
access)

>  
>  To be clear, with the 2M size array data, and the threshold of
> 250-size array for the (serial) insertion sort, 16K+ sort tasks are
> constructed (as D. Holmes mentioned).  This is significantly *faster*
> than doing what either of you suggested.  If I simply split the
> problem in two, and just run a serial mergesort on each half and then
> merge the two halves (that's three total tasks), it
>
>> t of work-stealing could happen with two tasks (each merge-sorting a 
>> 1M item array)?  With 16K tasks, the work-stealing can really strut 
>> its stuff (and at least in my case,can overwhelm the overhead 
>> associated with scheduling the tasks)
>>
>> If you only create as many tasks as there are hardware threads -- 
>> that seems a lot like a standard thread-pool implementation.  (Of 
>> course, if there are many more cores, the opportunity for 
>> work-stealing would increase even with the 1-1 ratio).
>>
>> When I first used fork/join, that's the approach I took.
>> 1.  split the problem (hardware threads * (1||2)
>> 2.  have each task do its work serially (i.e. don't fork)
>> 3.  combine the (sub)totals/results
>>
>> Not only is that less appealing from a programming point of view, it 
>> performs slower; that's the promise, nee even purpose, of the 
>> fork/join framework at least when it comes to implementing naturally 
>> recursive algorithms.
>>
>> What I meant was to create as many tasks as workers, and only 
>> process a chunk per task, and at the end of each task forking it. (See 
>> it as continuation passing style over a pool of workers). Which 
>> essentially means that there won't be more tasks than there are 
>> workers.

Now, I'm confused what you mean by tasks vs workers.  I understood you 
to write that # fork/join recursive task instances should == # of 
hardware threads (i.e. java threads in forkjoinpool). So in the parallel 
mergesort example, on my laptop that would mean 2 (actually 3 for the 
initial task), task *instances*; the way to think about it, is how many 
total nodes in the DAG (or tree) at its deepest == # of tasks.  At least 
in theory all of those 'nodes' could be active (still on a workqueue) at 
one time.
And in theory, two really should be enough, but that entails each 
'node' doing a bigger chunk since the total work (overhead aside) is the 
same.

It sure looks like (empirically) that 16K 'chunks' are processed faster 
(in total), than 3 'chunks'.  Why?  Clearly the 16K have more 
'overhead'.  But for some reason, they are utilizing more of the CPUs' 
power for a larger % of time; that is the only logical conclusion that I 
can draw


>>
>> I hope I cleared things up.
>>
>> Cheers,
>>>>  
> ,204);border-left-style:solid;padding-left:1ex">
>  -T
>
>  That said I don't know what the memory requirements are for your 
> algorithm.
>  If you are allocating temporary arrays then a  large threshold could 
> well
>  cause a problem.
>
>> ote>
>> No in-place merge sort? ;-)
>>  
>>
>>>> -----Original Message-----
>>> > From: concurrency-interest-bounces at cs.oswego.edu
>>> > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
>>> > thurston at nomagicsoftware.com
>>>
>>>> Sent: Tuesday, 12 February 2013 7:49 AM
>>> > To: concurrency-interest at cs.oswego.edu
>>> > Subject: Re: [concurrency-interest] Some interesting (confusing?)
>>> > benchmark results
>>> >
>>> >
>>> > Well, it's the algorithm that's dispositive in this case.
>>> > Insertion-sort is empirically better than mergesort up to around
>>> > 25-sized arrays.  That's where I started, and I played around 
>>> with
>>> > different thresholds and 250 seemed to be the best.
>>> > Anything larger than 1000 you start to see significant 
>>> degradation.
>>> > I tried your formula (approximately) with N = 2 000 000 ==> 250K
>>> >
>>> > That brought my laptop to its knees (as expected): over 33K ms as
>>> > opposed to 390ms with 250.
>>> >
>>> > Following my own admonition (the algorithm matters), it occurs to 
>>> me
>>> > that I just use my own serial mergesort as comparison to the 
>>> Forkjoin
>>> > (as opposed to Arrays.sort).  I'll let y'all know the results of 
>>> that
>>> >
>>> > -T
>>> >
>>> >
>>> >
>>> > On 2013-02-11 13:19, David Holmes wrote:
>>> > > Your sequential threshold is far too small for only two worker
>>> > > threads. You
>>> > > would get near optimum speedup only splitting the problem into 
>>> two.
>>> > >
>>> > > The heuristic for sequential threshold is:
>>> > >
>>> > > Threshold = N / (LoadBalanceFactor * NumCore)
>>> > >
>>> > > where N is the "size" of the problem.
>>> > >
>>> > > The LoadBalanceFactor in an ideal world would be 1 but real 
>>> tasks
>>> > > encounter
>>> > > delays and execute differently so a value > 1 is needed.
>>> > > Heuristically a
>>> > > value of 8 is a good starting value.
>>> > >
>>> > > David Holmes
>>> > >
>>> > >> -----Original Message-----
>>> > >> From: concurrency-interest-bounces at cs.oswego.edu
>>> > >> [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf 
>>> Of
>>> > >> thurston at nomagicsoftware.com
>>> > >> Sent: Tuesday, 12 February 2013 6:52 AM
>>> > >> To: concurrency-interest at cs.oswego.edu
>>> > >> Subject: [concurrency-interest] Some interesting (confusing?)
>>> > >> benchmark
>>> > >> results
>>> > >>
>>> > >>
>>> > >> Hello,
>>> > >>
>>> > >> I made some initial attempts at using ForkJoin framework for 
>>> some
>>> > >> rather obvious recursively parallel use-cases, namely summing 
>>> the
>>> > >> elements of an int[] and also sorting an int[] (randomly 
>>> filled).
>>> > >>
>>> > >> The problem-sizes I used ranged from 1M - 10M.
>>> > >>
>>> > >> I really enjoy using the framework and find it relatively easy 
>>> to
>>> > >> reason about my programs (if not necessarily the internals of 
>>> the
>>> > >> framework).
>>> > >> But I was disappointed with the results: in both cases they 
>>> were
>>> > >> slower
>>> > >> than the corresponding Java serial implementation.
>>> > >>
>>> > >> I completely understand the principle of YMMV, and I wasn't
>>> > >> expecting
>>> > >> 2x speedup, but especially in the case of the sort, but I was
>>> > >> expecting
>>> > >> that ForkJoin would do at least a little better than the
>>> > >> single-threaded
>>> > >> version.  I guess what I'm asking is: are these results 
>>> surprising?
>>> > >> Does it mean I'm doing something wrong?
>>> > >>
>>> > >> I'll give a brief description of my implementation:
>>> > >> single-threaded:  Arrays.sort(int[])
>>> > >>
>>> > >> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 
>>> (arrived at
>>> > >> by
>>> > >> trial-and-error)
>>> > >>
>>> > >> int[] compute()
>>> > >> {
>>> > >> if int[].length < THRESHOLD
>>> > >>      return insertionSort(int[])
>>> > >> left = new MergeSort(int[] //split in half)
>>> > >> right = new MergeSort(int[] //other half)
>>> > >> return merge(right.compute(), left.join())
>>> > >> }
>>> > >>
>>> > >> The insertionSort() and merge() methods are just standard
>>> > >> implementations; there is a bit of apples to oranges 
>>> comparison
>>> > >> since
>>> > >> Arrays.sort() uses an optimized quicksort, but we're still 
>>> talking
>>> > >> O(nlog(n))
>>> > >>
>>> > >> Just ran it on my laptop:
>>> > >> Windows 7 64-bit
>>> > >> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
>>> > >> Core 2 Duo==> 2 cores 2GHz
>>> > >>
>>> > >> 2M int[]:
>>> > >> single-threaded: ~330ms
>>> > >> ForkJoin (2 workers): ~390ms
>>> > >>
>>> > >> Would appreciate any feedback and hopefully the #s are 
>>> somewhat
>>> > >> helpful
>>> > >> (and please no scoffawing at my antiquated machine)
>>> > >>
>>> > >> -T
>>> > >>
>>> > >>
>>> > >> _______________________________________________
>>> > >> Concurrency-interest mailing list
>>> > >> Concurrency-interest at cs.oswego.edu
>>> > >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1] 
>>> [1]
>>>
>>> > >>
>>> >
>>> > _______________________________________________
>>> > Concurrency-interest mailing list
>>> > Concurrency-interest at cs.oswego.edu
>>> > http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1] 
>>> [1]
>>>
>>> >
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1] [1]
>>>
>>> --
>>
>> 
>> r:rgb(0,0,0);font-family:Times;font-variant:normal;letter-spacing:normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;font-size:medium">VIKTOR
>> KLANG
>>  _Director of Engineering_
>>
>>  Typesafe [2] - The software stack for applications that scale
>>  Twitter: @viktorklang
>>
>> Links:
>> ------
>> [1] http://cs.oswego.edu/mailman/listinfo/concurrency-interest [1]
>> [2] http://www.typesafe.com/ [2]
>>
>> --
>>
>> VIKTOR KLANG
>> _Director of Engineering_
> 
> t-align:-webkit-auto;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;font-size:medium">
>  Typesafe [2] - The software stack for applications that scale
>  Twitter: @viktorklang
>
> Links:
> ------
> [1] http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> [2] http://www.typesafe.com/



More information about the Concurrency-interest mailing list