[concurrency-interest] Some interesting (confusing?) benchmark results

√iktor Ҡlang viktor.klang at gmail.com
Tue Feb 12 05:09:16 EST 2013


On Tue, Feb 12, 2013 at 10:40 AM, <thurston at nomagicsoftware.com> wrote:

>
>
> On 2013-02-11 23:47, √iktor Ҡlang wrote:
>
>> On Tue, Feb 12, 2013 at 6:13 AM, <thurston at nomagicsoftware.com> wrote:
>>
>>  On 2013-02-11 14:24, √iktor Ҡlang wrote:
>>>
>>>  On Mon, Feb 11, 2013 at 11:18 PM, David Holmes <
>>>> davidcholmes at aapt.net.au> wrote:
>>>>
>>>>  That doesn't make sense to me. If you had a threshold of 250 then you
>>>>> are
>>>>> creating thousands of tasks and that overhead should potentially bring
>>>>> your
>>>>> laptop to its knees. With 250K threshold you have far fewer tasks.
>>>>>
>>>>
>>>> I'm not sure it makes sense to create more sort-tasks than there is
>>>> Threads in the pool.
>>>>
>>>
>>> Really, Viktor?  That surprises me coming from Akka.
>>>
>>
>> I have found that when I am surprised by someone, I get a much better
>> understanding by asking for clarification before I assume too much.
>>
>>
>>
>>   Does that mean you guys only allocate as many message-processing
>>> "workers" as there are hardware threads?
>>>
>>
>> What's you definition of "worker"?
>>
> Worker in above sense means the # of Recursive(Action/Task) *instances*
> (in fork/join terms), which isn't probably applicable for message
> passing/mailbox processing.
>

Ok, for me the worker is the ForkJoinWorkerThread. The Task is just how to
do the work.


> I guess I mean how many dedicated threads are there that can be actively
> processing Actors' messages (i.e. not waiting at the receive if that makes
> sense)


We only schedule one task for an actor, no matter how many messages it has,
and then submit a new task at the end of each task (if there is more work
to be done, you can see it as continuation passing style over a thread
pool).


>
>
>  Consider this:
>>
>> Do you agree that thread pool sizing depends on type of work? (IO
>> bound vs CPU bound, bursty vs steady etc etc)
>>
> Well, yes; i.e. optimal thread-pool sizing (and of course it depends on a
> lot more than that)


>
>  Do you agree that a JVM Thread is not a unit of parallelism?
>>
> Not sure what you mean.


A JVM Thread is a unit of concurrency and the only unit of execution
available.


>
>
>  Do you agree that having more JVM Threads than hardware threads is
>> bad for CPU-bound workloads?
>>
>
> Not necessarily. What is true is that you want as few threads as are
> necessary to keep all the CPUs at (or near) 100%.  In an ideal case, that
> would be == hardware threads (context switches are expensive),



> but OS thread scheduling is complex and opaque (I'm not pretending to be
> an expert). I don't know what the magic formula is--you have to measure.


Actually, according to our heuristics the optimal setting (for us & for
memory-bound work) tends to be 0.6-0.7 x the number of available hardware
threads. (long topic that should probably discussed in a dedicated thread).


> but it seems plausible that an extra buffer of software threads > hardware
> threads could perform better even in CPU-bound workload


Have you ever observed that?


> (does reading/writing to memory count as I/O? I guess technically it does;
> hard to think of a meaningful workload that wouldn't entail memory access)
>

Technically all reads and writes are IO.


>
>
>>  To be clear, with the 2M size array data, and the threshold of
>> 250-size array for the (serial) insertion sort, 16K+ sort tasks are
>> constructed (as D. Holmes mentioned).  This is significantly *faster*
>> than doing what either of you suggested.  If I simply split the
>> problem in two, and just run a serial mergesort on each half and then
>> merge the two halves (that's three total tasks), it
>>
>>  t of work-stealing could happen with two tasks (each merge-sorting a 1M
>>> item array)?  With 16K tasks, the work-stealing can really strut its stuff
>>> (and at least in my case,can overwhelm the overhead associated with
>>> scheduling the tasks)
>>>
>>> If you only create as many tasks as there are hardware threads -- that
>>> seems a lot like a standard thread-pool implementation.  (Of course, if
>>> there are many more cores, the opportunity for work-stealing would increase
>>> even with the 1-1 ratio).
>>>
>>> When I first used fork/join, that's the approach I took.
>>> 1.  split the problem (hardware threads * (1||2)
>>> 2.  have each task do its work serially (i.e. don't fork)
>>> 3.  combine the (sub)totals/results
>>>
>>> Not only is that less appealing from a programming point of view, it
>>> performs slower; that's the promise, nee even purpose, of the fork/join
>>> framework at least when it comes to implementing naturally recursive
>>> algorithms.
>>>
>>> What I meant was to create as many tasks as workers, and only process a
>>> chunk per task, and at the end of each task forking it. (See it as
>>> continuation passing style over a pool of workers). Which essentially means
>>> that there won't be more tasks than there are workers.
>>>
>>
> Now, I'm confused what you mean by tasks vs workers.


worker == ForkJoinWorkerThread
task == ForkJoinTask


>  I understood you to write that # fork/join recursive task instances
> should == # of hardware threads (i.e. java threads in forkjoinpool). So in
> the parallel mergesort example, on my laptop that would mean 2 (actually 3
> for the initial task), task *instances*; the way to think about it, is how
> many total nodes in the DAG (or tree) at its deepest == # of tasks.  At
> least in theory all of those 'nodes' could be active (still on a workqueue)
> at one time.
> And in theory, two really should be enough, but that entails each 'node'
> doing a bigger chunk since the total work (overhead aside) is the same.


> It sure looks like (empirically) that 16K 'chunks' are processed faster
> (in total), than 3 'chunks'.  Why?  Clearly the 16K have more 'overhead'.
>  But for some reason, they are utilizing more of the CPUs' power for a
> larger % of time; that is the only logical conclusion that I can draw
>

Another interesting solution is to spawn as many tasks as workers and at
the beginning of executing every task, you fork a new task. The workload
per task is still a smaller chunk of the array itself. The effects might
just be that there is enough tasks to make sure that workers that finish
early has something to do (either take from local queue or steal) And you
don't need to create a ton of tasks up front.

Cheers,
√


>
>
>
>>> I hope I cleared things up.
>>>
>>> Cheers,
>>>>>>
>>>
>> ,204);border-left-style:solid;**padding-left:1ex">
>>
>>  -T
>>
>>  That said I don't know what the memory requirements are for your
>> algorithm.
>>  If you are allocating temporary arrays then a  large threshold could well
>>  cause a problem.
>>
>>  ote>
>>>
>>> No in-place merge sort? ;-)
>>>
>>>
>>>  -----Original Message-----
>>>>>
>>>> > From: concurrency-interest-bounces@**cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>
>>>> > [mailto:concurrency-interest-**bounces at cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>]On
>>>> Behalf Of
>>>> > thurston at nomagicsoftware.com
>>>>
>>>>  Sent: Tuesday, 12 February 2013 7:49 AM
>>>>>
>>>> > To: concurrency-interest at cs.**oswego.edu<concurrency-interest at cs.oswego.edu>
>>>> > Subject: Re: [concurrency-interest] Some interesting (confusing?)
>>>> > benchmark results
>>>> >
>>>> >
>>>> > Well, it's the algorithm that's dispositive in this case.
>>>> > Insertion-sort is empirically better than mergesort up to around
>>>> > 25-sized arrays.  That's where I started, and I played around with
>>>> > different thresholds and 250 seemed to be the best.
>>>> > Anything larger than 1000 you start to see significant degradation.
>>>> > I tried your formula (approximately) with N = 2 000 000 ==> 250K
>>>> >
>>>> > That brought my laptop to its knees (as expected): over 33K ms as
>>>> > opposed to 390ms with 250.
>>>> >
>>>> > Following my own admonition (the algorithm matters), it occurs to me
>>>> > that I just use my own serial mergesort as comparison to the Forkjoin
>>>> > (as opposed to Arrays.sort).  I'll let y'all know the results of that
>>>> >
>>>> > -T
>>>> >
>>>> >
>>>> >
>>>> > On 2013-02-11 13:19, David Holmes wrote:
>>>> > > Your sequential threshold is far too small for only two worker
>>>> > > threads. You
>>>> > > would get near optimum speedup only splitting the problem into two.
>>>> > >
>>>> > > The heuristic for sequential threshold is:
>>>> > >
>>>> > > Threshold = N / (LoadBalanceFactor * NumCore)
>>>> > >
>>>> > > where N is the "size" of the problem.
>>>> > >
>>>> > > The LoadBalanceFactor in an ideal world would be 1 but real tasks
>>>> > > encounter
>>>> > > delays and execute differently so a value > 1 is needed.
>>>> > > Heuristically a
>>>> > > value of 8 is a good starting value.
>>>> > >
>>>> > > David Holmes
>>>> > >
>>>> > >> -----Original Message-----
>>>> > >> From: concurrency-interest-bounces@**cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>
>>>> > >> [mailto:concurrency-interest-**bounces at cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>]On
>>>> Behalf Of
>>>> > >> thurston at nomagicsoftware.com
>>>> > >> Sent: Tuesday, 12 February 2013 6:52 AM
>>>> > >> To: concurrency-interest at cs.**oswego.edu<concurrency-interest at cs.oswego.edu>
>>>> > >> Subject: [concurrency-interest] Some interesting (confusing?)
>>>> > >> benchmark
>>>> > >> results
>>>> > >>
>>>> > >>
>>>> > >> Hello,
>>>> > >>
>>>> > >> I made some initial attempts at using ForkJoin framework for some
>>>> > >> rather obvious recursively parallel use-cases, namely summing the
>>>> > >> elements of an int[] and also sorting an int[] (randomly filled).
>>>> > >>
>>>> > >> The problem-sizes I used ranged from 1M - 10M.
>>>> > >>
>>>> > >> I really enjoy using the framework and find it relatively easy to
>>>> > >> reason about my programs (if not necessarily the internals of the
>>>> > >> framework).
>>>> > >> But I was disappointed with the results: in both cases they were
>>>> > >> slower
>>>> > >> than the corresponding Java serial implementation.
>>>> > >>
>>>> > >> I completely understand the principle of YMMV, and I wasn't
>>>> > >> expecting
>>>> > >> 2x speedup, but especially in the case of the sort, but I was
>>>> > >> expecting
>>>> > >> that ForkJoin would do at least a little better than the
>>>> > >> single-threaded
>>>> > >> version.  I guess what I'm asking is: are these results surprising?
>>>> > >> Does it mean I'm doing something wrong?
>>>> > >>
>>>> > >> I'll give a brief description of my implementation:
>>>> > >> single-threaded:  Arrays.sort(int[])
>>>> > >>
>>>> > >> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 (arrived at
>>>> > >> by
>>>> > >> trial-and-error)
>>>> > >>
>>>> > >> int[] compute()
>>>> > >> {
>>>> > >> if int[].length < THRESHOLD
>>>> > >>      return insertionSort(int[])
>>>> > >> left = new MergeSort(int[] //split in half)
>>>> > >> right = new MergeSort(int[] //other half)
>>>> > >> return merge(right.compute(), left.join())
>>>> > >> }
>>>> > >>
>>>> > >> The insertionSort() and merge() methods are just standard
>>>> > >> implementations; there is a bit of apples to oranges comparison
>>>> > >> since
>>>> > >> Arrays.sort() uses an optimized quicksort, but we're still talking
>>>> > >> O(nlog(n))
>>>> > >>
>>>> > >> Just ran it on my laptop:
>>>> > >> Windows 7 64-bit
>>>> > >> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
>>>> > >> Core 2 Duo==> 2 cores 2GHz
>>>> > >>
>>>> > >> 2M int[]:
>>>> > >> single-threaded: ~330ms
>>>> > >> ForkJoin (2 workers): ~390ms
>>>> > >>
>>>> > >> Would appreciate any feedback and hopefully the #s are somewhat
>>>> > >> helpful
>>>> > >> (and please no scoffawing at my antiquated machine)
>>>> > >>
>>>> > >> -T
>>>> > >>
>>>> > >>
>>>> > >> ______________________________**_________________
>>>> > >> Concurrency-interest mailing list
>>>> > >> Concurrency-interest at cs.**oswego.edu<Concurrency-interest at cs.oswego.edu>
>>>> > >> http://cs.oswego.edu/mailman/**listinfo/concurrency-interest<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>[1] [1]
>>>>
>>>>
>>>> > >>
>>>> >
>>>> > ______________________________**_________________
>>>> > Concurrency-interest mailing list
>>>> > Concurrency-interest at cs.**oswego.edu<Concurrency-interest at cs.oswego.edu>
>>>> > http://cs.oswego.edu/mailman/**listinfo/concurrency-interest<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>[1] [1]
>>>>
>>>>
>>>> >
>>>>
>>>> ______________________________**_________________
>>>> Concurrency-interest mailing list
>>>> Concurrency-interest at cs.**oswego.edu<Concurrency-interest at cs.oswego.edu>
>>>> http://cs.oswego.edu/mailman/**listinfo/concurrency-interest<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>[1] [1]
>>>>
>>>> --
>>>>
>>>
>>>
>>> r:rgb(0,0,0);font-family:**Times;font-variant:normal;**
>>> letter-spacing:normal;line-**height:normal;text-align:-**
>>> webkit-auto;text-indent:0px;**text-transform:none;white-**
>>> space:normal;word-spacing:0px;**font-size:medium">VIKTOR
>>> KLANG
>>>  _Director of Engineering_
>>>
>>>  Typesafe [2] - The software stack for applications that scale
>>>  Twitter: @viktorklang
>>>
>>> Links:
>>> ------
>>> [1] http://cs.oswego.edu/mailman/**listinfo/concurrency-interest<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>[1]
>>> [2] http://www.typesafe.com/ [2]
>>>
>>> --
>>>
>>>
>>> VIKTOR KLANG
>>> _Director of Engineering_
>>>
>>
>> t-align:-webkit-auto;text-**indent:0px;text-transform:**
>> none;white-space:normal;word-**spacing:0px;font-size:medium">
>>  Typesafe [2] - The software stack for applications that scale
>>  Twitter: @viktorklang
>>
>> Links:
>> ------
>> [1] http://cs.oswego.edu/mailman/**listinfo/concurrency-interest<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>> [2] http://www.typesafe.com/
>>
>
>


-- 
*Viktor Klang*
*Director of Engineering*
*
*
Typesafe <http://www.typesafe.com/> - The software stack for applications
that scale
Twitter: @viktorklang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130212/8d1f0f7e/attachment-0001.html>


More information about the Concurrency-interest mailing list