[concurrency-interest] Some interesting (confusing?) benchmarkresults

David Holmes davidcholmes at aapt.net.au
Mon Feb 11 19:36:34 EST 2013


Sure but that is why the heuristic value of 8 is a good starting point.

Of course it can be refined for a given platform, but how are you going to express the logic in a cross platform way?

David
  -----Original Message-----
  From: concurrency-interest-bounces at cs.oswego.edu [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of viktor ?lang
  Sent: Tuesday, 12 February 2013 8:39 AM
  To: dholmes at ieee.org
  Cc: thurston; concurrency-interest
  Subject: Re: [concurrency-interest] Some interesting (confusing?) benchmarkresults


  David,




  On Mon, Feb 11, 2013 at 11:34 PM, David Holmes <davidcholmes at aapt.net.au> wrote:

    Victor,

    See the description of LoadBalanceFactor. It makes sense to create more tasks than there are threads because things progress at different rates. A thread can be delayed through scheduling, JIT activity, GC etc etc. Suppose with 2 threads one had completed its half while the other was only a quarter of the way through. You then have to wait much longer for completion. But if you have broken things down into smaller subtasks (say 4) then that first thread could be "helping" the second to complete the overall job. But break it down too fine and you get too much task overhead. Hence the heuristic value of 8 - which is generally a good starting point. Plus performance is not super-sensitive to getting the LoadBalanceFactor exactly right.


  True, but cache-effects (& hardward) also has to be taken into consideration (shared l2 vs shared l3 vs going cross-socket/NUMA).
  Ideally the user API would be devoid of encoding such assumtptions, as they will vary on a platform level…


  Cheers,
  √


    David
      -----Original Message-----
      From: viktor ?lang [mailto:viktor.klang at gmail.com]
      Sent: Tuesday, 12 February 2013 8:25 AM
      To: dholmes at ieee.org
      Cc: thurston; concurrency-interest
      Subject: Re: [concurrency-interest] Some interesting (confusing?) benchmark results







      On Mon, Feb 11, 2013 at 11:18 PM, David Holmes <davidcholmes at aapt.net.au> wrote:

        That doesn't make sense to me. If you had a threshold of 250 then you are
        creating thousands of tasks and that overhead should potentially bring your
        laptop to its knees. With 250K threshold you have far fewer tasks.



      I'm not sure it makes sense to create more sort-tasks than there is Threads in the pool.


        That said I don't know what the memory requirements are for your algorithm.
        If you are allocating temporary arrays then a  large threshold could well
        cause a problem.

        The parallel array sort going into JDK 8 is a merge sort but uses a working
        array equal in size to the original array to be sorted.



      No in-place merge sort? ;-)


        David


        > -----Original Message-----
        > From: concurrency-interest-bounces at cs.oswego.edu
        > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
        > thurston at nomagicsoftware.com

        > Sent: Tuesday, 12 February 2013 7:49 AM
        > To: concurrency-interest at cs.oswego.edu
        > Subject: Re: [concurrency-interest] Some interesting (confusing?)
        > benchmark results
        >
        >
        > Well, it's the algorithm that's dispositive in this case.
        > Insertion-sort is empirically better than mergesort up to around
        > 25-sized arrays.  That's where I started, and I played around with
        > different thresholds and 250 seemed to be the best.
        > Anything larger than 1000 you start to see significant degradation.
        > I tried your formula (approximately) with N = 2 000 000 ==> 250K
        >
        > That brought my laptop to its knees (as expected): over 33K ms as
        > opposed to 390ms with 250.
        >
        > Following my own admonition (the algorithm matters), it occurs to me
        > that I just use my own serial mergesort as comparison to the Forkjoin
        > (as opposed to Arrays.sort).  I'll let y'all know the results of that
        >
        > -T
        >
        >
        >
        > On 2013-02-11 13:19, David Holmes wrote:
        > > Your sequential threshold is far too small for only two worker
        > > threads. You
        > > would get near optimum speedup only splitting the problem into two.
        > >
        > > The heuristic for sequential threshold is:
        > >
        > > Threshold = N / (LoadBalanceFactor * NumCore)
        > >
        > > where N is the "size" of the problem.
        > >
        > > The LoadBalanceFactor in an ideal world would be 1 but real tasks
        > > encounter
        > > delays and execute differently so a value > 1 is needed.
        > > Heuristically a
        > > value of 8 is a good starting value.
        > >
        > > David Holmes
        > >
        > >> -----Original Message-----
        > >> From: concurrency-interest-bounces at cs.oswego.edu
        > >> [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
        > >> thurston at nomagicsoftware.com
        > >> Sent: Tuesday, 12 February 2013 6:52 AM
        > >> To: concurrency-interest at cs.oswego.edu
        > >> Subject: [concurrency-interest] Some interesting (confusing?)
        > >> benchmark
        > >> results
        > >>
        > >>
        > >> Hello,
        > >>
        > >> I made some initial attempts at using ForkJoin framework for some
        > >> rather obvious recursively parallel use-cases, namely summing the
        > >> elements of an int[] and also sorting an int[] (randomly filled).
        > >>
        > >> The problem-sizes I used ranged from 1M - 10M.
        > >>
        > >> I really enjoy using the framework and find it relatively easy to
        > >> reason about my programs (if not necessarily the internals of the
        > >> framework).
        > >> But I was disappointed with the results: in both cases they were
        > >> slower
        > >> than the corresponding Java serial implementation.
        > >>
        > >> I completely understand the principle of YMMV, and I wasn't
        > >> expecting
        > >> 2x speedup, but especially in the case of the sort, but I was
        > >> expecting
        > >> that ForkJoin would do at least a little better than the
        > >> single-threaded
        > >> version.  I guess what I'm asking is: are these results surprising?
        > >> Does it mean I'm doing something wrong?
        > >>
        > >> I'll give a brief description of my implementation:
        > >> single-threaded:  Arrays.sort(int[])
        > >>
        > >> ForkJoin: I'm using a merge-sort, with THRESHOLD = 250 (arrived at
        > >> by
        > >> trial-and-error)
        > >>
        > >> int[] compute()
        > >> {
        > >> if int[].length < THRESHOLD
        > >>      return insertionSort(int[])
        > >> left = new MergeSort(int[] //split in half)
        > >> right = new MergeSort(int[] //other half)
        > >> return merge(right.compute(), left.join())
        > >> }
        > >>
        > >> The insertionSort() and merge() methods are just standard
        > >> implementations; there is a bit of apples to oranges comparison
        > >> since
        > >> Arrays.sort() uses an optimized quicksort, but we're still talking
        > >> O(nlog(n))
        > >>
        > >> Just ran it on my laptop:
        > >> Windows 7 64-bit
        > >> 1.7.0_03; Java HotSpot(TM) 64-Bit Server VM 22.1-b02
        > >> Core 2 Duo==> 2 cores 2GHz
        > >>
        > >> 2M int[]:
        > >> single-threaded: ~330ms
        > >> ForkJoin (2 workers): ~390ms
        > >>
        > >> Would appreciate any feedback and hopefully the #s are somewhat
        > >> helpful
        > >> (and please no scoffawing at my antiquated machine)
        > >>
        > >> -T
        > >>
        > >>
        > >> _______________________________________________
        > >> Concurrency-interest mailing list
        > >> Concurrency-interest at cs.oswego.edu
        > >> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
        > >>
        >
        > _______________________________________________
        > Concurrency-interest mailing list
        > Concurrency-interest at cs.oswego.edu
        > http://cs.oswego.edu/mailman/listinfo/concurrency-interest
        >

        _______________________________________________
        Concurrency-interest mailing list
        Concurrency-interest at cs.oswego.edu
        http://cs.oswego.edu/mailman/listinfo/concurrency-interest






      -- 

      Viktor Klang
      Director of Engineering 


      Typesafe - The software stack for applications that scale
      Twitter: @viktorklang





  -- 

  Viktor Klang
  Director of Engineering


  Typesafe - The software stack for applications that scale
  Twitter: @viktorklang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130212/d62be6c1/attachment-0001.html>


More information about the Concurrency-interest mailing list