[concurrency-interest] Matrix multiply with parallelized inner product

jason marshall jdmarshall at gmail.com
Wed Feb 6 11:38:32 EST 2008


But doesn't mean that still mean that the total number of tasks in the
executor (or indeed, probably on the whole machine) should be 8-16X the
available CPUs at the start of a new task, not that each task should be
subdivided 8-16 ways?  If 4 tasks of moderately proportional cost come in at
roughly around the same time, you want to split each into 2-4, not 8-16,
right?

A task that reschedules partitions of itself blindly will grow the queue
size to 16-64 entries before the second pass begins.  If it continues to
recursively divide, that could be quite a lot of overhead.  I would think
this would only make sense if there is some cheap calculation you can do to
detect asymmetries in the computation (for instance, a sparsely populated
array might subdivide the densest segments until they are 'small enough').
If there are no gross asymmetries of this sort, then recursively calling the
Executor to split the tasks is going to reduce throughput, no matter what
you do.

There's nothing you can really do to the Executor to protect people from
having to understand how parallelization works.

-Jason






On Feb 5, 2008 4:18 PM, Doug Lea <dl at cs.oswego.edu> wrote:

> jason marshall wrote:
> > I'm curious about the magic number of 8 ( or "p<<3")
>
> It is empirically derived. In combination with the dynamics
> pasted below (from PAS.java in our CVS), it is approximately
> the value that maximizes throughput on near-no-op operations on
> moderate-sized arrays run on common 8-through-32-way
> platforms. The value in part reflects internal task
> overhead. As this overhead decreases, this know-nothing
> estimate can generate finer granularities -- 16 is almost
> as good as 8, and might with some work replace it sometime.
> As discussed for example in sec 4.4 of my CPJ book, if you
> do know something about execution times of base tasks,
> or are just willing to engage in some trial and error, you can
> sometimes do better in choosing granularities. The operation
> arguments to apply, map, reduce etc are opaque to us inside
> Parallel*Array so we cannot do this. But fortunately,
> the sensitivity curves are usually pretty shallow, so
> with only a little dynamic adjustment, exact values don't
> usually matter a lot.
>
> >
> > I understand that you're trying to accommodate asymmetries in the work
> > load, to reduce the latency at the tail end (caused unequal units of
> > work, unequal CPU loading)
>
> And more things like that. Including differences in
> cache miss rates, hence execution times even
> for "equal" units of work. The finer the granularity,
> the better you can cope, but with overly fine granularities
> you start losing to task creation, work-stealing, and GC
> overhead. See below on how we dynamically
> counterbalance when luck prevails and we can coarsen a bit.
>
>     /**
>      * Base for most divide-and-conquer tasks used for computing
>      * ParallelArray operations. Rather than pure recursion, it links
>      * right-hand-sides and then joins up the tree, exploiting cases
>      * where tasks aren't stolen.  This generates and joins tasks with
>      * a bit less overhead than pure recursive style -- there are only
>      * as many tasks as leaves (no strictly internal nodes).
>      *
>      * Split control relies on pap.getThreshold(), which is
>      * expected to err on the side of generating too many tasks. To
>      * counterbalance, if a task pops off its own smallest subtask, it
>      * directly runs its leaf action rather than possibly replitting.
>      *
>      * There are, with a few exceptions, three flavors of each FJBase
>      * subclass, prefixed FJO (object reference), FJD (double) and FJL
>      * (long).
>      */
>     static abstract class FJBase extends RecursiveAction {
>         final AbstractParallelAnyArray pap;
>         final int lo;
>         final int hi;
>         final FJBase next; // the next task that creator should join
>         FJBase(AbstractParallelAnyArray pap, int lo, int hi, FJBase next)
> {
>             this.pap = pap;
>             this.lo = lo;
>             this.hi = hi;
>             this.next = next;
>         }
>
>         public final void compute() {
>             int g = pap.getThreshold();
>             int l = lo;
>             int h = hi;
>             if (h - l > g)
>                 internalCompute(l, h, g);
>             else
>                 atLeaf(l, h);
>         }
>
>         final void internalCompute(int l, int h, int g) {
>             FJBase r = null;
>             do {
>                 int rh = h;
>                 h = (l + h) >>> 1;
>                 (r = newSubtask(h, rh, r)).fork();
>             } while (h - l > g);
>             atLeaf(l, h);
>             do {
>                 if (ForkJoinWorkerThread.removeIfNextLocalTask(r))
>                     r.atLeaf(r.lo, r.hi);
>                 else
>                     r.join();
>                 onReduce(r);
>                 r = r.next;
>             } while (r != null);
>         }
>
>         /** Leaf computation */
>         abstract void atLeaf(int l, int h);
>         /** Operation performed after joining right subtask -- default
> noop */
>         void onReduce(FJBase right) {}
>         /** Factory method to create new subtask, normally of current type
> */
>         abstract FJBase newSubtask(int l, int h, FJBase r);
>     }
>
>
>
>
>


-- 
- Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: /pipermail/attachments/20080206/d44c25ca/attachment-0001.html 


More information about the Concurrency-interest mailing list