[concurrency-interest] Matrix multiply with parallelized inner product

Doug Lea dl at cs.oswego.edu
Tue Feb 5 19:18:05 EST 2008

jason marshall wrote:
> I'm curious about the magic number of 8 ( or "p<<3")

It is empirically derived. In combination with the dynamics
pasted below (from PAS.java in our CVS), it is approximately
the value that maximizes throughput on near-no-op operations on
moderate-sized arrays run on common 8-through-32-way
platforms. The value in part reflects internal task
overhead. As this overhead decreases, this know-nothing
estimate can generate finer granularities -- 16 is almost
as good as 8, and might with some work replace it sometime.
As discussed for example in sec 4.4 of my CPJ book, if you
do know something about execution times of base tasks,
or are just willing to engage in some trial and error, you can
sometimes do better in choosing granularities. The operation
arguments to apply, map, reduce etc are opaque to us inside
Parallel*Array so we cannot do this. But fortunately,
the sensitivity curves are usually pretty shallow, so
with only a little dynamic adjustment, exact values don't
usually matter a lot.

> I understand that you're trying to accommodate asymmetries in the work 
> load, to reduce the latency at the tail end (caused unequal units of 
> work, unequal CPU loading)

And more things like that. Including differences in
cache miss rates, hence execution times even
for "equal" units of work. The finer the granularity,
the better you can cope, but with overly fine granularities
you start losing to task creation, work-stealing, and GC
overhead. See below on how we dynamically
counterbalance when luck prevails and we can coarsen a bit.

      * Base for most divide-and-conquer tasks used for computing
      * ParallelArray operations. Rather than pure recursion, it links
      * right-hand-sides and then joins up the tree, exploiting cases
      * where tasks aren't stolen.  This generates and joins tasks with
      * a bit less overhead than pure recursive style -- there are only
      * as many tasks as leaves (no strictly internal nodes).
      * Split control relies on pap.getThreshold(), which is
      * expected to err on the side of generating too many tasks. To
      * counterbalance, if a task pops off its own smallest subtask, it
      * directly runs its leaf action rather than possibly replitting.
      * There are, with a few exceptions, three flavors of each FJBase
      * subclass, prefixed FJO (object reference), FJD (double) and FJL
      * (long).
     static abstract class FJBase extends RecursiveAction {
         final AbstractParallelAnyArray pap;
         final int lo;
         final int hi;
         final FJBase next; // the next task that creator should join
         FJBase(AbstractParallelAnyArray pap, int lo, int hi, FJBase next) {
             this.pap = pap;
             this.lo = lo;
             this.hi = hi;
             this.next = next;

         public final void compute() {
             int g = pap.getThreshold();
             int l = lo;
             int h = hi;
             if (h - l > g)
                 internalCompute(l, h, g);
                 atLeaf(l, h);

         final void internalCompute(int l, int h, int g) {
             FJBase r = null;
             do {
                 int rh = h;
                 h = (l + h) >>> 1;
                 (r = newSubtask(h, rh, r)).fork();
             } while (h - l > g);
             atLeaf(l, h);
             do {
                 if (ForkJoinWorkerThread.removeIfNextLocalTask(r))
                     r.atLeaf(r.lo, r.hi);
                 r = r.next;
             } while (r != null);

         /** Leaf computation */
         abstract void atLeaf(int l, int h);
         /** Operation performed after joining right subtask -- default noop */
         void onReduce(FJBase right) {}
         /** Factory method to create new subtask, normally of current type */
         abstract FJBase newSubtask(int l, int h, FJBase r);

More information about the Concurrency-interest mailing list