I agree that something should be done to improve performance in cases
where the optimal steady state is to engage only a subset of available
worker threads. I'll explore other options (which has been on todo-list
for too long). None of those we've tried so far are very good:

On 11/9/18 2:52 AM, Francesco Nigro via Concurrency-interest wrote:
> In the past implementations of the FJ pool there was a SPIN static
> variable used for this purpose that where quite good (with
> parallelism<available cores) to not park immediately the executor
> threads where is nothing to do.

... with the disadvantage of unacceptably high CPU wastage on machines
running near CPU saturation, as we discovered only after releasing, and
so later removed. There are now only a few j.u.c classes that include
paths allowing a non-trivial amount of spinning, and some of these might
be subject to reconsideration in light of Loom, in which spinning is
rarely a good idea.

> I have the same problem of Carl and I have found a similar "solution":
> reduce the parallelism hoping to get the queues more busy, but it would
> in incour in an higher contention on offer side. 

When steady state has little variance in task execution rates, reducing
parallelism *is* the right solution. But has the disadvantage of poorer
throughput when a lot of tasks suddenly appear.

> It wouldn't make sense to inject some WaitStrategy to allow the user to
> choose what to do before parking (if parking)?

I'm not a big fan of forcing users to make such decisions, since they
can/will encounter the same issues as algorithmic solutions do, as in
Carl's situation below.

> Il giorno ven 9 nov 2018, 00:18 Carl Mastrangelo via
> Concurrency-interest <concurrency-interest at cs.oswego.edu
> <mailto:concurrency-interest at cs.oswego.edu>> ha scritto:
>     Hi,
>     I am using ForkJoinPool as a means to avoid lock contention for
>     submitting tasks to an executor.  FJP is much faster than before,
>     but has some unexpected slowdown when there is not enough work.   In
>     particular, A significant amount of time is spent waiting parking
>     and unparking threads when there no work to do.  When there is no
>     active work, it seems each worker scans the other work queues
>     looking for work before going to sleep.  
>     In my program I have a parallelism of 64, because when the work load
>     is high, each of the threads can be active.  However, when work load
>     is low, the workers spend too much time looking for more work.

Besides keeping too many threads alive, the main performance issue here
is reduced locality because related tasks are taken by different
threads. So ...
>     One way to fix this (I think) is to lower the number of worker
>     queues, but keep the same number of workers.   In my case, having 32
>     or 16 queues rather than having exactly 64 might help, but I have no
>     way of testing it out.   Has this ever been brought up, or is it
>     possible for me to easily patch FJP to see if it would help?

... I don't think this will be a dramatic improvement, but if you are
interested in experimenting, you could try building a jsr166.jar after
changing ForkJoinPool line 703 from:
    static final int SQMASK       = 0x007e;   // max 64 (even) slots
    static final int SQMASK       = 0x001e;   // max 16 (even) slots


