[concurrency-interest] PooledLinkedBlockingQueue

Doug Lea dl@cs.oswego.edu
Wed, 3 Nov 2004 09:54:13 -0500

> the performance of this implementation don't seem to 
> outperform conventionnal node allocation/deallocation. 

Right. I continue to doubt that you can outperform the current
GC-based approach using ThreadLocals in most applications. Here are
some rough calculations explaining why:

Allocation and field-initialization of a small object such as a list
node in hotspot (and most JVMs) is typically pretty fast, on the order
of 10 machine instructions. Normally, a node is allocated from a TLAB
(thread-local allocation buffer), which is just a few instructions
(none of them atomic), plus field initialization for nonnull
fields. Occasionally the TLAB runs out of space and needs to be
replenished, but this usually doesn't add much to the average.

Accessing a ThreadLocal is also reasonably fast, but typically
slightly slower than an allocation. I'd guess somewhere around 20
instructions. (This varies across machines - x86 might be relatively a
bit slower than others like sparc because thread-local base pointer
isn't kept in a register because there aren't many registers.) There
are ways to speed up access of different forms of thread-locals, but
not by much. For example, you can create a Thread subclass, MyThread,
that has a listPool field. If all your Threads are of this subclass,
you can access the pool via
((MyThread)Thread.currentThread()).listPool).  This avoids a few
instructions of ThreadLocal lookup, but forces you to use MyThread
objects vs plain Threads everywhere and entails a checked cast, so
usually isn't worth it.

Comparing these, for the GC-based version, the total cost is
  Allocate/initialize + some portion of GC costs
And for recycling, it is
  2 ThreadLocal accesses + reinitialization + pool bookkeeping costs

In most producer-consumer applications, you'd expect GC to win here.
Usually, the items being queued by producers are themselves being
allocated, so there is both an item and a list node allocated per
producer operation, so the allocation of list nodes only increases the
frequency of GC by some (usually small) constant factor. If both sides
of the producer-consumer pair are relatively fast and don't retain
these items, then most of the gorbage is collected in young generation
scans, in which GC time is roughly proportional to the amount of
retained (live) objects, not the amount of garbage. In these kinds of
scenarios, the number of instructions ascribable to GC of a given list
node is likely to be less than the cost of a ThreadLocal access plus
pool bookkeeping.  There are other scenarios where it could go the
other way, but I think this one is the most typical.