[concurrency-interest] juc and GC (plus FJ updates)

Doug Lea dl at cs.oswego.edu
Tue Jan 16 12:40:36 EST 2018


As nicely pointed out recently by Nitsan Wakart (with contributions
from Aleksey Shipilev)
(http://psy-lob-saw.blogspot.com/2018/01/what-difference-jvm-makes.html)
performance of concurrent data structures can vary a lot across
garbage collectors. Of which there are at least nine: ParallelGC,
ConcMarkSweepGC, G1, Shenandoah, ZGC, Zing, IBM-J9 GC, Android, each
with different low-level implementations across processors (X86, ARM,
etc), and each with many tuning options.  Different properties of
collectors interact with concurrency: Write barriers may add ordering
and low-level fencing, read-barriers may widen CAS windows, GC pausing
interacts with blocking for synchronization, safepoints interact with
progress, and placing and moving objects interact with cache pollution
and thread-to-memory affinity.  We'd like j.u.c components to work
reasonably under different collectors (and different JVMs). Which is
challenging if not impossible in general, but it is worth the effort
to do the best we can for some of the most commonly used classes like
ConcurrentHashMap and ForkJoin.  (Some others contain too many
user-selected options to systematically deal with; for example
ThreadPoolExecutor allows any kind of Queue to be used.)

One precaution is to cope with the fact that nearly all collectors use
cardmarks, amplifying false-sharing effects by creating memory
contention for reference writes that happen to be nearby in memory
even if not on same cacheline.  The performance impact is just as bad
as when multiple threads are repeatedly writing the same location (8X
slowdowns are not rare).  GCs also tend to move elements of linked
data structures closer together after a garbage collection, which
sometimes causes programs to get slower over time.  For array-based
collections, one workaround is to over-allocate such that different
arrays that are frequently written by different threads are far enough
apart.  The target value is such that bytes marking addresses from
different arrays are on different cachelines. Emprically, 8Kbytes of
separation works across at least most collectors. Under ParallelGC,
you can band-aid this with -XX:+UseCondCardMark, which reduces most
cases of write-write contention to read-write, leading to less cache
traffic.  Even though FJ guards workstealing arrays by overallocating,
(and normally-populated ConcurrentHashMaps intrinsically do so for
their arrays) UseCondCardMark usually improves performance by reducing
impact of accessing bookeeping fields of other objects like executed
FJ tasks. (It doesn't always improve because of extra per-write
overhead that might not have been needed, but even if so, impact is
small.)  Collectors that don't have this option are more prone to
occasional mysterious big slowdowns.

Contemplation of other GC mechanics sometimes reduces other negative
impact. For example, if a reference write might sometimes entail a
memory fence anyway, then you might prefer basing explicitly atomic
operations on them vs surrounding them with locks. ConcurrentHashMap,
ConcurrentLinkedQueue, CompletableFuture, SubmissionPublisher, and
ForkJoin, mostly do this. I just locally (jsr166 CVS) committed some
FJ updates that do this in more cases and add a few other tweaky
adjustments.  The impact on ParallelGC is small, but others now seem
more well-behaved.

Across various tests that mainly exercise concurrency (without doing
very much else), in hostpot, -XX:+UseParallelGC -XX:+UseCondCardMark
nearly always gives best throughput, even better than no-GC (Epsilon),
but with occasional long GC pauses. In most cases, G1 is around 10%
slower; sometimes moreso. I don't have enough empirical experience with
newer in-progress collectors or other JVMs to say anything about them
yet in general.  As always, the main determiners of performance in
ForkJoin (as well as ThreadPoolExecutor and other async frameworks)
are task issue rate (granularity) and isolation (including
not blocking within tasks).  If async tasks contain around 10,000
thread-local instructions, GC choices depend less on concurrency per
se, and more on throughput/latency choices etc.

Please try out tentative updates in
http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166.jar (running with java
--patch-module java.base="$DIR/jsr166.jar"). We also are always
looking to add more performance tests that are informative about basic
functionality or interactions with other frameworks.  Contributions
welcome.  If you are interested in further exploring impact on
hotspot, try out Aleksey's builds with some of the new in-progress
GCs, at https://builds.shipilev.net/

Aside: All of the above emphasize that we are in the era of the "killer
microsecond", where most performance challenges are not handled well
by nanosecod-level (usually CPU-based, like instruction parallelism)
or millisecond-level (usually OS-based, like IO) techniques.  See
https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext

-Doug



More information about the Concurrency-interest mailing list