[concurrency-interest] juc and GC (plus FJ updates)

Gil Tene gil at azul.com
Tue Jan 16 15:34:02 EST 2018

I agree that some level of conditional card marking (like -XX:+UseCondCardMark in ParallelGC) can be critical to avoid sharing contention on card mark. With HotSpot's current byte-per-card and card-per-512-bytes-of-heap ratios, there is real sharing of individual card marks [representing 512 bytes of heap each], false sharing on a card mark line [representing 64 cards on x86] and false sharing for heap reference stores [all stores to the same 32KB-aligned region of heap (in x86) represent a store to the same cardmark cache line]. Arguably, in a world where the likelihood of multiple threads mutating references that are stored in the same 32KB of heap is pretty high, conditional card marking should arguably be a default.

Conditional card marking (in the sense that the mark does not dirty an already dirty card) is [currently] more expensive primarily because it involves reading the card value for evaluation. When this extra cost is paid on every reference mutation, it can offset much of the scaling gains, and presents an annoying tradeoff choice between single-threaded performance and multi-threaded scalability.

But the extra cost of conditional card marking can be combatted fairly effectively. For reference, in Zing we've worked hard to eliminate these tradeoffs (our designs started off on 384-way Vega SMPs, so false sharing on card marks was an obvious non-starter). Zing's card marking scheme uses conditional marking at all times, which avoids line contention at one level (writers to the same card or card cache line). We also use precise card marking [a card bit per heap word rather than a card byte per 512 heap bytes], which completely eliminates all card-related false sharing. In addition, we reduce the cost of the read-and-maybe-write: our conditional card modification check (which requires a card read) is much cheaper in practice because it is fronted by a conditional newgen-ref-being-stored-into-oldgen-location check which is purely based on addresses, does not require any reads, and eliminates 95%-99%+ of card modification attempts.

Historically, the main design reason against bit-based card marking had been that unconditional card marks are faster when using a byte per card (it's a fire-and-forget byte store operation, with no need for a read-modify-write thing). However, once you buy into the conditional card marking notion (with or without generation-based filter ahead of it) , you've introduced a read-check-and-maybe-write thing anyway, making a read-check-and-maybe-modify-and-write thing not that big a deal. The steps from there to precise (1 card per heap word) card marking then become pretty obvious.

I'd encourage other card-marking collectors (which is pretty much all current and most future generational schemes) to look at similar notions for the card table as they can help reduce multi-threaded scaling sensitivity of most workloads. Specifically, reducing the cost of conditional card marking (by e.g. fronting it with a cross-generation-contitional filter) can help eliminate the notion that a tradeoff is involved when turning on -XX:+UseCondCardMark. With the current 32KB-wide reference-mutation-contention window in HotSpot, false-card-line-sharing contention can easily hit unsuspecting and even unrelated data structure, and the obvious healthy default should avoid that.

— Gil.

> On Jan 16, 2018, at 9:40 AM, Doug Lea via Concurrency-interest <Concurrency-interest at cs.oswego.edu> wrote:
> As nicely pointed out recently by Nitsan Wakart (with contributions
> from Aleksey Shipilev)
> (http://psy-lob-saw.blogspot.com/2018/01/what-difference-jvm-makes.html)
> performance of concurrent data structures can vary a lot across
> garbage collectors. Of which there are at least nine: ParallelGC,
> ConcMarkSweepGC, G1, Shenandoah, ZGC, Zing, IBM-J9 GC, Android, each
> with different low-level implementations across processors (X86, ARM,
> etc), and each with many tuning options.  Different properties of
> collectors interact with concurrency: Write barriers may add ordering
> and low-level fencing, read-barriers may widen CAS windows, GC pausing
> interacts with blocking for synchronization, safepoints interact with
> progress, and placing and moving objects interact with cache pollution
> and thread-to-memory affinity.  We'd like j.u.c components to work
> reasonably under different collectors (and different JVMs). Which is
> challenging if not impossible in general, but it is worth the effort
> to do the best we can for some of the most commonly used classes like
> ConcurrentHashMap and ForkJoin.  (Some others contain too many
> user-selected options to systematically deal with; for example
> ThreadPoolExecutor allows any kind of Queue to be used.)
> One precaution is to cope with the fact that nearly all collectors use
> cardmarks, amplifying false-sharing effects by creating memory
> contention for reference writes that happen to be nearby in memory
> even if not on same cacheline.  The performance impact is just as bad
> as when multiple threads are repeatedly writing the same location (8X
> slowdowns are not rare).  GCs also tend to move elements of linked
> data structures closer together after a garbage collection, which
> sometimes causes programs to get slower over time.  For array-based
> collections, one workaround is to over-allocate such that different
> arrays that are frequently written by different threads are far enough
> apart.  The target value is such that bytes marking addresses from
> different arrays are on different cachelines. Emprically, 8Kbytes of
> separation works across at least most collectors. Under ParallelGC,
> you can band-aid this with -XX:+UseCondCardMark, which reduces most
> cases of write-write contention to read-write, leading to less cache
> traffic.  Even though FJ guards workstealing arrays by overallocating,
> (and normally-populated ConcurrentHashMaps intrinsically do so for
> their arrays) UseCondCardMark usually improves performance by reducing
> impact of accessing bookeeping fields of other objects like executed
> FJ tasks. (It doesn't always improve because of extra per-write
> overhead that might not have been needed, but even if so, impact is
> small.)  Collectors that don't have this option are more prone to
> occasional mysterious big slowdowns.
> Contemplation of other GC mechanics sometimes reduces other negative
> impact. For example, if a reference write might sometimes entail a
> memory fence anyway, then you might prefer basing explicitly atomic
> operations on them vs surrounding them with locks. ConcurrentHashMap,
> ConcurrentLinkedQueue, CompletableFuture, SubmissionPublisher, and
> ForkJoin, mostly do this. I just locally (jsr166 CVS) committed some
> FJ updates that do this in more cases and add a few other tweaky
> adjustments.  The impact on ParallelGC is small, but others now seem
> more well-behaved.
> Across various tests that mainly exercise concurrency (without doing
> very much else), in hostpot, -XX:+UseParallelGC -XX:+UseCondCardMark
> nearly always gives best throughput, even better than no-GC (Epsilon),
> but with occasional long GC pauses. In most cases, G1 is around 10%
> slower; sometimes moreso. I don't have enough empirical experience with
> newer in-progress collectors or other JVMs to say anything about them
> yet in general.  As always, the main determiners of performance in
> ForkJoin (as well as ThreadPoolExecutor and other async frameworks)
> are task issue rate (granularity) and isolation (including
> not blocking within tasks).  If async tasks contain around 10,000
> thread-local instructions, GC choices depend less on concurrency per
> se, and more on throughput/latency choices etc.
> Please try out tentative updates in
> http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166.jar (running with java
> --patch-module java.base="$DIR/jsr166.jar"). We also are always
> looking to add more performance tests that are informative about basic
> functionality or interactions with other frameworks.  Contributions
> welcome.  If you are interested in further exploring impact on
> hotspot, try out Aleksey's builds with some of the new in-progress
> GCs, at https://builds.shipilev.net/
> Aside: All of the above emphasize that we are in the era of the "killer
> microsecond", where most performance challenges are not handled well
> by nanosecod-level (usually CPU-based, like instruction parallelism)
> or millisecond-level (usually OS-based, like IO) techniques.  See
> https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext
> -Doug
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 874 bytes
Desc: Message signed with OpenPGP
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20180116/fa399d1e/attachment.sig>

More information about the Concurrency-interest mailing list