[concurrency-interest] ConcurrentSkipListMap

Doug Lea dl at cs.oswego.edu
Mon Aug 14 09:00:58 EDT 2017

As of JDK9 (and in part internally with JDK8), it's possible to
improve performance of some concurrently-readable data structures
using techniques that substantially reduce the number of volatile-mode
reads. This has only a small impact on TSO machines (including X86)
but can be very noticeable on weaker processors (ARM, POWER, RISCV).
ConcurrentSkipListMap (CSLM) is the best target in j.u.c for putting
these techniques into place. Even though CSLM is rarely as fast as
ConcurrentHashMap for insertion and removal, it should be among the
fastest concurrent data structures for traversal etc (plus, sometimes
you need things to be sorted, so need CSLM anyway).  And this
generally seems to hold after reworking CSLM: on some POWER8 and ARMv8
processors tested (thanks to Paul McKenney/OSUOSL and Andrew
Haley/RedHat) performance improvements seem to range from a few
percent to a factor of six, depending on operation load mix, key
types, size, etc. Performance on X86 is also a little better.  The
main interaction is with cache misses -- large and/or heavily
contended maps encounter a lot of them, limiting speedups.  A
paste of some of the internal documentation below describes the
man ideas (which as usual took much longer to figure out how to put
into place than I had first guessed).

While revisiting CSLM, a couple of other improvements were made.  When
it was first introduced, we didn't have the contention-avoiding
LongAdder to track size, so lived with very slow O(n) estimation.
Even though the javadocs tell people not to call size(), some still
do/did. There's now no reason to do this.  Also, the indexing
originally maxed out at levels corresponding to a billion elements
(slowing down after that), which is now up to essentially unbounded
(Long.MAX_VALUE) elements. Plus other minor things, with possibly
a few more to come. I haven't completely given up on ways to reduce
floating-garbage issues common to all concurrent linked structures.
And there are several opportunities to use and discover more related
techniques for other j.u.c classes.

It would be great if people using CSLM could try this out and report
experiences.  In the process of developing/testing, I made a JDK8
backport version, that I checked in so that others could try. See
http://gee.cs.oswego.edu/dl/concurrency-interest/index.html for
instructions for getting either jsr166.jar abd running jdk9 "java
--patch-module java.base=jsr166.jar",  or getting jsr166-4jdk8.jar
and running jdk8 -Xbootclasspath/p:jsr166-4jdk8.jar. (Also, anyone
interested in trying on Android is invited to build from jdk8 source.)

Also, if you have JMH or other performance tests that target
realistic use cases, please let me know. As usual, we can only
roughly guess about typical operation mixes.

... internal documentation excerpts ...

     * This class provides concurrent-reader-style memory consistency,
     * ensuring that read-only methods report status and/or values no
     * staler than those holding at method entry. This is done by
     * performing all publication and structural updates using
     * (volatile) CAS, placing an acquireFence in a few access
     * methods, and ensuring that linked objects are transitively
     * acquired via dependent reads (normally once) unless performing
     * a volatile-mode CAS operation (that also acts as an acquire and
     * release).  This form of fence-hoisting is similar to RCU
     * and related techniques (see McKenney's online book
     * It minimizes overhead that may otherwise occur when using so
     * many volatile-mode reads. Using explicit acquireFences is
     * logistically easier than targeting particular fields to be read
     * in acquire mode: fences are just hoisted up as far as possible,
     * to the entry points or loop headers of a few methods. A
     * potential disadvantage is that these few remaining fences are
     * not easily optimized away by compilers under exclusively
     * single-thread use.  It requires some care avoid volatile mode
     * reads of other fields. (Note that the memory semantics of a
     * reference dependently read in plain mode exactly once are
     * equivalent to those for atomic opaque mode.)  Iterators and
     * other traversals encounter each node and value exactly once.
     * Other operations locate an element (or position to insert an
     * element) via a sequence of dereferences. This search is broken
     * into two parts. Method findPredecessor (and its specialized
     * embeddings) searches index nodes only, returning a base-level
     * predecessor of the key. Callers carry out the base-level
     * search, restarting if encountering a marker preventing link
     * modification.  In some cases, it is possible to encounter a
     * node multiple times while descending levels. For mutative
     * operations, the reported value is validated using CAS (else
     * retrying), preserving linearizability with respect to each
     * other. Others may return any (non-null) value holding in the
     * course of the method call.

More information about the Concurrency-interest mailing list