[concurrency-interest] ThreadLocal vs ProcessorLocal

Chris Hegarty chris.hegarty at oracle.com
Fri Oct 12 08:30:33 EDT 2012

I am also not aware of any work being done on this, with respect to JDK8.


On 12/10/2012 12:43, David Holmes wrote:
> Hi Antoine,
> I've not seen any sign that this has moved toward an actual API proposal
> so far. As the feature complete date for JDK 8 is January 31, I'd say
> the chances of any such API appearing in JDK 8 would be slim. But
> perhaps someone is fleshing this out quietly and will come forward with
> a fully worked out API design and implementation.
> David
> -------
>     -----Original Message-----
>     *From:* concurrency-interest-bounces at cs.oswego.edu
>     [mailto:concurrency-interest-bounces at cs.oswego.edu]*On Behalf Of
>     *Antoine Chambille
>     *Sent:* Thursday, 11 October 2012 10:44 PM
>     *To:* Concurrency-interest at cs.oswego.edu
>     *Subject:* Re: [concurrency-interest] ThreadLocal vs ProcessorLocal
>     After deploying the "ActivePivot" software (java in-memory analytics
>     solution) on a new server we discovered how inefficient a large heap
>     java application can be on NUMA system.
>     The server has 4 sockets (8C Xeon CPU) with 512GB of memory. There
>     are 4 NUMA nodes each with 8 cores and 128GB. ActivePivot loads
>     large amounts of data in memory (up to hundreds of GB) and then
>     performs complex aggregation queries over the entire data, using a
>     global fork/join pool. Although the data structures are partitioned,
>     this partitioning is random with respect to NUMA nodes and
>     processors. There is a sharp performance drop compared to smaller
>     SMP servers.
>     We are considering launching several JVMs, each bound to NUMA nodes,
>     and communicating with each other. But that's an entire new layer of
>     code, and this won't be quite as performant as processor id
>     partitioning in one JVM.
>     I think that's another use case for the processor id API. I hope
>     this RFE is making progress? Is there a chance to see it as part of
>     JDK8?
>     Regards,
>     -Antoine CHAMBILLE
>     Quartet FS
>     On 19 August 2011 01:22, David Holmes <davidcholmes at aapt.net.au
>     <mailto:davidcholmes at aapt.net.au>> wrote:
>         __
>         Aside: Just for the public record. I think an API to support use
>         of Processor sets would be a useful addition to the platform to
>         allow better partitioning of CPU resources. The Real-time
>         Specification for Java 1.1 under JSR-282 is looking at such an
>         API primarily to work in with OS level processor sets. For our
>         purposes, more locally within the JVM the processors made
>         available to the JVM could be partitioned both internally (ie
>         binding GC to a specific set of cores) and at the
>         application/library level (such as allocating ForkJoinPools
>         disjoint sets of cores). The abiility to query the current
>         processor ID is inherently needed and so I would make it part of
>         the Processors class. This would provide the foundation API for
>         ProcessortLocal.
>         David Holmes
>             -----Original Message-----
>             *From:* concurrency-interest-bounces at cs.oswego.edu
>             <mailto:concurrency-interest-bounces at cs.oswego.edu>
>             [mailto:concurrency-interest-bounces at cs.oswego.edu
>             <mailto:concurrency-interest-bounces at cs.oswego.edu>]*On
>             Behalf Of *Nathan Reynolds
>             *Sent:* Thursday, 11 August 2011 2:38 AM
>             *To:* concurrency-interest at cs.oswego.edu
>             <mailto:concurrency-interest at cs.oswego.edu>
>             *Subject:* [concurrency-interest] ThreadLocal vs ProcessorLocal
>             I would like to recommend that we stripe data structures
>             using a ProcessorLocal instead of ThreadLocal.
>             ProcessorLocal (attached) is exactly like ThreadLocal except
>             the stored objects keyed off of the processor instead of the
>             thread. In order to implement ProcessorLocal, it needs an
>             API that returns the current processor id that the thread is
>             running on. The HotSpot team has filed an RFE and are
>             planning on providing such an API. (Many of you are already
>             aware of this.)
>             I would like to share a story and some results to further
>             the discussion on processor striping (i.e. ProcessorLocal).
>             A long time ago, we found that an Oracle C++ program
>             bottlenecked on a reader/writer lock. Threads couldn't
>             read-acquire the lock fast enough. The problem was due to
>             the contention on the cache line while executing the CAS
>             instruction. So, I striped the lock. The code non-atomically
>             incremented an int and masked it to select one of the
>             reader/writer locks. Multiple threads could end up selecting
>             the same reader/writer lock because the int was incremented
>             in an unprotected manner. If multiple threads selected the
>             same reader/writer lock, the lock would handle the
>             concurrency and the only negative was lock performance. This
>             optimization worked great until Intel released Nehalem-EX.
>             A while ago, Intel found on Nehalem-EX that the same Oracle
>             C++ program didn't scale to the 4ᵗʰ Nehalem socket. All of
>             the processors/cores were 100% busy, but the throughput
>             didn't improve by adding the 4ᵗʰ Nehalem socket. The problem
>             was the cores were fighting to get the cache line holding
>             the unprotected int!
>             I tried 4 approaches to select the reader/writer lock.
>             1) Processor id - This performed the best. The cache lines
>             holding the reader/writer locks are almost never invalidated
>             due to another core accessing the reader/writer lock. In
>             other words, almost 0 CAS contention.
>             2) ThreadLocal - ThreadLocal had a 1:1 mapping of threads to
>             locks. It required too many locks and the locks had to
>             migrate with the threads.
>             3) Hash the stack pointer - Hashing the stack pointer caused
>             some collisions but essentially randomly selected locks and
>             this hurt cache performance.
>             4) Shift and mask the cycle counter (i.e. rdtsc) -
>             Contention was rare but again it randomly selected the locks.
>             Compared to non-atomically incrementing an int, processor id
>             resulted in 15% more throughput. The other 3 only showed 5%
>             more throughput.
>             Nathan Reynolds
>             <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>
>             | Consulting Member of Technical Staff | 602.333.9091
>             <tel:602.333.9091>
>             Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>             Technology
>         _______________________________________________
>         Concurrency-interest mailing list
>         Concurrency-interest at cs.oswego.edu
>         <mailto:Concurrency-interest at cs.oswego.edu>
>         http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>     --
>     Antoine CHAMBILLE
>     R&D Director
>     Quartet FS
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

More information about the Concurrency-interest mailing list