[concurrency-interest] ThreadLocal vs ProcessorLocal

Aleksey Shipilev aleksey.shipilev at oracle.com
Fri Oct 12 08:44:11 EDT 2012

I think we better ask Nathan if he feels enthusiastic enough to drive
this into JDK8. But as David notes, this is very close to
feature-completeness for JDK8.


On 10/12/2012 04:30 PM, Chris Hegarty wrote:
> I am also not aware of any work being done on this, with respect to JDK8.
> -Chris.
> On 12/10/2012 12:43, David Holmes wrote:
>> Hi Antoine,
>> I've not seen any sign that this has moved toward an actual API proposal
>> so far. As the feature complete date for JDK 8 is January 31, I'd say
>> the chances of any such API appearing in JDK 8 would be slim. But
>> perhaps someone is fleshing this out quietly and will come forward with
>> a fully worked out API design and implementation.
>> David
>> -------
>>     -----Original Message-----
>>     *From:* concurrency-interest-bounces at cs.oswego.edu
>>     [mailto:concurrency-interest-bounces at cs.oswego.edu]*On Behalf Of
>>     *Antoine Chambille
>>     *Sent:* Thursday, 11 October 2012 10:44 PM
>>     *To:* Concurrency-interest at cs.oswego.edu
>>     *Subject:* Re: [concurrency-interest] ThreadLocal vs ProcessorLocal
>>     After deploying the "ActivePivot" software (java in-memory analytics
>>     solution) on a new server we discovered how inefficient a large heap
>>     java application can be on NUMA system.
>>     The server has 4 sockets (8C Xeon CPU) with 512GB of memory. There
>>     are 4 NUMA nodes each with 8 cores and 128GB. ActivePivot loads
>>     large amounts of data in memory (up to hundreds of GB) and then
>>     performs complex aggregation queries over the entire data, using a
>>     global fork/join pool. Although the data structures are partitioned,
>>     this partitioning is random with respect to NUMA nodes and
>>     processors. There is a sharp performance drop compared to smaller
>>     SMP servers.
>>     We are considering launching several JVMs, each bound to NUMA nodes,
>>     and communicating with each other. But that's an entire new layer of
>>     code, and this won't be quite as performant as processor id
>>     partitioning in one JVM.
>>     I think that's another use case for the processor id API. I hope
>>     this RFE is making progress? Is there a chance to see it as part of
>>     JDK8?
>>     Regards,
>>     -Antoine CHAMBILLE
>>     Quartet FS
>>     On 19 August 2011 01:22, David Holmes <davidcholmes at aapt.net.au
>>     <mailto:davidcholmes at aapt.net.au>> wrote:
>>         __
>>         Aside: Just for the public record. I think an API to support use
>>         of Processor sets would be a useful addition to the platform to
>>         allow better partitioning of CPU resources. The Real-time
>>         Specification for Java 1.1 under JSR-282 is looking at such an
>>         API primarily to work in with OS level processor sets. For our
>>         purposes, more locally within the JVM the processors made
>>         available to the JVM could be partitioned both internally (ie
>>         binding GC to a specific set of cores) and at the
>>         application/library level (such as allocating ForkJoinPools
>>         disjoint sets of cores). The abiility to query the current
>>         processor ID is inherently needed and so I would make it part of
>>         the Processors class. This would provide the foundation API for
>>         ProcessortLocal.
>>         David Holmes
>>             -----Original Message-----
>>             *From:* concurrency-interest-bounces at cs.oswego.edu
>>             <mailto:concurrency-interest-bounces at cs.oswego.edu>
>>             [mailto:concurrency-interest-bounces at cs.oswego.edu
>>             <mailto:concurrency-interest-bounces at cs.oswego.edu>]*On
>>             Behalf Of *Nathan Reynolds
>>             *Sent:* Thursday, 11 August 2011 2:38 AM
>>             *To:* concurrency-interest at cs.oswego.edu
>>             <mailto:concurrency-interest at cs.oswego.edu>
>>             *Subject:* [concurrency-interest] ThreadLocal vs
>> ProcessorLocal
>>             I would like to recommend that we stripe data structures
>>             using a ProcessorLocal instead of ThreadLocal.
>>             ProcessorLocal (attached) is exactly like ThreadLocal except
>>             the stored objects keyed off of the processor instead of the
>>             thread. In order to implement ProcessorLocal, it needs an
>>             API that returns the current processor id that the thread is
>>             running on. The HotSpot team has filed an RFE and are
>>             planning on providing such an API. (Many of you are already
>>             aware of this.)
>>             I would like to share a story and some results to further
>>             the discussion on processor striping (i.e. ProcessorLocal).
>>             A long time ago, we found that an Oracle C++ program
>>             bottlenecked on a reader/writer lock. Threads couldn't
>>             read-acquire the lock fast enough. The problem was due to
>>             the contention on the cache line while executing the CAS
>>             instruction. So, I striped the lock. The code non-atomically
>>             incremented an int and masked it to select one of the
>>             reader/writer locks. Multiple threads could end up selecting
>>             the same reader/writer lock because the int was incremented
>>             in an unprotected manner. If multiple threads selected the
>>             same reader/writer lock, the lock would handle the
>>             concurrency and the only negative was lock performance. This
>>             optimization worked great until Intel released Nehalem-EX.
>>             A while ago, Intel found on Nehalem-EX that the same Oracle
>>             C++ program didn't scale to the 4ᵗʰ Nehalem socket. All of
>>             the processors/cores were 100% busy, but the throughput
>>             didn't improve by adding the 4ᵗʰ Nehalem socket. The problem
>>             was the cores were fighting to get the cache line holding
>>             the unprotected int!
>>             I tried 4 approaches to select the reader/writer lock.
>>             1) Processor id - This performed the best. The cache lines
>>             holding the reader/writer locks are almost never invalidated
>>             due to another core accessing the reader/writer lock. In
>>             other words, almost 0 CAS contention.
>>             2) ThreadLocal - ThreadLocal had a 1:1 mapping of threads to
>>             locks. It required too many locks and the locks had to
>>             migrate with the threads.
>>             3) Hash the stack pointer - Hashing the stack pointer caused
>>             some collisions but essentially randomly selected locks and
>>             this hurt cache performance.
>>             4) Shift and mask the cycle counter (i.e. rdtsc) -
>>             Contention was rare but again it randomly selected the locks.
>>             Compared to non-atomically incrementing an int, processor id
>>             resulted in 15% more throughput. The other 3 only showed 5%
>>             more throughput.
>>             Nathan Reynolds
>> <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>
>>             | Consulting Member of Technical Staff | 602.333.9091
>>             <tel:602.333.9091>
>>             Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>>             Technology
>>         _______________________________________________
>>         Concurrency-interest mailing list
>>         Concurrency-interest at cs.oswego.edu
>>         <mailto:Concurrency-interest at cs.oswego.edu>
>>         http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>     --
>>     Antoine CHAMBILLE
>>     R&D Director
>>     Quartet FS
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

More information about the Concurrency-interest mailing list