[concurrency-interest] ThreadLocal vs ProcessorLocal

Gregg Wonderly gregg at cytetech.com
Fri Oct 12 10:13:13 EDT 2012


Why would you not go ahead an write your own JNI interface to the OS services 
for processor affinity just so that you could have the partitioning information? 
  From that point, it would seem that you could create the ProcessorLocal kind 
of functionality with some maps.

Gregg Wonderly

On 10/12/2012 7:53 AM, Antoine Chambille wrote:
> Thanks for sharing.
>
> This is a bit depressing though. The upcoming release of Java will probably not
> have fine grained memory fences, neither support for processor affinity.
>
> I am certainly to much focused on our own requirements (in-memory database, TB
> heaps, tens of cores) but I feel that Java (once the best language in the world
> for concurrent programming, when JDK5 was released) is not ready for the
> many-cores era.
>
>
> With respect to NUMA: in the medium term we will have to try exotic deployments.
> For instance launching several JVMs, one per NUMA node, and connect them with
> high-speed messaging. For messaging we are looking at memory mapped buffers over
> files in a RAM drive (/dev/shm on Linux for instance) or reusing a framework
> like "Chronicle" by Peter Lawrey (https://github.com/peter-lawrey/Java-Chronicle).
>
>
> -Antoine
> Quartet FS
>
>
>
> On 12 October 2012 14:30, Chris Hegarty <chris.hegarty at oracle.com
> <mailto:chris.hegarty at oracle.com>> wrote:
>
>     I am also not aware of any work being done on this, with respect to JDK8.
>
>     -Chris.
>
>
>     On 12/10/2012 12:43, David Holmes wrote:
>
>         Hi Antoine,
>         I've not seen any sign that this has moved toward an actual API proposal
>         so far. As the feature complete date for JDK 8 is January 31, I'd say
>         the chances of any such API appearing in JDK 8 would be slim. But
>         perhaps someone is fleshing this out quietly and will come forward with
>         a fully worked out API design and implementation.
>         David
>         -------
>
>              -----Original Message-----
>              *From:* concurrency-interest-bounces at __cs.oswego.edu
>         <mailto:concurrency-interest-bounces at cs.oswego.edu>
>              [mailto:concurrency-interest-__bounces at cs.oswego.edu
>         <mailto:concurrency-interest-bounces at cs.oswego.edu>]*On Behalf Of
>              *Antoine Chambille
>              *Sent:* Thursday, 11 October 2012 10:44 PM
>              *To:* Concurrency-interest at cs.__oswego.edu
>         <mailto:Concurrency-interest at cs.oswego.edu>
>              *Subject:* Re: [concurrency-interest] ThreadLocal vs ProcessorLocal
>
>
>              After deploying the "ActivePivot" software (java in-memory analytics
>              solution) on a new server we discovered how inefficient a large heap
>              java application can be on NUMA system.
>
>              The server has 4 sockets (8C Xeon CPU) with 512GB of memory. There
>              are 4 NUMA nodes each with 8 cores and 128GB. ActivePivot loads
>              large amounts of data in memory (up to hundreds of GB) and then
>              performs complex aggregation queries over the entire data, using a
>              global fork/join pool. Although the data structures are partitioned,
>              this partitioning is random with respect to NUMA nodes and
>              processors. There is a sharp performance drop compared to smaller
>              SMP servers.
>
>
>              We are considering launching several JVMs, each bound to NUMA nodes,
>              and communicating with each other. But that's an entire new layer of
>              code, and this won't be quite as performant as processor id
>              partitioning in one JVM.
>
>              I think that's another use case for the processor id API. I hope
>              this RFE is making progress? Is there a chance to see it as part of
>              JDK8?
>
>
>              Regards,
>              -Antoine CHAMBILLE
>              Quartet FS
>
>
>
>
>              On 19 August 2011 01:22, David Holmes <davidcholmes at aapt.net.au
>         <mailto:davidcholmes at aapt.net.au>
>              <mailto:davidcholmes at aapt.net.__au
>         <mailto:davidcholmes at aapt.net.au>>> wrote:
>
>                  __
>
>                  Aside: Just for the public record. I think an API to support use
>                  of Processor sets would be a useful addition to the platform to
>                  allow better partitioning of CPU resources. The Real-time
>                  Specification for Java 1.1 under JSR-282 is looking at such an
>                  API primarily to work in with OS level processor sets. For our
>                  purposes, more locally within the JVM the processors made
>                  available to the JVM could be partitioned both internally (ie
>                  binding GC to a specific set of cores) and at the
>                  application/library level (such as allocating ForkJoinPools
>                  disjoint sets of cores). The abiility to query the current
>                  processor ID is inherently needed and so I would make it part of
>                  the Processors class. This would provide the foundation API for
>                  ProcessortLocal.
>                  David Holmes
>
>                      -----Original Message-----
>                      *From:* concurrency-interest-bounces at __cs.oswego.edu
>         <mailto:concurrency-interest-bounces at cs.oswego.edu>
>                      <mailto:concurrency-interest-__bounces at cs.oswego.edu
>         <mailto:concurrency-interest-bounces at cs.oswego.edu>>
>                      [mailto:concurrency-interest-__bounces at cs.oswego.edu
>         <mailto:concurrency-interest-bounces at cs.oswego.edu>
>                      <mailto:concurrency-interest-__bounces at cs.oswego.edu
>         <mailto:concurrency-interest-bounces at cs.oswego.edu>>]*On
>                      Behalf Of *Nathan Reynolds
>                      *Sent:* Thursday, 11 August 2011 2:38 AM
>                      *To:* concurrency-interest at cs.__oswego.edu
>         <mailto:concurrency-interest at cs.oswego.edu>
>                      <mailto:concurrency-interest at __cs.oswego.edu
>         <mailto:concurrency-interest at cs.oswego.edu>>
>                      *Subject:* [concurrency-interest] ThreadLocal vs ProcessorLocal
>
>
>                      I would like to recommend that we stripe data structures
>                      using a ProcessorLocal instead of ThreadLocal.
>                      ProcessorLocal (attached) is exactly like ThreadLocal except
>                      the stored objects keyed off of the processor instead of the
>                      thread. In order to implement ProcessorLocal, it needs an
>                      API that returns the current processor id that the thread is
>                      running on. The HotSpot team has filed an RFE and are
>                      planning on providing such an API. (Many of you are already
>                      aware of this.)
>
>                      I would like to share a story and some results to further
>                      the discussion on processor striping (i.e. ProcessorLocal).
>
>                      A long time ago, we found that an Oracle C++ program
>                      bottlenecked on a reader/writer lock. Threads couldn't
>                      read-acquire the lock fast enough. The problem was due to
>                      the contention on the cache line while executing the CAS
>                      instruction. So, I striped the lock. The code non-atomically
>                      incremented an int and masked it to select one of the
>                      reader/writer locks. Multiple threads could end up selecting
>                      the same reader/writer lock because the int was incremented
>                      in an unprotected manner. If multiple threads selected the
>                      same reader/writer lock, the lock would handle the
>                      concurrency and the only negative was lock performance. This
>                      optimization worked great until Intel released Nehalem-EX.
>
>                      A while ago, Intel found on Nehalem-EX that the same Oracle
>                      C++ program didn't scale to the 4ᵗʰ Nehalem socket. All of
>                      the processors/cores were 100% busy, but the throughput
>                      didn't improve by adding the 4ᵗʰ Nehalem socket. The problem
>                      was the cores were fighting to get the cache line holding
>                      the unprotected int!
>
>                      I tried 4 approaches to select the reader/writer lock.
>
>                      1) Processor id - This performed the best. The cache lines
>                      holding the reader/writer locks are almost never invalidated
>                      due to another core accessing the reader/writer lock. In
>                      other words, almost 0 CAS contention.
>                      2) ThreadLocal - ThreadLocal had a 1:1 mapping of threads to
>                      locks. It required too many locks and the locks had to
>                      migrate with the threads.
>                      3) Hash the stack pointer - Hashing the stack pointer caused
>                      some collisions but essentially randomly selected locks and
>                      this hurt cache performance.
>                      4) Shift and mask the cycle counter (i.e. rdtsc) -
>                      Contention was rare but again it randomly selected the locks.
>
>                      Compared to non-atomically incrementing an int, processor id
>                      resulted in 15% more throughput. The other 3 only showed 5%
>                      more throughput.
>
>                      Nathan Reynolds
>
>         <http://psr.us.oracle.com/__wiki/index.php/User:Nathan___Reynolds
>         <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>>
>
>                      | Consulting Member of Technical Staff | 602.333.9091
>         <tel:602.333.9091>
>                      <tel:602.333.9091 <tel:602.333.9091>>
>                      Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>
>                      Technology
>
>
>                  _________________________________________________
>                  Concurrency-interest mailing list
>         Concurrency-interest at cs.__oswego.edu
>         <mailto:Concurrency-interest at cs.oswego.edu>
>                  <mailto:Concurrency-interest at __cs.oswego.edu
>         <mailto:Concurrency-interest at cs.oswego.edu>>
>
>         http://cs.oswego.edu/mailman/__listinfo/concurrency-interest
>         <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>
>
>
>
>              --
>              Antoine CHAMBILLE
>              R&D Director
>              Quartet FS
>
>
>
>         _________________________________________________
>         Concurrency-interest mailing list
>         Concurrency-interest at cs.__oswego.edu
>         <mailto:Concurrency-interest at cs.oswego.edu>
>         http://cs.oswego.edu/mailman/__listinfo/concurrency-interest
>         <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>
>
>
>
> --
> Antoine CHAMBILLE
> R&D Director
> Quartet FS
>
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>



More information about the Concurrency-interest mailing list