[concurrency-interest] ThreadLocal vs ProcessorLocal

David Holmes davidcholmes at aapt.net.au
Thu Aug 18 19:22:06 EDT 2011

Aside: Just for the public record. I think an API to support use of Processor sets would be a useful addition to the platform to allow better partitioning of CPU resources. The Real-time Specification for Java 1.1 under JSR-282 is looking at such an API primarily to work in with OS level processor sets. For our purposes, more locally within the JVM the processors made available to the JVM could be partitioned both internally (ie binding GC to a specific set of cores) and at the application/library level (such as allocating ForkJoinPools  disjoint sets of cores). The abiility to query the current processor ID is inherently needed and so I would make it part of the Processors class. This would provide the foundation API for ProcessortLocal.

David Holmes
  -----Original Message-----
  From: concurrency-interest-bounces at cs.oswego.edu [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of Nathan Reynolds
  Sent: Thursday, 11 August 2011 2:38 AM
  To: concurrency-interest at cs.oswego.edu
  Subject: [concurrency-interest] ThreadLocal vs ProcessorLocal

  I would like to recommend that we stripe data structures using a ProcessorLocal instead of ThreadLocal.  ProcessorLocal (attached) is exactly like ThreadLocal except the stored objects keyed off of the processor instead of the thread.  In order to implement ProcessorLocal, it needs an API that returns the current processor id that the thread is running on.  The HotSpot team has filed an RFE and are planning on providing such an API.  (Many of you are already aware of this.)

  I would like to share a story and some results to further the discussion on processor striping (i.e. ProcessorLocal).

  A long time ago, we found that an Oracle C++ program bottlenecked on a reader/writer lock.  Threads couldn't read-acquire the lock fast enough.  The problem was due to the contention on the cache line while executing the CAS instruction.  So, I striped the lock.  The code non-atomically incremented an int and masked it to select one of the reader/writer locks.  Multiple threads could end up selecting the same reader/writer lock because the int was incremented in an unprotected manner.  If multiple threads selected the same reader/writer lock, the lock would handle the concurrency and the only negative was lock performance.  This optimization worked great until Intel released Nehalem-EX.

  A while ago, Intel found on Nehalem-EX that the same Oracle C++ program didn't scale to the 4ᵗʰ Nehalem socket.  All of the processors/cores were 100% busy, but the throughput didn't improve by adding the 4ᵗʰ Nehalem socket.  The problem was the cores were fighting to get the cache line holding the unprotected int!

  I tried 4 approaches to select the reader/writer lock.

  1) Processor id - This performed the best.  The cache lines holding the reader/writer locks are almost never invalidated due to another core accessing the reader/writer lock.  In other words, almost 0 CAS contention.
  2) ThreadLocal - ThreadLocal had a 1:1 mapping of threads to locks.  It required too many locks and the locks had to migrate with the threads.
  3) Hash the stack pointer - Hashing the stack pointer caused some collisions but essentially randomly selected locks and this hurt cache performance.
  4) Shift and mask the cycle counter (i.e. rdtsc) - Contention was rare but again it randomly selected the locks.

  Compared to non-atomically incrementing an int, processor id resulted in 15% more throughput.  The other 3 only showed 5% more throughput.

  Nathan Reynolds | Consulting Member of Technical Staff | 602.333.9091
  Oracle PSR Engineering | Server Technology 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20110819/43b29628/attachment.html>

More information about the Concurrency-interest mailing list