[concurrency-interest] ThreadLocal vs ProcessorLocal

Nathan Reynolds nathan.reynolds at oracle.com
Wed Aug 10 12:38:03 EDT 2011


I would like to recommend that we stripe data structures using a 
ProcessorLocal instead of ThreadLocal.  ProcessorLocal (attached) is 
exactly like ThreadLocal except the stored objects keyed off of the 
processor instead of the thread.  In order to implement ProcessorLocal, 
it needs an API that returns the current processor id that the thread is 
running on.  The HotSpot team has filed an RFE and are planning on 
providing such an API.  (Many of you are already aware of this.)

I would like to share a story and some results to further the discussion 
on processor striping (i.e. ProcessorLocal).

A long time ago, we found that an Oracle C++ program bottlenecked on a 
reader/writer lock.  Threads couldn't read-acquire the lock fast 
enough.  The problem was due to the contention on the cache line while 
executing the CAS instruction.  So, I striped the lock.  The code 
non-atomically incremented an int and masked it to select one of the 
reader/writer locks.  Multiple threads could end up selecting the same 
reader/writer lock because the int was incremented in an unprotected 
manner.  If multiple threads selected the same reader/writer lock, the 
lock would handle the concurrency and the only negative was lock 
performance.  This optimization worked great until Intel released 
Nehalem-EX.

A while ago, Intel found on Nehalem-EX that the same Oracle C++ program 
didn't scale to the 4ᵗʰ Nehalem socket.  All of the processors/cores 
were 100% busy, but the throughput didn't improve by adding the 4ᵗʰ 
Nehalem socket.  The problem was the cores were fighting to get the 
cache line holding the unprotected int!

I tried 4 approaches to select the reader/writer lock.

1) Processor id - This performed the best.  The cache lines holding the 
reader/writer locks are almost never invalidated due to another core 
accessing the reader/writer lock.  In other words, almost 0 CAS contention.
2) ThreadLocal - ThreadLocal had a 1:1 mapping of threads to locks.  It 
required too many locks and the locks had to migrate with the threads.
3) Hash the stack pointer - Hashing the stack pointer caused some 
collisions but essentially randomly selected locks and this hurt cache 
performance.
4) Shift and mask the cycle counter (i.e. rdtsc) - Contention was rare 
but again it randomly selected the locks.

Compared to non-atomically incrementing an int, processor id resulted in 
15% more throughput.  The other 3 only showed 5% more throughput.

Nathan Reynolds 
<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> | 
Consulting Member of Technical Staff | 602.333.9091
Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20110810/80cfbc4c/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ProcessorLocal.java
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20110810/80cfbc4c/attachment-0001.ksh>


More information about the Concurrency-interest mailing list