[concurrency-interest] ThreadLocal vs ProcessorLocal

Nathan Reynolds nathan.reynolds at oracle.com
Wed Aug 10 18:50:51 EDT 2011

Thank you for your questions David.  They got me thinking and it 
triggered another round of optimization.  I have attached a new version 
of ProcessorLocal which replaces ConcurrentHashMap with an array.  It 
uses ProcessorIndex (attached) to map processor ids to indexes.

I wrote ProcessorIndex a while ago but it used ConcurrentHashMap.  The 
original version of ProcessorLocal didn't use ProcessorIndex since it 
would require 2 ConcurrentHashMap.get() calls.  1 call to translate 
processor id into index inside ProcessorLocal.  1 call to map index into 
holder.  (I hadn't considered a copy-on-write array.)

Note: Not all (if any) hardware has an instruction or memory location 
which returns a processor index between 0 and N - 1 where N is the 
number of processors.  For example, x86's processor id is a 32-bit 
integer and the values can be spread across the entire 32-bit range.

I was able to come up with a better way to write ProcessorIndex.  The 
new version uses a HashMap (not ConcurrentHashMap) to translate 
processor ids to indexes.  If a processor id is not found in the 
HashMap, then it uses copy-on-write to replace the old HashMap with a 
new HashMap that has the discovered processor id.  The number of 
copy-on-writes is limited to the number of processors.  After warming 
up, we get HashMap read speed and continue supporting the ability to 
handle new processors.

If processors are added or removed from the system (e.g. hot swapping, 
processor affinity mask, virtual machine migration, virtual CPU 
reassignment), ProcessorIndex will simply add more entries to its 
internal HashMap.  The stale entries are kept.  Data structures that 
rely on the index for NUMA affinity will become less efficient if the 
indexes were rearranged.  Thus, data structures should call 
ProcessorIndex.reset() at ideal opportunities (e.g. when clear() is called).

Do the holders need padding?  I am not sure.  If the JVM's allocator and 
GC is NUMA aware, then false sharing should hopefully not happen since 
each element is allocated by the processor that will use it.  The 
allocator and GC will keep each element in memory local to the processor 
and not next to each other.  I suppose false sharing among sibling cores 
could happen but the impact is much less significant than false sharing 
among sockets.  If false sharing is a problem, then using padded holder 
objects will solve that problem.

Nathan Reynolds 
<http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds> | 
Consulting Member of Technical Staff | 602.333.9091
Oracle PSR Engineering <http://psr.us.oracle.com/> | Server Technology

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20110810/67d0a18a/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ProcessorIndex.java
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20110810/67d0a18a/attachment-0002.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ProcessorLocal.java
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20110810/67d0a18a/attachment-0003.ksh>

More information about the Concurrency-interest mailing list