[concurrency-interest] ThreadLocal vs ProcessorLocal

Antoine Chambille ach at quartetfs.com
Fri Oct 12 08:53:55 EDT 2012


Thanks for sharing.

This is a bit depressing though. The upcoming release of Java will probably
not have fine grained memory fences, neither support for processor affinity.

I am certainly to much focused on our own requirements (in-memory database,
TB heaps, tens of cores) but I feel that Java (once the best language in
the world for concurrent programming, when JDK5 was released) is not ready
for the many-cores era.


With respect to NUMA: in the medium term we will have to try exotic
deployments. For instance launching several JVMs, one per NUMA node, and
connect them with high-speed messaging. For messaging we are looking at
memory mapped buffers over files in a RAM drive (/dev/shm on Linux for
instance) or reusing a framework like "Chronicle" by Peter Lawrey (
https://github.com/peter-lawrey/Java-Chronicle).


-Antoine
Quartet FS



On 12 October 2012 14:30, Chris Hegarty <chris.hegarty at oracle.com> wrote:

> I am also not aware of any work being done on this, with respect to JDK8.
>
> -Chris.
>
>
> On 12/10/2012 12:43, David Holmes wrote:
>
>> Hi Antoine,
>> I've not seen any sign that this has moved toward an actual API proposal
>> so far. As the feature complete date for JDK 8 is January 31, I'd say
>> the chances of any such API appearing in JDK 8 would be slim. But
>> perhaps someone is fleshing this out quietly and will come forward with
>> a fully worked out API design and implementation.
>> David
>> -------
>>
>>     -----Original Message-----
>>     *From:* concurrency-interest-bounces@**cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>
>>     [mailto:concurrency-interest-**bounces at cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>]*On
>> Behalf Of
>>     *Antoine Chambille
>>     *Sent:* Thursday, 11 October 2012 10:44 PM
>>     *To:* Concurrency-interest at cs.**oswego.edu<Concurrency-interest at cs.oswego.edu>
>>     *Subject:* Re: [concurrency-interest] ThreadLocal vs ProcessorLocal
>>
>>
>>     After deploying the "ActivePivot" software (java in-memory analytics
>>     solution) on a new server we discovered how inefficient a large heap
>>     java application can be on NUMA system.
>>
>>     The server has 4 sockets (8C Xeon CPU) with 512GB of memory. There
>>     are 4 NUMA nodes each with 8 cores and 128GB. ActivePivot loads
>>     large amounts of data in memory (up to hundreds of GB) and then
>>     performs complex aggregation queries over the entire data, using a
>>     global fork/join pool. Although the data structures are partitioned,
>>     this partitioning is random with respect to NUMA nodes and
>>     processors. There is a sharp performance drop compared to smaller
>>     SMP servers.
>>
>>
>>     We are considering launching several JVMs, each bound to NUMA nodes,
>>     and communicating with each other. But that's an entire new layer of
>>     code, and this won't be quite as performant as processor id
>>     partitioning in one JVM.
>>
>>     I think that's another use case for the processor id API. I hope
>>     this RFE is making progress? Is there a chance to see it as part of
>>     JDK8?
>>
>>
>>     Regards,
>>     -Antoine CHAMBILLE
>>     Quartet FS
>>
>>
>>
>>
>>     On 19 August 2011 01:22, David Holmes <davidcholmes at aapt.net.au
>>     <mailto:davidcholmes at aapt.net.**au <davidcholmes at aapt.net.au>>>
>> wrote:
>>
>>         __
>>
>>         Aside: Just for the public record. I think an API to support use
>>         of Processor sets would be a useful addition to the platform to
>>         allow better partitioning of CPU resources. The Real-time
>>         Specification for Java 1.1 under JSR-282 is looking at such an
>>         API primarily to work in with OS level processor sets. For our
>>         purposes, more locally within the JVM the processors made
>>         available to the JVM could be partitioned both internally (ie
>>         binding GC to a specific set of cores) and at the
>>         application/library level (such as allocating ForkJoinPools
>>         disjoint sets of cores). The abiility to query the current
>>         processor ID is inherently needed and so I would make it part of
>>         the Processors class. This would provide the foundation API for
>>         ProcessortLocal.
>>         David Holmes
>>
>>             -----Original Message-----
>>             *From:* concurrency-interest-bounces@**cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>
>>             <mailto:concurrency-interest-**bounces at cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>
>> >
>>             [mailto:concurrency-interest-**bounces at cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>
>>             <mailto:concurrency-interest-**bounces at cs.oswego.edu<concurrency-interest-bounces at cs.oswego.edu>
>> >]*On
>>             Behalf Of *Nathan Reynolds
>>             *Sent:* Thursday, 11 August 2011 2:38 AM
>>             *To:* concurrency-interest at cs.**oswego.edu<concurrency-interest at cs.oswego.edu>
>>             <mailto:concurrency-interest@**cs.oswego.edu<concurrency-interest at cs.oswego.edu>
>> >
>>             *Subject:* [concurrency-interest] ThreadLocal vs
>> ProcessorLocal
>>
>>
>>             I would like to recommend that we stripe data structures
>>             using a ProcessorLocal instead of ThreadLocal.
>>             ProcessorLocal (attached) is exactly like ThreadLocal except
>>             the stored objects keyed off of the processor instead of the
>>             thread. In order to implement ProcessorLocal, it needs an
>>             API that returns the current processor id that the thread is
>>             running on. The HotSpot team has filed an RFE and are
>>             planning on providing such an API. (Many of you are already
>>             aware of this.)
>>
>>             I would like to share a story and some results to further
>>             the discussion on processor striping (i.e. ProcessorLocal).
>>
>>             A long time ago, we found that an Oracle C++ program
>>             bottlenecked on a reader/writer lock. Threads couldn't
>>             read-acquire the lock fast enough. The problem was due to
>>             the contention on the cache line while executing the CAS
>>             instruction. So, I striped the lock. The code non-atomically
>>             incremented an int and masked it to select one of the
>>             reader/writer locks. Multiple threads could end up selecting
>>             the same reader/writer lock because the int was incremented
>>             in an unprotected manner. If multiple threads selected the
>>             same reader/writer lock, the lock would handle the
>>             concurrency and the only negative was lock performance. This
>>             optimization worked great until Intel released Nehalem-EX.
>>
>>             A while ago, Intel found on Nehalem-EX that the same Oracle
>>             C++ program didn't scale to the 4ᵗʰ Nehalem socket. All of
>>             the processors/cores were 100% busy, but the throughput
>>             didn't improve by adding the 4ᵗʰ Nehalem socket. The problem
>>             was the cores were fighting to get the cache line holding
>>             the unprotected int!
>>
>>             I tried 4 approaches to select the reader/writer lock.
>>
>>             1) Processor id - This performed the best. The cache lines
>>             holding the reader/writer locks are almost never invalidated
>>             due to another core accessing the reader/writer lock. In
>>             other words, almost 0 CAS contention.
>>             2) ThreadLocal - ThreadLocal had a 1:1 mapping of threads to
>>             locks. It required too many locks and the locks had to
>>             migrate with the threads.
>>             3) Hash the stack pointer - Hashing the stack pointer caused
>>             some collisions but essentially randomly selected locks and
>>             this hurt cache performance.
>>             4) Shift and mask the cycle counter (i.e. rdtsc) -
>>             Contention was rare but again it randomly selected the locks.
>>
>>             Compared to non-atomically incrementing an int, processor id
>>             resulted in 15% more throughput. The other 3 only showed 5%
>>             more throughput.
>>
>>             Nathan Reynolds
>>             <http://psr.us.oracle.com/**wiki/index.php/User:Nathan_**
>> Reynolds <http://psr.us.oracle.com/wiki/index.php/User:Nathan_Reynolds>>
>>
>>             | Consulting Member of Technical Staff | 602.333.9091
>>             <tel:602.333.9091>
>>             Oracle PSR Engineering <http://psr.us.oracle.com/> | Server
>>
>>             Technology
>>
>>
>>         ______________________________**_________________
>>         Concurrency-interest mailing list
>>         Concurrency-interest at cs.**oswego.edu<Concurrency-interest at cs.oswego.edu>
>>         <mailto:Concurrency-interest@**cs.oswego.edu<Concurrency-interest at cs.oswego.edu>
>> >
>>
>>         http://cs.oswego.edu/mailman/**listinfo/concurrency-interest<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>>
>>
>>
>>
>>     --
>>     Antoine CHAMBILLE
>>     R&D Director
>>     Quartet FS
>>
>>
>>
>> ______________________________**_________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.**oswego.edu <Concurrency-interest at cs.oswego.edu>
>> http://cs.oswego.edu/mailman/**listinfo/concurrency-interest<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>>
>


-- 
Antoine CHAMBILLE
R&D Director
Quartet FS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20121012/c21ba4db/attachment.html>


More information about the Concurrency-interest mailing list