[concurrency-interest] x86 NOOP memory barriers

Nathan Reynolds nathan.reynolds at oracle.com
Tue Aug 6 15:23:20 EDT 2013


It would be interesting if the charts included data points for 6 
threads.  This will match the 6 "real" cores.

The beauty of hyper-threading is that if one thread stalls, then the 
other thread gets all of the computing resources.  So, if one thread 
executes a lock instruction or stalls waiting for a cache line to be 
fetched, then the other thread can execute its logic operations at full 
speed.

Also, the logical operation of a thread doesn't impact the operation of 
the other thread on the core.  So, if one thread executes a fence, it 
only logically impacts that thread.  The other thread is free to execute 
loads and stores as it wishes. However, the physical operation of a 
thread does impact the physical operation of the other thread since they 
are competing for the core's resources.

-Nathan

On 8/6/2013 11:03 AM, Dmitry Zaslavsky wrote:
> I  didn't have a chance to run this benchmark but one very significant 
> point here is that the system is hyper threaded its only 6 'real' cores
> My guess would be that volatile access generated lock instruction 
> cause real stall for the other hyper thread.
>
> Another factor is that test has read and write. The description says 
> that atomiclongarray was used.
> .get method is volatile get and I guess it was used even for lazy version?
>
>
>
> Sent from mobile device
>
> On Aug 6, 2013, at 12:15 PM, "Boehm, Hans" <hans.boehm at hp.com 
> <mailto:hans.boehm at hp.com>> wrote:
>
>> It would be nice to understand exactly what the difference in 
>> generated code is for the different versions whose performance you 
>> plotted in 
>> http://psy-lob-saw.blogspot.com/2013/05/using-jmh-to-benchmark-multi-threaded.html 
>> .  I'm surprised by the increasing differences in the unshared case 
>> at high processor counts.  That suggests you are generating different 
>> memory traffic for some reason, perhaps because of optimization 
>> artifacts for this particular implementation.  AFAICT, the unshared 
>> cases should be embarrassingly parallel; there should be no real 
>> thread interaction?  Are you spacing out the elements far enough to 
>> deal with prefetching artifacts?
>>
>> I'm also surprised by the lazy vs. volatile differences in the shared 
>> case.   It seems to me the time should be completely dominated by 
>> coherence misses in either case.  There may be some unexpected odd 
>> optimization or lack thereof happening here.  In my limited 
>> experience, the impact of memory fences, etc. commonly decreases as 
>> scale increases, since those slowdowns are local to each core, and 
>> don't affect the amount of memory traffic. See for example the 
>> microbenchmark measurements in 
>> http://www.hpl.hp.com/techreports/2012/HPL-2012-218.html .
>>
>> This benchmark is such that I have a hard time guessing what 
>> optimizations would be applied in each case, and I would expect that 
>> to vary a lot across JVMs.  You're probably recalculating the 
>> addresses of the array indices more in some cases than others.  Can 
>> multiple increments even get combined in some cases?
>>
>> Hans
>>
>> *From:*concurrency-interest-bounces at cs.oswego.edu 
>> <mailto:concurrency-interest-bounces at cs.oswego.edu> 
>> [mailto:concurrency-interest-bounces at cs.oswego.edu] *On Behalf Of 
>> *Nitsan Wakart
>> *Sent:* Tuesday, August 06, 2013 1:38 AM
>> *To:* concurrency-interest at cs.oswego.edu 
>> <mailto:concurrency-interest at cs.oswego.edu>
>> *Subject:* Re: [concurrency-interest] x86 NOOP memory barriers
>>
>> Summarized in this blog post here:
>>
>> http://psy-lob-saw.blogspot.com/2013/08/memory-barriers-are-not-free.html
>>
>> Please point out any mistakes/omissions/oversight.
>>
>> Thanks for the help guys.
>>
>> ------------------------------------------------------------------------
>>
>> *From:*Michael Barker <mikeb01 at gmail.com <mailto:mikeb01 at gmail.com>>
>> *To:* Nitsan Wakart <nitsanw at yahoo.com <mailto:nitsanw at yahoo.com>>
>> *Cc:* Vitaly Davidovich <vitalyd at gmail.com 
>> <mailto:vitalyd at gmail.com>>; "concurrency-interest at cs.oswego.edu 
>> <mailto:concurrency-interest at cs.oswego.edu>" 
>> <concurrency-interest at cs.oswego.edu 
>> <mailto:concurrency-interest at cs.oswego.edu>>
>> *Sent:* Saturday, August 3, 2013 12:33 AM
>> *Subject:* Re: [concurrency-interest] x86 NOOP memory barriers
>>
>>
>> > So because a putOrdered is a write to memory it cannot be reordered 
>> with
>> > other writes, as per "8.2.3.2 Neither Loads Nor Stores Are 
>> Reordered with
>> > Like Operations".
>>
>> Yes in combination with the compiler reordering restrictions.  In
>> Hotspot this is implemented within the
>> LibraryCall::inline_unsafe_ordered_store (library_call.cpp) call.
>> Look for:
>>
>> insert_mem_bar(Op_MemBarRelease);
>> insert_mem_bar(Op_MemBarCPUOrder);
>>
>> Mike.
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu 
>> <mailto:Concurrency-interest at cs.oswego.edu>
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20130806/2020c5e1/attachment-0001.html>


More information about the Concurrency-interest mailing list